Title: Forty-Five Years of Digitizing Ebooks: Project Gutenberg's Practices
Author: Gregory B. Newby
Release date: October 18, 2019 [eBook #60600]
Most recently updated: July 6, 2021
Language: English
Credits: an Anonymous Project Gutenberg Volunteer
By Gregory B. Newby
CEO Project Gutenberg Literary Archive Foundation
Contents
ITEMS ARE IN THE PUBLIC DOMAIN FOR ONE OF THREE REASONS
COLLECTION DEVELOPMENT POLICY AND EARLY MARKUP
EVOLUTION IN PROOFREADING: DISTRIBUTED PROOFREADERS
COPYRIGHT CLEARANCE OR PERMISSION
EVOLUTION OF MASTER SOURCE FORMATS
"NO SWEAT OF THE BROW COPYRIGHT"
PAST INNOVATIONS AND FUTURE INITIATIVES
Project Gutenberg creates and freely distributes electronic books (eBooks). This document offers elements of the story of Project Gutenberg’s methods and practices for creating those eBooks, and the surrounding procedures for making them as widely available as possible. Project Gutenberg seeks to make the world’s great literature enjoyable and accessible.
The first Project Gutenberg eBook was created on July 4, 1971. Michael S. Hart had been granted access to a powerful mainframe computer at the University of Illinois at Urbana-Champaign, and realized that his greatest impact would be by digitizing and distributing free literature (for more history, see: The eBook is 40 (1971-2011), by Marie Lebert).
Michael took a printed copy of United States Declaration of Independence to the computer laboratory, where he sat at the teletype terminal and typed this first eBook. He distributed it via email to the people he knew about via the Internet’s predecessor, ARPAnet, which was available at UIUC. At that moment, the first eBook had been freely distributed to the online community of the day.
Digitization and production techniques, at the time of this first eBook, were ad hoc and informal. A single eBook producer would edit a single file, from a single source. The first eBook’s printed source was a single sheet of paper, without hyphenation, a book cover, images, or other characteristics of book-length sources. In 1971, capitalization was not an issue, as only upper case letters were available in the character set used by the system.
Figure 1: Top view of a Model 33 Teletype, salvaged from the computer laboratory where Michael Hart typed the first eBook. The paper roll was where output would be printed.
During the next twenty years, from approximately 1971-1991, techniques of digitization would be dramatically improved, and regularized. Ongoing developments since then have tracked the available technologies for eBook creation and use, as well as preferences and interests of the many volunteers who would produce those eBooks. Throughout the history of Project Gutenberg, these techniques, while refined and clearly articulated, have remained flexible (see the Volunteers’ FAQ)
Project Gutenberg’s founder, Michael Hart, was motivated by completely free and unencumbered redistribution of literary works. Access to literary works enables literacy, which in turn opens the door to education and, it is hoped, opportunity. Interest in literary works that could be freely redistributed led to an emphasis on books and other items that are in the public domain.
The public domain is, today, understood to be those items that are not copyrighted. Copyright in the United States, where Project Gutenberg operates, is defined as a temporary monopoly by authors (or their agents), in order to benefit from commercial potential and thereby fostering continued creation:
“To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries” United States Constitution
1. They are ineligible for copyright. In the US, this includes works created by the US Government;
2. Their copyright term has expired; or
3. They are granted to the public domain by the creator or their agent (i.e., the rights holder).
Because of its emphasis on literary works, Project Gutenberg has mostly focused on items for which the copyright term has expired. Until 1998, this included items published 75 years earlier. For example, items from 1920 entered the public domain when their copyrights expired in 1995. The US Copyright Term Extension Act of 1998 changed the term to 95 years for most literary works, so new items (from 1923 onward) will not enter the public domain before 2019.
Figure 2: Michael Hart’s sunroom workspace in his Urbana home
There are over one million published works from 1923 and earlier, and these are the main items that Project Gutenberg continues to digitize and distribute. In addition, there were approximately one million works published in the United States from 1923-1964 but not renewed. Those items entered the public domain when their first copyright term ended, 28 years after publication. The copyright procedures utilized are online at Copyright procedures
The eBook collection, and all other aspects of Project Gutenberg, relies on volunteers to grow. Therefore, selection of items is done mainly by volunteers. Project Gutenberg seeks to limit duplication in the collection, and instead prefers to add items not already in the collection. Improvements to existing items is ongoing, mainly when errata reports are submitted by readers.
It took over two decades to release the first 100 eBooks, with #100 being published in 1994. Most of those first eBooks were collected through personal interaction with Hart. He would guide or participate in the digitization process, often developing procedures to deal with new characteristics. Footnotes and endnotes, italics and underscores, bold text, and different fonts all presented challenges for representation as plain text. Primitive markup techniques were developed, such as using an underscore character to surround underscored text, _like this_.
It was not until the mid-1990s that hypertext markup language (HTML) was first used, and at the time it was decided that Project Gutenberg eBooks should be wholly self-contained. A zip file would include all of the needed images, and external links were discouraged.
Throughout the entire history of Project Gutenberg, volunteers have been encouraged to work on items they are interested in, and to make their own decisions about how to best represent the content.
The first eBooks were created by typing the text of printed books into word processor or text editing programs, and then submitting the files for final formatting and redistribution. Typists would perform basic formatting, including:
Plain text eBooks, which were the only major format until HTML became more frequent by the mid- to late-1990s, were designed to be viewed on computer monitors with fixed-width fonts with 80-character lines. Plain text is still provided for nearly all Project Gutenberg eBooks today, although HTML and other formats are also provided.
Once an item is typed into an electronic file, and basic formatting is completed, one or more rounds of proofreading will help to improve quality. This includes typos, poor formatting, or inconsistency of presentation. In practice, all eBooks published by Project Gutenberg still have errors, even if they are far better than 99% accurate. For example, an eBook that is 99.999% accurate (i.e., “five nines”) will still have one wrong character in 10,000. That amounts to approximately 30 errors in a typical 50,000 word novel. Proofreading is, by its nature, asymptotic. Subsequent rounds of proofreading improve an eBook, but that eBook is still likely to contain some errors.
Errors in eBooks often reflect errors in their printed sources, and Project Gutenberg encourages fixing those errors.
From 2002-2004 an important innovation was developed, in support of the creation of new Project Gutenberg eBooks. This was Distributed Proofreaders, an early example of what is now known as crowdsourcing. Through Distributed Proofreaders, volunteers engage in a portion of the eBook creation process — whether it is copyright clearances, proofreading (a page at a time!), or formatting, checking, and finalization before uploading. Those portions, when coordinated together, lead to the creation of new eBooks from printed sources.
Distributed Proofreaders has become the single largest source for new eBooks to the collection, accounting for approximately half of all titles. Distributed Proofreaders has also innovated substantially in the use of HTML+CSS (cascading style sheets) for very attractive presentation of eBooks in Web browsers.
By the early 1990s, scanning and optical character recognition (OCR) started to become widely available. Hart received a full scanning station via a grant from a computer manufacturer, which was used to produce several of the first 100 eBooks. The scanner was a flatbed model, which required the user to hold the book open, scan a page (or pair of pages) for ingest to the OCR software, then flip to the next page.
The OCR software would then automatically recognize the characters from the scan, and create an editable view of the text. Proofreading and formatting would then occur in the same way as for a typed text.
A few years later, Project Gutenberg worked with Distributed Proofreaders to acquire sheet-fed scanners. These scanners, which are still in operation, are faster. They also tend to produce an image that is properly aligned, versus the skewing that sometimes occurs with flatbed scanners. An important difference is the printed books are damaged: prior to scanning, the spines of the books are cut off, in order for the individual pages to be ingested by the scanner.
Figure 3: Image from the Doré illustrations of Dante’s Inferno
It has been Project Gutenberg’s intention to make all the original images from the scanners available, alongside the finished eBook. This is to have a more complete record of the eBook’s source(s), and also to facilitate improvements by finding typos. Most eBook producers to date have chosen to not provide the scans, however.
Scanners are used for images within printed books, which are typically included as JPEG, GIF or PNG items within HTML and other formats. Inline images may be at a lower resolution, and then clickable to obtain higher resolution images. Color scanners are used, whenever possible, for color images.
Project Gutenberg has no prohibition against using items scanned by other parties. Several excellent sources of scans are freely available, including Google Books, Gallica, and The Internet Archive. Scans, and raw OCR output (if available), may then be transformed into Project Gutenberg eBooks by volunteers.
From approximately 1994-2004, procedures for digitization became more clearly articulated. This included the notion that a copyright “clearance” was the necessary first step for starting any new eBook for contribution to Project Gutenberg. The “copyright how-to” mentioned above was developed and refined, with guidance from a number of lawyers with expertise in US copyright law.
Project Gutenberg has always operated within the copyright laws of the US, and includes text in each eBook, and online at Project Gutenberg, making it clear that readers in other countries must follow the laws that apply to them. Project Gutenberg affiliates, which operate completely independently, exist to emphasize the literary works and languages of different countries, and they follow the copyright laws of the country or region in which they operate.
Generally, copyright clearance is simple. Items published prior to 1923, anywhere in the world, are in the public domain in the US. Prior to 1993, all copyright clearance actions required mailing a photocopy of the title page and verso (obverse) page of a candidate book to Michael Hart or Greg Newby, but then an online system was developed that accepted scans of those pages. A database maintains records of cleared items, and who submitted them. A few other copyright rules are sometimes applied, for items published after 1923.
Sometimes, copyrighted items are submitted by authors. For many years, Project Gutenberg was one of few online repositories of user-contributed literary works, and therefore accepted items from contemporary authors. The two requirements for such content were:
1. A perpetual, worldwide, non-exclusive, irrevocable license be granted to Project Gutenberg, for unlimited redistribution of the item; and
2. The item must be made available as plain text, (valid) HTML, or both.
However, user-contributed content is generally no longer accepted for the main collection at Project Gutenberg. Instead, a new self-publishing portal, operated by an affiliate, The World EBook Library is available at self.gutenberg.org. With the self-publishing portal, authors may use any license they wish (such as a Creative Commons license), and can provide items in PDF or other formats. This simplifies the process for the authors, and removes the need for Project Gutenberg’s volunteers to be involved with author-contributed content.
Project Gutenberg encourages the use of multiple printed sources to create an eBook. For many historical works, including the US Declaration of Independence (the first Project Gutenberg eBook), there are variations among the printed sources. Another early example is the works of William Shakespeare. Project Gutenberg has several different versions of Shakespeare, including one based on the first edition folios. It has been typical, throughout the modern history of publishing, for different versions of a book to have variations.
In practice, the majority of Project Gutenberg eBooks rely on a single printed source. However, even those items might benefit from other sources — such as when some pages are missing, or illustrations come from a different version, or when typos/errata reports come from other sources.
It is a principal of Project Gutenberg that the eBooks in the collection are denoted as Project Gutenberg eBooks. Even if the publisher imprint and frontispiece from a printed work is included, there is no assurance that the content exactly matches that printed work. And, in fact, it will not match: minimally, the headers/footers will be removed, and paragraphs will flow together such that they span the pages of the printed source. Many other adjustments are typically made, as mentioned above.
For this reason, Project Gutenberg’s online catalog metadata does not include a citation to the source(s) used to create an eBook. Instead, Project Gutenberg should be cited as the publisher. For example, a bibliographic citation might have a form such as this:
Carroll, Lewis. “Alice’s Adventures in Wonderland.”. Urbana, Illinois: Project Gutenberg. Available: www.gutenberg.org/ebooks/11
Project Gutenberg is, arguably, the oldest continuously operating online content project in the world. From 1971 until the mid-1990s, there were relatively few online resources for literary content. For this reason, and also due to a general willingness to experiment and reach out to broader audiences, Project Gutenberg has a great variety in the content types offered.
Among the first 100 items, there are mathematical constants and a musical performance. Government publications, notably the 1990 US Census and the CIA World Factbook from 1990 onward, were also included. The next few hundred items include movies, photographs of ancient cave paintings, and the first non-English items (Virgil’s Aeneid, Cicero’s Orations, and Caesar’s Commentaries, all in Latin).
Hundreds of audio eBooks are in the collection. Many were automatically generated via text-to-speech software. There are also a number of readings/performances by human readers, including from Project Gutenberg’s partner, Librivox. Today, automated text-to-speech is accessible by most people with a computer or mobile phone, so there is less emphasis on that format. Human readings/performances continue to be of interest, especially when the performance, as well as the original Project Gutenberg source eBook, is granted to the public domain.
Non-English languages have some additional characteristics that were not well-suited for the plain text ASCII of Project Gutenberg’s early days. By the early 1990s, it was necessary to display accented characters, to accommodate languages such as French and Spanish. Later, languages such as Chinese would require entirely separate character sets.
OCR software may be poorly suited for several non-English languages, or may fail due to older styles of typesetting (the old German “Fraktur” is notorious in this regard).
Also, it is necessary to have proofreaders who are fluent in the language, to assure the eBook is enjoyable and reasonably free of errors. Despite these challenges, nearly 20% of the collection is in a language other than English, with 65 separate languages or dialects other than English. This emphasis on language diversity continues today, and is limited only by the willingness of volunteers to submit copyright clearances and prepare items for distribution.
Table 1: Language counts as of August 1, 2016, for 52615 eBooks.
# of eBooks Language code Language or dialect 43095 en English 2711 fr French 1469 de German 1421 fi Finnish 739 nl Dutch 678 it Italian 540 pt Portuguese 504 es Spanish 427 zh Chinese 219 el Greek 128 sv Swedish 112 hu Hungarian 112 eo Esperanto 102 la Latin 66 da Danish 60 tl Tagalog 31 pl Polish 31 ca Catalan 22 ja Japanese 17 no Norwegian 11 cy Welsh 10 cs Czech 9 ru Russian 7 is Icelandic 7 fur Friulian 6 te Telugu 6 he Hebrew 6 enm Middle English 6 bg Bulgarian 4 sr Serbian 4 ang Old English 4 af Afrikaans 3 nai North American Indian 3 nah Nahuatl 3 ilo Iloko 3 ceb Cebuano 2 ro Romanian 2 nav Navajo 2 myn Mayan Languages 2 mi Maori 2 grc Greek, Ancient 2 gla Gaelic, Scottish 2 ga Irish 2 fy Frisian 2 arp Arapaho 1 yi Yiddish 1 sl Slovenian 1 sa Sanskrit 1 rmr Calo 1 oji Ojibwa 1 oc Occitan 1 nap Napoletano- Calabrese 1 lt Lithuanian 1 ko Korean 1 kld Gamilaraay 1 kha Khasi 1 iu Inuktitut 1 ia Interlingua 1 gl Galician 1 fa Farsi 1 et Estonian 1 csb Kashubian 1 br Breton 1 bgi Giangan 1 ar Arabic 1 ale Aleut
Plain text was the first master source type/format for Project Gutenberg, and remains important today. Plain text is readable on any device. Plain text is printable, and efficient to store (including for compression, or sharing by email). For decades, the International Standards Organization has provided standard computerized encoding for the basic American standard codes (ASCII) and extensions for accents and other special characters (Latin1 or ISO 8859-1). Encoding exists for other languages, and Unicode (with 8- and 16-bit variations) provides encoding for larger groups of characters.
Within the first few hundred Project Gutenberg eBooks, some encoding was offered which seemed promising, but did not withstand the test of time. An early PostScript file was rendered unusable due to insertion of the Project Gutenberg standard header; a dictionary included markup that, today, might be reminiscent of XML or ReStructured Text, but without any sort of codebook for proper presentation; a few word processor native formats, including WordStar and WordPerfect, were used but are no longer readable with modern computers.
Even HTML (and other XML variants) was viewed with skepticism, since the longevity of formats is notoriously difficult to predict when they first become available.
For these reasons, Project Gutenberg still prefers to make plain text available for essentially every eBook. The only exceptions are those for which no plain text encoding is reasonable — such as Chinese, or mathematical texts, or music. In this way, the collection is “future proof,” so that even if all content cannot be fully represented as text, the files themselves will still be readable and enjoyable to read.
Figure 3: Typical text view, showing fixed-length lines and spacing among components.
"A CONNECTICUT YANKEE IN KING ARTHUR'S COURT"
by MARK TWAIN (Samuel L. Clemens) PREFACE The ungentle laws and customs touched upon in this tale are historical, and the episodes which are used to illustrate them are also historical. It is not pretended that these laws and customs existed in England in the sixth century; no, it is only pretended that inasmuch as they existed in the English and other civilizations of far later times, it is safe to consider that it is no libel upon the sixth century to suppose them to have been in practice in that day also. One is quite justified in inferring that whatever one of these laws or customs was lacking in that remote time, its place was competently filled by a worse one.
Today, Project Gutenberg’s plain text offerings are most often derived automatically from another master format. The most common master format is HTML, which offers advantages of ubiquity and ease of authoring. LaTeX is also used as a master, mainly for mathematical texts. ReStructured Text (RST) was encouraged by Project Gutenberg, due to the ease of conversion to other formats. However, RST has not been widely adopted by eBook producers.
The ubiquity of reading devices — from mobile phones, to tablets, to electronic paper — was predicted by Project Gutenberg. Rather than creating separate master files for each native format for the devices, automatic conversion is applied to one of the master formats. For years, Java-format eBooks were automatically created, and these were usable on many mobile phones.
Today, EPUB and MOBI (also known as Kindle) formats are the most common. Free software for conversion, called ebookmaker (previously called epubmaker) is used to create derivative formats. This helps to assure compatibility for different reader devices.
Volunteers upload the master format for their completed eBook to the Project Gutenberg server, where it undergoes automated and manual checks before the new eBook is posted and announced online. Prior to the upload, the copyright clearance must be completed.
Upon uploading, automated checks include:
The conversion check consists of using the epubmaker application to automatically generate derived formats. Ideally, resulting files will include:
For HTML, EPUB and MOBI, pairs of files are generated: one with images, and one without. The set of files without images is intended to be friendlier to readers with limited bandwidth, or without the necessary storage space for any images included with the eBook.
After uploading, a team of human experts — known as the “whitewashers,” after a scene in Mark Twain’s “The Adventures of Tom Sawyer” — does final formatting, attaches the Project Gutenberg header and footer, and uploads the new item to the server at www.gutenberg.org.
The Project Gutenberg catalog database includes metadata from within each eBook: the author, title, available file formats, upload/publication date, language, etc. Human catalogers eventually add additional metadata, including Library of Congress Subject Headings. This catalog is available for free download in machine readable form (XML/RDF or MARC).
Organizations that desire to redistribute Project Gutenberg’s content, freely and without limitations, are invited to do so. The catalog may be used for this purpose, and various mechanisms are available to automatically maintain a copy of the collection itself (i.e., “mirroring”), including for generated content.
An important innovation during the evolution of Project Gutenberg was to clarify the notion of “authorship” and its critical role for establishing copyright. In early days, it was common to think that applying HTML markup, or reformatting, or spelling changes, qualified an item for a new copyright. Historically, some print publishers even claimed new copyrights simply for typesetting a new edition.
Today, we know US copyright is based on the creative expression of ideas through authorship. Markup and spelling changes do not qualify. As a result, Project Gutenberg volunteers are able to “harvest” public domain materials on the Internet, once they are determined to match public domain print materials. This is not a frequent occurrence, however, since most volunteers prefer to work on items that are not yet digitized.
Similarly, Project Gutenberg claims no copyright on the “sweat of the brow” labor which is applied to make eBooks from print sources. There were a few earlier items where such copyright was claimed erroneously, but this is no longer done.
Project Gutenberg has over 50,000 eBooks in its collection. This is far fewer than Google Books, or The Internet Archive, or other large-scale digitization projects of historical items. An important distinction is that Project Gutenberg engages in the proofreading, formatting, markup/encoding, and other activities described above. Those other very large projects are primarily devoted to scanning, and then provide raw OCR output with a few automatically generated formats.
Such items are only partial eBooks — really, they are pictures (scans) of books, with some additional automated features. These are valuable, but do not provide the reading experience or quality of presentation that Project Gutenberg strives for. Using current technology, it takes human intellect and effort to convert a picture of a book to a true, functional, eBook.
Project Gutenberg has evolved its practices over the years, and has often been a leader in the creation and distribution of eBooks. Some past innovations include the following, and all are still in active use today:
Project Gutenberg has ongoing initiatives to improve service offerings to readers. There are no definite timelines for these, and assistance (or partnerships!) are always of interest. Some future initiatives may include:
Project Gutenberg is thankful to tens of thousands of volunteers, over more than 45 years, that have contributed to the creation and distribution of free electronic books. It is through the efforts of these volunteers that Project Gutenberg has been successful, and continues to thrive.