16  Corpus Building for Literary History

Evgeniia Fileva (Trier)

16.1 Introduction

Corpus building in the field of literary history is a complex and challenging process. It is characterized primarily by the collection and modeling of large amounts of literary data (both metadata and full text) covering several time periods. These data are used by scholars to analyze language use, authorial style, genre changes, and other literary, linguistic, historical and social phenomena. In addition, corpora in the context of literary history ensure the preservation of valuable artifacts such as books, manuscripts, letters, etc. Collections of literary data offer for analysis an orderly and systematic amount of data that opens the way for scholars to study them thoroughly over time using computational methods.

Indeed, the digital age has brought about a wide range of resources, standards and tools for the construction of corpora and for collecting and processing information. As Merja Kytö (2011) notes, the turning point came in the 1970s and 1980s, when it became possible to piece together large amounts of textual information.

16.2 Early and/or large diachronic corpora

One of the first historical literary corpus projects is the Thesaurus Linguae Graecae: A Digital Library of Greek Literature (TLG), which is a digital library that includes works of Greek literature from Homer to the fall of Byzantium in AD 1453 (and further into the modern era more recently). It was created in the late 1970s by Marianne McDonald and is currently housed at the University of California, Irvine. The TLG includes over 110 million words from more than 10,000 works by approximately 4,000 authors. It is designed to be a comprehensive resource for scholars, students, and anyone interested in Greek literature, language, and culture. Users can access the TLG through a subscription service that provides access to the TLG’s database and search tools. The database includes a range of texts, from well-known works like Plato’s dialogues and the works of Aristotle to lesser-known texts like the poetry of Hesiod and the works of the Church Fathers. Other pioneering, large and important digital libraries include the Perseus Digital Library, founded in 1987 by Gregory Crane, with a focus on Greek (32 million words) and Latin (16 million words) texts. Created even earlier, in 1971, by Michael S. Hart, the Project Gutenberg digital library remains an important source of texts, notably in English. The Oxford Text Archive (OTA) is another pioneering, large-scale text repository for the Humanities, founded by Lou Burnard and Susan Hockey in 1976.

Collections of historical literary data for the English language are traditionally the most elaborate and extensive. There are several large projects such as HathiTrust, Text Creation Partnership and Project Gutenberg to preserve and digitize large collections of text documents. For example, HathiTrust has about 17 million volumes in its digital library. From a historical perspective, HathiTrust represents a significant achievement in the digitization of historical texts. By making millions of books and other materials available in digital form, HathiTrust has transformed the way scholars and researchers access and study historical texts. It has also provided new opportunities for the analysis and interpretation of historical data, offering new insights into the past and its relationship to the present. Metadata for HathiTrust are stored in MARC format and typically include information about the book uploaded to the library, such as author information, contributor information, publication information, contain identification numbers (ISBN, ISSN), etc.

More than 150 libraries around the world are participating in the TCP project to create reliable XML encoded ebooks. Such projects are very important to the scholarly community and provide the basis for many studies. The TCP Project uses SGML and TEI P3 (a subset of the TEI P3 schema) for text encoding. The original SGML DTD was replaced by an XML DTD for distributing and indexing the texts. The current standard for text encoding is TEI P5, but the TCP’s XML schema is closer to TEI P4. Conversion between the TCP markup and TEI P5 is possible, and stylesheets are available to generate P5-conformant versions of TCP texts. These files are available for all TCP output, but in some cases, the P5 files may lag the TCP XML files in terms of revisions.

In addition, it is worth mentioning the Helsinki Corpus of English Texts. The Helsinki corpus is a collection of English texts spanning from the earliest Old English period to the end of the Early Modern English period. It contains about 1.6 million words and is divided into three main periods – Old English, Middle English, and Early Modern English - each with subperiods of 100 or 70 years. The corpus covers various genres, regional varieties, and sociolinguistics variables, including gender, age, education, and social class. Overall, the Helsinki corpus is a representative and comprehensive resource for studying the development of the English language over time. The Helsinki Corpus was updated (while preserving all information from the original version) using XML markup which ensures longevity and easy conversion to future formats. Valid and rules-compliant markup is crucial for automatic software, as even slight deviations can cause errors. The updated corpus follows the latest TEI guidelines, resulting in a need for considerable rethinking. The XML version is a single file with a teiHeader giving general data and individual text headers giving bibliographic and descriptive metadata. The annotation model used in the original corpus, COCOA format, was not fully convertible to XML using a single conversion script.

Other examples of relatively large, diachronic corpora of literary texts frequently used in CLS research include the TextGrid Digitale Bibliothek (containing a very large amount of texts in German, including translations into German, and covering multiple centuries); the Deutsches Textarchiv (DTA), intended as a diachronic reference corpus of German literary and non-literary texts (see Geyken and Gloning 2015) or Théâtre classique, a platform providing a large number of French dramatic texts, primarily for the time period ca. 1620 to 1810 (see Fièvre 2007). More recent players include DraCor, an innovative platform providing multilingual corpora of dramatic texts in multiple languages in an interactive environment based on the idea of ‘programmable corpora’ (see Fischer et al. 2019) and the European Literary Text Collection (ELTeC), a collection of corpora of novels in multiple European languages covering the period 1840 to 1920 (see Burnard, Odebrecht, and Schöch 2021; Schöch et al. 2021).1

16.3 The Use of Diacronic Corpora in DH/CLS Research

The historical corpora named above are essentially examples of diachronic corpora, which are characterized by the ability to trace the development and change of literature over time based on such a collection of texts. Diachronic corpora provide researchers with the opportunity to discover trends in the literature of a particular period and to analyze them (see also Chapter 18).

Diacronic corpora and historical texts collections underlie several of the studies we have found in our corpus of research literature. These studies have created research driven corpora, such as the one by José Calvo Tello, who has built the Corpus of Novels of the Spanish Silver Age (CoNSSA) based on texts obtained from a variety of sources, among them Wikisource, Gutenberg Project, Google Books, ePubLibre, Spanish National Library (BNE) etc. (Calvo Tello 2021). The time period covered by CoNSSA, however, 1880–1936, is relatively narrow for a diachronic corpus.

Based on the Project Gutenberg´s text collection, there is a study of “Quantitative patterns of stylistic influence in the evolution of literature” by James M. Hughes and colleagues (2012). This is an example of an examination of trends in authorial style over time. The researchers used a subset of works by authors from the Project Gutenberg database. The authors for a subcorpus were selected based on certain criteria (year of publication, availability of death and birth information about an author, at least five works presented in the Project Gutenberg collection), resulting in a final group of 537 authors. A representative feature vector was then created for each author by aggregating the frequencies of function words for each of their works, with a total of 7,733 works being analyzed.

Another interesting study was carried out by Trilcke et al. (Trilcke, Fischer, and Göbel 2016) on the basis of a popular corpus in German. They analyzed 465 German-language dramas from 1730-1930 for semantic connections in social interactions. The authors cite the Digitale Bibliothek in the TextGrid repository as the source for the corpus. The metadata was collected manually and stored in DLINA, a special XML format developed specifically for this study.

A study by Grace Muzny (Muzny, Algee-Hewitt, and Jurafsky 2017) proposes a new metric for measuring dialogism, that is, direct speech in novels over 230 years. The collection of texts taken for analysis includes 1,100 canonical English-language novels by 422 authors. The time period from 1782 to 2011, divided into 3 time periods (late 18th century, turn of the 19th and mid-20th centuries) is covered. In order to obtain a diverse corpus of dialogues, the authors developed new tools to extract dialogue from a large corpus of novels spanning three centuries, resulting in a new dialogue corpus with over 2 million instances of quoted speech. The corpus represents various genres and styles, and is crucial in understanding the abstract grammatical features that characterize spoken dialogue, which is essential for understanding literary style. It is notable that the researchers encountered a problem with the recognition of direct speech in the texts, because the OCR did not always recognize it correctly. For addressing this problem, they proposed the quote extraction system QuoteAnnotator.2

Sociological and cultural changes in literature are represented in the studies of Richard Jean So and Ted Underwood. Both authors explore gender in the context of literary change. So, Long, and Zhu (2018), for example, studied American novels from 1880-2000 and conducts a cultural analysis of racial and gender criticism. To do so, they analyzed some 10,000 books by 6,000 authors, manually annotated race and gender, and identified the 4,000 authors for whom this data was found. Based on this, the authors examine whether author style, language, and narrative depend on racial identification.

Ted Underwood has conducted a notable diachronic analysis of literature, namely on representation of women in literature. This study covers English-language fiction from the late 18th century to the early 21st (Underwood, Bamman, and Lee 2018).3 Another study conducted by Underwood addresses the literary and historical development of genres. In his article on “The Life Cycle of Genres” (2017), he discusses the problem of historical comparison in literary studies and the difficulty in reaching a consensus about the life cycles of novelistic genres. Underwood cites Franco Moretti’s research that suggests that genres display a rather regular changing of the guard with a 25-year rhythm. Similarly, Matthew Jockers’ statistical model shows that genres framed on a 25- to 30-year scale are linguistically coherent phenomena in the 19th century. To investigate these questions, Underwood collected lists of titles assigned to a genre in 18 different sites of reception and gathered corresponding texts to compare groups of texts associated with different sites of reception and segments of the timeline to determine how stable different categories have been. The genres that Underwood used for his research include detective fiction, science fiction, and the Gothic. The author collected corresponding texts from the Chicago Text Lab and HathiTrust Digital Library to compare groups of texts associated with different sites of reception and segments of the timeline to investigate how stable different categories have been. The metadata were collectively developed at the Stanford Literary Lab and contain genre tags, dates and source description for a total of 962 texts.

Šeļa, Plecháč, and Lassche (2022) conducted a research study that provides evidence of the association between poetic meter and semantics in 18th- and 19th-century European literatures. The study uses five metrically annotated poetry collections in different languages (Czech, Dutch, English, German and Russian) to compare and analyze the metrical types used in the poems. The focus of the study is on iambic and trochaic metrical types, which are the most widespread ways to organize verse in European accentual-syllabic traditions. Data for the Czech collection is sourced from the Corpus of Czech Verse, the German data is from the Metricalizer corpus, and the Russian corpus is part of the Russian National Corpus. As for the texts in English, they are taken from the Gutenberg English Poetry Corpus, and the early modern Dutch songs are from the Dutch Song Database compiled and hosted by the Meertens Institute in Amsterdam. The focus of the study is on the Czech, German, and Russian corpora as they cover a comparable cultural niche and time span, while the English and Dutch collections are used as secondary sources to show the general validity of the study’s claims for material with substantially different structures and origins.

16.4 Conclusion

To summarize, building a corpus for literary history is a challenging and multifaceted task that necessitates careful attention to several factors. It should be a representative selection of texts from various genres, authors, and regions that accurately reflect the literary period being studied. Once created, a literary corpus can be used to explore a range of research questions, from the study of individual authors and texts, to broader questions about literary movements, genres, and cultural trends. Through the use of computational methods, scholars can analyze large bodies of text in new and innovative ways, uncovering patterns and trends that might not be visible through traditional methods. Metadata information is not always present in studies based on diachronic corpora, but we still can observe that the presented metadata usually focuses on the year of publication and authors, since research based on diachronic corpora is viewed from a historical perspective. A sufficiently large and balanced corpus is necessary to ensure the validity of any results and avoid biases toward certain genres or periods.


See works cited and further reading for this chapter on Zotero.

Citation suggestion

Evegniia Fileva (2023): “Corpus Building for Literary History”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/corpus-lithist.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).

  1. For more information on some of these corpora, see also the section “Corpus Building for Genre”, Chapter 11.↩︎

  2. Challenges in digitization of historical texts are presented in the chapter “Annotation for Literary History” (Chapter 17).↩︎

  3. Readers can find out more about this study in the chapter on “Analysis for Gender” (Chapter 23).↩︎