Introduction to Corpus Building

Fileva, Evgeniia

doi:10.5281/zenodo.7892112

Evgeniia Fileva (Trier)

1.1 What is corpus building?

Not only are corpora and text collections a way of preserving literary texts for the long-term transmission of literary heritage, but they are also an essential foundation for research in both Digital Humanities generally and Computational Literary Studies in particular. The number of available corpora is large, and increases every year. This is primarily due to the fact that most research projects have specific requirements with respect to the corpus of texts being used. As a consequence, many corpora are created for specific research purposes and are therefore as diverse as the projects they support.

Corpus building is used here as a cover term for a range of activities related to the creation of both linguistic corpora specifically – i.e. systematically composed and linguistically-annotated sets of machine-readable texts (McEnery and Hardie 2012) – and text collections more generally – i.e. small or large collections of literary texts or historical documents with various degrees of rigour in their composition and a broad range of possible annotations. These activities first of all include conceptual work, such as deciding on the scope and size of a corpus, designing the criteria for the composition of the corpus, designing the text encoding scheme and modeling the information captured by the metadata. But they also include the practical implementation of these design decisions, such as identifying texts for inclusion in the corpus, digitizing the texts (if they are not available in digital form), applying the encoding scheme to the texts, collecting the required metadata and making the corpus available to others.

All in all, our analysis of the research literature in CLS covered here contains relatively few studies specifically devoted to the issue of corpus building. It appears that knowledge on this topic is mostly derived from adjacent, relevant fields, such as Corpus Linguistics, text encoding and scholarly digital editing, from which many lessons are probably taken. A limited number of studies, however, do exist also within CLS proper that describe and reflect on the corpus building process in particular cases and for particular purposes. These are presented in other chapters of this survey that treat corpus building for specific areas of investigation.

Some publications from within CLS also reflect in a more general or theoretical manner on what corpora are, how they can be designed and what kind of role they play in the research-based process. This category includes much work by Katherine Bode, concerned in particular with the biases baked into corpora due to the ways in which they have been created, often using material that has become available due to many historical and contextual factors rather than only scholarly concerns (e.g. Bode 2020). A recent addition to this body of work is Michael Gavin’s book on Literary Mathematics (Gavin 2023), in which the author underlines the importance of the construction of the corpus as a place where words (in the texts themselves) and contexts (through the metadata describing each individual text) come together and support multiple avenues of mathematical analysis. A more modest summary of best practices in the design of datasets, including corpora for research in CLS, has been proposed by Christof Schöch (Schöch 2017).

1.2 What are the major issues in corpus building?

1.2.1 Corpus size

The size of a corpus can vary considerably. In addition, the size of a corpus may be calculated in various ways, notably as the number of texts it contains (irrespective of the texts’ length), or as the number of tokens (words) or types (distinct words) it contains. In terms of the number of texts, corpora can range from a relatively small size, as in the European Literary Text Collection (ELTeC), where each separate corpus contains 100 novels, to millions of books, as in the case of the HathiTrust Collection, made up of over 17 million items (Schöch et al. 2021; Organisciak et al. 2014). But keep in mind that novels are relatively long texts, in terms of the number of tokens, so that virtually all ELTeC corpora of just 100 novels contain more than 5 million tokens each, whereas the Diachronic Spanish Sonnet Corpus (DISCO), with more than 4300 texts, is made up of slightly more than 400.000 tokens (Ruiz Fabo, Martínez Cantón, and Calvo Tello 2018).

The estimated corpus size is generally determined according to the goal of the researchers building the corpus. If the goal is to show the linguistic and/or literary diversity of a language or literary tradition, e.g. to preserve literary heritage in a digital format, or to provide a background for further research, a corpus can consist of a large number of texts. However, a small, representative corpus can also serve as a useful corpus for DH and CLS research. In addition, one cannot exclude technical limitations, which do not always allow working with a large amount of data, and if this is the case, scholars are limited to a smaller amount of texts. Overall, corpus size depends on the goals of a study, and larger databases may be necessary to capture either rarer or more diverse phenomena or to discover and test for trends over time (Schöch 2017).

1.2.2 Markup, annotation, format

Markup is a way to add information to the corpus in a machine-readable format. Markup can be divided into two types: document-level markup and annotations of parts of the texts (Reppen 2022). Depending on the community of origin, corpora may be encoded in one of a number of widespread formats. The simplest way to provide the texts in a corpus is the plain text format. However, in this case, neither information on textual structure nor token-based linguistic annotation can be provided. In addition, metadata can in this case only be transmitted in a separate file or via the filenames. If filenames are used for this purpose, the names of units in the corpus should be standardized and reduced to the same format, containing, for example, the name of the book, author, language, etc. A more suitable method is to use filenames as identifiers, but store metadata in a separate, tabular file for the corpus as a whole using the identifiers to link files and relevant metadata entries.

In DH and CLS, the accepted standard for textual encoding are the Guidelines of the Text Encoding Initiative (TEI). This XML-based format allows for the inclusion of detailed metadata, encoding of textual structure as well as token-based linguistic annotation (see (Burnard 2014)). Using a formal language like XML to represent analytic annotation in a text has a significant advantage in automatic validation. This means that the annotation used in a document can be checked to see if it conforms to a previously defined model of which types of annotation are permitted and in which contexts (Burnard 2014). Examples of recent literary corpora that use XML-TEI include the Deutsches Textarchiv, the Diachronic Spanish Sonnet Corpus, the European Literary Text Collection, DraCor, TextGrid’s Digitale Bibliothek, Théâtre classique and many more.

1.2.3 Metadata

Metadata contains information about various aspects of the books included in the corpus, such as about the author, publication, format, etc. Metadata is usually contained in the XML/TEI mark-up of the document. This kind of markup is commonly occurring in the Deutsches Textarchiv, CoNSAA, and ELTeC corpora. Metadata, i.e. knowledge about the text, can also be stored as a CSV table. This type of metadata storage can be seen, for instance, in the Zeta project corpus. The table, published in the Github repository, contains information about each novel, including genre and subgenre. Project Gutenberg has a feature in its editorial metadata such as the absence of print source publication date information. Since the corpus includes e-books, this criterion has been replaced by a release date. In addition, the Gutenberg project uses XML/RDF format for metadata. However, this scheme has not always been used in the project; previously metadata was stored in the MARC (Machine-Readable Cataloging) format. This format is now used by the HathiTrust corpus, which stores content and bibliographic information. You can read more about the metadata requirements of the HathiTrust project and the process of data submission on their website.

Corpora can provide different kinds of document-level information about texts (metadata). There are various approaches to classifying metadata types. The standard metadata types defined by NISO include descriptive, structural, and administrative kinds (see National Information Standards Organization). In some studies, such as those by Burnard and Calvo Tello, there are slight differences from the NISO typology. For example, Calvo Tello lists editorial metadata instead of structural metadata (Calvo Tello 2021). Burnard includes all 4 types of metadata (Burnard 2014). The ELTeC corpus also distinguishes 4 types of metadata, similar in structure to Burnard’s typology. Storing metadata in the markup according to the standardized TEI format, where the TEI Header provides a large number of metadata elements and attributes at the beginning of each file is a very flexible and powerful solution.

1.2.4 Data accessibility and copyright

Сorpora and corresponding corpus documentation (information about metadata, format, DOI and URI identification numbers for each text, etc.) can then be placed in appropriate repositories, such as for example libraries and archives, repositories like Zenodo or Figshare or cross-project infrastructure initiatives (e.g. DARIAH or CLARIN). In this way, the data are published, archived and can be used, verified or reproduced by other researches for further study. It is important to make sure that the data meet standards (e.g. TEI encoding) and are technically available (Schöch 2017).

It’s not always the case that the materials from the corpus are fully open access or in the public domain. If corpus consists of closed-access materials, permission must be obtained for their use as well as for the use of the corpus itself. In such scenarios, the corpus cannot be made available to third parties. However, there are solutions and proposals available that support so-called non-consumptive research and can help address this issue, such as XSample (Andresen et al. 2022) or derived text formats (Organisciak et al. 2014; Schöch et al. 2020).

1.3 Is representativeness possible?

For Computational Linguistics tasks, a corpus must be balanced and representative (Calvo Tello 2021). Collections of texts are created to provide access to texts and the ability to analyze them using machine learning algorithms. The creation of text corpora for analysis in linguistics and Digital Humanities has several aspects. Schöch mentions in particular the aspect of the “population”, which designates the total number of relevant items from which a corpus could be sampled or selected, given a specific scope of the corpus in terms of language, period, text type etc. (Schöch 2017). The data selected for inclusion in a corpus can be a random selection of cases from the population, if the population is known and all texts are accessible, in which case we can speak of a representative corpus. Alternatively, a corpus can contain a minimum number of cases for each possible combination of values of the criteria, and may in this case be called a balanced corpus. Finally, a corpus can include a selection not from the population, but from a readily available source of data, in which case one may speak of an opportunistic corpus (Schöch 2017; Calvo Tello 2021). Balanced and opportunistic selection are alternatives to representative sampling, because the latter is a very labor-intensive strategy, if it is realistic at all. This procedure allows for valid statements to be made about the population based on the sample and serves as a standard of comparison for other data collections. However, the population must be finite and known, which can be challenging and costly to achieve. Additionally, there must be digital or analog availability of the selected datasets and difficult decisions must be made regarding how to treat all relevant works equally or weigh them based on factors such as their distribution and reception (Schöch 2017).

Representativeness of a corpus is being actively discussed in the scientific community. Randi Reppen points out that in some cases representativeness is possible, especially if all texts of a particular time period, author, or event can be collected, but this is rather the exception. More often than not, representativeness is governed by corpus size and material selection. Thus, smaller corpora are more often used to study grammatical phenomena and patterns, while vocabulary is more representative of larger corpora (Reppen 2022). Katherine Bode suggests that debates about corpus representativeness should focus on identifying ontological gaps and epistemological biases in evidence, and adapting editorial theories and practices from textual studies to the digital context, rather than demanding a single, “correctly balanced sample”. The aim is to characterize the transformations that have produced the available evidence, rather than constructing a dataset in which all types of literature from all periods are equally represented (Bode 2020).

1.4 Research-driven or curation-driven corpus building?

In the literature we have found, we encounter different approaches to the building of a corpus. Thus, depending on whether a corpus is being assembled for the study of a linguistic or literary phenomenon or whether the goal is to assemble a collection of texts, research-driven and curation-driven approaches are distinguished.

There are many collections of texts that are actively used by various scientists as a database for their research. The list is particularly extensive for the English language. For example, corpora such as HathiTrust, the Brown Corpus, and Project Gutenberg are examples of widely used collections of texts for CLS. For the German language, notable examples of such a corpora are the Deutsches Textarchiv and the Digital Library in the TextGrid repository, which provide a lot of literary data. We have covered research based on these and many other corpora in the following chapters.

Sometimes creating a corpus for an individual task will be more effective than taking an already existing collection of texts. There are examples of corpora for which materials are selected according to language and genre, such as the Corpus of Novels of the Spanish Silver Age (CoNSSA) (Calvo Tello 2021), the Diachronic Spanish Sonnet Corpus (DISCO) and DISCO PAL corpus of poetic texts for Spanish (Ruiz Fabo, Martínez Cantón, and Calvo Tello 2018). Similarly, the corpus created as part of the Riddle of Literary Quality project covers texts published in the Netherlands, and is used in several studies of that project (Koolen 2018). Overall, there are corpora that cover many aspects of literary research, such as time periods, many genres, one language, etc. We have also covered some of them in the further chapters on corpus building in specific scenarios.

References

See works cited and further reading for this chapter on Zotero.

Citation suggestion

Evgeniia Fileva (2023): “Introduction to Corpus Building”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/corpus-intro.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).

Andresen, Melanie, Markus Gärtner, Sibylle Hermann, Janina Jacke, Nora Ketschik, Felicitas Lea Kleinkopf, Jonas Kuhn, and Axel Pichler. 2022. “Vorzüge von Auszügen – Urheberrechtlich geschützte Texte in den digitalen Geisteswissenschaften (nach-)nutzen.” https://doi.org/10.17175/2022_007.

Bode, Katherine. 2020. “Why You Can’t Model Away Bias.” Modern Language Quarterly 81 (1): 95–124. https://doi.org/10.1215/00267929-7933102.

Burnard, Lou. 2014. What Is the Text Encoding Initiative? How to Add Intelligent Markup to Digital Resources. Encyclopédie Numérique. Marseille: OpenEditionPress.

Calvo Tello, José. 2021. The Novel in the Spanish Silver Age: A Digital Analysis of Genre Using Machine Learning. Bielefeld University Press. https://doi.org/10.1515/9783839459256.

Gavin, Michael. 2023. Literary Mathematics: Quantitative Theory for Textual Studies. Stanford Text Technologies. Stanford, California: Stanford University Press.

Koolen, Corina. 2018. “Reading Beyond the Female: The Relationship Between Perception of Author Gender and Literary Quality.” PhD thesis, University of Amsterdam.

McEnery, Tony, and Andrew Hardie. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge Textbooks in Linguistics. Cambridge ; New York: Cambridge University Press.

Organisciak, Peter, Sayan Bhattacharyya, Loretta Auvil, J. Stephen Downie, and Beth Plale. 2014. “Large-Scale Text Analysis Through the HathiTrust Research Center.” In Digital Humanities Conference. https://dh-abstracts.library.virginia.edu/works/1979.

Reppen, Randi. 2022. “Building a Spoken Corpus : What Are the Basics?” In The Routledge Handbook of Corpus Linguistics, edited by Anne O’Keeffe and Michael McCarthy, Second. Routledge Handbooks in Applied Linguistics. Abingdon, Oxon ; New York, NY: Routledge.

Ruiz Fabo, Pablo, Clara Martínez Cantón, and José Calvo Tello. 2018. “DISCO: Diachronic Spanish Sonnet Corpus.” In Digital Humanities Im Deutschsprachigen Raum. http://e-spacio.uned.es/fez/eserv/bibliuned:363-Pruiz2/Ruiz_Fabo_Pablo_DISCO_corpus.pdf.

Schöch, Christof. 2017. “Aufbau von Datensammlungen.” In Digital Humanities: Eine Einführung, edited by Fotis Jannidis, Hubertus Kohle, and Malte Rehbein, 223–32. Stuttgart: Metzler.

Schöch, Christof, Frédéric Döhl, Achim Rettinger, Evelyn Gius, Peer Trilcke, Peter Leinen, Fotis Jannidis, Maria Hinzmann, and Jörg Röpke. 2020. “Abgeleitete Textformate: Text und Data Mining mit urheberrechtlich geschützten Textbeständen.” https://doi.org/10.17175/2020_006.

Schöch, Christof, Tomaž Erjavec, Roxana Patraș, and Diana Santos. 2021. “Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives.” Modern Languages Open 1 (25): 1–19. https://doi.org/10.3828/mlo.v0i0.364.