11  Corpus Building for Genre Analysis

Evgeniia Fileva (Trier)

11.1 Introduction

The key challenge in corpus building for genre analysis is the composition of the corpus in terms of the genres or subgenres targeted, on the one hand, and in terms of competing or potentially interfering categories such as authorship, period or formal characteristics, on the other. Often corpora used for research in Computational Linguistics or Digital Humanities (such as stylometry) are designed from the start to include only one genre or subgenre (e.g. newspaper articles, encyclopedias, theater plays, tragedies, etc.), although the actual level of homogeneity of such corpora can be a matter of debate. For such tasks, the primary genre-based classification of texts is important because texts belonging to the same genre (or class) are assumed to demonstrate similar linguistic features.

Often, however, classifying texts by genre, or in a more fine-grained manner by subgenre within a broader genre, particularly for the purpose of creating a corpus, presents a separate, specific problem. The main reasons for this are that genre categories are hard to define in a coherent, systematic way; that sensible categorizations of genres are very much bound to the respective literary tradition (language, period) they apply to; and that a single text may participate in multiple genres at the same time (see the chapter on “What is Genre Analysis?” (Chapter 10) for more information). As a consequence, Calvo Tello (2021), for instance, approached the problem of genre classification as a “multi-level classification task”. Overall, for genre-based research it is important that corpora contain specific information about genre stored in metadata.

11.2 Corpora frequently used for genre analysis

11.2.1 Curation-driven, general-purpose corpora

Typically but not necessarily, curation-driven, general-purpose corpora are relatively large in volume and relatively heterogeneous with respect to the genres or subgenres represented in them. However, there are also curation-driven corpora for CLS that focus on one particular, larger genre, such as drama, narrative or poetry. In addition, it is typical for curation-driven corpora to be monolingual and for texts contained in them to be encoded in XML-TEI, although exceptions to both points certainly exist.

The Brown Corpus is among the oldest, well-established corpora used for genre-specific tasks in particular in Computational Linguistics. Conceptually and practically, the Brown Corpus is constructed very differently from, for example, the vast Hathi Trust Library, and consists of 500 samples each of about 2,000 words, with a total of 802 texts, all first published in 1961, representing a range of styles and varieties of prose (see Brown Corpus Manual and Francis and Kucera 1979). The corpus is not meant to represent standard English, but rather a standardized body of data for comparative studies. For his study, Kessler, Nunberg, and Schutze (1997) used only 499 out of 802 texts from the Brown Corpus, which were selected using a custom classification system. The texts were analyzed based on three categorical facets: Brow (levels of intellectual background required by the target audience), Narrative (binary, whether a text is written in narrative mode), and Genre (reportage, editorial, scitech, legal, nonfiction, fiction). The corpus was divided into a training subcorpus (402 texts) and an evaluation subcorpus (97 texts) for the study, which were selected to have roughly equal numbers of all represented combinations of facet levels. The texts in the evaluation subcorpus were chosen using a pseudo random-number generator, resulting in different quantitative compositions of the training and evaluation sets, with some genre levels being more frequent in one set than the other (Kessler, Nunberg, and Schutze 1997). Then an analysis of each text in the evaluation set was conducted.

Closer to current concerns in CLS research, such corpora as Deutsches Textarchiv (DTA), Corpus of Novels of the Spanish Silver Age (CoNSSA) or Théâtre classique are also monolingual corpora. Thus, the DTA covers the period from 1600 to 1900 and includes about 4,400 works in total, of which 1,500 belong to a balanced core corpus and about 700 are literary works (‘Belletristik’). As stated in the corpus description, the collection was created to reflect the diversity of the German language, so the collection is characterized by genre richness as well as scrupulous supervision in the process of digitizing the material. Furthermore, the corpus website offers a researcher-oriented navigation tool, the “linguistic search” within the DDC (Dialing/DWDS-Concordancer) linguistic search engine. For each text, DDC creates a machine-readable index file containing additional information for each word that can be used during queries. These index files are created from inputs in a DDC-specific XML format, which contains all the information available to DDC but is not readily efficiently queryable. The developers suggest using the DTA as a reference corpus for linguistic research.

The Project Gutenberg’s text collection counts texts in more than 60 languages, all open access and public domain. The project description indicates that in addition to books there are also such units as manuals, pamphlets, periodicals, travelogues, theses, journals, or chapbooks. One of the features of this corpus is that instead of genres it uses “categories”, that is, topics by which books are sorted. The official website states the following description: “The collection includes eBooks on many topics. There is emphasis on literary works and reference items of historical significance, because volunteers have focused on digitizing such works. Any eligible item, on any topic, is welcome.” (Gutenberg).

The most widely used text collections for German are the Digitale Bibliothek in the TextGrid Repository and the Deutsches Textarchiv (DTA). The TextGrid Digital Library offers a comprehensive collection of XML/TEI-encoded texts from fiction and non-fiction literature from the beginning of book printing until the first decades of the 20th century, written or translated into German. The collection is of particular interest to German and comparative literary studies as it contains almost all important canonical texts and numerous other literature-historical relevant texts whose copyright protection period has expired. The texts are mostly from reliable editions and are therefore citeable. The TextGrid Repository makes these texts available to the general public not only for reading but also for further processing, such as in editions and corpora. The XML files were converted into a valid TEI format, which allows for precise searching and analysis. The metadata only contains some very broad genre specific information such as verse, prose and drama. Both corpora have, for example, been used in the research of Trilcke, Fischer and colleagues (Trilcke, Fischer, and Göbel 2016; Trilcke, Fischer, and Kampkaspar 2015) among other corpora such as Wikisource and Projekt Gutenberg-DE. The authors describe the DTA corpus as having high-quality TEI markup, but containing relatively few texts. The German-language branch of Wikisource also has limited texts. The Projekt Gutenberg-DE archive has poor markup with only basic XHTML. The TextGrid Repository, which contains basic TEI markup, is the most applicable option in their view (Trilcke, Fischer, and Kampkaspar 2015).

The European Literary Text Collection (ELTeC) belongs to the group of curation-driven corpora that have a clear focus on one particular literary genre, in this case the novel. ELTeC consists of a number of sets of novels in different European languages, where each set contains 100 novels first published in the period 1840 to 1920 in one given language-based literary tradition. ELTeC is not a representative corpus, but uses a number of corpus composition criteria to ensure that the variety of production in each literary tradition is represented in each corpus. These criteria concern aspects such as author gender, text length, degree of canonicity and publication time (Schöch et al. 2021). However, ELTeC provides virtually no metadata on the subgenres of the novel represented in each collection, because of the challenges connected to establishing a taxonomy across multiple literary traditions. ELTeC is a multilingual (more than 17 European languages) collection consisting of 12 complete corpora of individually selected 100 novels, with the corpora being comparable to each other in their internal structure, size, and composition. ELTeC also contains additional corpora of various sizes. All metadata (title, author, publication date, etc.) are modeled according to the same standard and encoded in XML-TEI, using a set of ELTeC-specific schemas (Burnard, Odebrecht, and Schöch 2021).

Théâtre classique is a collection of French dramatic texts edited by Paul Fièvre since 2007, with (currently) 1,700 French plays, the majority of which were written or published between 1630 and 1800 (see Fièvre 2007). The platform provides contextual information and several statistical and analytical perspectives on the textual data. The XML-TEI data contains good document-level metadata and detailed structural markup typical of dramatic works (acts, scenes, stage directions, character speeches), while the the TXT and HTML files lack much of this level of detail. The corpus has been described as “an essential enabling force for recent, quantitative approaches to French drama” (Schöch 2018) and is now available also via the DraCor platform.

11.2.2 Research-driven corpora built for specific purposes

Sometimes new corpora are created for specific tasks. Such an example of the creation of the original text collection can be seen in Calvo Tello (2021). The Corpus of Novels of the Spanish Silver Age (CoNSSA) was created on the basis of Spanish-language literary prose that was written between 1880 and 1939 be authors from Spain. The corpus includes novels by 107 authors who met all the criteria that have been set for that corpus by Calvo Tello: novels in Spanish by Spanish authors published between 1880 and 1939. In addition to basic bibliographic information, Calvo Tello collected the following metadata about authors and their works: name (full and short), years of life, author’s preferred genre, number of pages dedicated to the author in the Manual de literatura española (MdLE), and provenance information. Calvo Tello states that categorizing genres and subgenres have been a particularly challenging task. The author proposes a method of encoding information of unprecedented detail about genre and subgenre that takes into account, for instance, the hierarchical levels and sources of information, including super-genre, genre, subtitle, subgenre from different sources such as literary histories, editorial information, and annotations by Calvo Tello himself. The aim is to capture the complexity of genre and subgenre categorization and to reflect the blending of different subgenres in many texts. Another corpus designed specifically for the investigation of subgenres of the Spanish-language novel, but this time for novels from Argentina, Cuba and Mexico, is the Corpus de novelas hispanoamericanas del siglo XIX (conha19) edited by Ulrike Henny-Krahmer who also curated a bibliography as a basis for the corpus building process (see Henny-Krahmer 2017) and describes the process of modeling, building and encoding the corpus in Henny-Krahmer (2023).

As Ruiz Fabo, Martínez Cantón, and Calvo Tello (2018) argue, there is a lack of research on poetic corpora, especially for Spanish. A good example of a poetry corpora widely used in the research community is Diachronic Spanish Sonnet Corpus. DISCO consists of 2677 sonnets in Spanish from the 19th century written by 685 authors from Spain and Latin America. The corpus is intended to provide a wide sample, inspired by distant reading approaches, and being updated with additional sonnets from other centuries. The poetic texts were extracted from Biblioteca Virtual Miguel de Cervantes and encoded in XML-TEI P5 format. The metadata, stored in the TEI-Header, include year of birth and death, place of birth, and gender. The corpus is available on GitHub, saved in Zenodo, and includes VIAF identifiers to enhance the corpus’s findability in the linked open data (Ruiz Fabo, Martínez Cantón, and Calvo Tello 2018).

DISCO PAL was created on the basis of the DISCO corpus and is available for data mining tasks on Spanish poetry, and particularly for obtaining the Global Affective Measure (GAM) of poetry (Barbado et al. 2022). While DISCO doesn’t provide information directly usable for text modeling tasks, (Barbado et al. 2022) presented DISCO PAL, the Diachronic Spanish Sonnet Corpus with Psychological and Affective Labels, which is a subset of the DISCO corpus. DISCO PAL includes 274 sonnets in Spanish from different time periods annotated with affective, lexico-semantic, and psychological labels and aims to make poetry available as machine-readable data for linking, indexing, and extracting new information. While DISCO includes the metadata only about authors, sonnet scansion, rhyme-scheme and enjambment, the DISCO PAL corpus includes binary labels for psychological concepts and integer values for affective and lexico-semantic features and also provides a rich source of data for text mining tasks on Spanish poetry (Barbado et al. 2022).

Another example of a poetic corpus is the one presented by Šeļa, Plecháč, and Lassche (2022). Their research employs five poetry collections in various languages such as Czech, Dutch, English, German, and Russian to compare and analyze metrical types used in poetry.

The EmoTales corpus (Francisco et al. 2012) is a specific corpus designed for narrative applications. The corpus focuses on fairy tales due to their explicit representation of emotions and suitability for the identification and study of emotions. It includes 18 tales of different lengths, written in English, with a total of 1,389 sentences and 16,816 words, chosen to cover a broad spectrum of styles by having tales from different authors and time periods. It contains in-context sentences with emotional tags based on subjective human evaluations.

Another example of a research-driven corpus is the corpus developed by the Zeta and Company project corpus. This corpus consists of French novels (both written in French and translations into French) from the second half of the 20th century, currently with 1200 novels in total. The corpus has several special properties. First of all, it is specifically designed for the investigation of several popular subgenres of the French novel, namely science fiction, crime fiction and sentimental novels, in comparison with a wide range of highbrow novels. For this reason, the corpus is balanced with respect to these four groups, within each decade (as far as feasible) as well as across the period 1950-2000. Second, the subgenre information is derived from the bibliographic metadata about publishers and collections in which the novels were published, under the assumptions that specific collections (such as Harlequin’s ‘Duo Romance’ or Gallimard’s ‘Série noire’) publish only novels that can be categorized as sentimental novel or crime fiction novel, respectively. And third, the novels all being still under copyright, the digitisation and usage in the research project is covered by the ‘Text and Data Mining-Exception’ that applies to research purposes in German copyright law. As the full texts cannot be published, the project makes available a subset of (currently) 320 texts as a so-called ‘derived text format’ that allow statistical calculations, but make reading the novels impossible (see Organisciak et al. 2014; Schöch et al. 2020).

11.3 Conclusion

Thus, building a corpora for genre analysis in Computational Linguistics and Digital Humanities could be a quite challenging task. The composition of the corpus in terms of targeted genres and classifying texts by genre or subgenre present specific challenges since genre categories are hard to define, and a text may belong to multiple genres. Curation-driven general-purpose corpora tend to be large and heterogeneous, whereas research-driven corpora focus on the selection of research material.


See works cited and further readings for this chapter on Zotero.

Citation suggestion

Evegniia Fileva (2023): “Corpus Building for Genre Analysis”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/corpus-genre.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).