12  Annotation for Genre Analysis

Christof Schöch (Trier)

12.1 Introduction

Apart from document-level, bibliographic metadata, which we describe in the chapter on “Corpus Building for Genre Analysis” (Chapter 11), the issue of annotation for genre analysis includes, in the perspective of this survey, two major areas: (a) the manual identification or automatic extraction, from literary texts available as digital full-text, of document-level information relevant to genre; and (b) the annotation, again manually or automatically, with information relevant to genre, of tokens or spans within the literary texts.

What kinds of information, however, are relevant to genres or subgenres? As we have seen in the chapter “What is Genre Analysis?” (Chapter 10), the answer to this question can be derived from the information that is relevant to the definition and/or description of literary genres and subgenres in the literary practice, as established by literary studies. As it turns out, a great variety of aspects are relevant to the genre or subgenre of a text, among them form, style, theme, personnel, setting, mode of publication, or audience. The question of determining which features are best suited for genre analysis, or most relevant for specific genres, is a matter of debate within CLS. In addition, however, computational approaches can (and frequently do) simply use the frequencies of word forms as a feature for approaching genre, in which case only a minimal annotation in the sense of tokenization is required.

Finally, as noted also in the chapter on “Corpus Building for Genre” (Chapter 11), we of course find studies that annotate and analyse either corpora of texts containing one specific genre to be characterised more closely through analysis, or corpora containing several distinct genres or subgenres intended for a contrastive analysis or in order to develop methods that are able to distinguish between texts of different genres and classify them accordingly.

Annotation for genre analysis, then, in Computational Literary Studies, consists primarily in collecting relevant information (other than bibliographic metadata) about literary texts, whether at the document level or within texts, whether regarding form, style, theme, personnel or setting (and others), and making this information available for the analysis step described in “Data Analysis for Genre” (Chapter 13). In some cases, this step could also be described as feature engineering or feature generation, rather than annotation or tagging in the conventional sense.

12.2 Minimal annotation for genre analysis

In many cases, the analysis of genres or subgenres operates with simple wordforms and a bag-of-words model. This appears to be particularly the case in studies that use methods such as regression, clustering or classification, and when large collections of texts are analyzed. In this case, the identification of word forms (or lemmas) characteristic of a given genre or subgenre is either a goal in and of itself, or the features derived in this way are the input for a subsequent cluster, regression or classification-based analysis.

A classic of this kind of analysis is Kessler, Nunberg, and Schutze (1997), albeit in a perspective more of Computational Linguistics than CLS. This paper formulated the influential idea of distinguishing between generic facets (abstract properties of texts relevant to genre) and generic cues (surface features indicative of specific generic facets). The authors also defend the idea of simple, surface features to be just as useful for (broad) genre classification tasks over more complex features representing structure or content.

Other examples of a study using virtually no annotation (other than bibliographic, document-level metadata) include Underwood et al. (2013), Schöch and Riddell (2014) or Worsham and Kalita (2018). Underwood et al. (2013) analyze a very large collection of texts, namely 469,000 volumes from the Hathi Trust Library, with the aim of generating page-level genre assignments. The authors used a statistical method, the Wilcoxon rank-sum test, to identify subsets of words from the overall vocabulary for use in various genre-oriented classification tasks. Worsham and Kalita (2018) analyze a subset of the Gutenberg corpus that they call the Gutenberg Dataset for Genre Identification that contains 3577 texts that can be assigned to one of six types of fiction. They use a classification approach based on deep learning and, programmatically, perform no preprocessing other than tokenization and a reduction of the vocabulary to the 5000 words that are most frequent in the corpus overall.

Schöch (2018) used a measure of distinctiveness first proposed by John Burrows, Zeta, as a way to derive words that are characteristic of several subgenres of French drama, namely comedy, tragedy and tragi-comedy. Other than tokenisation and lemmatisation, no preprocessing or linguistic annotation was used. The resulting lists of characteristic words for each subgenre, however, were used as a basis for a subsequent cluster analysis.

Similarily, Du, Dudar, and Schöch (2022) use a variety of distinctiveness (or keyness) measures in order to obtain lists of words that can be understood to be typical or characteristic of a certain number of subgenres of the French contemporary novel. In their pipeline, which also includes tokenization, lemmatisation and POS-tagging, the information extracted is not actually collected at the document level, but only at the level of the novels aggregated into groups defined by their subgenre.

12.3 Genre information at the document level

There is an increasing number of studies, in recent years, that generate document-level annotations (in addition to bibliographic metadata) for genre analysis.

Because of the importance of the semantic level for genre analysis and genre distinctions, topic modeling is in fact one of the methods of choice for genre analysis. An early example for this is the influential monograph of Jockers (2013), in which he used topic modeling to investigate trends and distinctions in a large corpus of English-language novels from North America and Ireland.

Another example is Wilkens (2017), who analysed 8500 20th century American novels in English for their subgenre assignments using topics in combination with other features. Taking up the distinction between properties of genres and textual cues that allow to measure the prevalence of these properties in each novel, he first extracted the following features: “1. Subject matter, measured in the present case by topic-modeled word frequencies. 2. Style, form, and diction, measured by volume-level statistics including reading-level score, verb fraction, text length, etc. 3. Setting and location, assessed via geolocation extraction and geosimilarity measures. 4. A limited range of extra-textual features, including publication date and author gender” (2017, 6).

In a similar manner, Hettinger et al. (2015) have generated features for each novel in a corpus of 1700 German novels. The features concern common stylometric features (such as word frequencies), but also other features concerning topics (using topic modeling) and character interaction (using character network data).1

Also operating at the level of entire documents, but not using topics as features, is the study by Coll Ardanuy and Sporleder (2014). The authors have created social networks of characters in novels and then derived feature vectors from these networks to characterise novels at the level of character-based structural properties. They then use these for cluster analysis with a perspective on authorship and genre. Falk (2015) is another study generating character network features for genre analysis, but this time for a set of dramatic works. A series of studies on dramatic character networks has been published by the DLINA (Digital Literary Network Analysis) group, with Trilcke et al. (2016) in particular using network data to investigate, among other things, different kinds of German dramas, for example “open” and “closed” dramas.

Finally, a recent study by Nicholas D. Paige (2020) employs an unusual approach, in which the analysis of digital full texts plays virtually no role and instead, a considerable number of properties of a large selection of French novels published between ca. 1600 and 1830 has been established by the author. These properties include length, narrative perspective, generic subtitle, presence or absence of chapters and of inset narratives, subject matter, type of protagonists and several more. Based on these properties and their patterns of evolution, correlation and co-evolution over time, the author then creates a history of the French novel and its subgenres in the seventeenth and eighteenth centuries.

12.4 Annotating genre information within texts

Instead of generating various kinds of features describing entire novels, there are also some studies that investigate genre by adding and analysing genre-specific annotations to words or spans within texts.

An example of manual annotation of small spans within texts is Reinig and Rehbein (2019), in which the authors have manually annotated metaphorical expressions in a corpus of German expressionist poetry, that is a corpus which is homogeneous with respect to genre. Also regarding poetry, specifically Spanish-language sonnets, Navarro-Colorado (2017) presents a pipeline for adding annotations regarding syllabic structure of words and, building on this, metrical properties to individual verses.

Schöch et al. (2016) have first automatically tagged the full texts of a corpus of French novels with token-level semantic (WordNet) and morphosyntactic annotations. A sample of sentences were then manually annotated with respect to character vs. narrator speech. Based on these second-level annotations, a classifier was then trained to perform a sentence-level annotation of the entire corpus for the presence of character speech within the novels. They then use the information on the proportion of direct speech in each novel for investigations into genre-based differences. In a similar manner, Brunner et al. (2020) have used direct / indirect speech annotations to investigate the distinction between highbrow and lowbrow literature.

Kim, Pado, and Klinger (2017) have investigated genre in a subset of Project Gutenberg containing five subgenres of narrative fiction. They use sentence-based emotion annotation (Sentiment Analysis) for all texts and use this annotation for a genre classification task. In addition, they also conduct an investigation into emotional trends over the course of novels that form patterns that are characteristic, to some degree, of the different subgenres. (See also Jannidis et al. (2017) for a similar investigation focusing on happy endings.)

Simple kinds of automatic, token-level annotations are often combined with automatic document-level feature generation. A study by Christof Schöch (2017) combines token-level annotations and document-level feature generation. Schöch used topic modeling to automatically obtain thematic information about each play in his corpus as a way to investigate differences between comedies, tragedies and tragicomedies. As a part of the preprocessing for topic modeling, token-level annotation is often performed, for example by Schöch (2017): both lemmatization, in order to focus on lemmata instead of word forms, and POS-tagging for the filtering of lemmata to include only content words such as nouns, verbs, adjectives and adverbs. Both steps reduce the dimensionality of the dataset and increase the semantic coherence and interpretability of the resulting topics. Generally speaking, this is done more often for languages other than English, as many of them are more highly inflected and lemmatisation has a larger impact on the results than for English.

One of the most elaborate examples of this strategy combining bibliographic metadata, additional document-level annotations relevant to genre and classification tasks based on most frequent words, is José Calvo Tello’s study of the Spanish novel (2021), in which he has collected, primarily through close reading of the novels, a large number of document-level annotations regarding for instance the characters, themes, or the setting of the novels in his corpus, for subsequent use in a wide range of classification tasks. The author found, for example, that semantic features can be highly useful for subgenre classification of novels. In his case, these semantic features were obtained by annotating the tokens of the novels using a vocabulary derived from a semantically-organized dictionary, the Diccionario Maria Moliner.

12.5 Conclusion

As can be seen from this survey as a whole, the domain of computational analysis of literary genre has emerged as a dynamic area of investigations within CLS over the last 10-15 years. In terms of feature engineering and annotation, a wide range of kinds of features are now used to cover the many abstract properties relevant to literary genres and subgenres. It is striking, at least in this survey, that a majority of studies appear to focus on novels when investigating literary subgenres, with fewer studies on drama and even fewer on poetry. In addition, while in authorship attribution, complex processes of feature generation and annotation are rather the exception, the inverse is true for genre analysis. This may shift again, however, with the increasing use of deep learning for literary genre analysis, where active feature engineering as part of the preprocessing and annotation step may become obsolete, at the price, however, of decreased transparency and interpretability of the results.

References

See works cited and further reading for this chapter on Zotero.

Citation suggestion

Christof Schöch (2023): “Annotation for Genre Analysis”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/annotation-genre.html, DOI: 10.5281/zenodo.7892112.



License: Creative Commons Attribution 4.0 International (CC BY).


  1. See also Hettinger et al. (2016) for another perspecive of this data.↩︎