2  Introduction to Preprocessing and Annotation

Andressa Gomide (Porto)

2.1 Introduction

A corpus is rarely a collection of raw texts grouped together. To derive meaning from corpora, we normally need to first (a) remove noise or unwanted text; (b) divide the corpus into segments (normally tokens and sentences); (c) add layers of annotation (or information) to the text; and (d) document and store information about the texts themselves (metadata). These pieces of information should be ideally structured in a way that allows them to be easily retrieved in a corpus.

The issue of annotation and preprocessing in Computational Literary Studies covers a number of aspects. The two key categories of annotation are document-level annotations (metadata) and token-level annotations (tagging). For the purposes of this survey, we deal with document-level annotation or metadata in the sections on Corpus Building, because metadata is an essential aspect of corpus design and corpus composition. Tagging can be performed at the level of individual tokens, but can also concern longer spans or regions of the text. For example, it is typical of text preparation in CLS research that text structure is annotated using markup such as XML-TEI. In addition to (and usually before) tagging texts, a certain number of preprocessing steps are typically performed, such as cleaning and normalization. Finally, preprocessing steps such as segmentation and filtering of tokens can be performed. These two latter manipulations lead over to analysis, where they are sometimes integrated and performed on-the-fly rather than being performed as part of the data preparation.

2.2 Preprocessing

Annotation is best done to a raw text, without extra information other than the text itself. However, some pre-processing to normalise the text can be desirable (Wolfe 2002; Dash 2021). In addition, if the source text already has structural annotations, for example in XML-TEI, a strategy needs to be found to preserve this information, or loosing this information be accepted. For matters of clarity, in this text, we divide this pre-processing into two groups: cleaning and data preparation.

2.2.1 Cleaning

Cleaning refers to removing noise or unwanted information from the text. These undesirable elements can be, for example, page headings and footnotes, page numbers, line breaks, sensitive data, and boilerplate text (text fragments repeated without major changes). In some cases, the cleaning has to be done manually (e.g. search for sensitive data). Very often, the cleaning can be done (semi) automatically with the identification of mark-up and/or the use of regular expressions.

Once the cleaning is completed, the discarded data cannot be accessed anymore. This is the first version of the corpus and is normally kept as canonical raw data. It is also at this point that the information about each text (metadata) is documented; see chapter “General Issues in Corpus Building” (Chapter 1).

2.2.2 OCR and spelling

There are two additional areas where a correction of texts directly concerns problems of preprocessing and annotation: OCR errors and spelling variation. We discuss these areas as directly related to language variation in a corpus.

The consensus for dealing with OCR errors seems to be the following: if noise that alters texts is ‘moderate’ and uniform across a given corpus, it can be ignored. This can save a lot of resources by making perfectly proofread OCR texts unnecessary. Research that simulates uniform OCR errors shows that authorship attribution scenarios, for instance, can handle up to 20-30% of global erroneous characters without losing too much of classification strength (Eder 2013). This finding also holds for recognized handwriting, making noisy HTR output viable for attribution studies (Franzini et al. 2018). The role of heterogeneous noise, however, is understood poorly. It can be a significant ‘invisible’ problem when texts are coming from large digital libraries and collections (e.g. Google books, Gutenberg, Gallica), since OCR quality may differ for different images, book formats, typesets, print quality. Even with comparable OCR frameworks, noise might be not uniform. In addition, in scenarios other than authorship attribution, noisy data is likely to be more of an issue.

Spelling normalization becomes a larger problem for Western texts the further we go back in time from the 19-20th century sources that largely had standardized grammar, spelling and typographic conventions. It is common to account for major orthographic variation (e.g. Umlauts in German and the long s in Latin-based orthographies Rebora et al. (2018), or pre-1917 Cyrillic). In Medieval manuscript sources, the regional and individual variation in orthography dramatically increases. The problem is vividly illustrated in a study by Kestemont, Moens, and Deploige (2013) that makes a lot of effort to normalize 12th century Latin texts by lemmatizing, isolating clitics, splitting contractions and generating possible spellings of words to catch their outside-dictionary occurrences. At the same time, spelling normalization becomes less relevant with modern fiction, where dialectal and regional variation is seen as a conscious authorial choice and is left untouched for the analysis (Gladwin, Lavin, and Look 2015).

2.3 Data Preparation

2.3.1 Segmentation: tokens and sentences

Tokenization and segmentation are steps performed in the earlier stages of most text analyses. It is normally a fast process done using deterministic algorithms (e.g. Bird, Klein, and Loper 2009) to establish token and sentence boundaries (Grefenstette 1999).

Some examples of commonly used tokenizers are OpenNLP, spaCy and TreeTagger (Schmidt 1994). Understanding (and customizing) the segmentation process is opportune when choosing a tokenizer, as they might vary on how they account for multiword expressions and ambiguous separators, for example. Inexact tokenization can negatively affect later processes and applications with the corpus. For instance, running a dependency parser on a badly tokenized sequence may yield errors beyond the span of the problematic token or sentence.

2.3.2 Normalization

Words might have different spellings that should be preserved or normalized, depending on the research question, especially when dealing with historical texts. When variation of orthography is not relevant, a common step is to normalize the text by establishing a standard for words with alternative spellings.

A more sophisticated way of dealing with spelling normalization is to encode both the original and the normalized or modernized version, as is possible in XML-TEI, in order to retrieve the version one is interested in as needed in the analysis step. Again, this is an area where corpus design, preprocessing and analysis interact.

Another way of normalization, which is especially done with highly inflected languages, is the annotation of the tokens with its lemma (base word form) or stem (word root), or even, in simpler scenarios, the replacement of token by the lemma or stem. Although this step is here described as a preprocessing step, a lower level of annotation is necessary.

2.3.3 Data filtering

Depending on the type of analysis that will follow, it might be necessary to remove unwanted tokens that affect the statistical analysis. This is normally done by using a stopword list to filter the texts. Stopwords, or empty words, are, usually, words commonly used in a language, such as prepositions and articles. Punctuations might also be filtered out of a corpus, especially when spoken data is concerned. However, they are crucial to identify boundaries and context, e.g. questions, exclamation, citation (Jurafsky and Martin 2023, 11). In addition, this kind of filtering can also be done based on token-level annotation (e.g. removing all but the nouns and verbs) or based on relative frequency or document frequency (e.g. removing all words present at least once in more than 90% of the documents). As mentioned above, such a filtering step is often performed ‘on-the-fly’ as part of the analysis pipeline.

2.4 In-text annotation

Adding extra information to the texts makes analyses easier and more precise (see chapter “General Issues in Data Analysis” (Chapter 3). Annotation can add different layers of knowledge to text. It can be related to grammar, meaning, orthography, etc. Leech (1993:275) proposes seven maxims of text annotation, from which we emphasize the following three:

  • “It should always be easy to dispense with annotation and revert to the raw corpus.” (1)
  • “(…) the annotated corpus is made available to a research community on a caveat emptor [engl.: buyer beware] principle.” (5)
  • “(…) to avoid misapplication, annotation schemes should preferably be based as far as possible on ‘consensual’, and ‘theory-neutral analyses’ of the corpus data”. (6)

There are many categorization schemes for the different types of annotation (see, e.g., Dash 2021). There are different layers of textual annotation and the various linguistic perspectives that can be applied to units such as tokens, sentences, paragraphs, and chapters. Gardt’s schema for the analysis of textual semantics, referred by Tello (Calvo Tello 2021), includes three large groups of textual components: communicative-pragmatic frame, macrostructure, and microstructure. The communicative-pragmatic frame corresponds to the basic metadata and covers the information about the producer, the reader, the situation, and the medium. Data about genre, text type and related information are stored in the macrostructure component. The microstructure frame includes several linguistic layers such as layout, morphology, lexicology, phraseology, semantics, forms of argumentation, syntax, and punctuation (Calvo Tello 2021). The use of templates or schemas for corpus mark-up enables texts to have multiple annotations, allowing users to access the version that meets their needs (Reppen 2022). As Schöch points out, the most crucial aspect of any annotation is that an annotation scheme follows an established standard (Schöch 2017).

For matters of clarity and to avoid theory bias, we follow The IMS Open Corpus Workbench (CWB) encoding manual (Evert 2022; Evert and Hardie 2011) and divide the annotation types into positional (token-level) and structural (region-level) attributes.

2.4.1 Token-level / positional

Token-level annotation is normally done automatically, attributing a value to each token in the corpus. The most common types of annotation of this kind are lemmas (the word in its uninflected form) and part-of-speech annotation (POS). Annotation of named entities (such as names of people, places, organizations or works) is also very common. Token-level annotation might also indicate richer morpho-syntactical information beyond simple POS-tags or the relationship among different tokens, as is the case when parsing a text (adding syntactic annotation).

Some common tools for token-level annotation tools are TreeTagger, spaCy, Stanza, Spark and Freeling. Among the most frequently referred standards, we can cite CLAWS, the Constituent Likelihood Automatic Word-tagging System (Garside and Rayson 1987); the EAGLE annotation guideline (Ide and Suderman 2014); the Text Encoding Initiative (TEI, Sperberg-McQueen and Burnard 1994); and Universal Dependencies, notably using CoNNL-U.

2.4.2 Region-level

Region-level annotations are particularly useful to restrict the corpus queries to specific corpus regions (see chapter “General Issues in Data Analysis”, Chapter 3). Leech (1993) makes the distinction between (a) representation and (b) interpretative information that can be annotated to a text. The first refers to the structure and the form of a text. It can be sentence boundaries, pauses, words, and spellings. Interpretative, however, refers to the ‘hidden’ information in the text and is normally done manually by experts.

Annotations targeting textual aspects relevant to CLS and often spanning multiple tokens, like figures of speech (comparisons, metaphors, etc.) or direct and indirect speech and though representation, straddle this distinction, as they are usually annotated manually, often (but not necessarily) with the intention to provide training data to an automatic process using machine-learning. This kind of region-level annotation can also be specific to certain literary forms, like speeches and stage directions in drama, chapters and parts in novels, or verses and stanzas in poetry.

2.5 Format

The previous sections discussed the different ways of obtaining and adding information to a corpus before proceeding to the analysis. However, the way this information is stored varies.

Corpora are often prepared using XML. TEI is a frequently adopted de facto standard. However, there are simplified suggestions, such as Hardie (2014), which do not impose strict rules. Simplified versions of TEI might be a good choice for smaller projects, but departing from TEI does come with considerable downsides with respect to interoperability.

To make a corpus ready for query systems such as CWB and Manatee (Rychlý 2007), we often need to provide the corpus formatted in a vertical format (vrt). This means each token is represented in a line and its respective tags are added to subsequent columns (Evert 2022). Other corpus query and analysis systems, however, are able to ingest annotated XML-TEI, for example TXM or TEITOK. The decision of how to format the corpus is normally made according to how the analysis will be performed. The chapter “General Issues in Data Analysis” (Chapter 3) deals with that.

2.6 Conclusion

As this section has hopefully shown, preprocessing and annotation of texts is a complex topic with many different aspects. Depending on the type of analysis, different levels and sophistication of annotation may be required. In addition, preprocessing and annotation interact and overlap both with corpus building and data analysis. Finally, it can be observed that increasingly, in CLS research, the combination of structural markup (such as XML-TEI) and token-level and region-level linguistic annotation is desirable and, while perfectly possible in principle, does raise challenges in particular in terms of the best sequence for the various annotation steps: most taggers do not handle existing XML markup graciously, so that the need for integration of structural XML markup and token-level annotations becomes quite challenging.


Citation suggestion

Andressa Gomide (2023): “Introduction to Preprocessing and Annotation”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/annotation-intro.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).