Annotation for Literary History

Dudar, Julia; Fileva, Evgeniia

doi:10.5281/zenodo.7892112

Julia Dudar and Evgeniia Fileva (Trier)

17.1 Introduction

In the context of literary history, the annotation techniques and items that need to be annotated depend on the research question and the analysis methods used, such as clustering, keyness analysis, or classification etc. These methods have already been discussed in other chapters, see in particular “Data Analysis for Literary History” (Chapter 18). However, historical corpora raise some specific challenges for researchers. The issue of spelling variation and digitization quality are specific to literary analysis over time and undergo significant changes over time, something that can be particularly problematic for any text analysis method.

17.2 Digitization of Historical Print Media: Challenges and Limitations

Historical languages and old printings fonts were not invented for digital format, so the transfer from one medium to another is not without its difficulties. OCR or manual transcription inevitably results in errors, as well as entails interpretation (Piotrowski 2012).

In historical corpora, the language is also historical. In historical languages, there are often no standard variants of spelling and orthography, it is hard to determine the norm, and the spelling of the same word varies even within works by the same author. Correct spelling recognition is a critical step in the tokenization phase, so in the NLP preprocessing process, scholars are consulting lexical resources (such as WordNet), using informational retrieval and statistical methods from NLP (like POS tagging and lemmatization) to solve this problem (Piotrowski 2012).

The choice of digitization method is an important issue in preparing to scan a document. For handwritten texts, the method of digitization is chosen based on the condition of the document: problems such as damaged medium, bleed-through and fading especially complicate the task (Piotrowski 2012). Automatic handwritten text recognition (HTR) is a popular NLP research task today, and advances in recent years using neural network technology are impressive (see e.g. the Transkribus tool, frequently used in Digital Humanities, and a survey of its uses: Nockels et al. 2022). A tool gaining popularity in Digital Humanities for OCR of historical print is OCR4all (see Reul et al. 2019). However, in many contexts and depending on the materials, manual double keying by qualified personnel often still shows the most reliably accurate results.

The process of OCR also has a whole list of features to consider when scanning and processing the result. At the very beginning it is important to choose a suitable scanner (depending on the format and condition of the printed book or other materials) and the right settings. In addition, the quality of the text itself naturally affects the scanning result, and even at this stage obstacles can arise, since historical documents are often damaged, faded, or blurred, making it difficult for OCR software to accurately recognize characters. Historical documents may also contain non-standard or archaic fonts that are not recognized by OCR software, resulting in errors and inaccuracies. OCR software may often have difficulty recognizing columns or lines of text that are broken or unevenly spaced, resulting in errors.

OCR technology has made significant progress in recent years, but there are still several limitations when it comes to historical texts. Researchers continue to explore new methods and techniques to improve OCR accuracy for historical documents. The creators of the open-source OCR tool OCR4all (Reul et al. 2019) offer researchers who work with historical texts the possibility to make correction during the process of text recognition, or even train own OCR model based on their corpora, if the gold standard is provided. This is especially useful for historical texts that may be difficult for traditional OCR tools to recognize due to factors such as aged or damaged paper, different font styles, or non-standard layouts. This tool enables the scholars to increase the performance of OCR recognition, to automate the digitization process, and to minimize the post-processing effort.

In some cases, the low quality of OCR recognition can be compensated for by using supplementary tools, manual correction or implementing more intricate annotation methods. For example, Muzny, Algee-Hewitt, and Jurafsky (2017) required a reliable dialogue extraction mechanism, to measure dialogism and analyze changes in the use of direct speech in drama over time. As the OCR quality of old texts was not sufficient for this purpose, the authors developed a tool for quote extraction and named it QuoteAnnotator. This tool is based on simple rules that enable the accurate extraction of dialogue sections from raw text, offering greater precision compared to regular expressions.

17.3 Normalization of Historical Texts

Spelling variations are a prevalent characteristic of historical corpora, as these texts typically have not undergone standardization. During the evolution of a language, lexical, grammatical, morphological, and syntactical changes naturally occur. Often these changes are well-known to researchers and do not pose significant difficulties in the preprocessing of historical texts. In cases where many changes occur, these historical properties can be considered as a separate language in their own right. To address this, special NLP tools already exist or can be created with requiring some additional effort. In cases where there are not much changes, an easy rule-based approach can be developed to address this task. For some research questions both language variants can be required: historical and modern one. In such a scenario, language modernization can be accomplished by adding an additional annotation layer while retaining the original historical spelling at the primary level (Bollmann 2018).

Further and more complicated problems for researchers are spelling variants that occur due dialectal influences or individual style of the writer. Even within works of one author, different spelling variations are possible. Such variations are often not known to the researchers and there are no special adapted NLP tools that could deal with such variants. As a result, time-consuming manual annotation often becomes necessary.

The issue of normalization of different spelling variants can be addressed through different solutions. They can broadly be divided into three categories:

Domain adaptation: historical language is seeing as target domain, while the modern language is a source domain, the labeled data from the source domain are combined with labeled and unlabeled data from the target domain;
Retraining the tool: an existing tool can be retrained using additional, manually annotated data (see Bollmann 2018);
Data adaptation: historical variants are mapped to their contemporary equivalents (see Piotrowski 2012).

If the normalization is a necessary step of the preprocessing and annotation of the corpora, how can this step be automated? There are some approaches available to tackle this challenge. As mentioned earlier, some minor spelling variants can be solved using simple rule-based methods. Other approaches rely on the concept of similarity between historical and modern spellings and employ different string distance metrics to identify the closest equivalent of the historical word (Jurish 2010; Pettersson, Megyesi, and Tiedemann 2013).

In some cases, character-based statistical machine translation can be applied for normalization. In such scenarios, the normalization process is modeled as a machine translation of character sequences instead of word translation, as is usually the case (e.g. Sánchez-Martínez et al. 2013; Pettersson, Megyesi, and Tiedemann 2013; Schneider, Pettersson, and Percillier 2017).

Although neural networks are widely used for solving various NLP problems, their usage for spelling normalization is relatively infrequent. Inspired by machine translation techniques, Marcel Bollmann proposes a neural network approach for addressing historical spelling normalization tasks (Bollmann 2018).

17.4 Conclusion

Preprocessing and annotation of historical literary texts pose several challenges for researchers, including issues with spelling variations and the quality of digitization methods. OCR technology has improved significantly in recent years, but historical texts still present challenges due to damaged or faded documents, non-standard fonts, and unevenly spaced text. To address these issues, researchers sometimes need supplementary annotation tools. Concerning spelling variations there are different possibilities to address this challenge including domain adaptation, tools retraining, and data adaptation approaches.

References

See works cited and further readings on Zotero.

Citation suggestion

Evegniia Fileva, Julia Dudar (2023): “Annotation for Literary History”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/annotation-lithist.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).

Bollmann, Marcel. 2018. “Normalization of Historical Texts with Neural Network Models.” PhD thesis, Bochum: Ruhr-Universität Bochum. https://doi.org/10.13154/294-6213.

Jurish, Bryan. 2010. “Comparing Canonicalizations of Historical German Text.” In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, 72–77. Uppsala, Sweden: Association for Computational Linguistics. https://aclanthology.org/W10-2209.

Muzny, Grace, Mark Algee-Hewitt, and Dan Jurafsky. 2017. “Dialogism in the Novel: A Computational Model of the Dialogic Nature of Narration and Quotations.” Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqx031.

Nockels, Joe, Paul Gooding, Sarah Ames, and Melissa Terras. 2022. “Understanding the Application of Handwritten Text Recognition Technology in Heritage Contexts: A Systematic Review of Transkribus in Published Research.” Archival Science 22 (3): 367–92. https://doi.org/10.1007/s10502-022-09397-0.

Pettersson, Eva, Beáta Megyesi, and Jörg Tiedemann. 2013. “An SMT Approach to Automatic Annotation of Historical Text.” In Proceedings from the Workshop on Computational Historical Linguistics at NoDaLiDa 2013. NEALT. https://cl.lingfil.uu.se/~bea/publ//pettersson-megyesi-tiedemann.pdf.

Piotrowski, Michael. 2012. Natural Language Processing for Historical Texts. Synthesis Lectures on Human Language Technologies. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-02146-6.

Reul, Christian, Dennis Christ, Alexander Hartelt, Nico Balbach, Maximilian Wehner, Uwe Springmann, Christoph Wick, Christine Grundig, Andreas Büttner, and Frank Puppe. 2019. “OCR4all a (Semi-)Automatic OCR Workflow for Historical Printings.” Applied Sciences 9 (22): 4853. https://doi.org/10.3390/app9224853.

Sánchez-Martínez, Felipe, Isabel Martínez-Sempere, Xavier Ivars-Ribes, and Rafael C. Carrasco. 2013. “An Open Diachronic Corpus of Historical Spanish: Annotation Criteria and Automatic Modernisation of Spelling.” arXiv. https://doi.org/10.48550/arXiv.1306.3692.

Schneider, Gerold, Eva Pettersson, and Michael Percillier. 2017. “Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts.” In Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, 40–46. Gothenburg: Linköping University Electronic Press. https://aclanthology.org/W17-0508.