Introduction to Evaluation

Schöch, Christof

doi:10.5281/zenodo.7892112

Christof Schöch (Trier)

4.1 Introduction

Evaluation, in the context of CLS research, means the process of and procedures for verifying the quality of the method of analysis in terms of its performance, accuracy, robustness or generalisability. Within the workflow model of research we follow here, evaluation is a step that becomes especially relevant towards the end of the workflow, after the analysis has been run. However, multiple cycles of annotation, analysis and evaluation may of course be required before any given study is completed.

Note that evaluation as described here, primarily concerned with assessing the performance or accuracy of an analytical process, is different from evaluating the validity of a method (or the quality of the operationalization), where the key question is how well a chosen measure or method is really indicative of the phenomenon that one intends to study. This consideration about the “fit between a measurement that is applied and the theoretical construct it aims to represent” (Herrmann, Jacobs, and Piper 2021, 472) is sometimes discussed under the heading of construct validity.

Evaluation is a key element in the research process in CLS. One may argue that one of the core features of Computational Literary Studies is that it proceeds by using formalized, algorithmic and/or statistical methods to investigate literary texts and literary history represented in the form of digital data. Some of this research and the methods it uses are exploratory, in which case evaluation is in large part a matter of plausibility and contextualization of results, but some exploratory methods (such as clustering) also support formal evaluation. Other research and the methods it uses are based on the paradigm of hypotheses-testing and classification, in which case formal evaluation is probably the dominant mode. In addition, but usually prior to the application of exploratory and/or hypothesis-driven methods of analysis, automatic or manual annotation of the data may be performed, and needs to be evaluated as well.

Given the current state of the art in evaluation, very much focused on measuring the agreement of annotators or measuring the performace of machine learning algorithms, we can structure the issue of evaluation in CLS research into several main areas briefly described in the following section and taken up, as appropriate, in the different chapters of this survey that are concerned with evaluation.

4.2 Research and evaluation scenarios

4.2.1 Evaluation of manual annotations

Manual, qualitative annotations can be evaluated, in the absence (usually) of a gold standard, by calculating the inter-annotator agreement (sometimes also called inter-rater reliability) using a range of measures intended for this purpose. Measures used for this purpose in CLS include Cohen’s Kappa, Fleiss’ Kappa and Krippendorff’s Alpha (see Fleiss 1971; Artstein and Poesio 2008) or the more recently proposed Gamma (Mathet, Widlöcher, and Métivier 2015).

4.2.2 Evaluation of automatic annotation

The evaluation of automatic, token-based annotations that have been applied in a classification-based approach, can usually be evaluated against a gold standard or reference annotation. Such a reference annotation is usually required for training of the algorithms. Examples that have not been used in the training phase can then be used for evaluating the resulting performance. This performance is measured in terms of true and false positives and true and false negatives, based on which indicators such as precision and recall, and further derived scores such as the F-score, can be calculated. A very useful introduction to the evaluation of classification-based methods is the chapter 8.5.1, “Metrics for Evaluating Classifier Performance”, in (Han and Kamber 2012).

4.2.3 Evaluation of classificatory approaches beyond annotation

Evaluation in the context of classification-based methods other than automatic token-level annotations, for example document-based classification tasks, are not fundamentally different to automatic token-level annotations. Again, the performance can be measured in terms of true and false positives and true and false negatives, based on which indicators such as precision and recall, and further derived scores such as the F-score, can be calculated. An error analysis based on a confusion matrix can in many cases yield important insights into the structure of the classification problem. Finally, it can be useful to determine which features have had the strongest influence on the trained model’s decision-making, for example by extracting feature weights from a model.

4.2.4 Evaluation of clustering methods

In evaluation of clustering methods when a gold standard is present, such measures as adjusted Rand Index or cluster purity are usually used. A useful introductory chapter on this topic is chapter 10.6, “Evaluation of Clustering”, in (Han and Kamber 2012).

4.2.5 Evaluation of exploratory approaches other than clustering

Evaluation of other exploratory methods, such as topic modeling or keyword extraction, can be done either using qualitative and subjective methods of evaluation (such as establishing the plausibility of results by comparing them to results from earlier research), or using downstream-tasks that can, for example, be framed as a classification problem, in which case the metrics mentioned above for this scenario can be used. Examples for scenarios using such a downstream classification tasks include Schöch (2017) and Du, Dudar, and Schöch (2022).

4.3 Further forms of quality assurance

There are of course other aspects of evaluation, or quality assurance more generally, that apply to research in CLS. Apart from traditional forms of peer review, it is worth mentioning at least two aspects: the issue of replication / reproducibility, closely linked to Open Science principles, and the issue of code, data and tool review and criticism.

4.3.1 Replication and reproducibility: an emerging issue

Research strategies that include enabling and/or performing replication, reproduction or other forms of follow-up research should be mentioned here, because they increasingly become an issue not just in Artificial Intelligence and Natural Language Processing, but also in CLS research Such strategies can indeed be used to assess the quality, coherence, robustness and/or reliability of earlier research, in particular when performed by others. The demands on documentation as well as on data and code availability are rather high, however, if they are to truly support replication and/or reproduction. As a consequence, this aspect of CLS is still in an emerging stage (see Huber and Çöltekin 2020; Schöch 2023).

4.3.2 Data, code and tool review

There is one further aspect of evaluation and quality assurance in CLS research, in some way connected to the previously mentioned issues of replication and reproduction. What is meant are critical approaches and reviewing practices applied not just to traditional scholarly publications, but also to other forms of scholarly output that are crucial to research in CLS. For example, digital editions and corpora or other datasets are regularly being reviewed, for a number of years already, in RIDE – A review journal for digital editions and resources. Similarly, datasets can be described, whether in short form or with examples for applications in research, at the peer-reviewed Journal for Open Humanities Data (JODH). Journals such as the Journal for Computational Literary Studies are taking first steps towards implementing code and data review into their peer review process.

Finally, tool criticism is another aspect of these critical practices. Traub and van Ossenbruggen define tool criticism as “the evaluation of the suitability of a given digital tool for a specific task” (Traub and van Ossenbruggen 2015, 1). Traub and van Ossenbruggen stress that the primary objective of tool criticism is to “better understand the bias of the tool on the specific task”, and that improving the tool itself is a secondary concern (Traub and van Ossenbruggen 2015, 1). When taking tool criticism to actual research practice, however, its implications become increasingly concrete.

4.4 Conclusion

Evaluation of methods is an important aspect of research in CLS and, depending on the approach used, takes various different forms. There is definitely a trend to more formal and more reflected evaluation in CLS across the various approaches and research issues that goes beyond the area of classification, as becomes visible from the entries on evaluation in specific areas of research in CLS.

References

See works cited and further reading for this chapter on Zotero.

Citation suggestion

Christof Schöch (2023): “Introduction to Evaluation”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/evaluation-intro.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).

Artstein, Ron, and Massimo Poesio. 2008. “Inter-Coder Agreement for Computational Linguistics.” Computational Linguistics 34 (4): 555–96. https://doi.org/10.1162/coli.07-034-R2.

Du, Keli, Julia Dudar, and Christof Schöch. 2022. “Evaluation of Measures of Distinctiveness: Classification of Literary Texts on the Basis of Distinctive Words.” https://doi.org/10.48694/JCLS.102.

Fleiss, Joseph L. 1971. “Measuring Nominal Scale Agreement Among Many Raters.” Psychological Bulletin 76 (5): 378–82. https://doi.org/10.1037/h0031619.

Han, Jiawei, and Micheline Kamber. 2012. Data Mining: Concepts and Techniques. Third. Burlington, MA: Elsevier. http://myweb.sabanciuniv.edu/rdehkharghani/files/2016/02/The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Kamber-Jian-Pei-Data-Mining.-Concepts-and-Techniques-3rd-Edition-Morgan-Kaufmann-2011.pdf.

Herrmann, J. Berenike, Arthur M. Jacobs, and Andrew Piper. 2021. “Computational Stylistics.” In Handbook of Empirical Literary Studies, edited by Donald Kuiken and Arthur M. Jacobs, 451–86. De Gruyter. https://doi.org/10.1515/9783110645958-018.

Huber, Eva, and Çağrı Çöltekin. 2020. “Reproduction and Replication: A Case Study with Automatic Essay Scoring.” In Proceedings of the 12th Language Resources and Evaluation Conference, 5603–13. Marseille, France: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.688.

Mathet, Yann, Antoine Widlöcher, and Jean-Philippe Métivier. 2015. “The Unified and Holistic Method Gamma (\(\gamma\)) for Inter-Annotator Agreement Measure and Alignment.” Computational Linguistics 41 (3): 437–79. https://doi.org/10.1162/COLI_a_00227.

Schöch, Christof. 2017. “Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama.” Digital Humanities Quarterly 11 (2). http://www.digitalhumanities.org/dhq/vol/11/2/000291/000291.html.

———. 2023. “Repetitive Research: A Conceptual Space and Terminology of Replication, Reproduction, Re-Implementation, Re-Analysis, and Re-Use in Computational Literary Studies.” International Journal of Digital Humanities, March, preprint. https://doi.org/10.21203/rs.3.rs-2657846/v1.

Traub, Myriam, and Jacco van Ossenbruggen. 2015. “Workshop on Tool Criticism in the Digital Humanities.” Amsterdam: CWI. https://ir.cwi.nl/pub/23500.