9  Evaluation in Authorship Attribution

Joanna Byszuk (Kraków)

Conducting a reliable and reproducible study in authorship attribution can be challenging, and the topic has gathered a lot of attention over the last years. For a long time, authorship attribution studies relied on rather simple methods of evaluation – reporting classification accuracy for known cases in the examined corpora, or using two or three different methods to ensure the same result is reached with various means. Most of the papers cite the specific settings (number and type of features, classification algorithm and distance measure) that were used. Many studies also mention determining the best settings for the particular corpus, before using those settings to classify a text of unknown authorship.

The evaluation of authorship attribution is done not only in application papers that attempt to solve actual attribution problems but also in methodological papers that attempt to validate certain approaches or understand particular parameters. Some such examples include studies mentioned in the previous section, examining how particular methods perform in attribution tasks (e.g. Jockers and Witten 2010; Stamatatos 2009) or what are the limitations of the methods when it comes to the length of the samples or language (e.g. Rybicki and Eder 2011; Eder 2013b).

One of the biggest problems for reproducibility is the lack of access to corpora or proprietary code. While nowadays it is a standard to openly publish both, e.g. on Github or Zenodo, unless they are under strict copyright, for many older studies it is impossible to find the exact corpora used, or even the full description of their content that would facilitate recreating.

The panel at the Digital Humanities Conference 2020 by Schöch et al. (2020) provided the most comprehensive discussion yet of the issues related to the reproducibility of computational literary studies. Contributions ranged from developing a typology of repeating research (distinguishing features of replication, reproduction, etc.), difficulties with funding and conducting replications of old studies, and scarcity of evaluation methods.

9.1 Methods of Evaluation

9.1.1 Evaluation in the case of clustering-based authorship attribution

Clustering approaches do not have a straightforward way of evaluating the results, therefore studies employing such methods usually rely on good practice rules for conducting reliable study. As noted in Eder (2013a), “[a] vast majority of methods used in stylometry establish a classification of samples and strive to find the nearest neighbors among them. Unfortunately, these techniques of classification are not resistant to a common mis-classification error: any two nearest samples are claimed to be similar, no matter how distant they are.” This is particularly characteristic of clustering and network approaches, as they are based on grouping the elements of the dataset based on similarity (finding the said nearest neighbors), and will always find some connection between all elements, even if they actually have little in common. Therefore, corpus design is of crucial importance – failing to include probable authors will produce a misattribution, and including too many candidates (of whom some could not have authored the text for objective reasons such as being already dead or not yet born) can introduce noise that will make the conditions of the classification more difficult or even impossible.

As Maciej Eder further argues, clustering approaches are very sensitive to the number and type of features used, which can and usually do influence the results. While in the case of strong and highly distinctive authorial signals the differences might be minor, they can nevertheless lead to erroneous conclusions. To counter this risk, Eder proposes applying bootstrapping (a technique used across various scholarly fields employing statistics), a procedure in which a series of experiments is performed, with each varying in the number of features used, e.g. ten experiments increasing from 100 to 1000 most frequent words in 100-word intervals. The results of each round are then put together and only texts that were recognized as each other’s nearest neighbors in as many cases as the set threshold (usually 50%) are considered reliably close. The method has since gained popularity in authorship attribution circles, and is now commonly applied in most studies.

Evert et al. (2017) compares the performance of various distance measures across feature vectors 100, 1000 and 5000 words long, using (1) the difference between z-transformed means, (2) the Adjusted Rand Index, and (3) clustering errors, finding that Cosine Delta produced the most accurate results and longer vectors usually resulted in fewer errors. It should be noted however, that for each corpus a different number of features might be enough to produce a reliable result, as observed in Eder (2017) and efforts are made to adjust the methods and the selection of frequencies to work on shorter texts Eder (2022).

9.1.2 Evaluation in the case of classification-based authorship attribution

Classification approaches to authorship attribution are evaluated in all studies. While relying on good practices described in methodological and benchmark papers is of course present, most, if not all, methods, provide classification accuracy scores with the results – these, however, take various forms, from accuracy, precision, recall, F1-score, or AUC (Area Under Curve), all traditionally used in machine learning.

A straightforward and often applied method of evaluation is cross-validation, and in particular the leave-one-out method, in which a series of experiments are performed, each time taking one text out of the corpus and comparing it against the remaining ones. This allows to verify the success rate of the classification both in the overall and in the detailed scopes, and to identify texts that are misclassified or introduce noise to the corpus.

More complex evaluation procedures include testing the performance of various classification methods and types of features before coming to conclusions (e.g. Grieve 2007).

9.2 Conclusion

In conclusion, one can say that authorship attribution is probably the domain within CLS that has the most developed practice of evaluation. The reason for this is not just that it is a very long-standing domain, but also that there are sufficient numbers of undisputed, single-author publications to make systematic evaluation, and evaluation-driven development of methods, feasible. In addition, authorship is – despite edge-cases of various modes of collaboration – a category with much clearer class boundaries than in the case of canonicity (which can be understood as a gradual attribute) or genre (where one text can participate in more than one genre, to various degrees). As a consequence, evaluation in authorship attribution is, arguably, more feasible than in the case of categories such as genre or canonicity.


See works cited and further readings for this chapter on Zotero.

Citation suggestion

Joanna Byszuk (2023): “Evaluation in Authorship Attribution”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/evaluation-author.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).