Analysis in Authorship Attribution

Byszuk, Joanna

doi:10.5281/zenodo.7892112

Joanna Byszuk (Kraków)

As mentioned in the chapter “What is Authorship Attribution?” (Chapter 5), computational studies into authorship are divided into attribution and verification problems. Both of these areas make use of machine learning methods (especially supervised and unsupervised learning, that is, classification and clustering methods) that have long proved their usefulness across numerous scholarly fields.

8.1 Attribution versus verification

To summarize the distinction between attribution and verification explicated above, “authorship attribution includes determining which of the candidate authors in the examined dataset is most likely to be the author of the text in question. In turn, authorship verification deals with checking whether any of the candidate authors is at all likely to be the author of the examined text” (Hernández-Lorenzo and Byszuk 2022).

A good preparation for authorship attribution must be preceded by a thorough study of relevant literature – as many of the texts of disputed authorship have been subject to literary and philological (even palaeographic) investigations that serve as an invaluable source of information on possible authors. It is also even more crucial than in other cases that open set authorship studies include the step of authorship verification next to (or before) the step of authorship attribution.

8.2 Authorship attribution methods

Apart from selecting the supervised (classification) or unsupervised (clustering method) – see chapter “General Issues in Data Analysis” (Chapter 3) for details – and the type and number of features, many methods of authorship attribution analysis requires considerations about the choice of distance measure (i.e. a similarity measure projected onto a multidimensional geometric space) to examine stylistic relation between the texts, an area that has seen a lot of research devoted to it (see e.g. Grieve 2007; Argamon 2008; Eder 2015; Evert et al. 2017). Relatively little work has been done on the performance of various features, with the evidence pointing to the most frequent words as the most reliable carrier of the authorial style signal (Eder 2011; Camps, Clérice, and Pinche 2020). For more information on the issue of the features, see the chapter “Annotation for Authorship Attribution” (Chapter 7).

Authorship attribution studies frequently use various unsupervised machine learning methods for exploratory data analysis. Most common are clustering approaches, including bootstrapped cluster analyses visualized in the form of networks, in stylometry also bootstrap approaches, but also used are methods of dimensionality reduction, such as Principal Component Analysis (PCA) and Multidimensional Scaling (MDS). The above-mentioned are included in the most commonly used R package ‘stylo’, but as these are standard statistic methods, they and others can also be found in more general packages in libraries in both R and Python.

There are relatively few examinations comparing the performance of particular methods. Most work in the field focuses on cluster and network approaches, and this area was given some focus in Ochab et al. (2019) who tested the performance of typical methods of cluster and network grouping, finding Ward linkage method and Louvain method of community detection the most reliable.

While in many simpler cases clustering will be a good enough method for authorship attribution, more precise classification methods give better certainty in cases where the authorial signal is weak or blurred, or in the case of shorter texts. Supervised machine learning classification with SVM (support vector machines), Delta method and NSC (nearest shrunken centroids) is often considered more reliable. As hinted above, in this type of analysis, one of the above-mentioned classifiers ‘learns’ the style of each of the authors based on which knowledge it is able to point which of them the text of disputed authorship is the most similar to.

Benchmark comparisons of machine learning methods for authorship attribution are also relatively outdated, with the most notable, being Stamatatos 2009 (listing dozens of studies describing methods) and a study of Jockers and Witten (Jockers and Witten 2010) proposing NSC and RDA (regularized discriminant analysis) as best performing. Jockers and Witten criticize SVM at length, however, this method is often used and recommended by computer science experts such as Stamatatos, Koppel, and Argamon, and has proven to be more reliable and stable when dealing with high dimensional and sparse data, such as historical and shorter writings (Stamatatos 2013; Franzini et al. 2018), and two- and three-class problems (Luyckx and Daelemans).

Savoy (2020b, 109, reformulated here) distinguishes the following steps of a quantitative inquiry, that can also be applied to attribution:

Defining a precise research question or formulating a hypothesis.
Preparing a selection of texts that will make up the evaluation corpus. (See also: section “Corpus building for Authorship Attribution”)
Preprocessing data to ensure its quality (which can include normalization of spelling and/or removing extra-textual elements such as page numbers).
Choosing a text representation strategy, that is which stylistic features are to serve as a marker of style.
(optional) Removing noisy attribute features to lower computational costs or improve method performance.
Choosing and applying a machine learning algorithm to perform the classification.

In the case of authorship attribution problems, the question of the right selection of texts that are to make the corpus is particularly important.

Once we have formulated our hypothesis and identified all texts that might relevantly represent possible (candidate) authors, we proceed with dividing them into a training set and a test set. The texts included in the training set will serve as the learning data for our classifier, so should best reflect the style of particular authors. The test set will include the investigated text as well as other texts by all authors represented in the training set. The inclusion of candidate authors both in the training and the test sets is aimed at helping us measure the performance and reliability of the classification. If our classifier has learned to recognize particular candidate authors, say with 80-90% accuracy (although different values will be considered ‘good’ for a various number of classes in an experiment), we know that it has a fairly good idea of how particular authors write and what stylistic features distinguish them – this allows us to trust that it recognizes the author of the investigated text similarly well.

8.3 Authorship verification methods

Authorship verification deals with the question of whether any of the candidate authors is at all likely to have written particular text. The main difference from the attribution approaches described above is that in this case, rather than try to guess the author, the algorithm compares pairs of texts against the others to see whether any of them is significantly more similar to one another than to the rest of the dataset.

The best-described approach is the General Imposters (GI) framework, first proposed by Moshe Koppel (2004; Koppel and Winter 2014) and further examined and developed by Mike Kestemont (Kestemont et al. 2016), Patrick Juola (Juola 2015), and others. As explained in Kestemont et al. (2016), “[the] general intuition behind the GI, is not to assess whether two documents are simply similar in writing style, given a static feature vocabulary, but rather, it aims to assess whether two documents are significantly more similar to one another than other documents, across a variety of stochastically impaired feature spaces (Eder 2012; Houvardas and Stamatatos 2006) and compared to random selections of so-called distractor authors (Juola 2015), also called ‘imposters’” (Kestemont et al. 2016, 88).¹

To put this procedure in simple terms, it is based on running a series of classifications comparing each text in the dataset to the examined text (Target) and to a random and changing subset of candidate authors (Imposters). The final outcome is a value between 0 and 1, showing how much each text was closer to the Imposters (0) and Target text (1). Importantly, apart from the general ‘the higher the better attribution accuracy confidence’ value, the framework includes calculating “Periods of Confidence”, that is, dividing the 0-1 range into three parts: definitely not Target text, cannot say for sure, definitely Target text.

8.4 Applications of authorship attribution

As far as applications of stylometric authorship attribution methods are concerned, they are of course legion, given that the field has been active at least since the late 1950s, with important precursors far into the 19th century.² Early and pionieering work using modern statistics from the late 1950s includes Grayston and Herdan (1959), Ellegård (1962), Mosteller and Wallace (1963) or Levison, Morton, and Wake (1966). Today, such applications can be found for cases of disputed or uncertain authorship in a wide range of languages, periods, and literary genres; from the wide range of applications, we can only list a somewhat arbitrary selection (ordered simply by year of publication):

Labbé and Labbé (2001) and Cafiero and Cafiero and Camps (2019) analysed French drama of the seventeenth century (the Molière-Corneille case);
Binongo (2003) uses Principal Component Analysis to find out who wrote the 15th book in the Wizard of Oz series (written primarily by Frank L. Baum);
van Dalen-Oskam and van Zundert (2007) investigate questions of authorship and scribal influence of the Middle Dutch Walewein;
Craig and Kinney (2009) investigated several authorship cases concerning English drama of the late 16th and early 17th century (around Shakespeare, Marlowe and others);
Jannidis and Lauer (2014) investigate German 19th-century novels;
Juola (2015) describes the process by which Joan K. Rowling’s authorship of A Cuckoo’s Calling was established;
Rißler-Pipka (2016) conducts experiments regarding authorship attribution of Spanish novels of the Early Modern period;
Kestemont et al. (2017) investigate several hypotheses regarding the authorship of the Wilhelmus, the Dutch folk song that is the national anthem of the Netherlands;
Tuzzi and Cortelazzo (2018) and Savoy (2020a) built suitable corpora and use several authorship attribution methods to attempt to identify the true author behind the contemporary Italian bestselling novels published under the name of Elena Ferrante;
Grieve et al. (2019) dealt with a mid-19th-century American-English letter (often attributed to Lincoln, but attributed to Hay by Grieve and colleagues);
Mazurko and Walkowiak (2020), developed stylometric methods for examining authorship of literary texts in Ukrainian;
Hadjadj and Sayoud (2021) used over-sampling and PCA to deal with authorship attribution challenges in imbalanced corpora in Arabic;
Ai et al. (2021) applied an LDA-Transformer model to perform Authorship Attribution of Chinese Poetry;
Vega García-Luengos (2021) worked on a corpus of plays by Spanish writer Lope de Vega and proposed new attribution hypotheses;
Most recently, Jungmannová and Plecháč (2022) investigated 20th-century novels in Czech (regarding Milan Kundera).

This list, though far from complete, hopefully conveys at least some of the richness of methods and domains of authentic authorship problems addressed with computational approaches.

8.5 Conclusion

In conclusion, it is worth noting that, while many methodological questions remain open in stylometric authorship attribution, it remains one of the oldest and most developed areas of investigation within Computational Literary Studies.

References

See works cited and further readings for this chapter on Zotero.

Citation suggestion

Joanna Byszuk (2023): “Analysis in Authorship Attribution?”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/analysis-author.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).

Ai, Zhou, Zhang Yijia, Wei Hao, and Lu Mingyu. 2021. “LDA-Transformer Model in Chinese Poetry Authorship Attribution.” In Information Retrieval, edited by Hongfei Lin, Min Zhang, and Liang Pang, 13026:59–73. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-88189-4_5.

Argamon, Shlomo. 2008. “Interpreting Burrows’s Delta: Geometric and Probabilistic Foundations.” Literary and Linguistic Computing 23 (2): 131–47. https://doi.org/10.1093/llc/fqn003.

Binongo, Jose Nilo G. 2003. “Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution.” Chance 16 (2): 9–17. https://doi.org/10.1080/09332480.2003.10554843.

Cafiero, Florian, and Jean-Baptiste Camps. 2019. “Why Molière Most Likely Did Write His Plays.” Science Advances 5 (11): eaax5489. https://doi.org/10.1126/sciadv.aax5489.

Camps, Jean-Baptiste, Thibault Clérice, and Ariane Pinche. 2020. “Stylometry for Noisy Medieval Data: Evaluating Paul Meyer’s Hagiographic Hypothesis.” arXiv. https://doi.org/10.48550/arXiv.2012.03845.

Craig, Hugh, and Arthur F. Kinney, eds. 2009. Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press. https://doi.org/10.1017/CBO9780511605437.

Eder, Maciej. 2011. “Style-Markers in Authorship Attribution : A Cross-Language Study of the Authorial Fingerprint.” Studies in Polish Linguistics 6 (1). https://ruj.uj.edu.pl/xmlui/handle/item/68325.

———. 2012. “Mind Your Corpus: Systematic Errors in Authorship Attribution.” In Digital Humanities Conference, 28:4. https://doi.org/10.1093/llc/fqt039.

———. 2015. “Taking Stylometry to the Limits: Benchmark Study on 5,281 Texts from Patrologia Latina.” In Digital Humanities Conference. Sydney. https://dh-abstracts.library.virginia.edu/works/2364.

Ellegård, Alvar. 1962. A Statistical Method for Determining Authorship: The Junius Letters, 1769-1772. Gothenburg: University of Gothenburg.

Evert, St., Thomas Proisl, Fotis Jannidis, Isabella Reger, Steffen Pielström, Christof Schöch, and Thorsten Vitt. 2017. “Understanding and Explaining Delta Measures for Authorship Attribution.” Digital Scholarship in the Humanities 32 (suppl_2). https://doi.org/10.1093/llc/fqx023.

Franzini, Greta, Mike Kestemont, Gabriela Rotari, Melina Jander, Jeremi K. Ochab, Emily Franzini, Joanna Byszuk, and Jan Rybicki. 2018. “Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm.” Frontiers in Digital Humanities 5. https://www.frontiersin.org/articles/10.3389/fdigh.2018.00004.

Grayston, K., and G. Herdan. 1959. “The Authorship of the Pastorals in the Light of Statistical Linguistics.” New Testament Studies 6 (1): 1–15. https://doi.org/10.1017/S0028688500001284.

Grieve, Jack. 2007. “Quantitative Authorship Attribution: An Evaluation of Techniques.” Literary and Linguistic Computing 22 (3): 251–70. https://doi.org/10.1093/llc/fqm020.

Grieve, Jack, Isobelle Clarke, Emily Chiang, Hannah Gideon, Annina Heini, Andrea Nini, and Emily Waibel. 2019. “Attributing the Bixby Letter Using n-Gram Tracing.” Digital Scholarship in the Humanities 34 (3): 493–512. https://doi.org/10.1093/llc/fqy042.

Hadjadj, Hassina, and Halim Sayoud. 2021. “Arabic Authorship Attribution Using Synthetic Minority Over-Sampling Technique and Principal Components Analysis for Imbalanced Documents:” International Journal of Cognitive Informatics and Natural Intelligence 15 (4): 1–17. https://doi.org/10.4018/IJCINI.20211001.oa33.

Hernández-Lorenzo, Laura, and Joanna Byszuk. 2022. “Challenging Stylometry: The Authorship of the Baroque Play La Segunda Celestina.” Digital Scholarship in the Humanities, November, fqac063. https://doi.org/10.1093/llc/fqac063.

Houvardas, John, and Efstathios Stamatatos. 2006. “N-Gram Feature Selection for Authorship Identification.” In Artificial Intelligence: Methodology, Systems, and Applications, edited by Jérôme Euzenat and John Domingue, 4183:77–86. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/11861461_10.

Jannidis, Fotis, and Gerhard Lauer. 2014. “Burrows’s Delta and Its Use in German Literary History.” In Distant Readings. Topologies of German Culture in the Long Nineteenth Century, edited by Matt Erlin and Lynne Tatlock, 29–54. Rochester: Camden House. gerhardlauer.de/index.php/download_file/view/335/1/.

Jockers, Matt, and Daniela Witten. 2010. “A Comparative Study of Machine Learning Methods for Authorship Attribution.” Literary and Linguistic Computing 25 (2): 215–23. https://doi.org/10.1093/llc/fqq001.

Jungmannová, Lenka, and Petr Plecháč. 2022. “Unsigned Play by Milan Kundera? An Authorship Attribution Study.” https://doi.org/10.48550/ARXIV.2212.09879.

Juola, Patrick. 2015. “The Rowling Case: A Proposed Standard Analytic Protocol for Authorship Questions.” Digital Scholarship in the Humanities 30 (suppl_1): i100–113. https://doi.org/10.1093/llc/fqv040.

Kestemont, Mike, Justin Stover, Moshe Koppel, Folgert Karsdorp, and Walter Daelemans. 2016. “Authorship Verification with the Ruzicka Metric.” In Digital Humanities Conference 2016 (DH2016) Book of Abstracts. Krakow: ADHO. https://dh-abstracts.library.virginia.edu/works/2542.

Kestemont, Mike, Els Stronks, Martine de Bruin, and Tim de Winkel. 2017. Van Wie Is Het Wilhelmus? De Auteur van Het Nederlandse Volkslied Met de Computer Onderzocht. Amsterdam: Amsterdam University Press.

Koppel, Moshe, and Jonathan Schler. 2004. “Authorship Verification as a One-Class Classification Problem.” In Twenty-First International Conference on Machine Learning - ICML ’04, 62. Banff, Alberta, Canada: ACM Press. https://doi.org/10.1145/1015330.1015448.

Koppel, Moshe, and Yaron Winter. 2014. “Determining If Two Documents Are Written by the Same Author: Determining If Two Documents Are Written by the Same Author.” Journal of the Association for Information Science and Technology 65 (1): 178–87. https://doi.org/10.1002/asi.22954.

Labbé, Cyril, and Dominique Labbé. 2001. “Inter-Textual Distance and Authorship Attribution Corneille and Molière.” Journal of Quantitative Linguistics 8 (3): 213–31. https://doi.org/10.1076/jqul.8.3.213.4100.

Levison, M., A. Q. Morton, and W. C. Wake. 1966. “On Certain Statistical Features of the Pauline Epistles.” The Philosophical Journal 3: 129–48.

Mazurko, Anton, and Tomasz Walkowiak. 2020. “Computer Based Stylometric Analysis of Texts in Ukrainian Language.” In Artificial Intelligence and Soft Computing, edited by Leszek Rutkowski, Rafał Scherer, Marcin Korytkowski, Witold Pedrycz, Ryszard Tadeusiewicz, and Jacek M. Zurada, 12416:220–30. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-61534-5_20.

Mosteller, Frederick, and David L. Wallace. 1963. “Inference in an Authorship Problem.” Journal of the American Statistical Association 58 (302): 275–309. http://www.jstor.org/stable/2283270.

Ochab, Jeremi K., Joanna Byszuk, Steffen Pielström, and Maciej Eder. 2019. “Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) Versus Network Clustering (Community Detection).” In Digital Humanities 2019: Book of Abstracts. Utrecht: ADHO. https://doi.org/10.34894/DSVVAC.

Rißler-Pipka, Nanette. 2016. “Der falsche Quijote? Autorschaftsattribution für spanische Prosa der frühen Neuzeit.” In DHd 2016 Modellierung, Vernetzung, Visualisierung, 212–17. Leipzig. http://dhd2016.de/boa.pdf.

Savoy, Jacques. 2020a. “Elena Ferrante: A Case Study in Authorship Attribution.” In Machine Learning Methods for Stylometry, 191–210. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-53360-1_8.

———. 2020b. Machine Learning Methods for Stylometry: Authorship Attribution and Author Profiling. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-53360-1.

Stamatatos, Efstathios. 2013. “On the Robustness of Authorship Attribution Based on Character N-Gram Features.” Journal of Law and Policy 21 (2). https://brooklynworks.brooklaw.edu/jlp/vol21/iss2/7.

Tuzzi, Arjuna, and Michele A. Cortelazzo, eds. 2018. Drawing Elena Ferrante’s Profile: Workshop Proceedings, Padova, 7 September 2017. Padova: Padova UP.

van Dalen-Oskam, Karina, and Joris van Zundert. 2007. “Delta for Middle Dutch and Copyist Distinction in Walewein.” Literary and Linguistic Computing 22 (3): 345–62. https://doi.org/10.1093/llc/fqm012.

Vega García-Luengos, Germán. 2021. “Las Comedias de Lope de Vega: Confirmaciones de Autoría y Nuevas Atribuciones Desde La Estilometría (I).” Talía. Revista de Estudios Teatrales 3 (May): 91–108. https://doi.org/10.5209/tret.74625.

Cited after https://computationalstylistics.github.io/blog/imposters/.↩︎
For an extensive list of early and later applications, see the Stylometry Bibliography curated by Christof Schöch since 2016 and containing around 3500 entries.↩︎