8  Analysis in Authorship Attribution

Joanna Byszuk (Kraków)

As mentioned in the chapter “What is Authorship Attribution?” (Chapter 5), computational studies into authorship are divided into attribution and verification problems. Both of these areas make use of machine learning methods (especially supervised and unsupervised learning, that is, classification and clustering methods) that have long proved their usefulness across numerous scholarly fields.

8.1 Attribution versus verification

To summarize the distinction between attribution and verification explicated above, “authorship attribution includes determining which of the candidate authors in the examined dataset is most likely to be the author of the text in question. In turn, authorship verification deals with checking whether any of the candidate authors is at all likely to be the author of the examined text” (Hernández-Lorenzo and Byszuk 2022).

A good preparation for authorship attribution must be preceded by a thorough study of relevant literature – as many of the texts of disputed authorship have been subject to literary and philological (even palaeographic) investigations that serve as an invaluable source of information on possible authors. It is also even more crucial than in other cases that open set authorship studies include the step of authorship verification next to (or before) the step of authorship attribution.

8.2 Authorship attribution methods

Apart from selecting the supervised (classification) or unsupervised (clustering method) – see chapter “General Issues in Data Analysis” (Chapter 3) for details – and the type and number of features, many methods of authorship attribution analysis requires considerations about the choice of distance measure (i.e. a similarity measure projected onto a multidimensional geometric space) to examine stylistic relation between the texts, an area that has seen a lot of research devoted to it (see e.g. Grieve 2007; Argamon 2008; Eder 2015; Evert et al. 2017). Relatively little work has been done on the performance of various features, with the evidence pointing to the most frequent words as the most reliable carrier of the authorial style signal (Eder 2011; Camps, Clérice, and Pinche 2020). For more information on the issue of the features, see the chapter “Annotation for Authorship Attribution” (Chapter 7).

Authorship attribution studies frequently use various unsupervised machine learning methods for exploratory data analysis. Most common are clustering approaches, including bootstrapped cluster analyses visualized in the form of networks, in stylometry also bootstrap approaches, but also used are methods of dimensionality reduction, such as Principal Component Analysis (PCA) and Multidimensional Scaling (MDS). The above-mentioned are included in the most commonly used R package ‘stylo’, but as these are standard statistic methods, they and others can also be found in more general packages in libraries in both R and Python.

There are relatively few examinations comparing the performance of particular methods. Most work in the field focuses on cluster and network approaches, and this area was given some focus in Ochab et al. (2019) who tested the performance of typical methods of cluster and network grouping, finding Ward linkage method and Louvain method of community detection the most reliable.

While in many simpler cases clustering will be a good enough method for authorship attribution, more precise classification methods give better certainty in cases where the authorial signal is weak or blurred, or in the case of shorter texts. Supervised machine learning classification with SVM (support vector machines), Delta method and NSC (nearest shrunken centroids) is often considered more reliable. As hinted above, in this type of analysis, one of the above-mentioned classifiers ‘learns’ the style of each of the authors based on which knowledge it is able to point which of them the text of disputed authorship is the most similar to.

Benchmark comparisons of machine learning methods for authorship attribution are also relatively outdated, with the most notable, being Stamatatos 2009 (listing dozens of studies describing methods) and a study of Jockers and Witten (Jockers and Witten 2010) proposing NSC and RDA (regularized discriminant analysis) as best performing. Jockers and Witten criticize SVM at length, however, this method is often used and recommended by computer science experts such as Stamatatos, Koppel, and Argamon, and has proven to be more reliable and stable when dealing with high dimensional and sparse data, such as historical and shorter writings (Stamatatos 2013; Franzini et al. 2018), and two- and three-class problems (Luyckx and Daelemans).

Savoy (2020b, 109, reformulated here) distinguishes the following steps of a quantitative inquiry, that can also be applied to attribution:

  • Defining a precise research question or formulating a hypothesis.
  • Preparing a selection of texts that will make up the evaluation corpus. (See also: section “Corpus building for Authorship Attribution”)
  • Preprocessing data to ensure its quality (which can include normalization of spelling and/or removing extra-textual elements such as page numbers).
  • Choosing a text representation strategy, that is which stylistic features are to serve as a marker of style.
  • (optional) Removing noisy attribute features to lower computational costs or improve method performance.
  • Choosing and applying a machine learning algorithm to perform the classification.

In the case of authorship attribution problems, the question of the right selection of texts that are to make the corpus is particularly important.

Once we have formulated our hypothesis and identified all texts that might relevantly represent possible (candidate) authors, we proceed with dividing them into a training set and a test set. The texts included in the training set will serve as the learning data for our classifier, so should best reflect the style of particular authors. The test set will include the investigated text as well as other texts by all authors represented in the training set. The inclusion of candidate authors both in the training and the test sets is aimed at helping us measure the performance and reliability of the classification. If our classifier has learned to recognize particular candidate authors, say with 80-90% accuracy (although different values will be considered ‘good’ for a various number of classes in an experiment), we know that it has a fairly good idea of how particular authors write and what stylistic features distinguish them – this allows us to trust that it recognizes the author of the investigated text similarly well.

8.3 Authorship verification methods

Authorship verification deals with the question of whether any of the candidate authors is at all likely to have written particular text. The main difference from the attribution approaches described above is that in this case, rather than try to guess the author, the algorithm compares pairs of texts against the others to see whether any of them is significantly more similar to one another than to the rest of the dataset.

The best-described approach is the General Imposters (GI) framework, first proposed by Moshe Koppel (2004; Koppel and Winter 2014) and further examined and developed by Mike Kestemont (Kestemont et al. 2016), Patrick Juola (Juola 2015), and others. As explained in Kestemont et al. (2016), “[the] general intuition behind the GI, is not to assess whether two documents are simply similar in writing style, given a static feature vocabulary, but rather, it aims to assess whether two documents are significantly more similar to one another than other documents, across a variety of stochastically impaired feature spaces (Eder 2012; Houvardas and Stamatatos 2006) and compared to random selections of so-called distractor authors (Juola 2015), also called ‘imposters’” (Kestemont et al. 2016, 88).1

To put this procedure in simple terms, it is based on running a series of classifications comparing each text in the dataset to the examined text (Target) and to a random and changing subset of candidate authors (Imposters). The final outcome is a value between 0 and 1, showing how much each text was closer to the Imposters (0) and Target text (1). Importantly, apart from the general ‘the higher the better attribution accuracy confidence’ value, the framework includes calculating “Periods of Confidence”, that is, dividing the 0-1 range into three parts: definitely not Target text, cannot say for sure, definitely Target text.

8.4 Applications of authorship attribution

As far as applications of stylometric authorship attribution methods are concerned, they are of course legion, given that the field has been active at least since the late 1950s, with important precursors far into the 19th century.2 Early and pionieering work using modern statistics from the late 1950s includes Grayston and Herdan (1959), Ellegård (1962), Mosteller and Wallace (1963) or Levison, Morton, and Wake (1966). Today, such applications can be found for cases of disputed or uncertain authorship in a wide range of languages, periods, and literary genres; from the wide range of applications, we can only list a somewhat arbitrary selection (ordered simply by year of publication):  

  • Labbé and Labbé (2001) and Cafiero and Cafiero and Camps (2019) analysed French drama of the seventeenth century (the Molière-Corneille case); 
  • Binongo (2003) uses Principal Component Analysis to find out who wrote the 15th book in the Wizard of Oz series (written primarily by Frank L. Baum);  
  • van Dalen-Oskam and van Zundert (2007) investigate questions of authorship and scribal influence of the Middle Dutch Walewein;  
  • Craig and Kinney (2009) investigated several authorship cases concerning English drama of the late 16th and early 17th century (around Shakespeare, Marlowe and others); 
  • Jannidis and Lauer (2014) investigate German 19th-century novels; 
  • Juola (2015) describes the process by which Joan K. Rowling’s authorship of A Cuckoo’s Calling was established; 
  • Rißler-Pipka (2016) conducts experiments regarding authorship attribution of Spanish novels of the Early Modern period;  
  • Kestemont et al. (2017) investigate several hypotheses regarding the authorship of the Wilhelmus, the Dutch folk song that is the national anthem of the Netherlands;
  • Tuzzi and Cortelazzo (2018) and Savoy (2020a) built suitable corpora and use several authorship attribution methods to attempt to identify the true author behind the contemporary Italian bestselling novels published under the name of Elena Ferrante;
  • Grieve et al. (2019) dealt with a mid-19th-century American-English letter (often attributed to Lincoln, but attributed to Hay by Grieve and colleagues); 
  • Mazurko and Walkowiak (2020), developed stylometric methods for examining authorship of literary texts in Ukrainian; 
  • Hadjadj and Sayoud (2021) used over-sampling and PCA to deal with authorship attribution challenges in imbalanced corpora in Arabic; 
  • Ai et al. (2021) applied an LDA-Transformer model to perform Authorship Attribution of Chinese Poetry;
  • Vega García-Luengos (2021) worked on a corpus of plays by Spanish writer Lope de Vega and proposed new attribution hypotheses; 
  • Most recently, Jungmannová and Plecháč (2022) investigated 20th-century novels in Czech (regarding Milan Kundera). 

This list, though far from complete, hopefully conveys at least some of the richness of methods and domains of authentic authorship problems addressed with computational approaches.

8.5 Conclusion

In conclusion, it is worth noting that, while many methodological questions remain open in stylometric authorship attribution, it remains one of the oldest and most developed areas of investigation within Computational Literary Studies.


See works cited and further readings for this chapter on Zotero.

Citation suggestion

Joanna Byszuk (2023): “Analysis in Authorship Attribution?”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/analysis-author.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).

  1. Cited after https://computationalstylistics.github.io/blog/imposters/.↩︎

  2. For an extensive list of early and later applications, see the Stylometry Bibliography curated by Christof Schöch since 2016 and containing around 3500 entries.↩︎