What is Authorship Attribution?

Byszuk, Joanna

doi:10.5281/zenodo.7892112

Joanna Byszuk (Kraków)

5.1 Introduction

Authorship attribution can be simply explained as a study whose goal it is to answer the question: who is the author of some text we examine?

Nowadays, non-traditional, computational or stylometric authorship attribution or stylometry for short – as it is also often called to distinguish it from more long-standing, philological methods of authorship attribution – is one of the more popular tasks in Natural Language Processing and Computational Literary Studies or even a subdiscipline on its own, but its roots are deep in the past. The questions “is this text authentic? was it really written by this person?” and “I wonder who could have authored this unsigned or otherwise anonymous text” have probably been accompanying humans since the beginning of written communication. In fact, some of the early approaches to authorship attribution that we know of use methods very much like the ones we apply today. For example, in 1851, Augustus De Morgan suggested that Biblical authors could be identified by the features of their writing, which was also used in early stylometric ventures dedicated to the study of the chronology of texts, verifying whether a text was falsified, etc.

A very similar method was used by Frederick Mosteller and David Wallace in their study (Mosteller and Wallace 1963) of The Federalist Papers, a famous collection of essay pamphlets published by a trio of American forefathers: Alexander Hamilton, James Madison, and John Jay, to promote the new Constitution. Using frequencies of selected words from the essays and analyzing them with a Bayesian-based method, they identified the author of twelve Papers, the authorship of which had been disputed earlier. While their method had its faults and better ones have been developed since, Mosteller and Wallace are and will be remembered as pioneers in computational authorship attribution.

5.2 Authorship Attribution or Verification?

There are two distinguishable types of authorship examination – attribution and verification (Koppel, Schler, and Argamon 2009).

Authorship attribution typically concerns the so-called closed set problems, that is all inquiries in which the real author must be one of the finite, known, set of candidates. Think of The Federalist Papers as the perfect example – a collection of 85 articles and essays written by three founding fathers whose names are well known to us: Alexander Hamilton, James Madison, and John Jay. While an extremely skeptical person could argue it is impossible to know if they didn’t have another anonymous co-writer unknown to the history, no evidence points to such a scenario and it is safe to assume that 12 letters of long disputed authorship must have been written by one of these three. In the case of closed sets, the focus is on distinguishing between individual candidates’ stylistic fingerprints and finding out which of them best fits the disputed text.

On the contrary, authorship verification usually concerns the so-called open set problems, that is all inquiries in which there is some suspicion over possible authors, usually based on philological evidence, but the possibility that the real author is not actually included in the corpus cannot be excluded. This kind of situation can result from many factors, such as many possible writers (especially in under-researched problems), little data available, and finally, the possibility that no other texts by the real author are preserved (or that the author completely hid themself under a pseudonym) and so cannot be included in the reference corpus.

While both have been applied in numerous stylometric investigations, attribution obtains significantly more attention in applications (starting with Mosteller and Wallace 1963) and review of methods (Grieve 2007; Stamatatos 2009; Evert et al. 2017) than verification, with the latter approach developing more intensely only in the last decade or so (e.g. Koppel and Schler 2004; Kestemont et al. 2016; Halvani, Winter, and Graner 2019).

5.3 Tools for Authorship Attribution

The most widely-used tool in stylometric authorship attribution is stylo (Eder, Rybicki, and Kestemont 2016), as it provides users with the choice between a simple graphical user-interface and a command-line interface, implements most currently-used methods of stylometric authorship attribution (many of which can also be used in applications beyond authorship attribution), is continually developed and has been popularized through more than a decade’s worth of introductory and advanced workshops by the developers. Another well-established tool for authorship attribution is JGAAP (Java Graphical Authorship Attribution Program).

References

See works cited and further readings on Zotero.

Citation suggestion

Joanna Byszuk (2023): “What is Authorship Attribution?”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/what-author.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).

Eder, Maciej, Jan Rybicki, and Mike Kestemont. 2016. “Stylometry with R: A Package for Computational Text Analysis.” The R Journal 8 (1): 107. https://doi.org/10.32614/RJ-2016-007.

Evert, St., Thomas Proisl, Fotis Jannidis, Isabella Reger, Steffen Pielström, Christof Schöch, and Thorsten Vitt. 2017. “Understanding and Explaining Delta Measures for Authorship Attribution.” Digital Scholarship in the Humanities 32 (suppl_2). https://doi.org/10.1093/llc/fqx023.

Grieve, Jack. 2007. “Quantitative Authorship Attribution: An Evaluation of Techniques.” Literary and Linguistic Computing 22 (3): 251–70. https://doi.org/10.1093/llc/fqm020.

Halvani, Oren, Christian Winter, and Lukas Graner. 2019. “Assessing the Applicability of Authorship Verification Methods.” arXiv:1906.10551 [Cs, Stat], June. https://doi.org/10.1145/3339252.3340508.

Kestemont, Mike, Justin Stover, Moshe Koppel, Folgert Karsdorp, and Walter Daelemans. 2016. “Authorship Verification with the Ruzicka Metric.” In Digital Humanities Conference 2016 (DH2016) Book of Abstracts. Krakow: ADHO. https://dh-abstracts.library.virginia.edu/works/2542.

Koppel, Moshe, and Jonathan Schler. 2004. “Authorship Verification as a One-Class Classification Problem.” In Twenty-First International Conference on Machine Learning - ICML ’04, 62. Banff, Alberta, Canada: ACM Press. https://doi.org/10.1145/1015330.1015448.

Koppel, Moshe, Jonathan Schler, and Shlomo Argamon. 2009. “Computational Methods in Authorship Attribution.” Journal of the American Society for Information Science and Technology 60 (1): 9–26. https://doi.org/10.1002/asi.20961.

Mosteller, Frederick, and David L. Wallace. 1963. “Inference in an Authorship Problem.” Journal of the American Statistical Association 58 (302): 275–309. http://www.jstor.org/stable/2283270.

Stamatatos, Efstathios. 2009. “A Survey of Modern Authorship Attribution Methods.” Journal of the American Society for Information Science and Technology 60 (3): 538–56. https://doi.org/10.1002/asi.21001.