7  Annotation for Authorship Attribution

Julia Dudar (Trier)

Generally speaking, stylometric authorship attribution relies primarily on un-annotated texts, simply using word forms as the fundamental feature. However, even this requires tokenization to be performed. In addition, some studies, especially when working with short texts, aim to use more information than just word forms for stylometric analysis.

Please note that the remarks on OCR and spelling normalization that are explained in the chapter “General issues in Preprocessing and Annotation” (Chapter 2) are also relevant for authorship attribution. Indeed, if texts written by different authors come from different sources, they may have undergone systematically different editorial preparation, including spelling normalization or modernisation, something that can interfere with authorship attribution.

7.1 Tokenization

In most studies doing authorship attribution, only relatively basic techniques of processing the string of characters are used, notably tokenization in order to identify individual word forms. The reason for this is that most stylometric authorship attribution operates at the level of individual word frequency analysis. It is an issue for ideogrammatic scripts such as Chinese, where one word (in the sense of a semantic unit that would be translated into one word, usually, in alphabetic languages) consists of (typically) 1-3 characters and boundaries between words are not marked by spaces.

7.2 N-Gram creation

In only a relatively small number of studies, n-grams at the word or character level have been used as an alternative to single word forms (unigrams). One prominent supporter of n-gram methods for authorship attribution is Brian Vickers. He has authored a number of works dedicated to authorship attribution for Early Modern Drama using the rare 3-6-word n-gram matches between a corpus of plays of a target author and a play of unknown author as a method (Vickers 2008, 2011, 2012). His main hypothesis is that word-n-grams are more suitable for authorship attribution tasks compared to single tokens because authors tend to choose semi-constructed word combinations (Vickers 2011).

The use of word n-grams for authorship attribution was also proven to be useful by Antonia, Craig, and Elliott (2014). For their analysis, the authors evaluated two n-grams methods: “strict n-grams”, which is based on the use of straightforward sequences of 1-5 word n-grams and “skip n-grams”, which involves the omission of function words. As corpora they used a collection of English Renaissance plays and a collection of articles written for Victorian periodicals. The findings of their analysis revealed that 3-grams obtained the highest results in various evaluation scenarios for both corpora when using the strict n-gram method. In contrast, when using the skip n-gram method, the 4-grams demonstrated the best performance.

However, other authors do not support this idea and have obtained different results using the n-gram method (see Craig and Kinney 2009; Jackson 2010; Hoover 2015). David Hoover (2015), for instance, replicated tests on the Vickers method using 3- to 6-word-n-grams on Victorian drama. His results demonstrated that frequent single words were the most effective indicators of an author’s style, while the Vickers method proved unsuccessful according to his experiment. Subsequently, he expanded his research (Hoover 2018) to assess not only rare 3- to 6-word n-grams, but also character-2grams, -3grams, and -4grams, as well as word-2grams, -3grams, and -4grams. The outcomes revealed that frequent single words still yielded the best results. Although character-2grams performed relatively well at times, they still produced inferior results compared to frequent words. Longer n-grams sequences displayed much weaker results for both character and word n-grams. Similar study but in a cross-language setup was conducted by Eder (2011).

López-Escobedo et al. (2013) combined functional and content word n-grams with various other features (see following sections). They extracted both types of n-grams: straightforward sequences and sequences with a gap. Concerning the length, they used unigrams, bigrams and trigrams.

7.3 Lemmatization or POS-Tagging

Even less frequently, linguistic annotation is used. There are some instances where lemmatization is employed (Labbé and Labbé 2001; Eder and Górski 2023). Sometimes, POS-tagging is used and usually combined with n-gram calculation, because POS usually reduce the number of features to a very small number and using n-grams multiplies the number of different features. Also, POS-tagging can be used to disambiguate word forms, as is routinely done by Hugh Craig in the “Intelligent Archive” (Craig and Whipp 2010). Besides word n-grams, López-Escobedo et al. (2013) also performed authorship attribution combining POS-n-grams with other features. The length of POS-n-grams varied between 1-3-n-grams.

7.4 Syntactic annotation

In some rare cases, more sophisticated linguistic annotation, including dependencies or other syntactical information, is used in stylometric authorship attribution. Cinkova and Rybicki (2020) developed an approach that enables the comparison of an original text and its translation to identify an author’s style across languages. They constructed a parallel corpus of Czech-German literary texts and enriched the texts with Universal Dependencies, a framework for consistent annotation of parts of speech, morphological characteristics, and syntactic dependencies across diverse languages. Despite this, the method’s effectiveness remained limited. After that, the authors subsequently added a shared pseudolemma with each annotated lemma in both languages, and it resulted in a clear performance improvement, pushing performance up to 95.6%.

7.5 Other features

Equally rarely, additional features beyond standard linguistic annotation are used in authorship attribution. An example is Jacobs and Kinder (2020), who also used stylistic features such as type-token ratio as well as content-oriented features such as sentiments. Another rare case is Suzuki et al. (2012), who have used co-occurence-based features.

An interesting study was conducted by Gómez-Adorno et al. (2018), where the authors evaluated a variety of stylometric features categorized into three groups: phraseological, punctuational, and lexical. It is important to note that they avoided using typical stylometric features such as word frequency and instead incorporated features like lexical diversity, average word length, average sentence length, standard deviation of sentence length, average paragraph length, standard deviation of paragraph length, document length, and punctuation marks. Such additional stylometric features including the distribution of word length frequencies, type token ratio or Hapax legomena count were also adressed by López-Escobedo et al. (2013).

7.6 Conclusion

Summarizing the provided information, we can conclude that although function or content word frequency is a popular, reliable and well-performing feature for authorship attribution, it may in some contexts be beneficial to incorporate additional features into the research design. Depending on the language of the corpus and authorial style, word n-grams, POS n-grams, or even syntactical dependencies can aid in identifying unique authorial styles or other previously unseen characteristics in literary works.


See works cited and further readings on Zotero.

Citation suggestion

Julia Dudar (2023): “Annotation for Authorship Attribution”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/annotation-author.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).