6  Corpus Building for Authorship Attribution

Artjoms Šeļa (Kraków)

6.1 Introduction: Corpus building and language variation

Corpus building is crucial step in many authorship attribution studies, because it is primarily through corpus design that researchers exercise control over a multitude of factors that might influence textual differences. This control is necessary to attempt to isolate the authorial signal on the grounds of linguistic evidence. Research on language variation (e.g. Tagliamonte 2011; see also Grieve 2023) says that this variation is simultaneously driven by many things, and tends to be nested: we can detect differences between large groups of speakers (dialects), but also between different regions of the same group, different cities within a region, different neighborhoods within a city. Similar effects were observed multiple times empirically in written texts: given a large corpus, we can detect overarching genre splits (e.g. poetry vs. prose), period and gender differences, register differences, and, finally, authorial differences that can be additionally organized by social background and education of authors. It lead some researchers (C. Labbé and Labbé 2001; D. Labbé 2007) to seek one ‘similarity scale’ for text analysis that would allow to tell apart possible influences at different values of distance scores. It is, however, impossible, to define such a scale universally for the multitude of relevant methods, languages and corpora. Instead, the dominant strategy in practice, especially in method evaluation research, is control of the corpus. Below we outline three areas where corpus preparation is most visible and discussed: (1) chronology, (2) genre & register and (3) document size and corpus sampling. In the end, we discuss alternative approaches to handling language variation in a corpus.

6.2 Chronology

Language is continuously changing and each generation of writers adapts a different version of it. Texts written in similar periods will tend to naturally group together, and stylistic distances between texts will tend to increase with the distance between them in time (Juola 2003; Hughes et al. 2012), with strong generational effects visible as well (Underwood et al. 2022). This creates a natural problem of difference inflation in chronologically unbalanced corpora. As Patrick Juola put it (Juola 2015, 106): “A collection of blog posts in modern English would not provide an adequate control sample for Elizabethan. […] Marlowe would still probably have written Shakespeare’s plays if the alternatives were […] bloggers for the New York Times’. In practice, it is common to limit the background corpus by the narrow time period of a text / corpus in question (e.g. Grieve 2007; Forsyth and Holmes 2018; Rebora et al. 2018a; Kocher and Savoy 2018).

Additionally, styles of individual authors tend to have their own chronological dynamics: they change over time and this change is traceable by stylometry. In fact, early European stylometrists at the end of 19th century gathered around the problem of dating Plato’s dialogues rather than for attribution goals (Grzybek 2014). Studies also demonstrate that change of writing styles in authors are unequal: stylistic drift might be very pronounced in some cases (Hoover 2007), but it is far from being universal (Reeve 2018; Nagy 2023). The existence of stylistic drift makes a case for distinguishing early/late works in corpus preparation for authorship attribution, or calls for nuanced approaches to sampling authorial style across time (Barber 2018) in order to reduce the regular effect of chronology.

6.3 Genre and register

It was many times observed that literary genre — at different levels of definition — has systematic effects on stylistic differences: fiction is distinct from non-fiction (Piper 2017), poetry differs from prose (Chaudhuri et al. 2018), tragedy differs from comedy (Schöch and Riddell 2014). These effects can be united under the linguistic notion of ‘register’: a specific situation under which a text was produced and addressed to a specific audience (Biber 1988). The practice of corpus normalization for ‘register’ is uneven and differs in its resolution. The ‘narrative fiction’ is often considered to be a uniform register (at least silently) in attribution and evaluation practice (e.g. Eder 2013b; Evert et al. 2017; Kocher and Savoy 2018), but there is an obvious linguistic variation within fiction, too, for example as an effect of narrative perspective. Many studies strictly limit registers, e.g. working only on letters addressed to one person (Tuccinardi 2016), only on Latin historiographies (Kestemont et al. 2016) or only on newspaper columns (Grieve 2007). This illustrates the nested nature of linguistic variation for which a fully ‘controlled’ corpus potentially becomes an endless process of zooming into specific conditions and constrains under which a group of texts was produced.

A special case of the difference in register is poetry in verse that linguistically is very distinct from prose. The major driver of this effect is non-arbitrary division of speech into periods (lines) and poetic meter that, by strictly governing prosody, systematically (re)organizes natural language. As a result, poetic texts in verse radically differ from prose even on the basis of a handful of most frequent linguistic features (Chaudhuri et al. 2018; Storey and Mimno 2020). This poses a significant obstacle for any study that deals with mixed corpora, or even with early modern drama that regularly mixes prose and poetry (and proportions of this mix can be different across dramatic genres).

In addition, different meters twist language in different ways; the form is a major confound of any quantitative claim made about poetic texts. Thus, major evaluation studies (Plecháč 2021; Nagy 2021) always work on metrically homogeneous corpora.

6.4 Size and sampling

One of the crucial questions of corpus design for authorship attribution is the question of size. How large the texts should be to be ‘large enough’? Usually this is asked in relation to an established ‘nearest neighbors’ methodology that is based on measuring distances between frequency distributions of most frequent features (words, character n-grams). Empirical tests offer a range of answers: from 5000 words as a minimum in a scenario where all quantified texts are of the same size (Eder 2013a), to 1000-2000 words in a scenario where a much larger reference or training corpus is available (Eder 2017). In practice, researchers often work with texts as small as 500-2000 words. Effective sample size depends on the nature of used features and can be decreased further when instead of linguistic ‘rare events’ it uses features that are densely distributed even in short texts: these can be derived from the formal structure of text (poetic rhythm, phonic organization of rhyme, see Plecháč 2021; Nagy 2021). Alternatively, a range of derived summary statistics (like vocabulary richness, complexity etc.) can also be used effectively in combination with frequency-based vectors (Weerasinghe, Singh, and Greenstadt 2021).

The question of size, again, cannot be answered universally; even the stylistic recognizability of individual authors is not something even or constant (Eder 2017) and some might require larger samples than others. Forensic authorship attribution often deals with extremely short texts (letters, notes) and many frequency-based methods are simply not applicable in this domain. The whole analysis paradigm differs as research resorts to various word n-gram tracing methods (Nini 2018; Grieve et al. 2019), or non-linguistic attribution (graphology, material evidence).

The question of document size is part of a larger problem of uneven text representation. In the vast majority of cases, stylometry is not taking samples of style from authors in controlled conditions, but is sampling the historical record of style that is, more often than not, uneven. The multivariate comparison of large texts alongside small texts introduce biases and can provide unreliable results (Moisl 2011). The common solution to this problem is using various sampling strategies, like ‘downsampling’ a corpus to a smallest text to make any comparison even (oversampling, that instead generates synthetic data points to match larger samples, is exceedingly rare (e.g. Feng and Hirst 2013)).

Moisl (2011), however, approached the problem of document length from a direction of behavior of individual features. He introduced a screening method that determines minimal ‘reliable’ sample size for each of the used features (the original study was based on English bigrams). Based on a normality assumption and the properties of a binomial distribution, he adapted a sample size function that estimates (in a given confidence threshold) a size of a document required to reliably represent a probability of a given feature. This allows to non-arbitrarily cull the corpus and feature space, balancing the number of features vs. the number of documents used. In turn, this can escape employing bootstrapping or extensive sampling strategies and can be useful in unsupervised scenarios (Cafiero and Camps 2019), but is currently under-tested.

It is also common to combine smaller texts with known authorship into even samples (Plecháč 2021; Rebora et al. 2018a, 2018b). Random samples that combine different sources can be particularly effective, since they draw stylistic information that is not stratified by any artificial condition (e.g. of one text, one chapter, one theme). Eder (2013b) has shown for 19th century English novels that random sampling that evenly draws from the available bag of words performs much better than consecutive samples, or ‘chunks’ of novels, which may be too dependent on local context to represent style effectively.

6.5 Beyond the corpus control

In a authorship attribution evaluation study, Jack Grieve writes that “the anonymous text is the product of a single situation and so each author-based corpus should be composed of texts produced in the most similar register, for the most similar audience, and around the same point in time as the anonymous text” (Grieve 2007, 255). To obtain such a fully controlled corpus is increasingly demanding: one can add the preference for texts being of similar length, or being produced, typeset, machine recognized under similar conditions. John Burrows noted that uniform corpus representation in many cases is unrealistic and will always come with information trade-off: “we need to determine what is most appropriate, accepting only such limitations as we must, and resisting them when we can” (Burrows 2007, 46). There is also concern of generalizability of methods: if most evaluation studies are made using highly specific slices of language, then how we can be sure they will perform similarly in any other scenario? At the extreme level of granularity each text can be viewed as its own ‘register’, written under conditions that never existed before and will not exist again. Here, a ‘laboratory’ control of textual conditions becomes an unreachable utopia.

The concerns and pressures of appropriate corpus construction created a sub-branch in stylometry of cross-genre or cross-register authorship attribution with corpora tinkered specifically to be heterogeneous (Kestemont, Mike 2012; Barlas and Stamatatos 2020; Wang, Xie, and Riddell 2021). Here linguistic variation is taken as a methodological challenge, not solely as an issue of corpus design. For example, ‘unmasking’ techniques (Koppel and Schler 2004) test the stability of authorial signal by iteratively removing most distinct features, that might drive surface differences between samples (i.e. induced by different genres or themes). Attempts have been made to also ‘control’ for various stylistic influences a posteriori, penalizing genre or chronological signals in a corpus (Calvo Tello 2017; Underwood and So 2021). Authorship verification problems that ask about the likelihood of sample X coming from author A are conceptually defined as ‘one-class’ classification problems, even if in practice, they are usually set up as multi-class (Halvani, Winter, and Graner 2019). The minimal context required to solve a verification question is just writings of one author and there are notable attempts to remove the control corpus (and ‘distractor authors’) from analysis altogether, and just focus all inference on the information coming from stylistic behavior of same-author texts (Noecker Jr and Ryan 2012; Halvani, Winter, and Graner 2019). In the end, we see that the language variation which fuels the concern around the corpus in stylometry, can be also either embraced and recognized, or bracketed out by radically minimizing the surrounding context.


See works cited and further readings on Zotero.

Citation suggestion

Artjoms Šeļa (2023): “Corpus Building for Authorship Attribution”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/corpus-author.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).