Artjoms Šeļa (Kraków)
18.1 The Trend Line
If there is one method that unites diverse body of works in computational literary history, it must be a trend line. A plot with a line that is tracing change (or continuity) in some derived textual or bibliographical feature over time became the central device for making arguments and telling coherent stories about literary history at large and small scales. Decline of abstract lexicon in fiction (Heuser and Le-Khac 2012), continuity in genre classification strength over centuries (Underwood 2017), long-term stability of Ancient Greek literary style Storey and Mimno (2020), rise of dialogue share (Sobchuk 2016) and ‘dialogism’ scores (Muzny, Algee-Hewitt, and Jurafsky 2017), increase of linguistic repetition (Gemma, Glorieux, and Frédéric 2015), expansion and saturation of literary and publishing markets (e.g. Bode 2012), increase in character network complexity (Krautter 2020) and the fall of dramatic protagonist (Algee-Hewitt 2017), numerous keyword and topics trajectories (through one work, one author’s lifetime, or large corpus). Different studies, relying on different information sources, text analysis procedures, historical and theoretical assumptions. What unites them is time which has a constant appearance as an X-axis on which different trends, bins, lines and curves unfold.
Inferring the trend in data — the average chronological direction of whatever variable currently sits on a Y-axis — in computational literary studies is routinely done with linear regression models that estimate the linear relationship between a response variable and time (y ~ x; Feature ~ Time). A regression line is then superimposed on empirical observations: an upwards slope will show an average linear rise of the Feature per unit of time, a downwards slope — average linear decline. When data follow clearly non-linear trajectories, researchers usually engage in curve-fitting: they use of exponential functions (Heuser and Le-Khac 2012) or polynomials of different degrees (Gemma, Glorieux, and Frédéric 2015; Trilcke et al. 2016); they rely on smoothing functions (often unspecified) that follow changes in data locally (Generalized Additive Models, Local regression Underwood and Sellers 2012; Underwood, Bamman, and Lee 2018; Krautter 2020; Pianzola, Acerbi, and Rebora 2020) or rolling means or medians (Erlin 2017; Sharmaa et al. 2020); they look at averages in a binned time over 2, 5, 10-year bins (Bode 2012; Sobchuk 2016) or eyeball empirical distribution of features through time.
What is central to the vast majority of observed studies is that linear models and smoothing functions are used as a simple trend-revealing technique, almost as a rhetorical (’see, this is clearly what is happening with this data in time’), not a statistical device. Indeed, literary history and causal inference are yet to stand together. With few exceptions, explicit statistical modeling that tries to tease out different factors that influence the distribution of a feature of interest, if it happens, happens outside of dimension of time. So et al. (So, Long, and Zhu 2018) use logistic regression to show a higher probability of Black writers using Bible citation in a social context; Koolen (2018) in her thesis models factors (gender, genre, being a translation) that might influence people judgement about literary quality; Manjavacas, Karsdorp, and Kestemont (2020) build bi-variate (outcome is coming from two predicted variables) model with group (or ‘random’) effects to estimate the relationship between quotation use in Patrologia Latina and surrounding topical context.
Exceptions include early work in literary sociology by van Rees and Vermunt (1996): authors engage with discrete-time event history models (also known as ‘survival’ models) to understand writer’s debutes and factors that shape their reputation, measured by number of reviews. Essentially, they build a logistic regression for estimating a probability of increase in the number of reviews, but where each variable is allowed to change states at discrete events (i.e. publishing a new book). The advantage of using a survival model (which also pushes authors to bin count variables, like number of reviews, into discrete categories) over simpler multiple regression is not apparent from the study, but this research embraces complex modeling of change, unusual in the current CLS landscape.
More recently, Jockers (2013) and Underwood et al. (2022) treated time not as a neutral dimension that carries (and reveals) change, but as a factor of data. Jockers fit hundreds of independent linear regressions (one per predictor per textual feature) and tabulated statistically significant (based on p-values) results to estimate to which extent different factors — time, gender, genre, author — can describe observed similarities between texts. Underwood et al. (2022) used a similarly designed barrage of linear models (relying, instead, on the measure of explained variance, \(R^{2}\)) to argue that birth year, or generation, of authors, explains change in literary topics better than publication year, thus revealing a cohort effect in literary history. Additionally, their study asks how the use of topics change during individual careers. Do authors continuously ‘update’ topic use through their lifetimes, or use them somewhat consistently, according to some random baseline? Their research compares two processes with structural equation models that treat books as discrete events (similarly to Rees and Vermunt). On average, they find that topics that are driven by generation effects tend to reoccur in authors without a clear trend of usage, and topics that are resulting from specific historical times (war and other localized events) tend to be updated, revealing rising and falling patterns.
When it comes to detecting change in the usage of keywords or topics over time, Wadsworth, Vasseur, and Damby (2016) proposed a series of stand-out methods to trace the evolution of vocabulary. They focus on Sylvia Plath and look at cumulative distribution of each word through her works over time. The authors fit two simple functions to a cumulative trend — a linear and a power law. Using fits of two curves and differences between them, the study identifies groups of words with similar behavior: those that are used consistently over time, and those that are accelerating, or decelerating. Additionally, by looking at distribution of word-onset times (times before which Plath did not use a word regularly), the authors find clusters of words that only appear in specific periods and can potentially discriminate shifts in poet’s style, or aid periodization.
We spent more time with these cases, because they present a more nuanced approach to chronology than simply fitting linear trends to data. The fascination with trend lines was already critiqued conceptually by Moretti and Sobchuk (2019): the authors note that trends smooth over complicated cases, conceal non-linearity and “remove conflict from history”. Even if the line is smooth and monotonous, the forces that generated it might not be. Additionally, there are purely technical considerations. Out-of-the-shelf usage of linear models introduces the usual pitfalls known from other disciplines: over-reliance on significance testing (p-values) and explained variance (\(R^{2}\)) in the reporting instead of explicit statistical modeling; using linear regression, tied to the normal distribution, to model unsuitable data like discrete counts, probabilities and ratios, which can introduce unreasonable predictions — negative ratios, or probabilities larger than 1 (see Winter and Bürkner 2021 for a relevant discussion of using the Poisson distribution for modeling counts in linguistics).
A reliance on linear trend lines can also negatively impact the representation of uncertainty in the results when lines are fit on top of aggregated data: yearly averages and medians (e.g. Trilcke et al. 2016; Muzny, Algee-Hewitt, and Jurafsky 2017; Underwood and Sellers 2012; Šeļa and Sobchuk 2017). The resulting ’average of averages’ line is fitted on a handful of yearly observations, conceal value dispersion and overemphasize trend direction in time. If the purpose of linear models is just demonstrating the trend, using empirical values together with indication of range of their distribution, or bootstrapped confidence intervals Sharmaa et al. (2020) might, in many cases, be a much more direct and transparent approach.
Overall, the analysis of chronological change and continuity in CLS is in the early stages of development. When compared to sophisticated text analysis, annotation and representation techniques, it is apparent that the dimension of time is a pretty much open methodological and conceptual problem.
18.2 Inferring historical relationships between texts
Chronological trends can be ubiquitous and useful, but they usually do not make any judgement about historical connections between data points, which is another large subfield of (computational) literary history. Which texts are connected through influence, citation and rewriting? Can similarity be used to also infer genealogy? How known manuscripts or editions of a text are related? These questions belong to the domain of intertextuality, and the main methodological domain of intertextuality, at least in CLS, is text reuse detection — techniques of tracing matching parts between texts.
18.2.1 Difference and similarity
Seo and Croft (2008) identify two major approaches to text reuse detection: string-based methods and similarity-based methods. While calculation of similarity (or distance) between full texts is rarely considered to specifically be a text reuse method, some studies (explicitly or implicitly) rely on language-based similarity to make arguments about influence and infer pathways of intertextuality. Jockers (2013) constructs networks based on pairwise stylistic similarities between texts and then identifies most influential texts as the most central nodes. A similar goal — of tracing patterns of stylistic similarity beyond authorship — is pursued by Eder (2017) in the so-called bootstrap consensus networks. Broadwell and Tangherlini (2017) use heatmaps based on distance matrices to understand boundaries of Modernism in Scandinavian literature — they use areas of overarching (dis-)similarity between books to argue about periodization. Iwata (2012) directly equates text similarity with historical relationships when dealing with collections of Japanese Noh plays. This equation is always a considerable leap of faith — there is no guarantee that similarity in frequency distributions of linguistic elements would also signal any kind of historical relationship. Stylistically central texts might be just the closest texts to an unobserved ‘average’ language — simultaneously similar to many, but related to none.
Recently, more attention was drawn to non-symmetrical measures of difference (Barron et al. 2018; Chang and DeDeo 2020) based on divergence. Divergence (specifically Kullback-Leibler divergence, KLD) measures the amount of ‘surprise’ of encountering a probability distribution P, given a prior probability distribution Q. Texts are usually represented as probability distributions over inferred topics (e.g. LDA) to keep divergence measures interpretable. Difference between P → Q and Q → P can, for example, signal an enclosement relationship, when one texts covers more ground than the other, topically focused and narrow. It is more surprising to encounter the general text having seen only the focused one, than the other way around.
It is argued that asymmetry in KLD is a good fit for modeling cultural point of views and their inherent subjectivity (Chang and DeDeo 2020). In relation to historical data, KLD was used to derive measures of novelty and resonance: the former describes how surprising is a text T given the past, the former describes how persistent is information from a text T in the future. High-novelty high-resonance texts introduce innovations that also leave their mark on the future. These measures were originally used to detect a novelty bias in the debates of the first revolutionary French Assembly (Barron et al. 2018) and recently extended to various historical cases, including the detection of events in Dutch chronicles (Lassche, Kostkan, and Nielbo 2022). The divergence-based framework provides an alternative approach to trends, focusing on uncovering relative, context-dependent positions of data points in time. The question of historical relationships between these points, however, remains open.
18.2.2 Text reuse and alignment
When two texts share matching or very similar fragments, it provides better ground for establishing (or suspecting) an intertextual link. The majority of text reuse research focuses on string matching: either for discovering potential quotations for future work and integration with search engines (Roe et al. 2016; Sturgeon 2017; Janicki, Kallio, and Sarv 2022), or deriving text reuse scores to understand the intensity of text reuse between documents (Bernstein, Gervais, and Lin 2015; Gawley and Diddams 2017; Shang and Underwood 2021).
There are numerous solutions and existing frameworks (like Tesserae Coffee et al. 2012; Tracer Büchler et al. 2014; Passim Smith et al. 2014; Text Matcher Reeve 2020) of text reuse detection. Most of them focus on matching literal, or near-literal, repetitions of strings that can be relatively easy solved by n-gram matching and sequence alignment algorithms that build up similar regions from local matching units. Many of the approaches originate in bioinformatics and are a part of the longest common substring problem (see Olsen, Horton, and Roe 2011). In literary studies, identified string matches are often weighted (e.g. by TF-IDF) to distinguish ‘interesting’ matches based on document-specific words from simple linguistic repetition (Bernstein, Gervais, and Lin 2015; Shang and Underwood 2021).
Tracing non-literal intertexts — like allusions — that are invoking the source text through a paraphrase, development of a shared theme or semantics is more difficult and rare (for an overview see Manjavacas, Long, and Kestemont 2019). Detection of allusions is an open interpretative practice: there is little ground-truth data available, because there is little scholarly agreement available, which complicates the adjustment of formal techniques to non-literal cases of text reuse. The nature of the problem that shifts the focus from lexical correspondence to semantic overlap, invites injecting semantic information into reuse detection.
Manjavacas, Long, and Kestemont (2019) – using a subset of manually identified Biblical allusions in writings of Bernard of Clairvaux – show that hybrid methods that combine both lexical and semantic information have the best performance in retrieving allusion sources. Specifically, they use the soft cosine that incorporates word similarity information directly into the cosine distance (which is calculated using bag-of-words vectors): the method can trace non-trivial similarities between target and source text.
Text reuse also poses a specific challenge to folklore studies that deal in large archives of written records of oral song performances. Automatic navigation through these archives is complicated because of high degree of textual variation, despite the large parts of the records are repeating on different levels. All texts, in principles, are related through a shared oral tradition, which conditions the reproduction of formulas and motifs, arrangement of parts and transmission of texts. Recent work by Janicki, Kallio, and Sarv (2022) shows that simple character bigram similarity of verse lines can be effectively used to identify ‘equivalent verses’ – clusters of very similar strings. The study proceeds to align different texts in the collection using cluster indices instead of actual strings, which is aimed to generalize from strings to their types and serves as a good foundation for inferring similarity between folklore records with lots of variation. Poetic form makes the task somewhat simpler, since it provides a natural discrete ‘unit’ of text — a line, but this study is a promising development on the path of intertextual generalization.
18.2.3 Trees
A geneticist and one of the founders of cultural evolution, Cavalli-Sforza (with P. Menozzi and A. Piazza) wrote that “a tree can be viewed as a simplified description of a matrix of distances” (Cavalli-Sforza, Menozzi, and Piazza 1994, 33). Distances between anything: cities, species, languages, or individual texts. CLS routinely relies on dendrograms (that are built by grouping closest pairs in a distance matrix) as tools of unsupervised clustering. We are interested in a pattern of similarities and dissimilarities: which texts sit at the closest branches (maybe written by one author?), which texts form distinct clusters (maybe belonging to one genre?). These trees are not usually assuming any historical relationships between the leaves, which was the original purpose of a tree as a model: to sort out a history of life. Thus, the majority of CLS trees are not phylogenies and are of little interest to literary history.
Here we instead focus at the trees that are phylogenies and assume the descent with modification as a primary mechanism that produced its objects. There are at least three areas where phylogenetic methods or thinking is used in relation to (literary) texts:
- Literary morphology. Gasparov (1996) reconstructed branching and merging history of European verse forms, while Moretti (2005) used a tree model to manually chart histories of clues in detective fiction and free indirect discourse.
- Oral traditions: several groups of anthropologists use phylogenies to study branching of folktales (Tehrani, Nguyen, and Roos 2015; da Silva and Tehrani 2016) and mythological motives (Thuillard et al. 2018), an analysis that also opens connections to evidence coming from human migration history, paleogenetics and archaeology.
- Paleography and codicology explicitly engages with phylogeny reconstructions: copies (or witnesses) of a manuscript are all related, but usually it is hard to tell how exactly: the historical record, similarly to archaeological one, is fragmented, incomplete; it needs hypothetical reconstruction. To do this, one can use accumulation of errors and changes in copies that, not unlike genetic sequences, reflects transmission history. Similarity between documents here can signal common origins. Use of phylogenetic trees in inferring relationships between manuscript copies is a modern extension of a stemmatology — manual philological reconstruction of manuscript histories (see recent handbook, edited by Roelli 2020).
Phylogenetic methods provide two clear advantages: automatic alignment (compare texts) and automatic inference of relationships (build a tree as a possible history). Phylogenies of texts started with relying on manual coding of traits: to test relationships between manuscripts of the Canterbury Tales (Barbrook et al. 1998), or reconstruct history of branching of Little Red Riding Hood (Tehrani 2013). Recently, sequence alignment is used more frequently for comparing strings of texts directly, as in the case of phylogeny of print editions of The Wandering Jew’s Chronicle (Bergel, Howe, and Windram 2015).
Phylogenetic methods that originated in evolutionary biology obviously have design problems when used with cultural textual data: all ‘ancestors’ are thought to be unobservable (what we have is only ‘leaves’ of the tree) and often methods assume strong vertical transmission (like maximum parsimony trees, that rely on minimizing independent mutations). Manuscript transmission simulated in the lab (Spencer et al. 2004), however, shows that these trees can still be useful and reflect historical clusters faithfully, but are not able to represent known ancestry relationships. Several methods outside of maximum parsimony trees were adopted that do not expect vertical transmission in data, e.g. split decomposition and, more recently, network-based approaches such as Neighbor-Net (Iwata 2012; Bergel, Howe, and Windram 2015). Few alternatives to phylogenetic framework as a whole were also proposed: Andrews and Macé (2012) model stemmata and witness readings as directed acyclic graphs, or DAGs, also used in causal inference — to utilize the formal approach, but escape the assumptions of evolutionary biology.
Since anthropologist Alfred Kroeber’s critique in early 20th century, evolutionary trees are often dismissed as incompatible with with cultural histories where there is no reason to expect a strong tree-like signal: after all, culture mashes and mixes branches together. However, this is a misconception about evolution: branching, tree-like phylogenies are characteristic only for vertebrate evolutionary histories. Culture is more akin to viruses and bacteria with extensive horizontal exchange of information (and still, for various reasons, can be very tree-like, not fundamentally different from biological evolution (Durham 1990; Collard, Shennan, and Tehrani 2006). From adjacent disciplines we also see the expanding use of unrooted, network-based methods that do not assume a strong tree-like pattern in the transmission process: one such example can be the use of dynamic phylogenetic networks with community inference to trace histories of interacting individuals in electronic music (Youngblood, Baraghith, and Savage 2020). Patterns of collaborations produce distinct subgenres of music that stay recognizable despite an increase in transmission between branches in the age of internet.
18.3 Conclusion
We centered this survey on two areas of analysis that, we think, are indispensable for historical inquiry. First, usage of trends and trendlines deal with describing change: we see vast array of techniques and applications, but most research mainly stays at curve-fitting step and rarely engages in explicit causal inference. Second, distances, text reuse techniques and trees — all try to infer relationships between texts, often under historical assumptions that can be further used to reconstruct lineages of texts, traditions, or forms.
References
See works cited and further reading on Zotero.
Citation suggestion
Artjoms Šeļa (2023): “Analysis in Literary History”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/analysis-lithist.html, DOI: 10.5281/zenodo.7892112.
License: Creative Commons Attribution 4.0 International (CC BY).