28  Analysis of Canonicity

Lisanne van Rossum (Amsterdam)

28.1 Introduction: Deconstructing the canon

Computational analysis of canonicity and prestige using a large corpus of texts is still rarely done; a sociological approach has been explored more often. The following overview of applied practice will present a sample of analytical techniques currently in use and provide brief explanations of main analytical principles.

28.2 Current applied practice

Much academic enquiry in the area of analysis for canonicity is focused on identifying the stylistic and bibliographical differences between prestigious and non-prestigious contemporary fiction. Piper and Portelance (2016), for example, compared a collection of prizewinning and best-selling fiction (different ‘social value groups’), using Linguistic Inquiry and Word Count Software (LIWC), a technique that uses pre-validated word categories to index different textual properties. The researchers also mapped the differences in their corpus across ‘genre groups’ and constructed a machine learning model to effectively predict whether a newly introduced text was a prize nominee or a Romance novel. Earlier, Kao and Jurafsky (2012) had used LIWC, among others, to develop a computational framework to measure poetic beauty sensations in amateur versus prizewinning poetry reference corpora. Their work also features a useful overview of statistical poetic aesthetics until 2012.

Similarly, Jannidis, Konle, and Leinen (2019) performed an extensive comparison of ‘high’ and ‘low’ literary genres published in Germany between 2009 and 2017. A document term matrix with the 8000 most common nouns in the text was used to investigate the stylistic homogeneity with genres, as well as a statistical measure termed Cosine Delta of the 2000 most common words in the text. Jannidis, Konle, and Leinen (2019) also employed type-token-ratio and word length to measure stylistic complexity, as well as topic modeling and Zeta-based methods to explore themes and topics in the texts. Topic modeling is a machine learning technique that automatically detects recurring words or phrases in a text or set of texts. Zeta is a keyness measure introduced by John Burrows that is particularly useful to identify content words that are characteristic of one group of texts when compared to another.

van Zundert et al. (2020) investigated the notion of timelessness through stylometric analysis. The researchers compared ‘evergreens,’ or fiction that remains popular across multiple decades, with former bestseller fiction using TF-IDF vectorization and UMAP dimension reduction. In Natural Language Processing, vectorization is a machine learning technique that trains a model by extracting features from textual data. Dimension reduction then condenses a data set with many features to those features that best express the distance between the data. A popular example of a dimension reduction technique is the Principal Component Analysis.

Also focused on literary value and history, Underwood (2019) analysed fiction from the mid-nineteenth century to the mid-twentieth century period. Embedded in an argument for ‘distant reading’ as a lens of interpretation rather than simply a means to an end, Underwood used predictive modelling to evaluate whether a novel or poetry collection’s semantic contents were indicators of a higher probability of inclusion in reviews in literature periodicals of the time, and as such elevating its prestige status.

In a similar vein, van Cranenburgh (2016) created a predictive model of textual markers of literary prestige, such as textual complexity, trained on material from novels that were rated in terms of literary quality by Dutch readers. Interestingly, the model identified several works with a discrepancy between style and rating, making reader bias visible. Ashok, Feng, and Choi (2013), too, focused on textual complexity and its correlation with literariness, including factors such as readability indices and lexical choices. The researchers also used part-of-speech (POS) distribution across genres to investigate which word categories were predictive of literary success, and then compared their data set to journalistic writing styles and applied the same analysis to film scripts.

Finally, for the Stanford Literary Lab, J. D. Porter conducted a large-scale investigation of canonicity (2018) by plotting GoodReads reviews (indicative of ‘popular appeal’) against MLA reviews (indicative of ‘elite appeal’) of 1406 publishing authors over roughly the last century. These plots, in which authors were listed individually, were condensed down and categorized to map genre spaces, and to further investigate which works clustered together. Porter also constructed a figure that visualized the ‘consecration trajectory’ in ranking of the 20 most frequently referenced MLA authors throughout the decades 1940-2010.

28.3 Limitations

The current approaches to degrees of canonicity, prestige, or literariness diverge widely and many more options need to be explored in more detail. Future work is expected to zoom in on such questions as which indicators are most important and may be useful for larger scale comparisons into geographical areas, time periods, and perhaps even different kinds of audiences? Which approaches can only be used for individual or local case studies, and which may be useful to analyze longitudinal developments or canonicity issues across language and country borders? How can the historical and sociological context be usefully included in the analysis? Another important element relates to the textual level in relation to languages: if linguistic features are found that may correlate with canonicity or prestige, are these features comparable to those found in corpora in other languages or other time periods etc.? And how should these differences be explored in more detail, in search or more knowledge about canon formation, from both a historical and a (socio-)linguistic perspective? These are only some of the topics that are expected to be addressed in future work.


See works cited and further reading for this chapter on Zotero.

Citation suggestion

Lisanne van Rossum (2023): “Analysis of Canonicity”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/analysis-canon.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).