26  Corpus Building for Canonicity

Lisanne van Rossum (Amsterdam)

26.1 The corpus and the canon

Data sets define the object of inquiry. Corpus building is therefore an important methodological step for the textual analysis of canonicity and literary prestige. The canon and the academy are themselves intrinsically connected: a place in the canon facilitates the availability and transmissibility of a text and this availability in turn aids its institutionalization in research, an idea that is much conceptualized in academia (Moretti 2000; Van Rees 1983, 1987; van Rees and Vermunt 1996; Bode 2020). As computational rendering of a complex sociological process of value attribution, the corpus thus is both the result of canonicity and responsible for canonicity.

A methodological overlap between corpus building and canonicity is that they are limited to what they contain and defined in large part by what they exclude. To paraphrase: selection is central both to canonicity and to corpus building. In computational literary studies, such boundaries can be approached as discrete by using pre-selected lists. Marc Verboord writes that “…a common way of dealing with the problem of selection, besides ignoring it, is to refer to selections other actors in the literary field have already produced” (Verboord 2003, 260).

However, working with data is frequently based on arbitrary and less evident criteria, especially in the case of prestige. Verboord poses that a main challenge on the methodological level “is that the term ‘canon’ suggests that literary quality is dichotomous in nature: an author either belongs to the ‘canon’ or s/he doesn’t. In reality, more levels can be perceived. As will be shown, these levels are largely a result of the various dimensions of which literary prestige consists” (Verboord 2003, 261). The formation of a corpus to operationalize canonicity, as such, presents an act of approximation. In the end, the results produced through corpus building remain relative to the data set and careful consideration of corpus composition results in a more robust understanding of its outcomes.

26.2 Current applied practice

Recent studies have frequently taken the route of using one or more actors or indicators in the literary valuation process to build a corpus for the analysis of prestige. The Stanford Literary Lab, for example, has focused on expert opinions as the basis for their canonicity analysis by using the MLA International Bibliography as a corpus to measure how frequently academics are publishing articles about any given writer (Porter 2018, 4). Similarly, José Calvo Tello has used the number of pages devoted to an author in a standard literary history as an indicator of prestige when creating the Corpus of Novels of the Spanish Silver Age (CONSSA, see Calvo Tello 2021). Ted Underwood and Jordan Sellers have focused instead on reviews in periodicals by creating a dual reference corpus, existing of two samples of both poetry and fiction from the time period 1820-1919 from differing sources (Underwood and Sellers 2015). The first of these was drawn from fourteen trans-Atlantic magazines that selectively reviewed literary works, the other was taken from a randomized sample from the HathiTrust Digital Library which features a substantially wider array of poetry and fiction.

For quantitative studies of prestige, literary prizes also present corpus building potential by producing discrete shortlists or longlists. James F. English has investigated the relationship between the literary awards industry and the production of cultural value (English 2008). Several corpora have been compiled, such as the Dutch 190 literary prizes corpus (Boudewijn 2020, 33) and the nominees corpus of 50 fiction novels (Koolen 2018, 271), to take stock of the female literary position in the Netherlands as reflected by prize nominations, jury composition and reports, and winnings.

Another strategy to operationalize canonicity through tangible indicators is the use of data from publishing history. In the case of the European Literary Text Collection (ELTeC), for example, the criterion of reprint count during a specific, recent period of time (1979-2009) was used to classify novels into two categories: those with two or more reprints, to be understood as still being in active circulation (and hence, canonised), and those with no or just one reprint, understood to be largely marginalized in the contemporary period (Schöch et al. 2021).

Next to this, popular valuation has recently become the focus for corpus building. As part of the aforementioned experiment, the Stanford Literary Lab used rating frequency on GoodReads, since 2007 the world’s largest website in recording, sharing, and recommending books, as a selection criterion to cross-reference with expert opinion data (Porter 2018, 3). And the research project The Riddle of Literary Quality selected its corpus of 401 novels on the basis of sales and public library loaning figures in the Netherlands between 2009 and 2012, with a focus on selecting those novels with the widest circulation and readership (Koolen et al. 2020).

Pierre Bourdieu’s work on the sources of literary distinction, moreover, suggests that beyond the experts and the public, information about co-referencing and peer estimation by artists can act as a proxy for prestige (Bourdieu 1996, 50–51). To the best knowledge of the authors, this avenue for operationalizing literary prestige remains understudied, although it would be an interesting and worth-while challenge to take up.

26.3 Limitations

A first, very fundamental limitation of corpus design practices in the context of canonicity and prestige is that many corpora do not take this into account as a relevant factor at all, often for pragmatic reasons of metadata availability. As a consequence, indicators relevant to the status of the texts with respect to their canonicity or prestige are not included in the metadata, despite the fact that differences in this respect are to be expected. This is the case, in particular, for very large, minimally-curated corpora such as the Gutenberg Project. As a consequence, and in particular when small to medium-sized corpora have been designed using the opportunistic model, the effects of availability (especially in digital form) on canonicity are strengthened even further, because there is no conscious intention to counter-balance them, leading to a strong positive bias for canonised works that is not being acknowledged.1

Several of the aforementioned studies are limited in scope and representativeness, e.g. they foreground limited data associated with prestige to represent the concept as a whole. From several viewpoints, this is an understandable strategy to take to corpus building: trading comprehensiveness for feasibility, for example, or negating the problem of compromising data by mapping heterogeneous data sets onto one another. As shown in the previous section, mixed and integrated approaches to corpus building have been successfully undertaken. Yet in a relatively young field in both digital data and methods, comprehensive, multi-dimensional models of literary prestige remain lacking and therefore, no long-term accounts of canon formation and transformation based on substantial sets of data exist. Lastly, it should be noted that current approaches are culturally specific and often focused on Western systems and markets of cultural consumption.

Another potential limitation of current corpus building practices with regard to canonicity or prestige is that they are for the most part focused building a corpus of the canon, or containing only texts belonging to the canon, based on a particular indicator of canonicity. Most of the indicators observed so far, however, could equally well be used to precisely build a corpus that contains texts that are part of the canon (by some measure) as well as texts that are not (by the same measure). Some examples of this practice include the European Literary Text Collection mentioned above, which makes it a point to include both canonised and largely forgotten texts, in a bid precisely to allow for comparison of texts from these two groups to be performed. Similarly, Jodie Archer and Matthew Jockers, in their study on textual properties that correlate positively with bestselling novels (Archer and Jockers 2016), included both commercially successful and commercially unsuccessful novels in their corpus. Taking this even further, following the lead of Verboord above, and assuming suitable quantitative rather than categorical indicators, corpora could be built that contain texts of different, defined degrees of canonicity.


See works cited and further readings on Zotero.

Citation suggestion

Lisanne van Rossum (2023): “Corpus Building for Canonicity”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/corpus-canon.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).

  1. In the absence of the relevant metadata to check for these effects, it is difficult to provide a clear-cut example for this case. One collections where such an effect could be expected is the 450 Multilingual Novels collection (Piper and Portelance 2016). More generally, pionieering providers of digital literary texts clearly started out with an unquestioned bias on canonised authors, but in many cases broadened their scope over time, e.g. the Théâtre classique corpus (Fièvre 2007).↩︎