21  Corpus Building for Gender Analysis

Evgeniia Fileva (Trier)

Gender studies in literature that aim to analyze authorial style or character behavior are necessarily accompanied by a careful selection of material for the corpus. In the context of gender studies, the representativeness of the corpus is a critical factor in obtaining reliable results. In most of the papers that we found on this topic, the focus is on the number of male and female authors. For gender studies, it is desirable to build a balanced corpus, that is, one in which there is no overweight towards one or the other gender. Due to historical and social circumstances, as well as effects of gender on canonisation and canonisation, in turn, on preservation and digitization, almost all original corpora have more male than female authors. To make sure a comparison of novels based on author gender is possible without creating unsurmountable obstacles for corpus design, the creators of ELTeC, for example, set a very wide range for the proportion of novels written by female authors. Rybicki (2015) formed equal collections of texts by male and female authors for his analysis. The gender balance of the corpus is also addressed in studies by Koolen, Underwood, Calvo Tello, Schöch, and others.

21.1 Approaches to corpus building in gender-based research

Various approaches to constructing a corpus in the context of gender analysis were encountered in the articles from our corpus of research articles. We were able to distinguish two main approaches, namely the use of an already existing corpus, and the creation of a special collection of texts, taking into account the goals of the research.

21.1.1 Re-using existing corpora

The first type includes the work of Koolen (2018), who conducted a study based on the Riddle of Literary Quality corpus. This corpus was created by the Royal Netherlands Academy of Arts and Sciences (KNAW) to investigate the textual characteristics of Dutch contemporary fictional prose. Koolen herself was involved in the Riddle project and, in her monograph Reading Beyond the Female, has used the corpus created by the project for a specific task: the study of the relationship between gender and the (perceived) quality of literary texts based on readers’ reactions to bestselling fiction. The original corpus is a collection of texts by Dutch authors from 2007-2012 and consists of 401 novels. Koolen provides a number of statistics on the gender distribution in the corpus, which is an important part of such studies. In the corpus mentioned, 55% of the works were written by men, 36% were written by women, and 9% have authors of unspecified gender (Koolen 2018).

Rybicki (2015) conducted a stylometric study using the Chawton House Library’s corpus, which is a part of a digitization project at Chawton House aimed to build a historic collection of women’s literature in English in the period 1600–1860. Rybicki compared word frequencies of the corpus with two reference corpora of famous female (22 novels) and male writers (21 novels) containing texts from the 19th and 20th centuries. The collection from the Chawton House corpus contained 46 novels written by women between 1723 and 1830. Interestingly, like the Riddle corpus, this collection contains, among other works, text by anonymous authors that are very likely to be by women. In the course of the study, Rybicki (Rybicki 2015) combined materials from the reference corpora in different ways and used tool combinations for stylometric analysis.

Underwood, Bamman, and Lee (2018) completed one of the most extensive studies in terms of time period and number of units analyzed. They analyzed the gender development of characters in English-language literature over several centuries. They selected 104,000 books for their analysis and covered a time period of 306 years. The collection is based on books in English from the Hathi Trust Library, which have been compared with the Chicago Text Lab corpus and the Publishers Weekly collection to check for representativeness. The essay examines the changing importance of gender in fiction and characterisation from the late 18th century to the early 21st century. Underwood et al. argue that while gender divisions between characters have become less defined over time, there has been a decline in the proportion of fiction written by women, and the number of female characters has also decreased. Underwood’s study is an example of the two approaches, namely author gender analysis and character gender analysis combined together, which is considered as an aspect that affects corpus composition. Underwood et al. emphasize the value of books drawn from academic libraries for their study because the significance of female characters there is greater and more transparent (Underwood, Bamman, and Lee 2018).

21.1.2 Creating customized corpora for analysis

The second type of approach to corpus building, where a corpus is created specifically for the research in question, includes the work of Weidman and O’Sullivan on the analysis of gender markers (Weidman and O’Sullivan 2017). They collected a corpus of 236 novels by 54 authors. Their analysis is based on an attempt to compare works in English from three literary periods – Victorian, modernist, and contemporary literature – in order to understand how an author’s gender affects the lexical and conceptual apparatus of the work.

21.2 Concerns with gender in corpus building for other purposes

Researchers who do not work directly in the field of gender studies nevertheless often mention this aspect in connection with the corpus they use.

Thus, Calvo Tello (2021) also makes a comparative corpus analysis on the topic of gender balance. Using the example of the the Manual de historia de la literatura española (MdLE), he observes that female authors of Spanish literature in the time period 1880-1939 are much less frequently represented in this reference work than men: 6.5% versus 93.5%. Calvo Tello examines further the “importance” of authors present in this particular literary history. For his own corpus building work, he identified the statistical populations of the corpus based on certain traits and used the MdLE to define populations in a given time period. The novels used for defining the total population of (canonised) novels must be mentioned in the manual and meet certain criteria. He gathered the following information about the collected novels: author’s name, birth and death year, gender, amount of dedicated pages in the manual and links to search for the author in digital libraries. His analysis shows that even though there are more men among the top authors, women’s works are nevertheless just as important as those written by men, and are examined just as carefully (Calvo Tello 2021).

The European Literary Text Collection (ELTeC) pays a lot of attention to the gender ratio, which is one of the compositional criteria, in order to enable gender-based analyses. Thus, one of the conditions for selecting material for the corpus was the presence of works written by women in the amount of at least 10% and at most 50% in each collection. The range of this percentage is so wide because in some languages and cultures, female authors are more strongly represented: for example, in English-language literature as opposed to Serbian, Slovenian or Czech literature (Schöch et al. 2021).

21.4 Limitations

There are some challenges in building a corpus for gender studies. Firstly, metadata could be unavailable or incomplete. For example, while designing the corpora for ELTeC, it became clear that many library catalogues and other resources did not contain metadata about author gender and therefore, did not allow for targeted searches for novels written by women (Schöch et al. 2021). This fact also makes it difficult to identify anonymous texts or to identify the gender of authors who who may have used a pseudonym. There are also cases where the author is non-binary or transgender (e.g. Maxim February), which requires the use of external databases to verify information about the authors (Koolen 2018). Sometimes, beyond subgenre, there are also correlations between author gender and other metadata categories, for example text length, as Schöch et al. (2021) indicate in the case of the Portuguese text collection. The lack of textual material and metadata, for female authors in particular, can greatly affect the results of analysis (Jockers and Kirilloff 2017).

Koolen (2018) discusses the importance of controlling for confounding factors in gender research when working with corpora by citing examples of studies that fail to account for potential biases, such as domain bias or publication bias. The author suggests that controlling for author and text type characteristics is necessary to avoid erroneously attributing differences to gender. Furthermore, the author notes that within the text type of fictional novels, there is a variety of subgenres that each have their own characteristics, which should not be attributed to gender. Overall, a corpus should be a sample that is statistically representative, and any findings derived from experiments using corpora are only reliable to the extent that this condition is fulfilled.


See works cited and further reading on Zotero.

Citation suggestion

Evgeniia Fileva (2023): “Corpus Building for Gender Analysis”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/corpus-gender.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).