22  Annotation for Gender Analysis

Evgeniia Fileva (Trier)

22.1 Main issues in annotation for gender

The methods of annotation are particularly well illustrated in the example of character gender research in the articles we have found. As for research on the author’s gender, the annotation process is unfortunately not well-presented there. It can be connected with the fact that identifying and analyzing gender in literary texts is a challenging process, as gender markers are often not explicitly stated in the text or in bibliographic resources, especially for author gender. In contrast, characters’ actions, dialogue, and descriptions are good material for studying gender markers, because there is a lot of textual material to be used. To address this issue, researchers have turned to annotation techniques, which involve labeling or marking specific linguistic features related to gender, such as pronouns, adjectives, or verbs.

Thus, syntactic parsing is a basic preprocessing step for research on gendered characters to understand what linguistic features they reveal. For example, Jockers and Kirilloff (2017) defined two classes of pronouns (male and female) as well as gendered nouns, and used the Stanford CoreNLP toolkit, in particular the dependency parser, for automatic parsing to identify pronoun-verb pairings in a corpus of novels. The output was processed to create CSV files containing the counts of verb associations with male and female pronouns. The data was then reshaped into a matrix with rows indicating source novel and pronoun gender, and columns indicating each verb associated with the novel and pronoun gender. Raw counts were converted to percentages to account for the imbalance in the occurrence of male and female pronouns. The dependency parser, however, was not 100% accurate, and the researchers conducted human analysis in order to find errors in the identification of verbs. To address this, the data was winnowed to only include highly frequent verbs and those that were most consistent with established patterns. The final list of 281 verbs was merged with book metadata to create a final matrix. The metadata included book information such as the file name, author’s name, year of publication, and pronoun class.

Although gender recognition is a relatively unproblematic task, Underwood, Bamman, and Lee (2018) points out an aspect that can pose some difficulty, namely character name disambiguation. In the annotation process, it is important not only to identify a character’s gender, but also to identify him or her by his or her own name. For example, Elizabeth Bennett may appear as Elizabeth, Miss Bennett, etc. In his work, Underwood uses the BookNLP pipeline, which is a Python library for analyzing literary texts.

Underwood’s example of Miss Bennett points to another aspect important to annotation in the context of gender studies, namely the ambiguity of proper names in a literary text. In addition to different variants of the same name, characters may have statuses, professions, social roles, and nicknames that may indicate it in the text. In addition, it is not impossible that a single proper name can point to more than one character, as in the case of Miss Bennett. Disambiguation is solved, to some extent, by NLP tools.

Schumacher and Flüh (2020) discuss various annotation techniques for analyzing gender stereotypes and evaluations in 19th-century literature. They employ both quantitative and qualitative approaches, such as the analysis of pronoun usage and the examination of character traits and actions, to identify gendered patterns and dynamics in literary texts. As part of the _m*w_ project, Schumacher uses digital annotation, in combination with Named Entity Recognition and emotional analysis, as one of three approaches to gender role recognition. Character gender role identification involves three aspects – gender identification, character gender identity and character actions, and is determined by using personal pronouns, character qualities and descriptions of their actions. A model with preconditioned information about gender roles is run through the NER process, which results in the identification of categories such as “masculine,” “feminine,” and “gender-neutral.” The corpus is annotated according to these categories using the CATMA tool, and subcategories corresponding to the character’s social role (e.g. father, mother, child, etc.), gender identity (e.g. homosexual man), and personal qualities (e.g. narcissist) are also added. This way of annotation facilitates a careful analysis of the text for gender stereotypes.

To extract physical appearance features from the corpus, Koolen (2018) used Lexical Semantic Queries as one of three approaches (see the chapter “Analysis for Gender”, Chapter 23) for more details). The Alpino parser was used as the method for developing queries to automatically parse Dutch novels and extract rich linguistic information, such as part-of-speech tags (including verbs and nouns) and grammatical functions (e.g., subject or object). The output of Alpino appears as linguistic parse trees in XML format, which can be queried using XPath. Koolen then constructed word lists of nouns and adjectives used in physical descriptions and included stative verbs in the queries. A set of thirteen XPath queries was developed based on the exploration of two novels and manual classification.

22.2 Limitations

Overall, these studies indicate that annotation for gender is a useful method for literary analysis that enables researchers to identify and examine gender-related linguistic features in literary texts. Nevertheless, it is necessary to take into account the challenges and limitations associated with this approach. For example, annotation by humans can be influenced by the subjective interpretation of linguistic features by different annotators. Furthermore, there is a risk of oversimplifying or essentializing gender by reducing it to a binary classification of male or female (Schumacher and Flüh 2020).

To address these limitations, scholars have proposed more nuanced and context-sensitive approaches to gender annotation. For example, Underwood, Bamman, and Lee (2018) argues for a performative approach to gender, which views gender as a fluid and contextual performance rather than a fixed identity. Rybicki’s study (Rybicki 2015) proposes a statistical approach to gender annotation, using multivariate analysis of word frequencies to identify gendered linguistic features in literary texts.

In conclusion, gender annotation is a valuable tool for literary analysis that can help uncover the complexities of gender representation in relation to literary quality, authorship, and historical contexts. However, it is important to approach gender annotation with a critical and nuanced perspective, taking into account the fluidity and complexity of gender identities and avoiding oversimplification or essentialization.


See works cited and further reading on Zotero.

Citation suggestion

Evgeniia Fileva (2023): “Annotation for Gender Analysis”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/annotation-gender.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).