23  Analysis for Gender

Evgeniia Fileva (Trier)

23.1 Introduction

Gender studies within CLS is a very active research area. The subject of study is gender, not primarily in the sense of biological sex, but as a set of social representations within the boundaries of certain socio-cultural perceptions that have been consolidated in a given society. According to this approach, gender is seen as an important concept of literature and appears as a dimension of social patterns of behavior that are rooted in a given type of culture. This approach can be observed in the scholarly publications we found. For example, Jan Rybicki or Sean Weidman and colleagues who study gender markers, Ted Underwood who focuses on the ways of gender identification of characters, or Corina Koolen who studies the influence of gender on the authorial style, etc.

What methodologies are used to address these issues? Koolen (2018) mentions two basic types of approaches: descriptive and predictive. The descriptive approach includes all techniques which serve to describe linguistic patterns and search for patterns in a dataset: tokenization, parsing, POS, etc. Descriptive approaches are often used in tasks such as sentiment analysis, named entity recognition, and information retrieval. As Koolen points out, this approach is easier to apply and organize the process. For further examples of the descriptive approach, one can refer, for example, to the studies such as Rybicki (2015) or Weidman and O’Sullivan (2017).

The predictive approach, as its name implies, serves to make predictions about data using machine learning techniques and models. It can include text categorization, machine learning model learning process, and evaluation of prediction results. This approach is expected to be more complex and demanding. This approach is demonstrated, for instance, by studies such as Schumacher and Flüh (2020), Jockers and Kirilloff (2017) and Underwood, Bamman, and Lee (2018).

However, as with stylometry, which can be applied both descriptively and predictively, many scholars combine both approaches. We have outlined the most significant studies in the analysis of gender in the literature that we found for this survey paper.

23.2 Current Applied Practice

Weidman and O’Sullivan (2017) pose the question of how gender affects literary style and whether male literary style can be distinguished from female literary style by gender markers. The authors use the distinctive words method and use a stylometric Zeta analysis and weighting Z-score differences. Using the stylo package for R (Eder, Rybicki, and Kestemont 2016), a comparative list of words that are most and least frequently used in the two corpora of novels by female and male authors was created. Two groups of markers, female (or ‘preferred’) and male (or ‘avoided’), were involved in the Zeta analysis. A cluster analysis was then performed on the literature of three historical periods (Victorian era, modernists, and contemporary authors) using Delta, which establishes text similarities based on most frequent words in the corpus. In this manner, authors who are considered more canonical and prestigious were clustered together, but another major distinguishing feature in the clusters was the individual authors’ gender. The analysis showed the presence of stereotypical distinctive features, by which one can make a stylistic distinction between female and male authors. These features include the language of place, direction and location (e.g. “home”, “kitchen” found in female authors vs. “country”, “earth” etc. in male authors), the language of certainty (confidence for men, less confidence for women), preferred terminology (women have family and relationships, focus on interiority, while men focus on exteriority) and other clearly distinguishable linguistic markers. Although the trend of stereotypical markers in literature from all three periods persists, the authors note that contemporary women writers are increasingly less exposed to traditional stereotypical domains, which have evolved strongly over time.

A similar conclusion is obtained by Rybicki (2015), in which he examines gender-based authorial attribution in English fiction also using Zeta. On the basis of two corpora, 20th- and 21st-century and 18th- and 19th-century, a cluster analysis of Delta distances of most frequent words was carried out. Rybicki explains his choice of method and preference of stylometric tools over such methods as SVM by the fact that for literary-linguistic research, a more stable method should be used, even if there could be errors in accuracy.

The word frequency method was also used by Underwood, Bamman, and Lee (2018) in their paper on methods of gender identification of characters and analysis of the words used to describe characters in the period from the late 18th century to the 1960s. The authors work was based on a large collection of 104,000 texts taken from the HathiTrust corpus. A manually selected collection from the Chicago Text Lab (Chicago Novel Corpus) was used for comparative analysis. The BookNLP tool was used for gender identification, which showed good results especially among the descriptions of male and female characters. A manual sampling method was used to determine an author’s gender, based on the Publishers Weekly list. After analyzing the statistical data, Underwood concluded that women were more likely to write about women and men, with men writing about men much less often. This trend has been stable over the entire time period taken. To examine how gender shapes character descriptions, Underwood used a representation that included different aspects of character simultaneously. A bag-of-words approach is used, where adjectives, verbs, etc. are employed to represent characters, while gender roles and personal names have been excluded. By labeling some characters with grammatical gender, the model can learn what words characterize “masculinity” or “femininity” of fictional characters based on the vocabulary associated with them. The accuracy of the model can show whether gender is a powerful organizing structure or whether it is becoming less prevalent. Even seemingly innocuous words can tacitly predict gender. The authors came to the conclusion that there have been significant changes in the representation of gender in literature over the last 170 years. The language used to describe fictional men and women has become less sharply marked, indicating that gender roles have become more flexible. Moreover, conventional binary roles have proven to be unstable over time, with shifting characteristics and attributes associated with each gender.

Unlike Underwood, who excluded descriptions of gender roles from his analysis, Schumacher and Flüh (2020) use them to train a model to be used in a tool for identifying gender in 19th-century German-language literature. First, 12 books by female authors and 12 randomly selected books by male authors were selected. The rest of the novels were used as a training corpus. Then all kinds of gender roles of characters (daughter, mother, husband, sexual orientation, social role, etc.) were integrated into the model. Three methods were used for analysis – Named Entity Recognition, stereotype annotation, and emotion analysis. Three gender categories were established for the Named Entity Recognition: male, female, and gender-neutral. The test was conducted on several combinations of texts. NER results were added to the CATMA annotation tool and supplemented with sub-categories such as gender role names. With the help of CATMA, the authors also performed an analysis of emotions, identifying which emotions are most often mentioned in the test text. Further close reading and analysis of gender-specific emotions revealed several results, including, for example, that female characters more often show the emotion of fear and less often the emotion of anger than male characters.

Jockers and Kirilloff (2017) study the differences between female and male characters based on their actions, which they call “character agency”. For this task, they use a classification method based on the nearest shrunken centroids classifier, which provides class predictions and probabilities, along with feature selection. This is especially relevant to this study, because in this way they can identify verbs that are important and not so important in gender identification and classification. The study found strong associations between verbs and pronoun gender in the corpus as a whole. The corpus was segmented by genre, and the model achieved varying accuracies in predicting pronoun gender, ranging from 58% to 100%, depending on the genre. The highest accuracy was observed in the anti-Jacobin, Evangelical, national tale, Gothic, industrial, and Newgate novel genres. The study suggests that the actions of characters play a crucial role in shaping our perception and comprehension of them.

In addition to descriptive methods, Koolen (2018) also uses a machine learning approach. Her research consisted of two main parts. In the first one, she tried to find out how gender affects an author’s style. Three classification experiments were conducted using LIWC, machine learning and topic modeling, respectively. In addition to gender, variables such as country of origin and whether or not the author won a literary award were also taken into account. The LIWC method showed which groups of words were more likely to indicate female authors and which were more likely to indicate male authors. The experiment compared the two original corpora, namely the Riddle corpus and the Nominees corpus, through machine learning, trained the model using the Support Vector Classifier, then used a bag-of-words approach to calculate the relative frequency of lemmas, then performed an evaluation. Experiments showed that the machine learning method can determine the gender of the author of novels with an accuracy of 83%, with male authors being better classified than female authors.

The second part of Koolen’s study focuses on the connection between physical appearance descriptions and gender. The first is to explore whether female authors really devote more attention to this aspect than men do. For this purpose she extracted descriptions of appearance from a corpus of chick lit novels. Three extraction methods were used – Lexical-Syntactic Queries, Machine Learning using SVM and a hybrid approach in which the result of the queries served as features in the SVM Classifier. As a result, Koolen concludes that manually constructed queries perform better than standard machine learning for extracting information about physical appearance in novels. Both methods have strengths and weaknesses, but the automated method is not robust enough for unseen text. A manual analysis of gathered sentences shows that physical appearance descriptions are abundant in literary novels and not necessarily more present in chick lit. Author gender is connected to differences in physical appearance descriptions, with male literary authors describing appearance the most.

23.3 Limitations and Ethical Issues

Gender studies in CLS reveals some ethical issues that must be taken into account in the analysis process. Koolen’s discussion of this topic is very well presented in her book Reading beyond the female (2018). There are unavoidable problems with gender categorization, as gender is a primary characteristic that people use for classification and can lead to stereotyping and essentialism. The use of NLP and big data can exacerbate these issues, making it a pressing topic for the field. Koolen discusses also the issue of neutrality and objectivity in NLP techniques, particularly in machine learning models that can identify stylistic differences between male and female authors. However, researcher’s selection of gender as two groups for training the algorithm means that the technique is not neutral and has similar issues to the descriptive method. The predictive success does not necessarily reflect the explanatory value of the gender division. Thus, NLP needs to keep the same issues in mind as the descriptive method.

Koolen discusses also the potential for bias in gender research using corpora. One issue is the violation of the assumption that a corpus is a statistically representative sample. Confounding factors such as the types of products that men and women tend to write, read or review can affect the validity of conclusions drawn from the dataset. Another issue is the potential for publication bias. Controlling for author and text type characteristics is important in gender research using corpora. Even within the text type of fictional novels, subgenres have their own characteristics that might be erroneously attributed to gender. The second issue argued by Koolen is that researchers often accept gender as a cause of difference without seeking supporting research beyond the chosen dataset, which can lead to bias and promote the separation of female and male authors in literary judgments. As a good example of acceptable gender-related text research, Koolen highlights Rybicki (2015), who examined whether a corpus supposedly consisting entirely of female authors might actually include works by male authors.

Limitations connected with the corpus have also been noted in some other works we have mentioned. For example, Weidman and O’Sullivan (2017) argue that their conclusions are only true for their dataset and leave out some aspects, such as the collective macro contribution of women to the literature. Jockers and Kirilloff (2017) note that their results might have been different if they had a larger corpus and richer metadata. In addition, their results support gender stereotypes to some extent. The potential of such research lies in the consideration of irony, a wide range of forms of character presence (e.g. first-person forms), and narrative time.


See works cited and further readings on Zotero.

Citation suggestion

Evgeniia Fileva (2023): “Analysis for Gender”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/analysis-gender.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).