24  Evaluation for Gender Analysis

Evgeniia Fileva (Trier)

24.1 Introduction

The evaluation of gender analysis in CLS involves assessing the selected methods and their effectiveness for classification and prediction tasks. Despite the wide range of tools and techniques used for gender research in literature, not all of them demonstrate high efficiency. There is an active discussion in the CLS and DH community on this topic. Rybicki (2015) argues that there is no universal consensus on the optimal method, and comparative studies only show slight improvements. Based on this, he concludes that the choice between different methods is not significant in literary studies, and stable methodology is more important. The main task for research in this area is to search for visible gender signals and their analytical analysis, as well as predicting gender on unseen material, both for characters and authors. The following text highlights research in which the authors provided an evaluation of their methods and/or results.

24.2 Evaluation of gender signals

One of the methods for identifying the gender of characters in literature is BookNLP, used by Underwood, Bamman, and Lee (2018). BookNLP has shown a high level of accuracy in assigning genders to characters based on their names and honorifics, with a precision of 94.7% for women and 91.3% for men. However, there were difficulties, such as recognizing the gender of first-person narrators in cases where there are not enough gender-specific references to the pronoun “I”. Regarding the predictability of grammatical gender from ungendered evidence using characters from novels in HathiTrust, Underwood, Bamman, and Lee (2018) have found that the accuracy of gender identification decreases from 1840 to 2007, and this trend is consistent across different sources of data and modeling strategies.

When comparing characters created by women and men within fiction written by either gender, the models are consistently less accurate for characters created by women (by 2.5% on average). Gender differences seem to be more pronounced in stories written by men. The accuracy of models that attempt to distinguish character gender in groups of characters drawn only from books by men or by women varies by around 10% across 230 years, from roughly 76% to 66%. Underwood points out that it is unclear whether this constitutes a dramatic or a subtle change, and the strength of 76% accuracy is uncertain given that the model has only 54 words of evidence, on average, for each character. Biographical information about the author is hard to infer from this limited data.

The method of classification based on the verb-pronoun word combination and the study of external factors affecting the classifier’s ability to predict pronoun gender based on verb associations, conducted by Matt Jockers and Gabi Kiriloff (2017), showed the following results. 10-fold cross-validation was performed, and an overall accuracy of 81% was observed, with an error rate of 16% for male pronouns and 22% for female pronouns. The authors also conducted hold-out validation, which showed a 30% improvement over chance, suggesting a strong association between certain verbs and pronoun gender. The verbs “wept,” “sat,” and “felt” were associated with female pronouns, while “took,” “walked,” and “rode” were associated with male pronouns. During the analysis, ten verbs (five male and five female) that were most useful in differentiating between male and female pronouns were identified. The study found that female pronouns were slightly less gendered or “codified” than male pronouns, as some verbs typically associated with female pronouns were still used with some male pronouns. The study also found that the algorithm was less confident in its assertions about verbs associated with female pronouns, suggesting that these verbs are generally more ambiguous in projecting a clear pronoun gender class. Overall, the algorithm reported an 81% accuracy rate in predicting pronoun gender. The researchers then segmented the corpus into different genres and found that the overall prediction accuracies were sustained, with accuracy ranging from 58% to 100% depending on the genre. The highest accuracies were observed in the anti-Jacobin, Evangelical, national tale, Gothic, industrial, and Newgate novels. While author gender was not found to be a strong determiner of classification accuracy, there were differences in how male and female authors associated male and female pronouns with verbs. Male authors were more likely to create female characters that defy gender stereotypes, while female authors were more conventional when creating characters of their own gender. When author gender was unknown, the machine struggled more with predictions of female pronouns.

Corina Koolen’s study (Koolen 2018) compared the performance of lexical-syntactic queries, an SVM classifier, and a hybrid method for identifying sentences containing descriptions of physical appearance. The evaluation was done using precision, recall, and f-measure, and an unweighted average was used to account for the small percentage of sentences containing physical descriptions. A macro evaluation that averages scores on both classes yielded an f-score of 30% for sentences containing physical descriptions and 90% for those that did not, but only the f-score for the former was reported in the study to avoid unjustly inflating the outcome.

Several elements were fine-tuned to improve their performance. These elements include adjusting and adding queries, machine learning features, and testing the size of the lexicon. However, none of these elements improved the overall outcome, and cutting the original lexicon in half caused a 5% drop in performance. Therefore, Koolen concluded that enlarging the lexicon might be the easiest way to improve performance. The hybrid method outperforms the SVM classifier, with the former being particularly effective in classifying chick lit compared to literary novels. Chick lit features more varied descriptions of physical appearance than literary novels do, with all degrees of comparison of the word “beautiful” appearing as a discriminating feature. Nonetheless, Koolen notes that one should be careful in interpreting machine learning features.

24.3 Conclusion

While the methods described here of analyzing gender in literature have shown varying degrees of success, they also face challenges such as ambiguous pronoun references, changing language use over time, and author gender biases. Fine-tuning elements can improve performance, for example, one of the ways to improve outcome could be enlarging the lexicon. The fact that some studies observe a change in performance depending on literary period, with accuracy dropping over time, shows that performing such evaluations in CLS research is not only a best practice that is essential for an assessment of one’s methods and the degree of trustworthyness of one’s results, but can also provide insight into the history of gendered writing in its own right.


See works cited and further readings on Zotero.

Citation suggestion

Evgeniia Fileva (2023): “Evaluation for Gender Analysis”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/evaluation-gender.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).