13  Data Analysis for Genre

Julia Dudar (Trier)

13.1 Introduction

This text describes different methods of literary text analysis with a focus on literary genres. It is divided in four sections according to underlying methodology covering classification, distinctive features, clustering and genre-based corpus analysis.

13.2 Classification

This section describes studies that approach the issue of genre with a classification-based methodology. During the last years the automatic genre-based classification of long literary texts gained popularity among the researchers. This is not surprising, as this methodology gives an opportunity to investigate a high number of literary texts in a short period of time. Features that are used for classification can be extracted with different methods: most frequent words, distinctive words, stylometric analysis, or topic modeling methods. Moreover, such classification can be based on different algorithms: Naïve Bayes, SVM, logistic regression or K-nearest neighbours etc. There have been a considerable number of publications in this field, and we are going to introduce some of them trying to cover the variety of different approaches.

Underwood et al. (2013) define two challenges of automatic classification of literary texts according to their genres. Both of them are related to heterogeneity within works of one genre. The first challenge is a heterogeneity caused by the changes of texts across time, as literary history spans several centuries. The second challenge is the length of books: Literary text are much longer than journal articles and are internally heterogeneous. For this reason, they need to be segmented for classification. To address these challenges, the authors introduce a multi-layered solution with trained hidden Markov models, and several overlapping classifiers. For their analysis they use a collection of 469 texts from HathiTrust Digital Library. In their study they focus on relatively broad categories like prose fiction, nonfiction, drama, and nondramatic poetry.

Hettinger et al. (2015) classify a large set of German novels, experimenting with different machine learning algorithms and different types of features. In particular, they explore how different types of features affect the performance of different classifiers. Besides most common words and punctuation marks that they identify as stylometric features, they used topic-based features, extracted using an LDA algorithm, and features extracted from social network graphs (character and interaction graphs). In addition, a number of classifiers is implemented and evaluated for this task, among them Naïve Bayes (NB), Fuzzy Rule Learning (Rule), Multilayer Neural Network (NN) and linear Support Vector Machine (SVM). The data set consisted of almost 1700 novels either originally written in German or translated into German. From this data set domain experts identified 32 prototype novels, that belong either to a social or to an educational subgenre, after that 100 additional novels were labeled by exprerts as social or educational (labeled data). The classifiers and features were evaluated based on these two datasets. The authors note that the combination of topic based (content) features and an SVM classifier yielded the best results.

The authors extended their work (Hettinger et al. 2016) by adding one further subgenre (adventure novels) and experimenting with feature engineering. As in the previous study, they used most frequent words (up 3000), character 4-grams, topic-based and social network features. However, they decided to concentrate on one classifier, namely SVM, as it showed the best results in previous analysis. They tried different combinations of features and experimented with the number of topics in LDA used for extracting topic-based features. The evaluation showed that classification results for adventure novels were much higher compared to the classification results of social and education novels. Network based features show the worst performance, while other content based features yield similarly high results.

Calvo Tello (2021) in his dissertation performed genre classification on different levels: fiction vs. non-fiction; narrative, drama or essay; subgenres of the novel. For the novel vs. non novel classification he used CORDE as training corpus, the largest existing historical corpus for Spanish, created by the Real Academia Española. He applied a grid search to find the best parameter combination for the classification of Spanish texts. He defined 20 combinations of features, including POS, tokens, lemmas, semantic annotation, most frequent tokens, mean and standard deviation of tokens etc. As classifier he chose logistic regression, as it yielded the best results for this task. The author also created the Corpus of Novels of the Spanish Silver Age (CoNSSA) and used it for subgenre classification task. He defined 11 subgenres, each of them was represented by at least 10 novels and applied multi-class classification using logistic regression. As the performance of classification was low, he transformed 11 classes into 3 more general classes: historical, comedy and naturalist. Alhough the performance in the second classification was higher than in the first one, it was still very low. Calvo Tello argued that a problem with the genre classification is that one particular literary work cannot only belong to just one subgenre, but a literary work is very often a mixture of different subgenres, which makes it really unique. This led him to the idea of multi-label classification, where each novel in the corpus was labeled as a binary vector, containing information to which subgenres the novel can be assigned. The results of multi-label classification were much higher above the baseline. The author also made the same observation as in the previous paper that some genres like adventure or erotic novels were classified more accurately compared to social and educational novels.

13.3 Distinctive Features

This section describes studies that approach the issue of genre with a methodology based on identifying distinctive features in a genre-comparative manner. The main idea of this method is based on the extraction of the most distinctive or characteristic words (known as keywords) from a target corpus in comparison to a more general and broad reference corpus. A target corpus usually used for this purpose consists of texts of one genre or subgenre, while a comparison corpus usually consists of texts of several genres and subgenres. Distinctive words are extracted with the help of so called keyness measure or measures of distinctiveness. There are several studies that are based on the application, analysis and comparison of different measures of distinctiveness (see e.g. Lijffijt et al. 2014; Paquot and Bestgen 2009), but there are only a few studies that are dedicated to the analysis of keyness measures used for genre comparison.

Schöch et al. (2018) describe in their paper an implementation of several variants of Zeta, used for genre comparative analysis. Based on the comparison of the three dramatic genres of comedy, tragedy and tragicomedy, they extracted distinctive words for these genres using Zeta. The aim of their paper was to reach a better understanding of the properties of Zeta as a measure for comparative analysis and to evaluate its usefulness for quantitative genre analysis. For example, due to the analysis with Zeta, it becomes clear that tragicomedy and tragedy are much closer in terms of their vocabulary than tragicomedy and comedy, so that the tragicomedies can be best described as a special form of tragedy, not a special form of comedy.

Du et al. (2021) compared two dispersion-based measures of distinctiveness, namely Eta and Zeta, using a genre analysis task. In their study, they used a balanced corpus of 160 novels published in France between 1980 and 1989. 120 of them are lowbrow novels of three subgenres: sci-fi, crime fiction and sentimental novels. The rest 40 are highbrow novels. The genre analysis was based on a comparison of novels of one genre versus the three other genres. Their analysis was based on the distinctive words extracted by the comparison of novels. The authors of the paper came to the conclusion that both measures are able to detect meaningful and interpretable distinctive words for one genre compared to more general corpus of the novels.

13.4 Clustering

This section describes studies that approach the issue of genre with a methodology based on clustering, that is by creating data-driven groups of texts that are then related to genre-related metadata. Among the most popular clustering approaches used in the CLS Community are stylometry and topic modeling.

Coll Ardanuy and Sporleder (2014) clustered novels according to genres through building social networks. They collected a corpus of 238 prominent novels and presented the plot and structure of the novels in static (describes whole novel) and dynamic (describes one chapter) co-occurrence networks. After that the authors extracted a feature vector from each social network. They performed clustering over the obtained vectors, contrasting groups by genre and by author. For genre analysis they defined 11 most common genres.

Schöch (2017) uses topic modeling to explore a corpus of French Drama of the Classical Age and the Enlightenment. The main goal of his paper is to discover the semantic types of topics that can be found in the collection of texts with topic modeling. Data-driven clustering of texts helps to investigate distinctive dominant topics and plot-related topic patterns in drama collections. The results of the analysis shows that different subgenres have their own lists of dominant topics.

13.5 Genre-based Corpus Analysis

This section describes studies that study literary phenomena not with a view specifically to genre, but using a genre or subgenre-based corpus. The phenomena analyzed are often genre-specific, however. For example, several studies focus on an analysis of different kinds of direct speech specifically in narrative texts such as novels and/or novellas (Brunner 2012, 2013; Schöch et al. 2016; Jannidis et al. 2018). Many studies perform network analysis specifically using corpora of dramatic texts (e.g. Powell 2014; Trilcke, Fischer, and Kampkaspar 2015), as the structure of interactions is often explicitly coded in dramas, in particularly when the texts have been encoded according to the guidelines of the Text Encoding Initiative. Also, many studies investigating the identification or classification of metaphors focus on poetry as their preferred genre (e.g. Tsvetkov et al. 2014; Shutova 2017; Do Dinh, Wieland, and Gurevych 2018; Reinig and Rehbein 2019). Of course, studies developing methods for metric analysis heavily focus on poetry (e.g. Hammond 2013; Carvalho, Loula, and Queiroz 2016; Navarro-Colorado 2017).

13.6 Conclusion

Summarizing all the information above we can conclude that there are three main methodologies used for literary text analysis with focus on literary genres: classification, clustering and distinctive features. The scope of such analyses is usually defined by text corpora that include literary works of several subgenres of one literary genre. Depending on the underlying methodology and particular implementation, data analysis can be based on most frequent words, content words or topic-words, gained through topic modeling approach, as well as other features. At this point it is important to mention that feature engineering plays a cruicial role for genre analysis. Besides these approaches there are also numerous studies that discover literary phenomena using genre-based corpora.

References

See works cited and further readings on Zotero.

Citation suggestion

Julia Dudar (2023): “Data Analysis for Genre”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/analysis-genre.html, DOI: 10.5281/zenodo.7892112.



License: Creative Commons Attribution 4.0 International (CC BY).