Evaluation in Genre Analysis

Dudar, Julia

doi:10.5281/zenodo.7892112

Julia Dudar (Trier)

14.1 Introduction

The issues relevant to evaluation in genre analysis vary depending on the methodological perspective adopted in a given study: classification, clustering, distinctive features or genre-based corpus analysis (see chapter “Data Analysis for Genre” (Chapter 13). However, there are some specific challenges for evaluation in terms of genre analysis. For example, the issue of hybrid genres, partial assignments or multiple assignments of texts can make evaluation more complicated. However, most researchers ignore this difficulty and just treat genres as categorical classes.

14.2 Classification

Some general remarks concerning classification approach can be found in the Chapter “General issues in Evaluation” (see Chapter 4). In classificatory approaches, and when relevant metadata is available in sufficient quality and amounts, evaluation can use standard statistical evaluation measures such as recall, precision, accuracy or F1-score. However when it comes to genre analysis, there are some specific challenges that should be considered in research design. The choice of the right evaluation measure is crucial and it depends on corpus design, its size and construction, and of course on the classification approach.

As mentioned in “Data Analysis for Genre”(see Chapter 13), Underwood (2014) classified a text collection from the HathiTrust Digital Library focusing on relatively broad categories like prose fiction, nonfiction, drama, and nondramatic verse. The authors experimented with different classification algorithms including random forests and support vector machines and also combinations of multiple algorithms. But their choice fell on an ensemble of regularized logistic models. It was trained by comparing each genre to all the other genres collectively. They choose this classifier not because of the best performance, but because it can be trained relatively quickly compared to SVM or other classifiers. With the high number of features and a range of different settings that should be tested, the authors decided that it would be the best option for them. For their task, SVM had slightly better performance but the implementation and training time were too long. A further advantage of regularized logistic regression is that it is highly interpretable. This characteristic gave the researchers the opportunity to find out unexpectedly important features of particular genres. Interesting is that most of this features were not content-oriented, but rather structural ones (“the”, “my” for fiction genre). For drama the most weighted features were stage directions, past-tense verbs of speech but also structural features.

In their research on genre analysis, Hettinger et al. (2015) evaluated a variety of classifiers, including k-Nearest Neighbour (kNN), Naive Bayes (NB), Fuzzy Rule Learning (Rule), pruned and unpruned (Tree), Multilayer Neural Network (NN), and linear Support Vector Machine (SVM). The authors used a majority vote classifier (MV) as the baseline, which achieved an accuracy of 0.66 for the prototype dataset and 0.58 for the labeled dataset. Based on the evaluation results, SVM performed the best for both the prototype and labeled datasets and for almost all feature types except social features. Naive Bayes also had high results, especially on the prototype dataset. Fuzzy Rule Learning performed the worst, with results below the baseline for all feature types on the labeled dataset. Regarding feature types, topic features showed better performance for all classifiers compared to stylometric and social features.

14.3 Clustering

In clustering-based approaches, evaluation is a little less straightforward, but measures such as Adjusted Rand Index or Silhouette coefficient can be used.

As discussed in the Chapter “Data Analysis for Genre” (see Chapter 13), Coll Ardanuy and Sporleder (2014) designed their genre analysis research based on the construction of social networks. On the basis of feature vectors extracted from social networks, they built clusters. According to their research design, the number of clusters was pre-defined and corresponded to the number of annotated classes (genres). Cluster evaluation was carried out with respect to the annotated data. However, the authors emphasize that this evaluation task was not trivial, as it was not always clear which labels correspond to which clusters. The label was assigned to clusters which contained most of the items of the labeled class. The authors used three metrics for the evaluation: purity, entropy and F1 measure. Coll Ardanuy and Sporleder did not use feature weights, however confessed that it could be useful, as some features had bigger impact on genre recognition than others. Despite of some weakness in research design (not consistent corpus design, no special treatment for multi-labeled texts etc.) the authors made some interesting observations. For example, they found similarities between historical, social and satirical genres: they all have a high proportion of minor or isolated nodes. On the contrary, Bildungsromane and picaresque novels often have one strong key protagonist and many minor characters around him or her. For science fiction, mystery and gothic novels, it is characteristic to have a mixed point of view.

As described in Chapter “Data Analysis for Genre” (see Chapter 13), Schöch (2017) discovered dominant topics in French Drama applying topic modeling algorithms. While the authors of research mentioned above used pre-defined parameters in their analysis, the author decided to evaluate different parameters of topic modeling and to choose the best model for his research question. For this purpose he created 48 models based on a range of different settings, like varied number of topics or varying hyper-parameters. The evaluation of the models was based on a classification task, where the plays needed to be classified according to their subgenre. As input to this approach, the probabilities of each topic in each play were used. The author used four different classifiers (Support Vector Machines, k-Nearest Neighbors, Stochastic Gradient Descent and Decision Tree), while the performance of the algorithms was evaluated in a ten-fold-cross-validation setting. This classification task was solved with an accuracy from 0.70 to 0.87, with the highest results obtained by SVM. With the help of this evaluation, the author found the best parameters for his topic modeling model, which he applied on further experiments. The interpretation of topic lists showed that most of the topics have a high level of coherence and helped the author to discover distinctive topics of subgenres of French Classical Drama.

14.4 Distinctive Features

When using approaches based on the extraction of distinctive features, evaluation is particularly hard. There is no gold standard, and it is not possible to establish one. One alternative method of evaluation is, again, to use downstream classification tasks.

Schöch et al. (2018) used classification for the evaluation of different parameters of distinctiveness measure. First, the authors performed keyness analysis with two variants of Zeta on two corpora: a collection of French Classical and Enlightenment Drama and a collection of Spanish-language novels from Spain and Latin America. To evaluate different parameters (segment size, number of segments) and variants of Zeta, they classified novels by genre (French collection) or by the continent of origin (Spanish collection) using distinctive words, as identified by the respective measure, as features. As classifier, a linear SVM was chosen. The result showed that most Zeta variants outperformed the baseline, while logarithmic Zeta had better performance compared to the Burrows Zeta. Segment size also influenced the performance: Burrows Zeta showed better results in the classification with large segments.

Du et al. (2021) used the same approach for the evaluation of 9 measures of distinctiveness, including two variants of Burrows Zeta. For their evaluation, the authors used a corpus of 320 contemporary French novels. This corpus contained the same number of novels for each of three subgenres of low brow novels: crime-fiction, sentimental novels, sci-fi and high brow novels. As in the previous study, the authors classified novels by genre, using top distinctive words, delivered by each measure, as features. But they expanded the previous study by adding seven distinctiveness measure, classifying the novels into four classes instead of two and testing the impact of the number of features on the classification performance. The most important result of their study is that measures based on dispersion or distribution (such as Zeta or tf-idf) are better suitable for the distinctiveness analysis compared to frequency based measures (such as chi-squared test or LLR test), as the former showed significantly better classification results than the latter, especially when using the smallest number of features.

14.5 Limitations

The issue of hybrid genres and multiple assignments of texts is still open, except some progress in this field made by José Calvo Tello in his dissertation (2021).

Hybrid genres can be seen as a mixture of different genres or as a new genre that combines features from existing genres. José Calvo Tello notes in his dissertation that in the evaluation of classification of hybrid genres, one of the main challenges is the lack of clear genre boundaries and the potential overlap of features. That is why it can be difficult to assign them to a single category (2021). One option to deal with this challenge may be the implementation of a probabilistic model that assigns multiple genre labels to each text. Another solution could be hierarchical classification system, where some genres are sub- or supergenres of other genres.

14.6 Conclusions

Overall, a research review on this topic has shown that evaluating of different approaches, like determining the most suitable annotation workflow (which could be machine-learning, manual or a combination of both) or selecting the best classifier, plays a critical role and has a significant impact on research results. While SVM is often a popular choice among classifiers in general classification research, applying a classification approach for genre analysis requires thorough investigation and comparison of multiple classifiers due to the unique feature properties and lack of clear boundaries between genres. The nature of literary genres is very complex, it means that literary genre can’t always be assigned to a certain predefined category. That’s why more research in the field of hybrid genres and thorough investigation of genre characteristics is needed, before working on analysis and evaluation of quantitative methods in genre analysis.

References

See works cited and further reading on Zotero.

Citation suggestion

Julia Dudar (2023): “Evaluation in Genre Analysis”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/evaluation-genre.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).