3  Introduction to Data Analysis

Andressa Gomide (Porto) and Christof Schöch (Trier)

3.1 Introduction

Once the data has been preprocessed and annotated (see the chapter “General Issues in Preprocessing and Annotation” (Chapter 2), the linguistic and literary investigation can begin. Some common categorizations of approaches distinguish bottom-up and top-down approaches (Biber, Connor, and Upton 2007); corpus-driven and corpus-based approaches (see, e.g. Biber 2005); exploratory and focused approaches (see Partington 2009; Gries 2010; Gabrielatos 2018); and deductive and inductive approaches (Stefanowitsch 2020).

These categorizations, with some differences, distinguish two main approaches. In the exploratory approach, the research starts with a very general question, with an open-mind approach to the data [Stefanowitsch (2020); p. 61]. The focused approach is much more specifically hypothesis-drive, but it can of course also investigate hypotheses raised using the exploratory approach first. When conducting a focused investigation, the observations about the data are restricted to the elements in question.

Beyond querying corpora and establishing various characteristics based on frequency or dispersion and their comparison, statistically more advanced methods are frequently used in CLS as well. One such approach are clustering methods, another one are classification methods. Finally, it is often necessary to approach a given question, issue or textual phenomenon using human annotation, in a first step, before one can attempt to train a machine learning algorithm to identify the phenomenon in question automatically.

Data analysis is very often an incremental cycle. Refined hypotheses are derived from previous research. For this, and many other reasons, when performing data investigations, it is crucial to document and share data and procedures taken to achieve the final results.

3.2 Tools for corpus analysis

The choice of tools for data analysis varies according to the research question(s); the language used; the type of analysis; personal interest; the type of data annotation, etc. Usually, web applications like SketchEngine (Kilgarriff et al. 2014) respectively NoSketchEngine or standalone tools like AntConc (Anthony 2022), TXM (Heiden and Lavrentiev 2013) or stylo (Eder, Rybicki, and Kestemont 2016) are preferred among researchers with no programming skills, for their usability. However, such tools have some drawbacks, such as the limited number of functionalities and limited flexibility and customizability. For users looking for customization, the use of suitable programming language libraries is preferred. Examples include NLTK (Natural Language Toolkit), spaCy (a Python-based library for NLP), Gensim (a Python-based library for topic modeling and Word Embeddings), Mallet (a Java-based library for Machine Learning, including topic modeling) or scikit-learn (a general-purpose machine-learning library for Python). A downside of this method is the steep learning curve, but the effort is usually worth it. Normally, the more sophisticated or powerful the tool is, the more it will allow the user to better explore the potential of the data.

3.3 Searching

A fundamental approach to dealing with textual data is searching in them.

3.3.1 Types of queries

Corpora can be queried by the orthographic word(s) and its variation; by the relationship of these tokens to other tokens (context); or by a combination of these two. In many cases, orthographic searches (or word form searches) are sufficient. Also, most corpus search systems allow for the use of regular expressions, which allows for searches by patterns. However, to search for advanced combinations of tokens and their annotations, we need a robust query system. These systems allow us to look for (sequences of) tokens and their annotations. Some examples are the Corpus Query Processor (CQP) (S. Evert 2008), the Corpus Query Language (CQL) (Jakubíček et al. 2010), and ANNIS (Zeldes et al. 2009).

3.3.2 Precision and recall

As powerful as a query processor may be or as sophisticated as a query may be written, it is very unlikely the search result is a perfect representation of what we intended. We have to consider that among the results we can have true positives (what is relevant) and false positives (not relevant). We should also consider that search might fail to identify some relevant results (false negative) and identify some irrelevant results (true negative). With data on these four cases, we can calculate the precision (ratio of relevant retrieved items and all retrieved items) and the recall (ratio of retrieved relevant items over all relevant items) of the results.

3.4 Frequency

Another fundamental aspect when dealing with textual corpora in CLS is to consider the frequency of word tokens or other features.

3.4.1 Significance testing and effect-size

Frequency is the core of quantitative studies, but frequency on its own can be deceiving and unreliable, as it only gives information about the sample our data is aiming to represent. For this reason, statistical significance tests are frequently used to verify whether a result was not obtained by chance, but that it “reflects the characteristics of the population from which the sample was drawn” (Sirkin 2006, 306). Effect-size measures can also be used to obtain the strength of the difference or relationship found in the results (see, e.g. Ellis 2010, 3–5; Gabrielatos 2018).

3.4.2 Contrasting

A common way of investigating the difference between two datasets or two subgroups of texts within a corpus is through the calculation of keyness, or distinctiveness. By calculating keyness we can identify keywords. The usual understanding of keyness is that a word or other feature is key when it occurs more often in a corpus or corpus section than would be expected based on a reference or comparison corpus. However, semantic definitions of keyness referring to aboutness, salience or discriminatory power have also been proposed (Schröter et al. 2021) and may motivate the use of dispersion in addition or instead of frequency to define keyness measures. For a better understanding of different measures to calculate keyness or distinctiveness and their implications, see Du et al. (2021).

3.4.3 Dispersion

Besides the aforementioned tests, another important piece of information is to know how the observed pattern is distributed across corpus units. This is normally acquired by applying dispersion measures. Dispersion may be defined as the degree to which occurrences of a word (or lemmas, phrases, annotations of any sort) are uniformly distributed across a corpus. If a word occurs much more often in one text of a corpus than in the other texts, it can be said to be unevenly dispersed. Conversely, an evenly dispersed word is expected to have a relatively constant presence across all corpus texts (Gries 2008). There are different methods and measures to calculate dispersion (see e.g. Gries 2021). A very simplistic, yet widely used way of reporting the distribution of frequencies in a corpus is the document frequency, as it takes only a simple count of how many texts or sections in the corpus feature the searched word.

3.4.4 Cooccurrences and relationships

Investigating the relationship between elements (tokens, annotation) in texts is a very common practice when analysing corpora. A straightforward method is the generation of frequency lists of n-grams. N-grams are sequences of a given number of words. They are widely used in language modelling (see, e.g. Jurafsky and Martin 2023). In research focused on the discourse, these sequences are also known as lexical bundles (Biber 1999).

A commonly used technique to investigate cooccurrences for specific words is the exploration of collocations, or “the phenomenon surrounding the fact that certain words are more likely to occur in combination with other words in certain contexts” (Baker 2006). The method used to identify collocations impact greatly on the results and the choice for the most suitable measure varies according to the intent of the research (see, e.g. St. Evert 2009).

3.5 Supervised versus unsupervised methods

Supervised learning (also called classification) is the machine learning term for problems in which we use computational aid to give us answers about new data, teaching the computer not just to recognize specific patterns, but also to relate such patterns to specific, previously-defined categories, such as authors. In order to do so, applying supervised methods requires dividing the data into training and test datasets, with the first serving as model data which the algorithm uses to learn the patterns. This means of course, that the training dataset needs some kind of labels (e.g. author names if we are training for authorial signal detection, or in more advanced cases marking specific patterns that are to be learned) that will indicate specific distinctive groups. Contrary to the supervised methods, unsupervised learning (also called clustering) does not require division into two datasets – the algorithms of this kind immediately group texts based on their specific features (e.g. words, or chunks of words or characters, or more complex semantic, syntactic, or structural features).

3.5.1 Clustering

Within CLS, there is a range of occasions where clustering methods are commonly used. The most prominent example is probably authorship attribution, where dendrograms (tree diagrams) based on text similarity matrices, in turn based on text similarity measures, are routinely used to display and discuss results. In addition, Principal Components Analysis is frequently used for stylometric analyses, whether focused on authorship or on other categories such as gender, literary period, or literary genre or subgenre. One of the interesting features of PCA is that the weights of features associated with each principal component can also be inspected, something that often supports interpretation. A prominent tool supporting the creation of distance-based dendrograms and frequency-based PCA plots is stylo (Eder, Rybicki, and Kestemont 2016).

When we want to have a general understanding of the main ideas in a corpus, clustering and topic modeling are frequently used techniques of data exploration. Clustering involves grouping the tokens in the corpus into meaningful groups. Two common clustering methods are the scatter browsing method and the hierarchical clustering. The scatter method is more suitable for an exploratory approach (see e.g., Cutting et al. 2017; Hearst and Pedersen 1996), while the hierarchical allows for query-specific investigations [Feldman and Sanger (2006); pp. 82-84].

Topic modeling is another approach that can be understood as being based on identifying clusters of words based on their co-occurrence patterns. It is often used to aid researchers to summarize and explore large corpora: “Topic modeling algorithms (…) analyze the words of the original texts to discover the themes that run through them [and] how those themes are connected to each other” (Blei 2012, p 7). Latent Dirichlet Allocation (LDA) (also described by Blei, Ng, and Jordan 2003) is currently the most used model, but many alternatives exist.

Very briefly put, topic modeling can be understood to be a process of dimensionality reduction. Using the in-document co-occurrence of words as the key information, the often very large term-document-matrix is decomposed, in a sense, into two matrices of much smaller size: a term-topic matrix and a topic-document matrix. Because there are usually much fewer topics than there are word forms, this yields not only a dimensionality reduction, but also an analytically-useful summary of the data. Two prominent tools supporting topic modeling are Mallet, a Java-based command-line-tool, and Gensim, a Python library.

An interesting, recent alternative to LDA-based topic modeling is the use of large language models to discover clusters of semantically-related words. One such approach is available in the Python-based tool BERTopics that uses a combination of one of many transformer-based language models and TF-IDF calculations to derive word clusters.

3.5.2 Classification

Finally, classification is another fundamental approach in CLS research. It is ubiquitous in CLS and NLP. Many of the automatic methods for token-based or region-based annotations described in the chapter on “General issues in preprocessing and annotation” (Chapter 2) are in fact classification-based sequence labeling approaches.

Classification as a key branch of machine learning is also used in many other applications in CLS research. Again, authorship attribution (see Chapter 5) is the most prominent example, where classifiers are trained to recognize the most likely author of an unseen text based on the properties of a large number of texts of known authorship. However, this is just one example, and classification is used for any number of tasks relevant to literary studies: whether to identify the class of a given token (as in Named Entity Recognition or metaphor detection), to classify specific sequences of texts (for example in the context of direct / indirect speech and thought recognition), or to classify entire texts by any number of categories (such as author gender, subgenre, time period, canonicity status, and many more).

3.6 Conclusion

It is not possible to give appropriate space to each and every method of analysis relevant to CLS research in this introductory text. However, within the different perspectives of this survey, the different chapters devoted to several key domains of interest within CLS attempt to provide a closer survey of current practices of analysis in the field.


See works cited and further readings for this chapter on Zotero.

Citation suggestion

Andressa Gomide, Christof Schöch (2023): “Introduction to Data Analysis”. In: Survey of Methods in Computational Literary Studies (= D 3.2: Series of Five Short Survey Papers on Methodological Issues). Edited by Christof Schöch, Julia Dudar, Evegniia Fileva. Trier: CLS INFRA. URL: https://methods.clsinfra.io/analysis-intro.html, DOI: 10.5281/zenodo.7892112.

License: Creative Commons Attribution 4.0 International (CC BY).