1 Introduction

If meaning is found and created in use, and corpora are language in use, can we find meaning in corpora? The field of usage-based semantics is large and rich, so the answer to this question is clearly positive. Corpora offer an immense amount of usage data on which to carry analyses, even if they barely scratch the surface of the amount of language that is actually produced — it is desirable and tempting to tap into this vast ocean to obtain the most detailed, the most reliable, the most thorough information. But there is a crucial bottleneck when it comes to semantic analysis: annotation is time- and energy-consuming. As long as we cannot instruct an automatic system to disambiguate each word in a corpus — like we do to tokenize and lemmatize, i.e. to identify what counts as a word and what its root is, or even to assign parts of speech or syntactic relations — semantic annotation is performed by humans. Humans are slower than computers; we get tired, we get confused, we need to eat and think of things beyond semantic annotation as weel. We also disagree sometimes — what is a sense? Are these two things really the same?

Automatic disambiguation systems do exist. Word Sense Disambiguation is an important task within Natural Language Processing (nlp). The notion of task is of crucial importance here: nlp algorithms are typically concerned with concrete applications and are evaluated in terms of those applications. There exists a correct answer that the algorithm must return. This is not so directly applicable to the situation of lexicological and lexicographical research — the study of the meanings of words and their relationships — especially from a Cognitive Linguistics point of view, where hard, dichotomous answers are rare. But let’s suppose for a moment that we can conciliate both approaches, and what counts as the answer from an nlp point of view is an answer from the lexicological perspective. Then we could use automatic disambiguation procedures to make the heavy lifting of semantic annotation of our growing body of corpus data and use their results for a partial description of language. As long as we know which answer the nlp algorithm is returning or, better yet, how to ask what we want to know. Maybe tuning the algorithm for outputs that from an nlp point of view would be wrong can result in complementary answers for a richer lexicological description. Such a qualitative perspective, trying to interpret not just whether the computational model matches a target but also how or why it does (not), also requires appropriate analytical tools. One such tool represents the internal semantic structure of an item, derived from computational models, as a 2d scatterplot where instances occurring in similar context are shown together, forming clusters or clouds.

This dissertation is concerned with the application of distributional methods to lexicological research and their exploration by means of visual analytics. The methodology will be tested and illustrated with a set of 32 Dutch lemmas, of which concordance lines will be extracted from a corpus of newspapers. Distributional models, developed within the field of Computational Linguistics, will be introduced in Section 1.1. In Section 1.2 we will discuss their relevance in Cognitive Semantics and Section 1.3 will offer an overview of the visual analytics dimension. The study described here is part of a larger research project within the Quantitative Lexicology and Variational Linguistics research group (qlvl) at KU Leuven. A brief history of the project and how this dissertation fits in it will be offered in Section 1.4. Finally, Section 1.5 will present the structure of the dissertation.

1.1 Distributional Semantics and Computational Linguistics

Distributional semantics is a usage-based model of meaning that underlies various computational methods for semantic representation (Sahlgren 2008; Lenci 2018): it is an educational program for computers that lets them pretend they understand human languages. It relies on what is called the Distributional Hypothesis, according to which lexemes with similar meanings will have similar distributions, i.e. will occur in similar contexts. The core idea is typically attributed to Harris (1954) and Firth (1957), but exactly how enthusiastic they would be at the sight of the current implementations is disputed: Tognini-Bonelli (2001: 157) remarks that Firth would not be in favour of electronic corpora, and Geeraerts (2017) offers a comprehensive comparison between Harris’ position and current distributional semantics. The attribution issue notwithstanding, the idea that meaning can be modelled by means of distributional information is pervasive in nlp and at the core of every form of Distributional Semantics. A more important question is what we mean by meaning or semantics to begin with (Sahlgren 2006; Lenci 2008), which in this research is informed by the Cognitive Linguistics framework. Beyond the particular attention to the semantic side of distributional semantics, this dissertation sets itself apart from most mainstream computational approaches in three core aspects: its motivation, the definition of units and its reliance on context-counting models.

1.1.1 Motivation

Computational Linguistics is typically task-oriented: it aims to solve concrete challenges such as information retrieval, question answering, sentiment analysis, machine translation, etc. For that purpose, benchmarks or gold standards are developed and the models are tested against them. For example, Baroni, Dinu & Kruszewski (2014) test different kinds of models against datasets tailored to evaluate semantic relatedness, synonym detection, concept categorization, selectional preferences and analogy; see Agirre & Edmonds (2007) and Raganato, Camacho-Collados & Navigli (2017) for evaluation systems for sense disambiguation. This is understandable and appropriate in a task-oriented workflow: when it comes to output, it does not really matter how the model reached the answer, as long as it is the answer that we seek. In contrast, investigating the structure of semantic representations, i.e. the how of this process, calls for a different approach (see for example Baroni & Lenci 2011; Wielfaert et al. 2019). On the one hand, we do not assume that there is one correct answer because we do not assume that there is only one question. Beyond “Are these two words similar?” we are interested in: “Are they synonyms?” “Are they co-hyponyms?” “Are they regionally specific expressions of the same concept?” and so forth. Different models may focus on different dimensions of semantic structure and thus answer different questions. For that reason, the dataset collected for this research covers a wide range of semantic phenomena, in the hope of tuning distributional models to their identification. On the other hand, we are not confident that any of those questions has an unequivocal answer either. As Chapter 4 will show, annotators often agree on the sense of an utterance, but not always. Hence, the manual annotations will serve as a guideline for the interpretation of the models, but not as a law to judge their accuracy.

1.1.2 Units of analysis

Whereas computational models typically work at type-level and often with word forms, this dissertation focuses on token-level models with lemmas as units. Type-level modelling represents a lexical unit, such as word, as the aggregated distributional behaviour of all its occurrences, e.g. we could see that word tends to be preceded by the. Patterns can be found by accumulating and classifying contextual information from thousands if not millions of events. The profile of a type can subsequently be compared to the profiles of other types, e.g. we can see that sentence also tends to be preceded by the, while walking does not. Such a representation conflates the variation within the range of application of that item as part of one overall tendency, and is therefore not suited to study polysemy. Even if the context does contain disambiguating cues, such as “Can we have a word?” or “That word is not in the dictionary,” the type-level representation will cover both. In spite of these shortcomings, some computational approaches to modelling polysemy do try to find the patterns in the type-level representations, e.g. Koptjevskaja-Tamm & Sahlgren (2014). In contrast, the work presented here relies on token-level modelling, which represents individual instances, e.g. comparing the two occurrences of word in the examples above. This approach does originate in computational linguistics (Schütze 1998) but is far less popular than type-level approaches, which are considered the default in most introductory descriptions of distributional models (Lenci 2018; Turney & Pantel 2010; Bolognesi 2020).

Apart from the distinction between modelling types or tokens, a crucial difference between this approach and many studies in computational linguistics is that the unit of analysis is the lemma instead of the word form. On the one hand, relying on word forms avoids layers of preprocessing that already incorporate a certain interpretation in terms of what counts as a word, which different forms go together and how they are classified grammatically. Sinclair (1991) also argues along these lines for the usage of word forms as lexical units in corpus linguistics. And, admittedly, different word forms of a given lemma might exhibit diverging distributional and semantic profiles. However, from a lexicological and lexicographical perspective, centring the lemma — the combination of stems and grammatical category — is the common practice. Moreover, the mismatch between word forms and lemmas — and therefore between either of them and meanings — is highly dependent on the language we describe and the words themselves. Therefore, lemmas will be the unit of analysis in this dissertation. This is not to say that the workflow depends on this decision, in the same way that it does not depend on Dutch being the language of the corpus. The methodology presented in these pages could be applied with word forms at the centre, but the degree to which the conclusions reached here would be applicable is an empirical question.

1.1.3 Context-counting and context-predicting

Currently, the most popular approach for distributional semantics relies on neural networks, i.e. context-predicting models. The methodology followed in this project relies instead on count-based or context-counting models: the values of the vectors, i.e. numerical representations of lexical units, are (relatively) directly derived from frequency counts. In contrast, the approach initiated by Mikolov et al. (2013) and which has taken over nlp, i.e. word embeddings, is a context-predicting architecture. Neural networks are trained to predict empty slots in a fragment of text: given a fixed window with a target item in the middle, cbow models are given the surrounding context in order to predict the target item, whereas skip-gram models try to predict the context based on the item in the middle. The training consists on a long sequence of trial and error: there is a right answer, i.e. the actual corpus, the algorithm starts by guessing and receives feedback, and iteratively it adapts its guessing strategy to minimise the error. The strategy consists of weights in the hidden layer of neural network; these weights are then used to represent the target item. In other words, while a context-counting model would define the distributional profile of a word along the lines of “it tends to co-occur with chocolate and cookies but not with mycorrhyza or algorithm,” context-predicting models say, more or less, “this is how I feel/what my brain does when I see that word.” The latter is, in a sense, more in line with the core of meaning as an introspective experience that defies definitions and restrictions, although computational models are far from actually understanding language. Exploring to what degree these models approximate humans’ assessments lies in the purview of other research programmes involving psycholinguistic experiments. Studies have been carried out to compare the performance of context-counting and context-predicting models — in terms, of course, of their accuracy with regards to popular benchmarks. Baroni, Dinu & Kruszewski (2014) found that the word2vec architecture outperformed context-counting models, much to their disappointment. In contrast, Levy, Goldberg & Dagan (2015) fine-tuned context-counting models based on the hyperparameters from word embedding and found that performance differences where local or even insignificant.

When our purpose is to understand what of meaning, if anything, can be found in text data, the interpretation of context-counting models is much more transparent. We can trace the composition of the vectors to concrete frequencies and instances. As we will see in the second part of this dissertation, these supposedly more transparent models are already quite opaque, especially with the added transformation from type-level to token-level models. That said, most of the workflow described here can also be combined with context-predicting models.

The years since Mikolov et al. (2013) have seen a rapid and enthusiastic growth in the field of word embeddings and nlp, with new models continually surpassing the previous ones. One of these is bert (Devlin et al. 2019), which, in spite of its indubitable relevance to the approach proposed here, will not be explored. Bidirectional Encoder Representations from Transformers (bert) is a machine-learning technique that can represent individual instances and sentences: unlike other context-predicting models, it can be used for token-level representations. But like other context-predicting models, its output is somewhat less interpretable than context-counting models. It has been tested on the typical task-based benchmarks and it is so time- and resources-consuming that nlp researchers will typically use pre-trained embeddings and fine-tune them for specific tasks rather than generate them from scratch. In principle, combining a model of the bert family with the workflow described here is not impossible: as long as occurrences are represented with vectors from which we can derive pairwise distances, the rest of the analysis stays the same. However, some crucial differences remain: we do not know which elements of the context informed the models’ decision, they are based on word forms and the word forms are based on a different tokenizer. For instance, a brief test of bertje (de Vries et al. 2019), the Dutch counterpart of bert, on a section of the dataset used for this project revealed that (i) for some lemmas bertje’s answer might be closer to the human perspective, (ii) for other lemmas a deeper investigation is in order and (iii) other lemmas cannot be modelled at all because of the discrepancy in the tokenization procedure². In other words, even if combining the methodologies is possible, the actual implementation requires some planning, specific decisions and tailoring the procedure to extract as much as we can from the backstage operations in context-predicting models.

1.2 Distributional Semantics and Cognitive Semantics

As a computational approach, distributional semantics is not intrinsically linked to any particular linguistic theory. Its usage-based essence makes it a natural fit for approaches that describe the parole along with the langue (in terms of de Saussure 1971), such as Cognitive Linguistics. In the introduction to The Oxford Handbook of Cognitive Linguistics, it is described as

an approach to the analysis of natural language that originated in the late seventies and early eighties in the work of George Lakoff, Ron Langacker, and Len Talmy, and that focuses on language as an instrument for organizing, processing, and conveying information. (Geeraerts & Cuyckens 2007a: 3)

It stands in contrast to frameworks that uphold a strict separation of semantics and pragmatics, of structure and usage, of lexical knowledge and world knowledge (Geeraerts 2010a). As the introduction and composition of the Handbook shows, as well as other compilations along these lines (such as Rudzka-Ostyn 1988; Kristiansen et al. 2006; Ibarretxe-Antuñano & Valenzuela 2016), the diverse field of Cognitive Linguistics is guided by a number of principles derived from this central notion of language as categorization. Among these principles, three in particular constitute the theoretical cornerstones of this study: (i) an emphasis on meaning, (ii) the notion of fuzzy and prototypical categories and (iii) a usage-based approach.

1.2.1 Everything is semantics

Understanding language as categorization and its function in the organization and communication of knowledge necessarily places the focus on meaning (Geeraerts & Cuyckens 2007b; Geeraerts 2016). From a Cognitive Linguistics perspective, all linguistic structures — not just lexical items but also syntactic patterns — are considered inherently meaningful (Langacker 2008; Lemmens 2015). Moreover, meaning in Cognitive Linguistics goes beyond traditional semantics — i.e. distinguishing linguistic from nonlinguistic features — and includes encyclopedic knowledge and pragmatics (Glynn 2010; Geeraerts 1997). While it is crucially a cognitive phenomenon involving conceptualization, it takes place in the mind of physical, embodied beings who perceive, understand, and interact with their world: meaning is embodied and neither limited to nor separated from reference (Rohrer 2007).

The centrality of semantics in Cognitive Linguistics has led to a strong body of work on meaning and on how traditional notions fit in with cognitive principles. For example, the line of work initiated in the ’80s with Lakoff & Johnson (2003) and further developed along different lines by Raymond Gibbs Jr., Gerard Steen, Zoltán Kövecses, Elena Semino and many others (see for example Gibbs & Steen 1999; Gibbs Jr. 2008; Semino 2008; Kövecses 2015) builds on understanding a traditional linguistic concept, i.e. metaphor, with the tools of Cognitive Linguistics. In these terms, metaphor refers to ways of thinking, understanding, conceptualizing, that manifest in linguistic behaviour but also permeate other areas of everyday life.

Along these lines, relationships between senses are understood as cognitive mechanisms that need not be restricted to linguistic behaviour nor to extralinguistic reference. Semantic categories such as metaphor, metonymy, specialization, homonymy and prototypicality are crucial tools to make sense of the variety of relationships between what we understand as senses. They are not unique to Cognitive Linguistics, but a framework that understands meaning as a property of any linguistic structure and as covering linguistic and extralinguistic features allows us to look for meaning in distributional models without expecting them to exhaust semantic description.

Cognitive Linguistics also incorporates the combination of a semasiological and onomasiological perspective, while previous frameworks have defined either one or the other as the only possibility (Geeraerts 2010a). A semasiological perspective, which is predominant in the research described here, starts from a form or expression and investigates its range of meanings or applications, e.g. the study of polysemy. An onomasiological perspective, on the other hand, starts from a concept and describes the forms that are used to express it, e.g. synonymy. This dissertation takes a semasiological perspective, but token-level distributional models can be used from both perspectives, as shown in De Pascale (2019).

1.2.2 Prototypicality

Among the most important notions in the Cognitive Linguistics understanding of categorization we find prototypicality and salience (Rosch 1978). Categories cannot always be described in terms of necessary and sufficient conditions; instead, they may be characterized by clusters of co-occurring properties that do not apply to all members to the same degree. They may even have fuzzy boundaries, an unclear range of application. As a property of categorization, this is a property of language, which Cognitive Linguistics embraces, incorporating a quantitative dimension to the study of meaning (Geeraerts 2010a). At this point, a quantitative perspective does not immediately require statistical methods, but refers to a shift in the understanding of what counts as meaning description. The notion of prototypicality makes it interesting, if not inevitable, to look at the uneven distribution and importance of the different features or members of a category, as is done, for example, in Geeraerts, Grondelaers & Bakema (1994) and Geeraerts (1997):

…the essence of prototype theory lies in the fact that it highlights the importance of flexibility (absence of clear demarcational boundaries) and salience (differences of structural weight) in the semantic structure of linguistic categories. (Geeraerts 2006: 74)

Given the set of meanings that a form can express, i.e. the intensional level, some of them are more salient than others. For example, given my current lifestyle, ‘device to control the cursor on a screen’ is a more salient meaning of mouse than ‘small rodent’; but, crucially, this might not be the case in other contexts, for other speakers. Given the range of application of a form or a meaning (i.e. the extensional level), some may be more typical members than others. For instance, a black, minimalist computer mouse might be more typical than a wavy, wider gaming mouse with a bright green drawing of a dragon. These situations represent intensional and extensional nonequality, respectively: some senses or members of a category are better representatives of the category than others. Both dimensions may overlap: a typical computer mouse concentrates most of the typical features of the category, regarding its functionality, size, shape and colour; conversely, a typical feature is defined by occurring frequently in the members of the category. These are two of the characteristics of prototypicality, and are complemented by intensional and extensional non discreteness, i.e. the lack of a single set of necessary and sufficient conditions and fuzzy boundaries of the categories. As could be expected, even prototypicality is a prototypical category, as these four features need not co-exist. The relative salience of the two senses of mouse does not mean that we might find an unknown entity and be in doubt whether it is a mouse; meanwhile, discussions such as whether a tomato is a fruit might easily ensue. Geeraerts (2006Ch. 4) offers a typology of salience phenomena as an application of prototype theory beyond the semasiological structure. For example, if from the semasiological perspective we are interested in describing how frequent (or salient) apples are as referents for the word fruit, from the onomasiological perspective we are interested in how frequently the word fruit is used to refer to apples (compared to saying apple).

The notion of (semasiological) prototypicality will be relevant for the interpretation of the modelling in Chapter 6. Until then, it also permeates the understanding of meaning that underlies this research. On the one hand, fuzzy boundaries and degrees of membership invite us to rethink the usefulness of reified senses: ambiguous examples and overlapping features are to be expected. Instead, a bottom-up procedure would rather capture configurations of features (Glynn 2014); assigning discrete senses to corpus data imposes a categorical structure that we know to be inappropriate (see also Geeraerts 1993). On the other hand, distributional models, as a quantitative approach that measures similarity between entities, is particularly adequate to such a non-discrete representation.

In this dissertation I will continue to talk about senses and I will extract discrete patterns from the non-discrete representations in terms of clusters, in order to manipulate and talk about these abstract entities, without implying that they have any ontological reality beyond the explanatory purposes. When it comes to senses, they are not considered a gold standard, an unique solution to the semasiological description of a lexical item; instead, they are guides and an operationalization of certain research questions. The clusters, on the other hand, will be generated by an algorithm that is forced to produce discrete groups but does assign its elements different degrees of membership (see Section 2.2.4). Finally, the overall approach describes tendencies, preferences, probabilities: at no level are the categories and typologies offered in this dissertation discrete and uniform. I have tried, but language resists.

1.2.3 A usage-based approach

Cognitive Linguistics presents itself as a usage-based approach and, as such, it is entirely compatible with a bottom-up, empirical, quantitative methodology such as distributional semantics. Quantitative cognitive semantics is now an established field, as shown by the contributions gathered in Gries & Stefanowitsch (2006), Glynn & Fischer (2010) and Glynn & Robinson (2014), among others. However, not all of Cognitive Linguistics — and especially Cognitive Semantics — relies on empirical methods: introspection was still the main source of information in much of the foundational sources (see for example the discussion illustrated in Geeraerts 1999). In practice, both introspection and empirical methods are required in scientific research, albeit applied to different stages or aspects of the investigation (Geeraerts 2010b). Interpretation is needed in order to formulate hypotheses that will guide the data collection and analysis and to interpret the results: the data does not speak for itself. The empirical steps, in contrast, facilitate reproducibility and falsifiability: by describing the concrete corpus, the method of collection and the quantitative methods applied to it, the study can be replicated by different researchers and the results compared. At the same time, large-scale quantitative methods such as distributional semantics delegate time consuming or computationally expensive tasks, such as reading and comparing thousands of attestations of a word, to an automatic system that can perform it faster and more systematically than humans, leaving the researcher to dedicate their energies in the tasks that humans are best at: interpretation and creativity. That is precisely the long-term goal of this research: to offer an empirical, quantitative workflow that transforms huge amounts of data, finds relevant patterns and provides them to the linguist for interpretation and the formulation of hypotheses.

Empirical research in semantics can take different shapes: corpus-based methods, as is the case in this research, but also experimental and referential methods. As Geeraerts (2015: 242–243) argues, each of these approaches captures a different aspect of meaning, namely textual patterns, on-line processing or referential properties. Meaning, especially from the maximalist perspective taken in Cognitive Linguistics, is too complex to be fully described by any one of these methods in isolation (see also Arppe et al. 2010; Stefanowitsch 2010). As such, we do not have such high expectations from distributional semantics — part of the question is: what do these models say? Concretely, we do not expect distributional models to provide information on how we think, but on how a community speaks and categorises: “‘language as cognition’ encompasses shared and socially distributed knowledge and not just individual ideas and experiences” (Geeraerts 2016: 533). It is the pool of shared practices and knowledge that corpora offers and distributional semantics tries to model.

Moreover, despite the large corpora, the advanced quantitative techniques and the sophisticated visualization tools on which this dissertation is built, this study has its limits. It is restricted to a specific corpus, and as such to specific varieties of a specific language, to a specific genre and period in time, to written text; it is restricted to a limited set of lexical items that were investigated; it is restricted to the precise samples collected, the precise questions asked, the precise techniques used to answer them. Most importantly, I will be as thorough as possible in stating the conditions in which the research was carried out and the choices made along the way. As a result, these limits are not just warnings as to the range of applicability of the results and conclusions, but also and more importantly sources of possibilities, inspiration for similar studies facilitated by the empirical nature of the investigation.

1.3 Visual analytics

Distributional models return mathematical representations of lexical items — or, in the case of token-level models, their attestations. These mathematical representations are arrays of numbers that, in the best-case scenario, we can interpret as co-occurrence information, as an unsorted list of collocations. We need an additional step to transform these individual representations into similarities, which operationalize the Distributional Hypothesis mentioned above. However, even then, the output is a matrix with as many rows and columns as items we are comparing; depending on the magnitude of our sample and the subtlety of its structure, scanning it visually can be taxing, if not entirely in vain. So, that is not what we do.

For word sense disambiguation, evaluation would normally involve a clustering algorithm, a benchmark and a measure of accuracy. The clustering algorithm would take the vectors or the similarity matrix and return clusters: groups of similar items that are different from each other. The measure of accuracy would report on the agreement between the clustering solution and the benchmark: the closer they are, the better the model. However, these measures say nothing about the qualitative differences between models, i.e. whether they misclassified the same items or how they differ from the benchmark. Even if we take the gold standard as an actual ground truth and the only correct solution — which is not the case in this study — this is not an ideal situation.

It is responding to these concerns that a visualization tool for the exploration of token-level models was envisaged (Wielfaert et al. 2019). The tool developed by Wielfaert in the context of the Nephological Semantics project takes the output from a dimensionality reduction algorithm, i.e. a procedure that tries to map distances based on multiple dimensions on a 2d or 3d space, and surrounds its visual representation with interactive features. These additional features, tailored for the exploration of distributional models, set the tool apart from a static scatterplot, or even from a default interactive plot.

To put it in Card, Mackinlay & Shneiderman (1999: 6)’s words: “‘The purpose of visualization is insight, not pictures.’ The main goals of this insight are discovery, decision making and explanation”³. Indeed, the kind of qualitative exploration achieved through this tool would have been extremely hard without it, if not impossible. In the first place, the tool sets up a workflow that goes from the exploration of the similarity between models and the role of parameter settings through the qualitative comparison of selections models to the detailed exploration of individual models. It is built to facilitate a fluid exploration and interconnection between levels of analysis. The tool offers simultaneous, interconnected access to the actual output of a model (as coordinates on a 2d plane), the variation of parameter settings, semantic annotation, metadata of the corpora and frequency data on the context words. The interaction of these different aspects of distributional models in a practical visual interface makes patterns and insights accessible that would not have been found any other way.

Because of this, the visualization tool is a key component of this dissertation. It is in these scatterplots that we find the clouds: clusters of similar tokens that come together in denser areas of the (reduced) semantic space. In an actual case study involving the methodological workflow presented here, a lot of the technicalities go into generating the clouds, but a large part of the analysis involves looking at them and finding shapes: cloudspotting.

1.4 Nephological Semantics

The research presented in this dissertation is part of a larger project within the qlvl research unit, the bof C1-project (3H150305) “Nephological Semantics: using token clouds for meaning detection in variationist linguistics,” with Dr. Prof. Dirk Geeraerts as Principal Investigator. Both the Python module for the creation of the models, written by Tao Chen, and the visualization tools for their analysis, designed by Thomas Wielfaert and myself, are products of this project. Moreover, this dissertation would not be what it is without the integration of the case studies, questions and insights discussed here with other branches of the project, and without the feedback loop on ideas, tests and thoughts on the different techniques.

The main objective of the project is to develop — and understand — appropriate methods for the retrieval of semantic information from corpus data, addressing concerns that stem from a longer tradition of usage-based lexical research. Geeraerts, Grondelaers & Bakema (1994) and Geeraerts, Grondelaers & Speelman (1999) embark in comprehensive, detailed lexicological analyses of the lexical fields of clothing and football terms in Dutch. Their approach is referential: Geeraerts, Grondelaers & Bakema (1994), for instance, collect pictures and descriptions of garments from Dutch and Flemish magazines and describe each clothing item in terms of a variety of features, such as the length of the sleeve. Based on the relationship between the (configurations of) features and the items used to name the objects, they developed a model of lexical variation that takes into account prototypicality and salience in terms of semasiological, onomasiological and contextual variation. However, the manual and detailed identification of features at a large enough scale is painstaking and time consuming, if at all feasible. In contrast, machine-readable linguistic material is available, more or less accessible and, given the right resources, processable. It will not provide the same kind of information as a referential approach, but it is more easily scalable to large amounts of data.

In the context of this project, token-level models for semasiological research are introduced by Heylen, Speelman & Geeraerts (2012) and Heylen et al. (2015). Another work-package, culminating in De Pascale (2019)’s PhD dissertation, applies the technique to lexical lectometric research, i.e. measuring distances between language varieties based on their naming choices for different concepts. The visualization tool, as mentioned before, is first described in Wielfaert et al. (2019). Between their work, this dissertation and further case studies taking place in the last year, the project is covering the application of token-level vector space models on semasiological, onomasiological and lectometric studies in varieties of Dutch and Mandarin, at both a synchronic and a diachronic level.

1.5 Structure of the dissertation

As a product of the Nephological Semantics project, this dissertation aims to contribute to both the development and understanding of distributional models for lexical semasiological research. It brings together the theoretical perspective on semantics from Cognitive Linguistics with computational methods and visual analytics in the hope of paving the way for future research along the same lines. With that in mind, the three chapters of the first part of this dissertation, The cloudspotter’s toolkit, will focus on the technical or methodological side of the project. Chapter 2 will describe the procedure to create clouds and the parameter settings explored, taking care to be thorough and specific about the technical decisions that resulted in the final models. Then, Chapter 3 will showcase the visualization tool designed by Thomas Wielfaert and myself as well as a ShinyApp extension that provides additional functionalities. Finally, Chapter 4 will illustrate the dataset on which the models were tested: the selection of lemmas and the questions they try to address, the collection of data and the annotation procedure.

The notion behind token-level models, i.e. that we can represent meaning differences in terms of distributional differences, and in particular the image of a scatterplot that translates these intuitions into an interpretable picture, sounds good. Alas, the reality is not as bright as we could have wished for, and the skies of distributional semantics have all but a stable weather. Hopefully, this dissertation can offer a guide for researchers who would dare to tread these waters. Therefore, the three chapters in the second part, The cloudspotter’s handbook, will discuss the results of the analyses, with an emphasis on crucial assumptions that clash with the data. First, Chapter 5 debunks the idea of a perfect cloud emerging from the ocean of the corpus. Clouds come in many different shapes, caused by different phenomena of distributional behaviour, and thus this chapter offers a classification of what we might encounter. Chapter 6 follows with a linguistic perspective on the variation of these shapes and discusses what we can or we cannot find in these models. Finally, Chapter 7 shows how no set of parameter settings offers the best solution across the board — not even close. Instead, the same parameter settings may result in different shapes for different lemmas, and they have to be tailored to the specific lemma to capture the relevant semantic structure.

An enthusiastic and hopeful aspiring cloudspotter might feel discouraged by the variability — bordering on unpredictability — of these clouds. I wouldn’t blame them. However, in spite of the diversity of shapes, of semantic phenomena and of parameter settings to explore, the methodology can offer interesting insights. They are partial insights, but insights nonetheless, and once we know what to expect from clouds, we can focus on acquiring them. In that perspective, the third and final part of this dissertation, The cloudspotter’s cheatsheet, will close with a general practical guide, a summary of suggestions for further research and an overall conclusion.

Acknowledgements

2 From corpora to clouds