Representations of Troponymy and Verb Hypernymy in BERT
Total Page:16
File Type:pdf, Size:1020Kb
University of Helsinki Faculty of Arts Department of Digital Humanities Master’s Thesis BERT, have you no manners? Representations of troponymy and verb hypernymy in BERT Dmitry Narkevich 015093853 Master’s Programme in Linguistic Diversity and Digital Humanities Supervisor: Marianna Apidianaki 11.05.2021 Tiedekunta – Fakultet – Faculty Koulutusohjelma – Utbildningsprogram – Degree Programme Faculty of Arts Master's Programme in Linguistic Diversity and Digital Humanities Opintosuunta – Studieinriktning – Study Track Language Technology Tekijä – Författare – Author Dmitry Narkevich Työn nimi – Arbetets titel – Title BERT, have you no manners? : Representations of troponymy and verb hypernymy in BERT Työn laji – Arbetets art – Level Aika – Datum – Month and Sivumäärä– Sidoantal – Number of pages Master’s Thesis year 43 May 2021 Tiivistelmä – Referat – Abstract Hypernymy is a relationship between two words, where the hyponym carries a more specific meaning, and entails a hypernym that carries a more general meaning. A particular kind of verbal hypernymy is troponymy, where troponyms are verbs that encode a particular manner or way of doing something, such as “whisper” meaning “to speak in a quiet manner”. Recently, contextualized word vectors have emerged as a powerful tool for representing the semantics of words in a given context, in contrast to earlier static embeddings where every word is represented by a single vector regardless of sense. BERT, a pre-trained language model that uses contextualized word representations, achieved state of the art performance on various downstream NLP tasks such as question answering. Previous research identified knowledge of scalar adjective intensity in BERT, but not systematic knowledge of nominal hypernymy. In this thesis, we investigate systematic knowledge of troponymy and verbal hypernymy in the base English version of BERT. We compare the similarity of vector representations for manner verbs and adverbs of interest, to see if troponymy is represented in the vector space. Then, we evaluate BERT’s predictions for cloze tasks involving troponymy and verbal hypernymy. We also attempt to train supervised models to probe vector representations for this knowledge. Lastly, we perform clustering analyses on vector representations of words in hypernymy pairs. Data on troponymy and hypernymy relationships is extracted from WordNet and HyperLex, and sentences containing instances of the relevant words are obtained from the ukWaC corpus. We were unable to identify any systematic knowledge about troponymy and verb hypernymy in BERT. It was reasonably successful at predicting hypernyms in the masking experiments, but a general inability to go in the other direction suggests that this knowledge is not systematic. Our probing models were unsuccessful at recovering information related to hypernymy and troponymy from the representations. In contrast with previous work that finds type-level semantic information to be located in the lower layers of BERT, our cluster-based analyses suggest that the upper layers contain stronger or more accessible representations of hypernymy. Avainsanat – Nyckelord – Keywords lexical semantics, BERTology, troponymy, hypernymy, pre-trained language models, transformers Säilytyspaikka – Förvaringställe – Where deposited Helsinki University Library Muita tietoja – Övriga uppgifter – Additional information Supervisor: Marianna Apidianaki Chapter 1 Acknowledgements I would like to express my gratitude to my supervisor, Marianna Apidianaki, for her constant guidance, ideas, and active support. Whenever I ran into a dead end during experimentation, she always had ideas on what to try next or what direction to shift the topic in. I am also very grateful to Aina Garí Soler for her advice, assistance, and comments through- out my thesis research. Thanks also to Thomas De Bluts for his helpful feedback on a draft of the thesis, and to Yves Scherrer for his support as seminar coordinator. Lastly, I am grateful to my parents for their support and encouragement, and to my friend Henri Ala for setting me on this path. 2 Chapter 2 Introduction Hypernymy is a relationship between two words, where the hyponym carries a more specific meaning, and entails a hypernym that carries a more general meaning. For example, we can say “pizza is food”, or more explicitly “pizza is a kind of food”, demonstrating the hypernymy between the pair of words hpizza;foodi. The relationship is directional: although all pizza is food, not all food is pizza. Verbs are slightly more nuanced than nouns. While we can easily say that “jumping is mov- ing”, the truth value of expressions like “whispering is talking” is less obvious. While whisper- ing does mean talking at a low volume, the verb talk implicates a volume manner of its own - a normal conversational volume - which whisper overrides. Noun hyponyms tend to specialize the hypernym, narrowing down categories (i.e., picking out a subset) without overriding any aspect of higher levels. For instance, a table is a kind of furniture, so the noun table describes a subset of the set of objects described by furniture. In contrast, hyponyms of verbs do not really describe a subset of actions described the hypernym, making the entailment relationship less clear. Furthermore, the degree of certain hypernymy relations (for example hhuman;animali) de- pends on the context. Although humans are technically in the category of animals, animal typ- ically denotes a non-human animal (Vulić et al., 2017). This necessitates a graded (numerical) evaluation of pairs rather than a binary hypernym/not-hypernym distinction. A particular kind of verbal hypernymy is troponymy. Troponyms (or manner verbs) are verbs that encode a particular manner or way of doing something. For example, to stroll is to walk in a relaxed manner, so stroll is a troponym and hyponym of walk, which is itself a hypernym of stroll. While troponymy pairs have a hyponym and hypernym, not all pairs of verbs with a hyper- nymy relation demonstrate troponymy. For example, although gulp is a hyponym of swallow, gulping is not swallowing in a particular manner. There is no adverb that can complete the sen- tence “to gulp is to swallow -ly”, nor can a non-adverbial manner be used in “to gulp is to 3 swallow by ”. Contextualized word vectors have emerged as a very powerful tool for encoding semantic concepts, with the pre-trained language model (PLM) BERT achieving state of the art perfor- mance on downstream NLP tasks different from what it was trained on, such as question an- swering (Devlin et al., 2019). Unlike classical static word embeddings such as fastText (Bojanowski et al., 2017), where every token is represented by a static vector regardless of the sense or context that it’s in, BERT produces a vector for a token given a particular context. Due to the somewhat ‘black box’ nature of deep neural models like BERT, there has been interest in digging into them to understand what kind of abstract semantic knowledge they learn, and how that knowledge is represented at different layers of the model. In this thesis, we set out to investigate systematic knowledge of troponymy in BERT. Due to insufficient labelled data we also included in the study an exploration of BERT’s knowledge of verbal hypernymy in general, using the HyperLex dataset (Vulić et al., 2017). We performed masking experiments, compared vector representations, and trained a large number of probing models. We work exclusively with English in this study. BERT has been demonstrated to understand the relationship between adjectives denoting the same concept with different intensities, (e.g., pretty < gorgeous) by Soler and Apidianaki (2020). Therefore, we were interested to see if this understanding extends to verbs expressing similar actions but in different manners. The practical applications of this are similar – for example textual entailment (e.g., whis- pering something entails saying it), or recommendation systems (e.g., if a user says they ”adore Finnish films”, the system should understand that this is the same as liking Finnish films, only to a higher degree). 4 Chapter 3 Related work The emerging field that explores with the inner workings of BERT and what kind of information it learns has been called BERTology (Rogers et al., 2020). Tenney et al. (2019) used edge probing, which involves using a supervised classifier to classify spans of token vectors in order to predict their relationship. They showed that BERT learns to perform the “classical NLP pipeline” – that is, part-of-speech tagging at the beginning, then constituency and dependency parsing, semantic role labelling and named entity recognition towards the later layers. Ravichander et al. (2020) investigated noun hypernymy in BERT with a probing classifier that aimed to predict whether or not a given pair of words is in a hypernymy relationship, and investigated how systematic this knowledge really is using cloze-task evaluations. For instance, even if BERT can predict vehicle for “A car is a [MASK]”, for the knowledge to be considered “systematic”, it would have to also correctly predict the plural for “Cars are [MASK]”. They did not find evidence of this kind of systematic knowledge, but we took inspiration from their work for constructing feature vectors for our probing classifiers, and tried a similar architecture for our shallow neural network. Vulić et al. (2020) investigated BERT’s knowledge of type-level lexical semantics over dif- ferent layers using analogy tasks, probing classifiers to predict semantic relationships between nouns, and word self-similarity investigations across layers. They show that averaging vectors over multiple layers decreases performance when later vectors are included, so we ran our initial probing experiments on vectors extracted from the embedding layer. 5 Chapter 4 Data 4.1 WordNet WordNet is a structured database of words and their semantic relationships (Fellbaum, 1998). This includes the grouping of synonymous words into ‘synsets’, and linking synsets that are hypernyms of one another. Each synset typically has a definition, and sometimes a couple of example sentences each containing one of the lexemes in the set.