BERT Is to NLP What Alexnet Is to CV: Can Pre-Trained Language Models Identify Analogies?

BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies? Asahi Ushio, Luis Espinosa-Anke, Steven Schockaert, Jose Camacho-Collados Cardiff NLP, School of Computer Science and Informatics Cardiff University, United Kingdom fUshioA,Espinosa-AnkeL,SchockaertS1,[email protected] Query: word:language Abstract Candidates: (1) paint:portrait (2) poetry:rhythm Analogies play a central role in human com- (3) note:music monsense reasoning. The ability to recognize (4) tale:story analogies such as “eye is to seeing what ear is (5) week:year to hearing”, sometimes referred to as analogical proportions, shape how we structure knowl- Table 1: An example analogy task from the SAT edge and understand language. Surprisingly, dataset. The third candidate is the answer to the query. however, the task of identifying such analogies has not yet received much attention in the language model era. In this paper, we analyze et al., 2013a; Vylomova et al., 2016; Allen and the capabilities of transformer-based language Hospedales, 2019; Ethayarajh et al., 2019). The models on this unsupervised task, using bench- underlying assumption is that when “a is to b what marks obtained from educational settings, as c d b − a well as more commonly used datasets. We find is to ” the word vector differences and that off-the-shelf language models can identify d − c are expected to be similar, where we write x analogies to a certain extent, but struggle with for the embedding of a word x. While this assump- abstract and complex relations, and results are tion holds for some types of syntactic relations, highly sensitive to model architecture and hy- for semantic relations this holds to a much more perparameters. Overall the best results were limited degree than was suggested in early work obtained with GPT-2 and RoBERTa, while (Linzen, 2016; Schluter, 2018). Moreover, the most configurations using BERT were not able to commonly used benchmarks have focused on spe- outperform word embedding models. Our results raise important questions for future work cific and well-defined semantic relations such as about how, and to what extent, pre-trained “capital of”, rather than the more abstract notion of language models capture knowledge about ab- relational similarity that is often needed for solving stract semantic relations.1 the kind of psychometric analogy problems that can be found in IQ tests and educational settings. 1 Introduction An example of such a problem is shown in Table1. One of the most widely discussed properties of Given the central role of analogy in human cog- word embeddings has been their surprising abil- nition, it is nonetheless important to understand the ity to model certain types of relational similari- extent to which NLP models are able to solve these ties in terms of word vector differences (Mikolov more abstract analogy problems. Besides its value as an intrinsic benchmark for lexical semantics, While the title is probably self-explanatory, this is a small note explaining it. BERT is to NLP what AlexNet is to CV is the ability to recognize analogies is indeed impor- making an analogy on what the BERT and AlexNet models tant in the contexts of human creativity (Holyoak represented for Natural Language Processing (NLP) and Com- et al., 1996), innovation (Hope et al., 2017), computer Vision (CV), respectively. They both brought a paradigm shift in how research was undertaken in their corresponding putational creativity (Goel, 2019) and education disciplines and this is what the analogy refers to. (Pardos and Nam, 2020). Analogies are also a 1Source code and data to reproduce our ex- prerequisite to build AI systems for the legal do- perimental results are available in the following repository: https://github.com/asahi417/ main (Ashley, 1988; Walton, 2010) and are used in analogy-language-model machine learning (Miclet et al., 2008; Hug et al., 3609 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 3609–3624 August 1–6, 2021. ©2021 Association for Computational Linguistics 2016;H ullermeier¨ , 2020) and for ontology align- on understanding pre-trained LSTM-based LMs ment (Raad and Evermann, 2015), among others. (Peters et al., 2018b), attention has now shifted to- Within NLP, however, the task of recognizing ward transformer-based models. The main aspects analogies has received relatively little attention. To that have been studied in recent years are syntax solve such problems, Turney(2005) proposed La- (Goldberg, 2019; Saphra and Lopez, 2019; Hewitt tent Relational Analysis (LRA), which was essen- and Manning, 2019; van Schijndel et al., 2019; tially designed as a relational counterpart to Latent Jawahar et al., 2019; Tenney et al., 2019b) and se- Semantic Analysis (Landauer and Dumais, 1997). mantics (Ettinger, 2019; Tenney et al., 2019a). For Somewhat surprisingly, perhaps, despite the sub- a more complete overview on analyses of the differ- stantial progress that word embeddings and lan- ent properties of transformer-based LMs, we refer guage models (LMs) have enabled in NLP, LRA to Rogers et al.(2021). still represents the current state-of-the-art in solv- Despite the rise in probing analyses for LMs ing abstract word analogy problems. When go- and the importance of analogical reasoning in hu- ing beyond a purely unsupervised setting, however, man cognition, understanding the analogical capa- GPT-3 was recently found to obtain slightly better bilities of LMs remains understudied. The most results (Brown et al., 2020). similar works have focused on capturing relational The aim of this paper is to analyze the ability of knowledge from LMs (in particular the type of pre-trained LMs to recognize analogies. Our focus information available in knowledge graphs). For is on the zero-shot setting, where LMs are used instance, Petroni et al.(2019) analyzed to what without fine-tuning. To predict whether two word extent LMs could fill manually-defined templates pairs (a; b) and (c; d) are likely to be analogical, such as “Dante was born in [MASK]”. Follow-up we need a prompt, i.e. a template that is used to con- works extended this initial approach by automat- struct the input to the LM, and a scoring function. ically generating templates and fine-tuning LMs We extensively analyze the impact of both of these on them (Bouraoui et al., 2020; Jiang et al., 2020), choices, as well as the differences between differ- showing an improved performance. In this paper, ent LMs. When the prompt and scoring function we focus on the analogical knowledge that is en- are carefully calibrated, we find that GPT-2 can out- coded in pre-trained LMs, without the extra step of perform LRA, standard word embeddings as well fine-tuning on additional data. as the published results for GPT-3 in the zero-shot setting. However, we also find that these results 2.2 Word Analogy Probing are highly sensitive to the choice of the prompt, as well as two hyperparameters in our scoring func- Word analogies have been used as a standard in- tion, with the optimal choices not being consistent trinsic evaluation task for measuring the quality of across different datasets. Moreover, using BERT word embeddings. Mikolov et al.(2013b) showed leads to considerably weaker results, underperform- that word embeddings, in particular Word2vec em- ing even standard word embeddings in all of the beddings, were able to solve analogy problems by considered configurations. These findings suggest simple vector operations (e.g. king - man + woman that while transformer-based LMs learn relational = queen). The motivation for this task dates back knowledge to a meaningful extent, more work is to the connectionism theory (Feldman and Ballard, needed to understand how such knowledge is en- 1982) in cognitive science. In particular, neural coded, and how it can be exploited. networks were thought to be able to model emer- gent concepts (Hopfield, 1982; Hinton, 1986) by 2 Related work learning distributed representations across an embedding space (Hinton et al., 1986), similar to the 2.1 Understanding Pre-trained LMs properties that word embeddings displayed in the Since their recent dominance in standard NLP analogy task. More recent works have proposed benchmarks (Peters et al., 2018a; Devlin et al., new mathematical theories and experiments to un- 2019; Liu et al., 2019), pre-trained language mod- derstand the analogical capabilities of word embed- els have been extensively studied. This has mainly dings, attempting to understand their linear alge- been done through probing tasks, which are aimed braic structure (Arora et al., 2016; Gittens et al., at understanding the knowledge that is implicitly 2017; Allen and Hospedales, 2019) or by explic- captured by their parameters. After the initial focus itly studying their compositional nature (Levy and 3610 Goldberg, 2014; Paperno and Baroni, 2016; Etha- 3 Word Analogies yarajh et al., 2019; Chiang et al., 2020). In this section, we describe the word analogy for- However, recent works have questioned the im- mulation that is used for our experiments (Section pressive results displayed by word embeddings 3.1). Subsequently, we provide an overview of the in this task. In many cases simple baselines ex- datasets used in our experiments (Section 3.2). query cluding the input pair (or ) were competitive 3.1 Task Description (Linzen, 2016). Simultaneously, some researchers have found that many relationships may not be We frame the analogy task in terms of analogical retrieved in the embedding space by simple linear proportions (Prade and Richard, 2017). Given a transformations (Drozd et al., 2016; Bouraoui et al., query word pair (hq; tq) and a list of candidate n 2018) and others argued that the standard evalu- answer pairs f(hi; ti)gi=1, the goal is to find the ation procedure has limitations (Schluter, 2018).

Load more