Do NLP Models Know Numbers? Probing Numeracy in Embeddings

Do NLP Models Know Numbers? Probing Numeracy in Embeddings Eric Wallace∗ 1, Yizhong Wang∗ 2, Sujian Li2, Sameer Singh3, Matt Gardner1 1Allen Institute for Artificial Intelligence 2Peking University 3University of California, Irvine fericw, [email protected], fyizhong, [email protected], [email protected] (a) Word2Vec (b) GloVe Abstract 500 The ability to understand and work with num- 0 bers (numeracy) is critical for many complex −500 Predicted Value reasoning tasks. Currently, most NLP models (c) ELMo (d) BERT treat numbers in text in the same way as other 500 tokens—they embed them as distributed vec- 0 tors. Is this enough to capture numeracy? We −500 Predicted Value begin by investigating the numerical reason- (e) Char-CNN (f) Char-LSTM ing capabilities of a state-of-the-art question 500 answering model on the DROP dataset. We 0 find this model excels on questions that require −500 Predicted Value numerical reasoning, i.e., it already captures −2000 −1000 0 1000 2000−2000 −1000 0 1000 2000 I put Number I put Number numeracy. To understand how this capabil- ity emerges, we probe token embedding meth- Figure 1: We train a probing model to decode a num- ods (e.g., BERT, GloVe) on synthetic list max- ber from its word embedding over a random 80% of the imum, number decoding, and addition tasks. integers from [-500, 500], e.g., “71” ! 71.0. We plot A surprising degree of numeracy is naturally the model’s predictions for all numbers from [-2000, present in standard embeddings. For exam- 2000]. The model accurately decodes numbers within ple, GloVe and word2vec accurately encode the training range (in blue), i.e., pre-trained embed- magnitude for numbers up to 1,000. Fur- dings like GloVe and BERT capture numeracy. How- thermore, character-level embeddings are even ever, the probe fails to extrapolate to larger numbers (in more precise—ELMo captures numeracy the red). The Char-CNN (e) and Char-LSTM (f) are trained best for all pre-trained methods—but BERT, jointly with the probing model. which uses sub-word units, is less exact. 1 Introduction ity to understand and work with numbers in ei- Neural NLP models have become the de-facto ther digit or word form (Spithourakis and Riedel, standard tool across language understanding tasks, 2018). For example, one must understand that the even solving basic reading comprehension and string “23” represents a bigger value than “twenty- textual entailment datasets (Yu et al., 2018; Devlin two”. Once a number’s value is (perhaps implic- et al., 2019). Despite this, existing models are in- itly) represented, reasoning algorithms can then capable of complex forms of reasoning, in partic- process the text, e.g., extracting the list of field ular, we focus on the ability to reason numerically. goals and computing that list’s maximum (first Recent datasets such as DROP (Dua et al., 2019), question in Figure2). Learning to reason numer- EQUATE (Ravichander et al., 2019), or Mathe- ically over paragraphs with only question-answer matics Questions (Saxton et al., 2019) test numer- supervision appears daunting for end-to-end mod- ical reasoning; they contain examples which re- els; our work seeks to understand if and how “out- quire comparing, sorting, and adding numbers in of-the-box” neural NLP models already learn this. natural language (e.g., Figure2). We begin by analyzing the state-of-the-art The first step in performing numerical reason- NAQANet model (Dua et al., 2019) for DROP— ing over natural language is numeracy: the abil- testing it on a subset of questions that evaluate nu- ∗Equal contribution; work done while interning at AI2. merical reasoning (Section2). To our surprise, 5307 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5307–5315, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics the model exhibits excellent numerical reasoning . JaMarcus Russell completed a 91-yard touchdown pass to rookie wide receiver Chaz Schilens. The Texans would abilities. Amidst reading and comprehending nat- respond with fullback Vonta Leach getting a 1-yard touch- ural language, the model successfully computes down run, yet the Raiders would answer with kicker Se- list maximums/minimums, extracts superlative en- bastian Janikowski getting a 33-yard and a 21-yard field goal. Houston would tie the game in the second quarter tities (argmax reasoning), and compares numer- with kicker Kris Brown getting a 53-yard and a 24-yard ical quantities. For instance, despite NAQANet field goal. Oakland would take the lead in the third quar- achieving only 49 F1 on the entire validation set, it ter with wide receiver Johnnie Lee Higgins catching a 29- yard touchdown pass from Russell, followed up by an 80- scores 89 F1 on numerical comparison questions. yard punt return for a touchdown. We also stress test the model by perturbing the val- Q: How many yards was the longest field goal? A: 53 idation paragraphs and find one failure mode: the model struggles to extrapolate to numbers outside Q: How long was the shortest touchdown pass? A: 29-yard its training range. Q: Who caught the longest touchdown? A: Chaz Schilens We are especially intrigued by the model’s abil- Figure 2: Three DROP questions that require numer- ity to learn numeracy, i.e., how does the model ical reasoning; the state-of-the-art NAQANet answers know the value of a number given its embedding? every question correct. Plausible answer candidates to The model uses standard embeddings (GloVe and the questions are underlined and the model’s predic- a Char-CNN) and receives no direct supervision tions are shown in bold. for number magnitude/ordering. To understand how numeracy emerges, we probe token embedding methods (e.g., BERT, GloVe) using synthetic spans, number answers (e.g., 35), and dates (e.g., list maximum, number decoding, and addition 03/01/2014). The only supervision provided is the tasks (Section3). question-answer pairs, i.e., a model must learn to reason numerically while simultaneously learning We find that all widely-used pre-trained embed- to read and comprehend. dings, e.g., ELMo (Peters et al., 2018), BERT (De- vlin et al., 2019), and GloVe (Pennington et al., 2.2 NAQANet Model 2014), capture numeracy: number magnitude is present in the embeddings, even for numbers in Modeling approaches for DROP include both the thousands. Among all embeddings, character- semantic parsing (Krishnamurthy et al., 2017) level methods exhibit stronger numeracy than and reading comprehension (Yu et al., 2018) word- and sub-word-level methods (e.g., ELMo models. We focus on the latter, specifically excels while BERT struggles), and character-level on Numerically-augmented QANet (NAQANet), the current state-of-the-art model (Dua et al., models learned directly on the synthetic tasks are 1 the strongest overall. Finally, we investigate why 2019). The model’s core structure closely follows NAQANet had trouble extrapolating—was it a QANet (Yu et al., 2018) except that it contains four failure in the model or the embeddings? We repeat output branches, one for each of the four answer our probing tasks and test for model extrapolation, types (passage span, question span, count answer, finding that neural models struggle to predict num- or addition/subtraction of numbers.) bers outside the training range. Words and numbers are represented as the con- catenation of GloVe embeddings and the output 2 Numeracy Case Study: DROP QA of a character-level CNN. The model contains no auxiliary components for representing number This section examines the state-of-the-art model magnitude or performing explicit comparisons. for DROP by investigating its accuracy on ques- We refer readers to Yu et al.(2018) and Dua et al. tions that require numerical reasoning. (2019) for further details. 2.1 DROP Dataset 2.3 Comparative and Superlative Questions DROP is a reading comprehension dataset that We focus on questions that NAQANet requires tests numerical reasoning operations such as numeracy to answer, namely Comparative and counting, sorting, and addition (Dua et al., 2019). Superlative questions.2 Comparative questions The dataset’s input-output format is a superset 1Result as of May 21st, 2019. of SQuAD (Rajpurkar et al., 2016): the an- 2DROP addition, subtraction, and count questions do not swers are paragraph spans, as well as question require numeracy for NAQANet, see AppendixA. 5308 Question Type Example Reasoning Required Comparative (Binary) Which country is a bigger exporter, Brazil or Uruguay? Binary Comparison Comparative (Non-binary) Which player had a touchdown longer than 20 yards? Greater Than Superlative (Number) How many yards was the shortest field goal? List Minimum Superlative (Span) Who kicked the longest field goal? Argmax Table 1: We focus on DROP Comparative and Superlative questions which test NAQANet’s numeracy. probe a model’s understanding of quantities or Question Type Count EM F1 events that are “larger”, “smaller”, or “longer” Human (Test Set) 9622 92.4 96.0 than others. Certain comparative questions ask Full Validation 9536 46.2 49.2 about “either-or” relations (e.g., first row of Ta- Number Answers 5842 44.3 44.4 ble1), which test binary comparison. Other com- Comparative 704 73.6 76.4 parative questions require more diverse compara- Binary (either-or) 477 86.0 89.0 tive reasoning, such as greater than relationships Non-binary 227 47.6 49.8 (e.g., second row of Table1). Superlative Questions 861 64.6 67.7 Number Answers 475 68.8 69.2 Superlative questions ask about the “shortest”, Span Answers 380 59.7 66.3 “largest”, or “biggest” quantity in a passage.

Do NLP Models Know Numbers? Probing Numeracy in Embeddings

Classifying Relevant Social Media Posts During Disasters Using Ensemble of Domain-Agnostic and Domain-Speciﬁc Word Embeddings

Injection of Automatically Selected Dbpedia Subjects in Electronic

Information Extraction Based on Named Entity for Tourism Corpus

A Second Look at Word Embeddings W4705: Natural Language Processing

Cw2vec: Learning Chinese Word Embeddings with Stroke N-Gram Information

Effects of Pre-Trained Word Embeddings on Text-Based Deception Detection

Smart Ubiquitous Chatbot for COVID-19 Assistance with Deep Learning Sentiment Analysis Model During and After Quarantine

A Comparison of Word Embeddings and N-Gram Models for Dbpedia Type and Invalid Entity Detection †

Linked Data Triples Enhance Document Relevance Classification

Utilising Knowledge Graph Embeddings for Data-To-Text Generation

Learning Effective Embeddings from Medical Notes

Investigating the Impact of Pre-Trained Word Embeddings on Memorization in Neural Networks Aleena Thomas, David Adelani, Ali Davody, Aditya Mogadala, Dietrich Klakow