Evaluating Commonsense Reasoning in Neural Machine Translation
Total Page:16
File Type:pdf, Size:1020Kb
The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation Jie Hey∗, Tao Wangz∗, Deyi Xiongy, and Qun Liux y College of Intelligence and Computing, Tianjin University, Tianjin, China z School of Computer Science and Technology, Soochow University, Suzhou, China x Huawei Noah’s Ark Lab, Hong Kong, China [email protected], [email protected] [email protected], [email protected] Abstract evidence for the need of commonsense knowl- edge in machine translation is “The box is in the Does neural machine translation yield transla- pen”, where machine translation is expected to per- tions that are congenial with common sense? form reasoning on the relative sizes of “box” and In this paper, we present a test suite to evalu- “pen”. Bar-Hillel also doubts that a machine, even ate the commonsense reasoning capability of equipped with extra-linguistic knowledge, would neural machine translation. The test suite consists of three test sets, covering lexical be able to reason with such knowledge sponta- and contextless/contextual syntactic ambiguity neously as human translators do (Bar-Hillel, 1960a; that requires commonsense knowledge to re- Macklovitch, 1995). solve. We manually create 1,200 triples, each Modern natural language processing (NLP) has of which contain a source sentence and two made tremendous progress, not only in building contrastive translations, involving 7 different abundant resources to develop linguistic insights, common sense types. Language models pre- but also in plenty of methodological practices. On trained on large-scale corpora, such as BERT, GPT-2, achieve a commonsense reasoning ac- the one hand, machine translation has been sub- curacy of lower than 72% on target transla- stantially advanced with large-scale parallel data tions of this test suite. We conduct extensive and statistical models. Recent results even suggest experiments on the test suite to evaluate com- that the quality of machine-generated translations monsense reasoning in neural machine trans- is approaching professional human translators (Wu lation and investigate factors that have impact et al., 2016; Hassan et al., 2018). On the other hand, on this capability. Our experiments and anal- a wide variety of efforts have been conducted to yses demonstrate that neural machine transla- either examine the commonsense reasoning capa- tion performs poorly on commonsense reason- ing of the three ambiguity types in terms of bility of neural models in natural language under- both reasoning accuracy ( 6 60:1%) and rea- standing, establish commonsense reasoning chal- soning consistency (6 31%). We will release lenges or enhance neural models in commonsense our test suite as a machine translation com- reasoning (Zhang et al., 2018; Talmor et al., 2018; monsense reasoning testbed to promote future Huang et al., 2019; Sap et al., 2019b). work in this direction. Comparing with Bar-Hillel’s doubts and recent progress on machine translation and commonsense 1 Introduction reasoning, it is natural for us to ask questions: do we solve the machine translation impasse related Sixty years ago, the pioneering machine transla- to commonsense reasoning? Or particularly, are tion researcher and linguist Bar-Hillel published current neural machine translation models able to his well-known argument on the non-feasibility of learn common sense? And if so, how much do general-purpose fully automatic high-quality ma- they learn? Does neural machine translation ac- chine translation (FAHQT) due to the inevitable quire sufficient commonsense knowledge and have requirement of world knowledge to help machine strong ability in commonsense reasoning to gener- translation to infer correct translations for am- ate human-level high-quality translations? Method- biguous words or linguistic structures (Bar-Hillel, ological discussion on the feasibility of FAHQT 1960a). The example that Bar-Hillel uses as an given the recent progress is far beyond the scope ∗Equal Contributions. of this work. Instead, we focus on empirically ana- 3662 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3662–3672 November 16 - 20, 2020. c 2020 Association for Computational Linguistics (1) 这个 人 戴 的 表 走 了 3 分钟 。 commonsense reasoning accuracy is higher The watch worn by this person went/walked for 3 than 50%, the reasoning consistency rate is minutes. far lower than 50% (random guess). (2) 吃 了 游客 的 鳄鱼 。 The crocodile who ate the tourist/Ate the tourist's 2 Related work crocodile. (3) 当 地震 袭击 中国 时 , 援助 的 是 中国 。 We briefly review recent efforts related to common- sense reasoning in NLP. We refer readers to Storks When the earthquake hit China, China received aid/China provided aid. et al.(2019)’s article for a thorough survey in this area. Figure 1: Examples of the lexical ambiguity (1), con- Commonsense Datasets textless syntactic ambiguity (2) and contextual syntac- According to Gunning(2018), commonsense tic ambiguity (3). English Translations in bold are cor- knowledge normally consists of a general theory of rect while underlined translations are incorrect. how the physical world works and a basic under- lyzing the capability of state-of-the-art neural ma- standing of human motives and behaviors. In recent chine translation models in using extra-linguistic years, a wide variety of datasets on the two kinds commonsense knowledge to resolve ambiguity at of commonsense knowledge have been proposed. different linguistic levels and select correct transla- Sap et al.(2019b) introduce Social IQA, containing tions after disambiguation. 38k multiple choice questions for probing the com- In order to achieve this goal, we manually build monsense reasoning about emotional and social a machine translation commonsense reasoning test in people’s daily life. Similarly, Event2mind and suite on Chinese-to-English translation with three Atomic (Rashkin et al., 2018; Sap et al., 2019a) types of commonsense-related ambiguities: lexi- focus on inferred knowledge in the form of if-then cal ambiguity, contextless and contextual syntactic to reason about people’s daily life behavior. For ambiguity (see Section 3.1 for more details). Exam- datasets on physical common sense, PIQA (Bisk ples are shown in Figure1. With this test suite, we et al., 2020) on commonsense phenomena in the thoroughly evaluate the commonsense reasoning physical world contains 21K QA pairs. SWAG and ability of state-of-the-art neural machine transla- HellaSwag (Zellers et al., 2018, 2019) are datasets tion models, e.g., LSTM- and Transformer-based on commonsense NLI, where materials from video NMT (Bahdanau et al., 2015; Vaswani et al., 2017). subtitles and wikihow articles are used to construct We also conduct analyses on the commonsense cloze tests. Bhagavatula et al.(2019) propose a reasoning capability according to commonsense dataset for abductive reasoning on events. The well- knowledge types, sentence length and reasoning known Winograd Schema Challenge (WSC) test consistency and the size of training data. set (Levesque et al., 2012; Sakaguchi et al., 2020) To the best of our knowledge, this is the first focus on solving the commonsense problems in work to understand and measure the commonsense the form of coreference resolution. Different from reasoning capability in neural machine translation. them on monolingual data, we provide a bilingual The contributions of this paper can be summarized commonsense test suite for machine translation. as follows: Commonsense Reasoning in NLP In addition to common sense datasets, we have also 1 • We build a test suite to examine the abil- witnessed that commonsense knowledge has been ity of neural machine translation in common- recently explored in different NLP tasks. Just to sense reasoning, which provides a benchmark name a few, Trinh and Le(2018), He et al.(2019) testbed for tracking progress in this direction. and Klein and Nabi(2019) use language models • Based on our experiments and analyses on trained on huge text corpora to do inference on the evaluating commonsense reasoning in NMT, WSC dataset. Ding et al.(2019) use common- we find that: 1) commonsense reasoning re- sense knowledge in Atomic (Sap et al., 2019a) and lated to lexical ambiguity and contextual syn- Event2mind (Rashkin et al., 2018) on downstream tactic ambiguity is more difficult than con- tasks such as script event prediction. Bi et al. textless syntactic ambiguity; 2) although the (2019) exploit external commonsense knowledge 1The built commonsense test suite will be publicly avail- from ConceptNet (Speer et al., 2016)) in machine able at https://github.com/tjunlp-lab/CommonMT. reading comprehension. 3663 Commonsense Reasoning Evaluation or sentence reasonability judgment (i.e., “He put a With pre-trained language models, like BERT (De- turkey into the fridge” vs. “He put an elephant into vlin et al., 2019), GPT-2 (Radford et al., 2019) the fridge”) (Wang et al., 2019), where common- being widely used in various NLP tasks, studies sense reasoning normally happens in one language, have been performed to examine the commonsense commonsense reasoning in NMT can be done ei- reasoning capability in pre-trained neural language ther in the encoding of the source language (i.e., models. Wang et al.(2019) and Zhou et al.(2020) encoding reasonable source representations) or in propose to measure the success rate of the pre- the decoding of the target language (i.e., producing trained language models in commonsense inference reasonable target outputs). As it is difficult to de- by calculating LM probabilities. Two sentences tect whether reasonable senses are identified and which are used to test commonsense inference dif- encoded in the encoder, we check target outputs fer only in commonsense concepts. Feldman et al. from the decoder to test the commonsense reason- (2019) further explore unsupervised methods to ing capability of NMT. This is the first rule that we generate commonsense knowledge using the world follow to design the test suite. knowledge of pre-trained language models. Our In the second rule for building the test suite, we commonsense reasoning evaluation resonates with manually create source sentences with ambiguity these evaluation efforts.