The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation
Jie He†∗, Tao Wang‡∗, Deyi Xiong†, and Qun Liu§ † College of Intelligence and Computing, Tianjin University, Tianjin, China ‡ School of Computer Science and Technology, Soochow University, Suzhou, China § Huawei Noah’s Ark Lab, Hong Kong, China [email protected], [email protected] [email protected], [email protected]
Abstract evidence for the need of commonsense knowl- edge in machine translation is “The box is in the Does neural machine translation yield transla- pen”, where machine translation is expected to per- tions that are congenial with common sense? form reasoning on the relative sizes of “box” and In this paper, we present a test suite to evalu- “pen”. Bar-Hillel also doubts that a machine, even ate the commonsense reasoning capability of equipped with extra-linguistic knowledge, would neural machine translation. The test suite consists of three test sets, covering lexical be able to reason with such knowledge sponta- and contextless/contextual syntactic ambiguity neously as human translators do (Bar-Hillel, 1960a; that requires commonsense knowledge to re- Macklovitch, 1995). solve. We manually create 1,200 triples, each Modern natural language processing (NLP) has of which contain a source sentence and two made tremendous progress, not only in building contrastive translations, involving 7 different abundant resources to develop linguistic insights, common sense types. Language models pre- but also in plenty of methodological practices. On trained on large-scale corpora, such as BERT, GPT-2, achieve a commonsense reasoning ac- the one hand, machine translation has been sub- curacy of lower than 72% on target transla- stantially advanced with large-scale parallel data tions of this test suite. We conduct extensive and statistical models. Recent results even suggest experiments on the test suite to evaluate com- that the quality of machine-generated translations monsense reasoning in neural machine trans- is approaching professional human translators (Wu lation and investigate factors that have impact et al., 2016; Hassan et al., 2018). On the other hand, on this capability. Our experiments and anal- a wide variety of efforts have been conducted to yses demonstrate that neural machine transla- either examine the commonsense reasoning capa- tion performs poorly on commonsense reason- ing of the three ambiguity types in terms of bility of neural models in natural language under- both reasoning accuracy ( 6 60.1%) and rea- standing, establish commonsense reasoning chal- soning consistency (6 31%). We will release lenges or enhance neural models in commonsense our test suite as a machine translation com- reasoning (Zhang et al., 2018; Talmor et al., 2018; monsense reasoning testbed to promote future Huang et al., 2019; Sap et al., 2019b). work in this direction. Comparing with Bar-Hillel’s doubts and recent progress on machine translation and commonsense 1 Introduction reasoning, it is natural for us to ask questions: do we solve the machine translation impasse related Sixty years ago, the pioneering machine transla- to commonsense reasoning? Or particularly, are tion researcher and linguist Bar-Hillel published current neural machine translation models able to his well-known argument on the non-feasibility of learn common sense? And if so, how much do general-purpose fully automatic high-quality ma- they learn? Does neural machine translation ac- chine translation (FAHQT) due to the inevitable quire sufficient commonsense knowledge and have requirement of world knowledge to help machine strong ability in commonsense reasoning to gener- translation to infer correct translations for am- ate human-level high-quality translations? Method- biguous words or linguistic structures (Bar-Hillel, ological discussion on the feasibility of FAHQT 1960a). The example that Bar-Hillel uses as an given the recent progress is far beyond the scope ∗Equal Contributions. of this work. Instead, we focus on empirically ana-
3662 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3662–3672 November 16 - 20, 2020. c 2020 Association for Computational Linguistics (1) 这个 人 戴 的 表 走 了 3 分钟 。 commonsense reasoning accuracy is higher
The watch worn by this person went/walked for 3 than 50%, the reasoning consistency rate is minutes. far lower than 50% (random guess).
(2) 吃 了 游客 的 鳄鱼 。
The crocodile who ate the tourist/Ate the tourist's 2 Related work crocodile.
(3) 当 地震 袭击 中国 时 , 援助 的 是 中国 。 We briefly review recent efforts related to common- sense reasoning in NLP. We refer readers to Storks When the earthquake hit China, China received aid/China provided aid. et al.(2019)’s article for a thorough survey in this area. Figure 1: Examples of the lexical ambiguity (1), con- Commonsense Datasets textless syntactic ambiguity (2) and contextual syntac- According to Gunning(2018), commonsense tic ambiguity (3). English Translations in bold are cor- knowledge normally consists of a general theory of rect while underlined translations are incorrect. how the physical world works and a basic under- lyzing the capability of state-of-the-art neural ma- standing of human motives and behaviors. In recent chine translation models in using extra-linguistic years, a wide variety of datasets on the two kinds commonsense knowledge to resolve ambiguity at of commonsense knowledge have been proposed. different linguistic levels and select correct transla- Sap et al.(2019b) introduce Social IQA, containing tions after disambiguation. 38k multiple choice questions for probing the com- In order to achieve this goal, we manually build monsense reasoning about emotional and social a machine translation commonsense reasoning test in people’s daily life. Similarly, Event2mind and suite on Chinese-to-English translation with three Atomic (Rashkin et al., 2018; Sap et al., 2019a) types of commonsense-related ambiguities: lexi- focus on inferred knowledge in the form of if-then cal ambiguity, contextless and contextual syntactic to reason about people’s daily life behavior. For ambiguity (see Section 3.1 for more details). Exam- datasets on physical common sense, PIQA (Bisk ples are shown in Figure1. With this test suite, we et al., 2020) on commonsense phenomena in the thoroughly evaluate the commonsense reasoning physical world contains 21K QA pairs. SWAG and ability of state-of-the-art neural machine transla- HellaSwag (Zellers et al., 2018, 2019) are datasets tion models, e.g., LSTM- and Transformer-based on commonsense NLI, where materials from video NMT (Bahdanau et al., 2015; Vaswani et al., 2017). subtitles and wikihow articles are used to construct We also conduct analyses on the commonsense cloze tests. Bhagavatula et al.(2019) propose a reasoning capability according to commonsense dataset for abductive reasoning on events. The well- knowledge types, sentence length and reasoning known Winograd Schema Challenge (WSC) test consistency and the size of training data. set (Levesque et al., 2012; Sakaguchi et al., 2020) To the best of our knowledge, this is the first focus on solving the commonsense problems in work to understand and measure the commonsense the form of coreference resolution. Different from reasoning capability in neural machine translation. them on monolingual data, we provide a bilingual The contributions of this paper can be summarized commonsense test suite for machine translation. as follows: Commonsense Reasoning in NLP In addition to common sense datasets, we have also 1 • We build a test suite to examine the abil- witnessed that commonsense knowledge has been ity of neural machine translation in common- recently explored in different NLP tasks. Just to sense reasoning, which provides a benchmark name a few, Trinh and Le(2018), He et al.(2019) testbed for tracking progress in this direction. and Klein and Nabi(2019) use language models • Based on our experiments and analyses on trained on huge text corpora to do inference on the evaluating commonsense reasoning in NMT, WSC dataset. Ding et al.(2019) use common- we find that: 1) commonsense reasoning re- sense knowledge in Atomic (Sap et al., 2019a) and lated to lexical ambiguity and contextual syn- Event2mind (Rashkin et al., 2018) on downstream tactic ambiguity is more difficult than con- tasks such as script event prediction. Bi et al. textless syntactic ambiguity; 2) although the (2019) exploit external commonsense knowledge 1The built commonsense test suite will be publicly avail- from ConceptNet (Speer et al., 2016)) in machine able at https://github.com/tjunlp-lab/CommonMT. reading comprehension.
3663 Commonsense Reasoning Evaluation or sentence reasonability judgment (i.e., “He put a With pre-trained language models, like BERT (De- turkey into the fridge” vs. “He put an elephant into vlin et al., 2019), GPT-2 (Radford et al., 2019) the fridge”) (Wang et al., 2019), where common- being widely used in various NLP tasks, studies sense reasoning normally happens in one language, have been performed to examine the commonsense commonsense reasoning in NMT can be done ei- reasoning capability in pre-trained neural language ther in the encoding of the source language (i.e., models. Wang et al.(2019) and Zhou et al.(2020) encoding reasonable source representations) or in propose to measure the success rate of the pre- the decoding of the target language (i.e., producing trained language models in commonsense inference reasonable target outputs). As it is difficult to de- by calculating LM probabilities. Two sentences tect whether reasonable senses are identified and which are used to test commonsense inference dif- encoded in the encoder, we check target outputs fer only in commonsense concepts. Feldman et al. from the decoder to test the commonsense reason- (2019) further explore unsupervised methods to ing capability of NMT. This is the first rule that we generate commonsense knowledge using the world follow to design the test suite. knowledge of pre-trained language models. Our In the second rule for building the test suite, we commonsense reasoning evaluation resonates with manually create source sentences with ambiguity these evaluation efforts. that requires commonsense reasoning. Inspired by Commonsense Knowledge and Reasoning in Schwartz and Gomez(2009) and Ovchinnikova Machine Translation (2012), we ground the commonsense reasoning test Commonsense knowledge has long been acknowl- on two types of ambiguity: lexical and syntactic edged as an indispensable knowledge source for ambiguity (LA and SA), which are common in disambiguation in machine translation (Bar-Hillel, machine translation. An example in LA is the “bat- 1960b; Davis and Marcus, 2015). Knowledge- ter” in “she put the batter in the refrigerator” (food based machine translation (KBMT), one of the pop- material vs. baseball player). SA relates to struc- ular machine translation paradigms in 1980s, lays tures, for instance, “I saw a man swimming on the much stress on extra-linguistic world knowledge bridge” (I was standing on the bridge vs. The man in machine translation (Nirenburg, 1989). Large was swimming on the bridge). We further refine SA ontology that is constructed either manually or au- into contextless (e.g., Example (2) in Figure1) and tomatically to provide world knowledge is one of contextual SA (e.g., Example (3) in Figure1). The essential components in KBMT (Knight and Luk, former can be correctly interpreted by resorting to 1994). commonsense knowledge while the latter cannot As data-driven machine translation, such as sta- be interpreted uniquely if no more context is given. tistical machine translation (SMT) and neural ma- chine translation, becomes de facto standard in ma- The third rule that we conform to is to 1) create chine translation, world knowledge has been less two contrastive source sentences for each lexical explicitly explored. Only a few studies have indi- or syntactic ambiguity point, where each source rectly and partially exploited world knowledge in sentence corresponds to one reasonable interpre- SMT or NMT, by incorporating linked open data tation of the ambiguity point, and 2) to provide resources such as DBpedia and BabelNet into SMT two contrastive translations for each created source with modest improvements (Du et al., 2016; Sri- sentence. This is similar to other linguistic evalua- vastava et al., 2017; Moussallem et al., 2018). tion by contrastive examples in the MT literature (Avramidis et al., 2019; Bawden et al., 2018;M uller¨ 3 Commonsense Reasoning Test Suite for et al., 2018; Sennrich, 2017). These two contrastive Machine Translation translations have similar wordings: one is correct and the other is not correct in that it translates the In this section, we discuss the design and construc- ambiguity part into the corresponding translation tion of the test suite, including the rules and steps of the contrastive source sentence. This translation for building this test suite. makes sense in the contrastive sentence but not in the sentence in question. Examples of contrastive 3.1 Test Suite Design source sentences and contrastive translations for Different from commonsense reasoning in Wino- each source sentence are shown in Figure2,3 and gram Schema Challenge (Levesque et al., 2012) 4.
3664 z z1 主力 部队 已经 对 敌人的 建筑 展开 了 攻关 。 1 维修 桌子 的 桌脚 。 r e1 Repair the legs of the table. 1 The main force has already launched an attack on the enemy’s building. 1 The leg that repairs the table. 1 The main force has already launched a research on the enemy’s building. z2 维修 桌子 的 锤子 。