Localizing Open-Ontology QA Semantic Parsers in a Day Using Machine Translation

Localizing Open-Ontology QA Semantic Parsers in a Day Using Machine Translation Mehrad Moradshahi Giovanni Campagna Sina J. Semnani Silei Xu Monica S. Lam Computer Science Department Stanford University Stanford, CA, USA {mehrad,gcampagn,sinaj,silei,lam}@cs.stanford.edu Abstract languages. As conversational agents are increas- ingly used as the new interface, how do we localize We propose Semantic Parser Localizer (SPL), them to other languages efficiently? a toolkit that leverages Neural Machine Trans- lation (NMT) systems to localize a semantic The focus of this paper is on question answering parser for a new language. Our methodol- systems that use semantic parsing, where natural ogy is to (1) generate training data automat- language is translated into a formal, executable ically in the target language by augmenting representation (such as SQL). Semantic parsing machine-translated datasets with local entities typically requires a large amount of training data, scraped from public websites, (2) add a few- which must be annotated by an expert familiar with shot boost of human-translated sentences and both the natural language of interest and the formal train a novel XLMR-LSTM semantic parser, and (3) test the model on natural utterances cu- language. The cost of acquiring such a dataset is rated using human translators. prohibitively expensive. For English, previous work has shown it is possi- We assess the effectiveness of our approach by extending the current capabilities of ble to bootstrap a semantic parser without massive Schema2QA, a system for English Question amount of manual annotation, by using a large, Answering (QA) on the open web, to 10 new hand-curated grammar of natural language (Wang languages for the restaurants and hotels do- et al., 2015; Xu et al., 2020a). This approach is ex- mains. Our models achieve an overall test ac- pensive to replicate for all languages, due to the ef- curacy ranging between 61% and 69% for the fort and expertise required to build such a grammar. hotels domain and between 64% and 78% for Hence, we investigate the question: Can we lever- restaurants domain, which compares favorably age previous work on English semantic parsers to 69% and 80% obtained for English parser trained on gold English data and a few ex- for other languages by using machine translation? amples from validation set. We show our ap- And in particular, can we do so without requiring proach outperforms the previous state-of-the- experts in each language? art methodology by more than 30% for hotels The challenge is that a semantic parser localized and 40% for restaurants with localized ontolo- to a new target language must understand ques- gies for the subset of languages tested. tions using an ontology in the target language. For Our methodology enables any software devel- example, whereas a restaurant guide in New York oper to add a new language capability to a QA may answer questions about restaurants near Times system for a new domain, leveraging machine Square, the one in Italy should answer questions translation, in less than 24 hours. Our code is released open-source.1 about restaurants near the “Colosseo” or “Fontana di Trevi” in Rome, in Italian. In addition, the parser 1 Introduction must be able to generalize beyond a fixed set of ontology where sentences refer to entities in the target Localization is an important step in software or language that are unseen during training. website development for reaching an international We propose a methodology that leverages ma- audience in their native language. Localization is chine translation to localize an English semantic usually done through professional services that can parser to a new target language, where the only translate text strings quickly into a wide variety of labor required is manual translation of a few hun- 1https://github.com/stanford-oval/SPL dreds of annotated sentences to the target language. 5970 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 5970–5983, November 16–20, 2020. c 2020 Association for Computational Linguistics English Dataset Spanish, Persian, Finnish, Italian, Japanese, Polish, (@org.schema.Restaurant.Restaurant()), I am looking for a burger place (geo == new Location(" woodland pond ") near woodland pond . Turkish, Chinese. The models can answer complex && servesCuisine =~ " burger ") questions about hotels and restaurants in the respec- Pre-processing tive languages. An example of a query is shown • Detokenize punctuation marks • Wrap parameters in quotation marks for each language and domain in Table1. … i am looking for a " burger " place Our contributions include the following: near " woodland pond " . Neural Machine Translation • Semantic Parser Localizer (SPL), a new method- sto cercando un posto da " hamburger " Cross-attention vicino a " stagno bosco ". weights ology to localize a semantic parser for any lan- Alignment guage for which a high-quality neural machine sto cercando un posto da " burger " vicino a " woodland pond ". Post-processing translation (NMT) system is available. To han- • Normalize quotation marks dle an open ontology with entities in the target • Split mixed language tokens … language, we propose machine translation with sto cercando un posto da " burger " vicino a " woodland pond ". alignment, which shows the alignment of the Italian Parameter translated language to the input language. This Ontology Substitution enables the substitution of English entities in Italian Dataset the translated sentences with localized entities. (@org.schema.Restaurant.Restaurant()), sto cercando un posto da lasagna (geo == new Location(" colosseo ") && Only a couple of hundred of sentences need to vicino a colosseo. servesCuisine =~ " lasagna ") be translated manually; no manual annotation of (@org.schema.Restaurant.Restaurant()), sto cercando un posto da focaccia sentences is necessary. (geo == new Location(" mercerie ") vicino a mercerie. && servesCuisine =~ " focaccia ") • An improved neural semantic parsing model, @org.schema.Restaurant.Restaurant()), sto cercando un posto da pizza based on BERT-LSTM (Xu et al., 2020a) but (geo == new Location(" lago di como ") vicino a lago di como. && servesCuisine =~ " pizza") using the XLM-R encoder. Its applicability ex- tends beyond multilingual semantic parsing task, Figure 1: Data generation pipeline used to produce as it can be deployed for any NLP task that can train and validation splits in a new language such as be framed as sequence-to-sequence. Pretrained Italian. Given an input sentence in English and its an- models are available for download. notation in the formal ThingTalk query language (Xu • Experimental results of SPL for answering ques- et al., 2020a), SPL generates multiple examples in the tions on hotels and restaurants in 10 different target language with localized entities. languages. On average, across the 10 languages, SPL achieves a logical form accuracy of 66.7% Our approach, shown in Fig.1, is to convert the for hotels and 71.5% for restaurants, which is English training data into training data in the tar- comparable to the English parser trained with get language, with all the parameter values in the English synthetic and paraphrased data. Our questions and the logical forms substituted with method outperforms the previous state of the local entities. Such data trains the parsers to an- art and two other strong baselines by between swer questions about local entities. A small sample 30% and 40%, depending on the language and of the English questions from the evaluation set domain. This result confirms the importance of is translated by native speakers with no technical training with local entities. expertise, as a few-shot boost to the automatic train- • To the best of our knowledge, ours is the first ing set. The test data is also manually translated multilingual semantic parsing dataset with local- to assess how our model will perform on real ex- ized entities. Our dataset covers 10 linguistically amples. We show that this approach can boost the different languages with a wide range of syntax. accuracy on the English dataset as well from 64.6% We hope that releasing our dataset will trigger to 71.5% for hotels, and from 68.9% to 81.6% for further work in multilingual semantic parsing. restaurants. • SPL has been incorporated into the parser gen- We apply our approach on the Restaurants and eration toolkit, Schema2QA (Xu et al., 2020a), Hotels datasets introduced by Xu et al.(2020a), which generates QA semantic parsers that can which contain complex queries on data scraped answer complex questions of a knowledge base from major websites. We demonstrate the effi- automatically from its schema. With the addition ciency of our methodology by creating neural se- of SPL, developers can easily create multilingual mantic parsers for 10 languages: Arabic, German, QA agents for new domains cost-effectively. 5971 2 Language Country Examples Hotels English I want a hotel near times square that has at least 1000 reviews. Arabic .É¯B@ « J Êª K 1000 « øñJm' Ð@Qm Ì'@ YjÓ áÓ HQ®ËA K Y J ¯ YKP@ à @ ú Î ú Î . German Ich möchte ein hotel in der nähe von marienplatz, das mindestens 1000 bewertungen hat. Spanish Busco un hotel cerca de puerto banús que tenga al menos 1000 comentarios. Farsi .YAK . éJ@X Y®K 1000 É¯@Yg é» Ñë@ñjJ Ó ÐP@ ÄK. ú¾K XQ K PX úÎJë Finnish Haluan paikan helsingin tuomiokirkko läheltä hotellin, jolla on vähintään 1000 arvostelua. Italian Voglio un hotel nei pressi di colosseo che abbia almeno 1000 recensioni. Japanese 東3スカイツリーÅ辺'+1000件)h.L3Eーがある;&KRúせて。 Polish Potrzebuj˛ehotelu w poblizu˙ zamek w malborku, który ma co najmniej 1000 ocen. Turkish Kapalı car¸sıyakınlarında en az 1000 yoruma sahip bir otel istiyorum. Chinese 我想()安è广:D近~一¶有至少1000aÄº的R店。 Restaurants English find me a restaurant that serves burgers and is open at 14:30 . Arabic .14:30 I «A Ë@ ÈñÊm' éJ® K ð AÓPðA Ë@ ÐAª£ ÐY®K Ñª¢Ó « úÍ Q« @ .

Load more