Automatic Translation of Wordnet Glosses

Total Page:16

File Type:pdf, Size:1020Kb

Automatic Translation of Wordnet Glosses Automatic Translation of WordNet Glosses Jesus´ Gimenez´ and Llu´ıs Mar` quez German Rigau TALP Research Center, LSI Department IXA Group Universitat Politecnica` de Catalunya University of the Basque Country jgimenez,lluism ¡ @lsi.upc.edu [email protected] Abstract 2004) contain very few glosses. For instance, in the current version of the Spanish WordNet fewer than We approach the task of automatically 10% of the synsets have a gloss. Conversely, since translating the glosses in the English version 1.6 every synset in the English WordNet has WordNet. We intend to generate a prelim- a gloss. We believe that a method to rapidly obtain inary material which could be utilized to glosses for all wordnets in the MCR may be help- enrich other wordnets lacking of glosses. ful. These glosses could serve as a starting point A Phrase-based Statistical Machine Trans- for a further step of revision and post-editing. Fur- lation system has been built using a par- thermore, from a conceptual point of view, the idea allel corpus of proceedings of the Euro- of enriching the MCR using the MCR itself results pean Parliament. We study how to adapt very attractive. the system to the domain of dictionary Moreover, SMT is today a very promising ap- definitions. First, we work with special- proach to Machine Translation (MT) for a number ized language models. Second, we ex- of reasons. The most important one in the context ploit the Multilingual Central Repository of this work is that it allows to build very quickly an to build domain independent translation MT system, given only a parallel corpus represent- models. Combining these two comple- ing the languages involved. Besides, SMT is fully mentary techniques and properly tuning automatic and results are also very competitive. the system, a relative improvement of 64% However, one of the main claims against SMT is in BLEU score is attained. that it is domain oriented. Since parameters are es- timated from a parallel corpus in a specific domain, 1 Introduction the performance of the system on a different domain In this work we study the possibility of applying is often much worse. In the absence of a parallel 1 Statistical Machine Translation (SMT) techniques corpus of definitions, we built phrase-based transla- 2 to the glosses in the English WordNet (Fellbaum, tion models on the Europarl corpus (Koehn, 2003). 1998). WordNet glosses are a very useful resource. However, the language of definitions is very specific For instance, Mihalcea and Moldovan (1999) sug- and different to that of parliament proceedings. This gested an automatic method for generating sense is particularly harmful to the system recall, because tagged corpora which uses WordNet glosses. Hovy many unknown words will be processed. et al. (2001) used WordNet glosses as external 1The term ’phrase’ used hereafter refers to a sequence of knowledge to improve their Webclopedia Question words not necessarilly syntactically motivated. Answering (QA) system. 2European Parliament Proceedings (1996-2003) are avail- able for 11 European languages at http://people.csail.mit.edu/- However, most of the wordnets in the Multilin- people/koehn/publications/europarl/. We used a version of this gual Central Repository (MCR) (Atserias et al., corpus reviewed by the RWTH Aachen group. In order to adapt the system to the new domain we study two separate lines. First, we use electronic )* ) ¡! "!#$%'&( (2) dictionaries in order to build more adequate target language models. Second, we work with domain in- Equation 2 devises three components in a SMT. dependent word-based translation models extracted First, a language model that estimates ) . Sec- ) from the MCR. Other authors have previously ap- ond, a translation model representing . Last, plied information extracted from aligned wordnets. a decoder responsible for performing the search. See Tufis et al. (2004b) presented a method for Word (Brown et al., 1993) for a detailed report on the Sense Disambiguation (WSD) based on parallel cor- mathematics of Machine Translation. pora. They utilized the aligned wordnets in Balka- Net (Tufis et al., 2004a). 3 System Description We suggest to use these models as a complement Fortunately, we can count on a number of freely to phrase-based models. These two proposals to- available tools to build a SMT system. gether with a good tuning of the system parameters We utilized the SRI Language Modeling Toolkit lead to a notable improvement of results. In our ex- (SRILM) (Stolcke, 2002). It supports creation and periments, we focus on translation from English into evaluation of a variety of language model types Spanish. A relative increase of 64% in BLEU mea- based on N-gram statistics, as well as several related sure is achieved when limiting the use of the MCR- tasks, such as statistical tagging and manipulation of based model to the case of unknown words . N-best lists and word lattices. The rest of the paper is organized as follows. In Section 2 the fundamentals of SMT are depicted. In In order to build phrase-based translation mod- Section 3 we describe the components of our sys- els, a phrase extraction must be performed on a word-aligned parallel corpus. We used the GIZA++ tem. Experimental work is deployed in Section 4. 3 Improvements are detailed in Section 5. Finally, in SMT Toolkit (Och and Ney, 2003) to generate word Section 6, current limitations of our approach are alignments. We applied the phrase-extract algo- discussed, and further work is outlined. rithm, as described by (Och, 2002), on the Viterbi alignments output by GIZA++. This algorithm takes 2 Statistical Machine Translation as input a word alignment matrix and outputs a set of phrase pairs that is consistent with it. A phrase pair Current state-of-the-art SMT systems are based on is said to be consistent with the word alignment if all ideas borrowed from the Communication Theory the words within the source phrase are only aligned field (Weaver, 1955). Brown et al. (1988) suggested to words withing the target phrase, and viceversa. that MT can be statistically approximated to the Phrase pairs are scored by relative frequency transmission of information through a noisy chan- (Equation 3). Let +-,'. be a phrase in the source lan- ¢¡£ ¥¤§¦¨¦© +-,/& nel. Given a sentence (distorted signal), guage ( ) and a phrase in the target language ¡£ ¤¦¨¦© 021 3/465§7+-,'.¥89+-,/&2 it is possible to approximate the sentece ( ). We define a function which (original signal) which produced . We need to es- counts the number of times the phrase +-,-. has been timate , the probability that a translator pro- seen aligned to phrase +-,/& in the training data. The & +-, duces as a translation of . By applying Bayes’ conditional probability that +-,-. maps into is es- rule we decompose it: timated as: ¡ ¥ (1) & 021 3/465§7+-,/.¥89+-, : =$>?)@ +-,/&2<¡ 021) 7+-,/.; (3) & 021 3/465§7+-,/.A89+-, To obtain the string which maximizes the trans- lation probability for , a search in the probability space must be performed. Because the denominator No smoothing is performed. is independent of , we can ignore it for the purpose 3The GIZA++ SMT Toolkit may be freely downloaded at of the search: http://www.fjoch.com/GIZA++.html For the search, we used the Pharaoh beam search 4.3 Evaluation Metrics decoder (Koehn, 2004). Pharaoh is an implemen- Three different evaluation metrics have been com- tation of an efficent dynamic programming search puted, namely the General Text Matching (GTM) algorithm with lattice generation and XML markup ¡ ¡ 8£¢ F-measure ( ) (Melamed et al., 2003), the for external components. Performing an optimal de- ¡¥¤ BLEU score ( 4 ) (Papineni et al., 2001), and the coding can be extremely costly because the search ¡§¦ NIST score ( 4 ) (Lin and Hovy, 2002). These space is polynomial in the length of the input metrics have proved to correlate well with both hu- (Knight, 1999). For this reason, like most decoders, man adequacy and fluency. They all reward n-gram Pharaoh actually performs a suboptimal (beam) matches between the candidate translation and a set search by pruning the search space according to cer- of reference translations. The larger the number of tain heuristics based on the translation cost. reference translations the more reliable these mea- sures are. Unfortunately, in our case, a single refer- 4 Experiments ence translation is available. BLEU has become a ‘de facto’ standard nowadays 4.1 Experimental Setting in MT. Therefore, we discuss our results based on the BLEU score. However, it has several deficien- As sketched in Section 2, in order to build a SMT cies that turn it impractical for error analysis (Turian system we need to build a language model, and a et al., 2003). First, BLEU does not have a clear in- translation model, all in a format that is convenient terpretation. Second, BLEU is not adequate to work for the Pharaoh decoder. at the segment4 level but only at the document level. We tokenized and case lowered the Europarl cor- Third, in order to punish candidate translations that pus. A set of 327,368 parallel segments of length be- are too long/short, BLEU computes a heuristically tween five and twenty was selected for training. The motivated word penalty factor. Spanish side consisted of 4,243,610 tokens, whereas In contrast, the GTM F-measure has an intuitive the English side consisted of 4,197,836 tokens. interpretation in the context of a bitext grid. It rep- We built a trigram language model from the Span- resents the fraction of the grid covered by aligned ish side of the Europarl corpus selection. Linear in- blocks. It also, by definition, works well at the seg- terpolation was applied for smoothing. ment level and punishes translations too divergent in We used the GIZA++ default configuration.
Recommended publications
  • Integrating Optical Character Recognition and Machine
    Integrating Optical Character Recognition and Machine Translation of Historical Documents Haithem Afli and Andy Way ADAPT Centre School of Computing Dublin City University Dublin, Ireland haithem.afli, andy.way @adaptcentre.ie { } Abstract Machine Translation (MT) plays a critical role in expanding capacity in the translation industry. However, many valuable documents, including digital documents, are encoded in non-accessible formats for machine processing (e.g., Historical or Legal documents). Such documents must be passed through a process of Optical Character Recognition (OCR) to render the text suitable for MT. No matter how good the OCR is, this process introduces recognition errors, which often renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding a new OCR error correction module to enhance the overall quality of translation. Experimenta- tion shows that our new system correction based on the combination of Language Modeling and Translation methods outperforms the baseline system by nearly 30% relative improvement. 1 Introduction While research on improving Optical Character Recognition (OCR) algorithms is ongoing, our assess- ment is that Machine Translation (MT) will continue to produce unacceptable translation errors (or non- translations) based solely on the automatic output of OCR systems. The problem comes from the fact that current OCR and Machine Translation systems are commercially distinct and separate technologies. There are often mistakes in the scanned texts as the OCR system occasionally misrecognizes letters and falsely identifies scanned text, leading to misspellings and linguistic errors in the output text (Niklas, 2010). Works involved in improving translation services purchase off-the-shelf OCR technology but have limited capability to adapt the OCR processing to improve overall machine translation perfor- mance.
    [Show full text]
  • Champollion: a Robust Parallel Text Sentence Aligner
    Champollion: A Robust Parallel Text Sentence Aligner Xiaoyi Ma Linguistic Data Consortium 3600 Market St. Suite 810 Philadelphia, PA 19104 [email protected] Abstract This paper describes Champollion, a lexicon-based sentence aligner designed for robust alignment of potential noisy parallel text. Champollion increases the robustness of the alignment by assigning greater weights to less frequent translated words. Experiments on a manually aligned Chinese – English parallel corpus show that Champollion achieves high precision and recall on noisy data. Champollion can be easily ported to new language pairs. It’s freely available to the public. framework to find the maximum likelihood alignment of 1. Introduction sentences. Parallel text is a very valuable resource for a number The length based approach works remarkably well on of natural language processing tasks, including machine language pairs with high length correlation, such as translation (Brown et al. 1993; Vogel and Tribble 2002; French and English. Its performance degrades quickly, Yamada and Knight 2001;), cross language information however, when the length correlation breaks down, such retrieval, and word disambiguation. as in the case of Chinese and English. Parallel text provides the maximum utility when it is Even with language pairs with high length correlation, sentence aligned. The sentence alignment process maps the Gale-Church algorithm may fail at regions that contain sentences in the source text to their translation. The labor many sentences with similar length. A number of intensive and time consuming nature of manual sentence algorithms, such as (Wu 1994), try to overcome the alignment makes large parallel text corpus development weaknesses of length based approaches by utilizing lexical difficult.
    [Show full text]
  • Machine Translation for Academic Purposes Grace Hui-Chin Lin
    Proceedings of the International Conference on TESOL and Translation 2009 December 2009: pp.133-148 Machine Translation for Academic Purposes Grace Hui-chin Lin PhD Texas A&M University College Station Master of Science, University of Southern California Paul Shih Chieh Chien Center for General Education, Taipei Medical University Abstract Due to the globalization trend and knowledge boost in the second millennium, multi-lingual translation has become a noteworthy issue. For the purposes of learning knowledge in academic fields, Machine Translation (MT) should be noticed not only academically but also practically. MT should be informed to the translating learners because it is a valuable approach to apply by professional translators for diverse professional fields. For learning translating skills and finding a way to learn and teach through bi-lingual/multilingual translating functions in software, machine translation is an ideal approach that translation trainers, translation learners, and professional translators should be familiar with. In fact, theories for machine translation and computer assistance had been highly valued by many scholars. (e.g., Hutchines, 2003; Thriveni, 2002) Based on MIT’s Open Courseware into Chinese that Lee, Lin and Bonk (2007) have introduced, this paper demonstrates how MT can be efficiently applied as a superior way of teaching and learning. This article predicts the translated courses utilizing MT for residents of global village should emerge and be provided soon in industrialized nations and it exhibits an overview about what the current developmental status of MT is, why the MT should be fully applied for academic purpose, such as translating a textbook or teaching and learning a course, and what types of software can be successfully applied.
    [Show full text]
  • Machine Translation for Language Preservation
    Machine translation for language preservation Steven BIRD1,2 David CHIANG3 (1) Department of Computing and Information Systems, University of Melbourne (2) Linguistic Data Consortium, University of Pennsylvania (3) Information Sciences Institute, University of Southern California [email protected], [email protected] ABSTRACT Statistical machine translation has been remarkably successful for the world’s well-resourced languages, and much effort is focussed on creating and exploiting rich resources such as treebanks and wordnets. Machine translation can also support the urgent task of document- ing the world’s endangered languages. The primary object of statistical translation models, bilingual aligned text, closely coincides with interlinear text, the primary artefact collected in documentary linguistics. It ought to be possible to exploit this similarity in order to improve the quantity and quality of documentation for a language. Yet there are many technical and logistical problems to be addressed, starting with the problem that – for most of the languages in question – no texts or lexicons exist. In this position paper, we examine these challenges, and report on a data collection effort involving 15 endangered languages spoken in the highlands of Papua New Guinea. KEYWORDS: endangered languages, documentary linguistics, language resources, bilingual texts, comparative lexicons. Proceedings of COLING 2012: Posters, pages 125–134, COLING 2012, Mumbai, December 2012. 125 1 Introduction Most of the world’s 6800 languages are relatively unstudied, even though they are no less im- portant for scientific investigation than major world languages. For example, before Hixkaryana (Carib, Brazil) was discovered to have object-verb-subject word order, it was assumed that this word order was not possible in a human language, and that some principle of universal grammar must exist to account for this systematic gap (Derbyshire, 1977).
    [Show full text]
  • The Web As a Parallel Corpus
    The Web as a Parallel Corpus Philip Resnik∗ Noah A. Smith† University of Maryland Johns Hopkins University Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content- based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair. 1. Introduction Parallel corpora—bodies of text in parallel translation, also known as bitexts—have taken on an important role in machine translation and multilingual natural language processing. They represent resources for automatic lexical acquisition (e.g., Gale and Church 1991; Melamed 1997), they provide indispensable training data for statistical translation models (e.g., Brown et al. 1990; Melamed 2000; Och and Ney 2002), and they can provide the connection between vocabularies in cross-language information retrieval (e.g., Davis and Dunning 1995; Landauer and Littman 1990; see also Oard 1997). More recently, researchers at Johns Hopkins University and the
    [Show full text]
  • Machine Translation and Computer-Assisted Translation: a New Way of Translating? Author: Olivia Craciunescu E-Mail: Olivia [email protected]
    Machine Translation and Computer-Assisted Translation: a New Way of Translating? Author: Olivia Craciunescu E-mail: [email protected] Author: Constanza Gerding-Salas E-mail: [email protected] Author: Susan Stringer-O’Keeffe E-mail: [email protected] Source: http://www.translationdirectory.com The Need for Translation IT has produced a screen culture that Technology tends to replace the print culture, with printed documents being dispensed Advances in information technology (IT) with and information being accessed have combined with modern communication and relayed directly through computers requirements to foster translation automation. (e-mail, databases and other stored The history of the relationship between information). These computer documents technology and translation goes back to are instantly available and can be opened the beginnings of the Cold War, as in the and processed with far greater fl exibility 1950s competition between the United than printed matter, with the result that the States and the Soviet Union was so intensive status of information itself has changed, at every level that thousands of documents becoming either temporary or permanent were translated from Russian to English and according to need. Over the last two vice versa. However, such high demand decades we have witnessed the enormous revealed the ineffi ciency of the translation growth of information technology with process, above all in specialized areas of the accompanying advantages of speed, knowledge, increasing interest in the idea visual
    [Show full text]
  • Language Service Translation SAVER Technote
    TechNote August 2020 LANGUAGE TRANSLATION APPLICATIONS The U.S. Department of Homeland Security (DHS) AEL Number: 09ME-07-TRAN established the System Assessment and Validation for Language translation equipment used by emergency services has evolved into Emergency Responders commercially available, mobile device applications (apps). The wide adoption of (SAVER) Program to help cell phones has enabled language translation applications to be useful in emergency responders emergency scenarios where first responders may interact with individuals who improve their procurement speak a different language. Most language translation applications can identify decisions. a spoken language and translate in real-time. Applications vary in functionality, Located within the Science such as the ability to translate voice or text offline and are unique in the number and Technology Directorate, of available translatable languages. These applications are useful tools to the National Urban Security remedy communication barriers in any emergency response scenario. Technology Laboratory (NUSTL) manages the SAVER Overview Program and conducts objective operational During an emergency response, language barriers can challenge a first assessments of commercial responder’s ability to quickly assess and respond to a situation. First responders equipment and systems must be able to accurately communicate with persons at an emergency scene in relevant to the emergency order to provide a prompt, appropriate response. The quick translations provided responder community. by mobile apps remove the language barrier and are essential to the safety of The SAVER Program gathers both first responders and civilians with whom they interact. and reports information about equipment that falls within the Before the adoption of smartphones, first responders used language interpreters categories listed in the DHS and later language translation equipment such as hand-held language Authorized Equipment List translation devices (Figure 1).
    [Show full text]
  • Terminology Extraction, Translation Tools and Comparable Corpora Helena Blancafort, Béatrice Daille, Tatiana Gornostay, Ulrich Heid, Claude Méchoulam, Serge Sharoff
    TTC: Terminology Extraction, Translation Tools and Comparable Corpora Helena Blancafort, Béatrice Daille, Tatiana Gornostay, Ulrich Heid, Claude Méchoulam, Serge Sharoff To cite this version: Helena Blancafort, Béatrice Daille, Tatiana Gornostay, Ulrich Heid, Claude Méchoulam, et al.. TTC: Terminology Extraction, Translation Tools and Comparable Corpora. 14th EURALEX International Congress, Jul 2010, Leeuwarden/Ljouwert, Netherlands. pp.263-268. hal-00819365 HAL Id: hal-00819365 https://hal.archives-ouvertes.fr/hal-00819365 Submitted on 2 May 2013 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. TTC: Terminology Extraction, Translation Tools and Comparable Corpora Helena Blancafort, Syllabs, Universitat Pompeu Fabra, Barcelona, Spain Béatrice Daille, LINA, Université de Nantes, France Tatiana Gornostay, Tilde, Latva Ulrich Heid, IMS, Universität Stuttgart, Germany Claude Mechoulam, Sogitec, France Serge Sharoff, CTS, University of Leeds, England The need for linguistic resources in any natural language application is undeniable. Lexicons and terminologies
    [Show full text]
  • Acknowledging the Needs of Computer-Assisted Translation Tools Users: the Human Perspective in Human-Machine Translation Annemarie Taravella and Alain O
    The Journal of Specialised Translation Issue 19 – January 2013 Acknowledging the needs of computer-assisted translation tools users: the human perspective in human-machine translation AnneMarie Taravella and Alain O. Villeneuve, Université de Sherbrooke, Quebec ABSTRACT The lack of translation specialists poses a problem for the growing translation markets around the world. One of the solutions proposed for the lack of human resources is automated translation tools. In the last few decades, organisations have had the opportunity to increase their use of technological resources. However, there is no consensus on the way that technological resources should be integrated into translation service providers (TSP). The approach taken by this article is to set aside both 100% human translation and 100% machine translation (without human intervention), to examine a third, more realistic solution: interactive translation where humans and machines co-operate. What is the human role? Based on the conceptual framework of information systems and organisational sciences, we recommend giving users, who are mainly translators for whom interactive translation tools are designed, a fundamental role in the thinking surrounding the implementation of a technological tool. KEYWORDS Interactive translation, information system, implementation, business process reengineering, organisational science. 1. Introduction Globalisation and the acceleration of world trade operations have led to an impressive growth of the global translation services market. According to EUATC (European Union of Associations of Translation Companies), translation industry is set to grow around five percent annually in the foreseeable future (Hager, 2008). With its two official languages, Canada is a good example of an important translation market, where highly skilled professional translators serve not only the domestic market, but international clients as well.
    [Show full text]
  • Resourcing Machine Translation with Parallel Treebanks John Tinsley
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by DCU Online Research Access Service Resourcing Machine Translation with Parallel Treebanks John Tinsley A dissertation submitted in fulfilment of the requirements for the award of Doctor of Philosophy (Ph.D.) to the Dublin City University School of Computing Supervisor: Prof. Andy Way December 2009 I hereby certify that this material, which I now submit for assessment on the programme of study leading to the award of Ph.D. is entirely my own work, that I have exercised reasonable care to ensure that the work is original, and does not to the best of my knowledge breach any law of copyright, and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. Signed: (Candidate) ID No.: Date: Contents Abstract vii Acknowledgements viii List of Figures ix List of Tables x 1 Introduction 1 2 Background and the Current State-of-the-Art 7 2.1 ParallelTreebanks ............................ 7 2.1.1 Sub-sentential Alignment . 9 2.1.2 Automatic Approaches to Tree Alignment . 12 2.2 Phrase-Based Statistical Machine Translation . ...... 14 2.2.1 WordAlignment ......................... 17 2.2.2 Phrase Extraction and Translation Models . 18 2.2.3 ScoringandtheLog-LinearModel . 22 2.2.4 LanguageModelling . 25 2.2.5 Decoding ............................. 27 2.3 Syntax-Based Machine Translation . 29 2.3.1 StatisticalTransfer-BasedMT . 30 2.3.2 Data-OrientedTranslation . 33 2.3.3 OtherApproaches ........................ 35 2.4 MTEvaluation.............................
    [Show full text]
  • MULTILINGUAL CHATBOT with HUMAN CONVERSATIONAL ABILITY [1] Aradhana Bisht, [2] Gopan Doshi, [3] Bhavna Arora, [4] Suvarna Pansambal [1][2] Student, Dept
    International Journal of Future Generation Communication and Networking Vol. 13, No. 1s, (2020), pp. 138- 146 MULTILINGUAL CHATBOT WITH HUMAN CONVERSATIONAL ABILITY [1] Aradhana Bisht, [2] Gopan Doshi, [3] Bhavna Arora, [4] Suvarna Pansambal [1][2] Student, Dept. of Computer Engineering,[3][4] Asst. Prof., Dept. of Computer Engineering, Atharva College of Engineering, Mumbai, India Abstract Chatbots - The chatbot technology has become very fascinating to people around the globe because of its ability to communicate with humans. They respond to the user query and are sometimes capable of executing sundry tasks. Its implementation is easier because of wide availability of development platforms and language libraries. Most of the chatbots support English language only and very few have the skill to communicate in multiple languages. In this paper we are proposing an idea to build a chatbot that can communicate in as many languages as google translator supports and also the chatbot will be capable of doing humanly conversation. This can be done by using various technologies such as Natural Language Processing (NLP) techniques, Sequence To Sequence Modeling with encoder decoder architecture[12]. We aim to build a chatbot which will be like virtual assistant and will have the ability to have conversations more like human to human rather than human to bot and will also be able to communicate in multiple languages. Keywords: Chatbot, Multilingual, Communication, Human Conversational, Virtual agent, NLP, GNMT. 1. Introduction A chatbot is a virtual agent for conversation, which is capable of answering user queries in the form of text or speech. In other words, a chatbot is a software application/program that can chat with a user on any topic[5].
    [Show full text]
  • Towards Controlled Counterfactual Generation for Text
    The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Generate Your Counterfactuals: Towards Controlled Counterfactual Generation for Text Nishtha Madaan, Inkit Padhi, Naveen Panwar, Diptikalyan Saha 1IBM Research AI fnishthamadaan, naveen.panwar, [email protected], [email protected] Abstract diversity will ensure high coverage of the input space de- fined by the goal. In this paper, we aim to generate such Machine Learning has seen tremendous growth recently, counterfactual text samples which are also effective in find- which has led to a larger adoption of ML systems for ed- ing test-failures (i.e. label flips for NLP classifiers). ucational assessments, credit risk, healthcare, employment, criminal justice, to name a few. Trustworthiness of ML and Recent years have seen a tremendous increase in the work NLP systems is a crucial aspect and requires guarantee that on fairness testing (Ma, Wang, and Liu 2020; Holstein et al. the decisions they make are fair and robust. Aligned with 2019) which are capable of generating a large number of this, we propose a framework GYC, to generate a set of coun- test-cases that capture the model’s ability to misclassify or terfactual text samples, which are crucial for testing these remove unwanted bias around specific protected attributes ML systems. Our main contributions include a) We introduce (Huang et al. 2019), (Garg et al. 2019). This is not only lim- GYC, a framework to generate counterfactual samples such ited to fairness but the community has seen great interest that the generation is plausible, diverse, goal-oriented and ef- in building robust models susceptible to adversarial changes fective, b) We generate counterfactual samples, that can direct (Goodfellow, Shlens, and Szegedy 2014; Michel et al.
    [Show full text]