Proceedings of the Joint Workshop on Exploiting Synergies Between

Total Page:16

File Type:pdf, Size:1020Kb

Proceedings of the Joint Workshop on Exploiting Synergies Between EACL 2012 Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra) at EACL-2012 Proceedings of the Workshop c 2012 The Association for Computational Linguistics ISBN 978-1-937284-19-0 Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ii Introduction Welcome to the Joint EACL Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra). This two-day workshop addresses two specific but related research problems in computational linguistics. The ESIRMT event (1st day) aims at reducing the gap, both theoretical and practical, between information retrieval and machine translation research and applications. Although both fields have been already contributing to each other instrumentally, there is still too much work to be done in relation to solidly framing these two fields into a common ground of knowledge from both the procedural and paradigmatic perspectives. The HyTra event (2nd day) aims at sharing ideas among researchers developing and applying statistical, example-based, or rule-based machine translation systems and who wish to enhance their systems with elements from the other approaches. The joint workshop provides participants with the opportunity of discussing research related to technology integration and system combination strategies at both the general level of cross-language information access and the specific level of machine translation technologies. This workshop has been supported by the Seventh Framework Programme of the European Comission through the T4ME (METANET) contract (grant agreement no.: 249119), through the TTC contract (grant agreement no.: 248005), through the Marie Curie HyghTra contract and by the Spanish Ministry of Economy and Competivity through the BUCEADOR project (TEC2009-14094-C04-01) and the Juan de la Cierva fellowship program. We would like to thank all people who in one way or another helped in making this workshop a success. Our special thanks go to our plenary speakers, to the speakers of our invited project session, to our sponsors, to the participants of the panel discussion, to the members of the program committee who did an excellent job in reviewing the submitted papers, and to the EACL organizers, in particular the workshop general chairs Kristiina Jokinen and Alessandro Moschitti. Last but not least we would like to thank our authors and the participants of the workshop. The Organizers Avignon, France, April 2012 iii Organizers ESIRMT: Marta R. Costa-jussa` (Barcelona Media Innovation Center) Patrik Lambert (University of Le Mans) Rafael E. Banchs (Institute for Infocomm Research) Organizers HyTra: Reinhard Rapp (Universities of Mainz and Leeds) Bogdan Babych (University of Leeds) Kurt Eberle (Lingenio GmbH) Tony Hartley (Toyohashi University of Technology and University of Leeds) Serge Sharoff (University of Leeds) Martin Thomas (University of Leeds) Program Committee: Jordi Atserias, Yahoo! Research , Barcelona, Spain Sivaji Bandyopadhyay, Jadavpur University, Kolkata, India Nuria´ Bel, Universitat Pompeu Fabra, Barcelona, Spain Pierrette Bouillon, ISSCO/TIM/ETI, University of Geneva, Switzerland Chris Callison-Burch, Johns Hopkins University, Baltimore, USA Michael Carl, Copenhagen Business School, Denmark Oliver Culo, ICSI, University of California, Berkeley, USA Andreas Eisele, Directorate-General for Translation, European Commission, Luxembourg Marcello Federico, Fondazione Bruno Kessler, Trento, Italy Jose´ A. R. Fonollosa, Universitat Politecnica` de Catalunya, Barcelona, Spain Mikel Forcada, University of Alicante, Spain Alexander Fraser, Institute for Natural Language Processing (IMS), Stuttgart, Germany Johanna Geiß, Lingenio GmbH, Heidelberg, Germany Mireia Ginesti-Rosell, Lingenio GmbH, Heidelberg, Germany Silvia Hansen-Schirra, FTSK, University of Mainz, Germany Gareth Jones, Dublin City University, Ireland Min-Yen Kan, National University of Singapore Udo Kruschwitz, University of Essex, UK Yanjun Ma, Baidu Inc. Beijing, China Maite Melero, Barcelona Media Innovation Center, Barcelona, Spain Haizhou Li, Institute for Infocomm Research, Singapore Paul Schmidt, Institut for Applied Information Science, Saarbrucken,¨ Germany Uta Seewald-Heeg, Anhalt University of Applied Sciences, Kothen,¨ Germany Nasredine Semmar, CEA LIST, Fontenay-aux-Roses, France Wade Shen, Massachusetts Institute of Technology, Cambridge, USA Fabrizio Silvestri, Instituto de Scienza e Tecnologia del’Informazione, Pisa, Italy Harold Somers, CNGL, Dublin City University, Ireland Anders Søgaard, University of Copenhagen, Denmark Jorg¨ Tiedemann, University of Uppsala, Sweden Zygmunt Vetulani, University of Poznan, Poland v Invited Speakers: Christof Monz (University of Amsterdam) Philipp Koehn (University of Edinburgh) Speakers of the Invited Project Session: Cristina Vertan George Tambouratzis, Marina Vassiliou and Sokratis Sofianopoulos John Tinsley, Alexandru Ceausu and Jian Zhang Svetla Koeva vi Table of Contents Semantic Web based Machine Translation Bettina Harriehausen-Muhlbauer¨ and Timm Heuss . 1 Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Extraction of (Semi- )Parallel Translation Equivalents Fangzhong Su and Bogdan Babych . 10 Full Machine Translation for Factoid Question Answering Cristina Espana-Bonet˜ and Pere R. Comas. .20 An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation Tze Yuang Chong, Rafael Banchs and Eng Siong Chng . 30 Natural Language Descriptions of Visual Scenes Corpus Generation and Analysis Muhammad Usman Ghani Khan, Rao Muhammad Adeel Nawab and Yoshihiko Gotoh . 38 Combining EBMT, SMT, TM and IR Technologies for Quality and Scale Sandipan Dandapat, Sara Morrissey, Andy Way and Josef van Genabith . 48 Two approaches for integrating translation and retrieval in real applications Cristina Vertan . 59 PRESEMT: Pattern Recognition-based Statistically Enhanced MT George Tambouratzis, Marina Vassiliou and Sokratis Sofianopoulos . 65 PLUTO: Automated Solutions for Patent Translation John Tinsley, Alexandru Ceausu and Jian Zhang . 69 ATLAS - Human Language Technologies integrated within a Multilingual Web Content Management System SvetlaKoeva........................................................................... 72 Tree-based Hybrid Machine Translation Andreas Søeborg Kirkedal . 77 Were the clocks striking or surprising? Using WSD to improve MT performance Spelaˇ Vintar, Darja Fiserˇ and Aljosaˇ Vrsˇcaj................................................ˇ 87 Bootstrapping Method for Chunk Alignment in Phrase Based SMT Santanu Pal and Sivaji Bandyopadhyay . 93 Design of a hybrid high quality machine translation system Bogdan Babych, Kurt Eberle, Johanna Geiß, Mireia Ginest´ı-Rosell, Anthony Hartley, Reinhard Rapp, Serge Sharoff and Martin Thomas . 101 Can Machine Learning Algorithms Improve Phrase Selection in Hybrid Machine Translation? Christian Federmann . 113 Linguistically-Augmented Bulgarian-to-English Statistical Machine Translation Model Rui Wang, Petya Osenova and Kiril Simov . 119 Using Sense-labeled Discourse Connectives for Statistical Machine Translation Thomas Meyer and Andrei Popescu-Belis . 129 vii ESIRMT-HyTra Workshop Program Monday 23rd (9:00-9:30) Workshop Presentation (9:30-10:30) Invited Talk Christof Monz (10:30-11:00) Coffee break (11:00-11:30) ESIRMT Morning Session Semantic Web based Machine Translation Bettina Harriehausen-Muhlbauer¨ and Timm Heuss (11:30-12:00) Measuring Comparability of Documents in Non-Parallel Corpora for Efficient Ex- traction of (Semi-)Parallel Translation Equivalents Fangzhong Su and Bogdan Babych (12:00-12:30) Full Machine Translation for Factoid Question Answering Cristina Espana-Bonet˜ and Pere R. Comas (12:30-13:00) An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation Tze Yuang Chong, Rafael Banchs and Eng Siong Chng ix Monday 23rd (continued) (13:00-15:00) Lunch break (15:00-15:30) ESIRMT Afternoon Session Natural Language Descriptions of Visual Scenes Corpus Generation and Analysis Muhammad Usman Ghani Khan, Rao Muhammad Adeel Nawab and Yoshihiko Gotoh (15:30-16:00) Combining EBMT, SMT, TM and IR Technologies for Quality and Scale Sandipan Dandapat, Sara Morrissey, Andy Way and Josef van Genabith (16:00-16:30) Coffee break (16:30-17:00) Project session Two approaches for integrating translation and retrieval in real applications Cristina Vertan (17:00-17:30) PRESEMT: Pattern Recognition-based Statistically Enhanced MT George Tambouratzis, Marina Vassiliou and Sokratis Sofianopoulos (17:30-18:00) PLUTO: Automated Solutions for Patent Translation John Tinsley, Alexandru Ceausu and Jian Zhang x Monday 23rd (continued) (18:00-18:30) ATLAS - Human Language Technologies integrated within a Multilingual Web Content Management System Svetla Koeva Tuesday 24th (9:00-9:30) HyTRA 1st Morning Session (9:30-10:00) Tree-based Hybrid Machine Translation Andreas Søeborg Kirkedal (10:00-10:30) Coffee break (10:30-11:30) Invited Talk Philipp Koehn (11:30-12:00) HyTRA 2nd Morning Session Were the clocks striking or surprising? Using WSD to improve MT performance Spelaˇ Vintar, Darja Fiserˇ and Aljosaˇ Vrsˇcajˇ (12:00-12:30) Bootstrapping Method for Chunk Alignment in Phrase Based SMT Santanu Pal and Sivaji Bandyopadhyay xi
Recommended publications
  • Final Study Report on CEF Automated Translation Value Proposition in the Context of the European LT Market/Ecosystem
    Final study report on CEF Automated Translation value proposition in the context of the European LT market/ecosystem FINAL REPORT A study prepared for the European Commission DG Communications Networks, Content & Technology by: Digital Single Market CEF AT value proposition in the context of the European LT market/ecosystem Final Study Report This study was carried out for the European Commission by Luc MEERTENS 2 Khalid CHOUKRI Stefania AGUZZI Andrejs VASILJEVS Internal identification Contract number: 2017/S 108-216374 SMART number: 2016/0103 DISCLAIMER By the European Commission, Directorate-General of Communications Networks, Content & Technology. The information and views set out in this publication are those of the author(s) and do not necessarily reflect the official opinion of the Commission. The Commission does not guarantee the accuracy of the data included in this study. Neither the Commission nor any person acting on the Commission’s behalf may be held responsible for the use which may be made of the information contained therein. ISBN 978-92-76-00783-8 doi: 10.2759/142151 © European Union, 2019. All rights reserved. Certain parts are licensed under conditions to the EU. Reproduction is authorised provided the source is acknowledged. 2 CEF AT value proposition in the context of the European LT market/ecosystem Final Study Report CONTENTS Table of figures ................................................................................................................................................ 7 List of tables ..................................................................................................................................................
    [Show full text]
  • Ruken C¸Akici
    RUKEN C¸AKICI personal information Official Ruket C¸akıcı Name Born in Turkey, 23 June 1978 email [email protected] website http://www.ceng.metu.edu.tr/˜ruken phone (H) +90 (312) 210 6968 · (M) +90 (532) 557 8035 work experience 2010- Instructor, METU METU Research and Teaching duties 1999-2010 Research Assistant, METU — Ankara METU Teaching assistantship of various courses education 2002-2008 University of Edinburgh, UK Doctor of School: School of Informatics Philosophy Thesis: Wide-Coverage Parsing for Turkish Advisors: Prof. Mark Steedman & Prof. Miles Osborne 1999-2002 Middle East Technical University Master of Science School: Computer Engineering Thesis: A Computational Interface for Syntax and Morphemic Lexicons Advisor: Prof. Cem Bozs¸ahin 1995-1999 Middle East Technical University Bachelor of Science School: Computer Engineering projects 1999-2001 AppTek/ Lernout & Hauspie Inc Language Pairing on Functional Structure: Lexical- Functional Grammar Based Machine Translation for English – Turkish. 150000USD. · Consultant developer 2007-2011 TUB¨ MEDID Turkish Discourse Treebank Project, TUBITAK 1001 program (107E156), 137183 TRY. · Researcher · (Now part of COST Action IS1312 (TextLink)) 2012-2015 Unsupervised Learning Methods for Turkish Natural Language Processing, METU BAP Project (BAP-08-11-2012-116), 30000 TRY. · Primary Investigator 2013-2015 TwiTR: Turkc¸e¨ ic¸in Sosyal Aglarda˘ Olay Bulma ve Bulunan Olaylar ic¸in Konu Tahmini (TwiTR: Event detection and Topic identification for events in social networks for Turkish language), TUBITAK 1001 program (112E275), 110750 TRY. · Researcher· (Now Part of ICT COST Action IC1203 (ENERGIC)) 2013-2016 Understanding Images and Visualizing Text: Semantic Inference and Retrieval by Integrating Computer Vision and Natural Language Processing, TUBITAK 1001 program (113E116), 318112 TRY.
    [Show full text]
  • Statistical Machine Translation from Slovenian to English
    Journal of Computing and Information Technology - CIT 15, 2007, 1, 47–59 47 doi:10.2498/cit.1000760 Statistical Machine Translation from Slovenian to English Mirjam Sepesy Maucec and Zdravko Kacic Faculty of Electrical Engineering and Computer Science, University of Maribor In this paper, we analyse three statistical models for the The historical enlargement of the EU has brought machine translation of Slovenian into English. All of many new and challenging language pairs for them are based on the IBM Model 4, but differ in the type of linguistic knowledge they use. Model 4a uses only machine translation. A lot of work has been basic linguistic units of the text, i.e., words and sentences. done on Czech [4], Polish [12], Croatian [3], In Model 4b, lemmatisation is used as a preprocessing Serbian [20] and not at last Slovenian [6]. step of the translation task. Lemmatisation also makes it possible to add a Slovenian-English dictionary as an The Czech-English machine translation system additional knowledge source. Model 4c takes advantage of the morpho-syntactic descriptions (MSD) of words. is based on dependency trees. Dependency trees In Model 4c, MSD codes replace the automatic word represent the sentence structure, as concentrated classes used in Models 4a and 4b. The models are around the verb. The presented system was out- experimentally evaluated using the IJS-ELAN parallel corpus. performed by the statistical translation system GIZA++/ISI ReWrite Decoder [16, 10, 18], Keywords: statistical machine translation, translation trained on the same corpus. model, lemmatisation, morpho-syntactic description, Slovenian language. The Polish-English MT system[12] uses an elec- tronic dictionary annotated for morphological, syntactic, and partly semantic information.
    [Show full text]
  • The Openhart 2013 Evalua on Workshop
    Welcome to the OpenHaRT 2013 Evalua8on Workshop Informaon Technology Laboratory nist.gov/itl Informaon Access Division nist.gov/itl/iad Mark Przybocki, Mul(modal Informaon Group nist.gov/itl/iad/mig August 23rd, 2013 Washington D.C. , Omni Shoreham hotel The Mul8modal Informa8on Group’s Project Areas § Speech Recogni(on § Speaker Recogni(on § Dialog Management § Human Assisted Speaker Recogni(on § Topic Detec(on and Tracking § Speaker Segmentaon § Spoken Document Retrieval § Language Recogni(on § Voice Biometrics § ANSI/NIST-ITL Standard Voice Record § Tracking (Person/Object) § Text-to-Text § Event Detec(on § Speech-to-Text § Event Recoun(ng § Speech-to-Speech § Predic(ve Video Analy(cs § Image-to-Text § Metric Development § Named En(ty Iden(ficaon § (new) Data Analy(cs § Automac Content Extrac(on 2 Defini8on: MIG’s Evalua8on Cycle Evalua'on Driven Research NIST Data NIST Researchers Performance Planning NIST Core technology Assessment development Analysis and NIST NIST Workshop 3 NIST’s MT Program’s Legacy – Past 10 Years • 27 Evaluaon Events -- tracking the state-of-the-art in performance – (4) technology types text-text speech-text speech-speech Handwri[en_Images-text – (9) languages Arabic-2, Chinese, Dari, Farsi, Hindi, Korean, Pashto, Urdu, English • 11 genres of structured and unstructured content (nwire, web, Bnews, Bconv, food, speeches, editorials, handwri(ng-2, blogs, SMS, dialogs) • 60 Evaluaon Test Sets available to MT researchers (source, references, metrics, sample system output and official results for comparison) • Over 85 research groups >400 Primary Systems Evaluated AFRL – American Univ. Cairo – Apptek – ARL – BBN – BYU – Cambridge – Chinese Acad. Sci. – 5% 3% 1% CMU – Columbia Univ. – Fujitsu Research – 11% OpenMT Google – IBM – JHU – Kansas State – KCSL – Language Weaver – Microsoh Research – Ohio TIDES State – Oxford – Qatar – Queen Mary (London) – 23% 57% TRANSTAC RWTH Aachen – SAIC - Sakhr – SRI – Stanford – Systran – UMD – USC ISI – Univ.
    [Show full text]
  • Statistical Machine Translation from English to Tuvan*
    Statistical Machine Translation from English to Tuvan* Rachel Killackey, Swarthmore College rkillac [email protected] Linguistics Senior Thesis 2013 Abstract This thesis aims to describe and analyze findings of the Tuvan Machine Translation Project, which attempts to create a functional statistical machine translation (SMT) model between English and Tuvan, a minority language spoken in southern Siberia. Though most Tuvan speakers are also fluent in Russian, easily accessible SMT technology would allow for simpler English translation without the use of Russian as an intermediary language. The English to Tuvan half of the system that I examine makes consistent morphological errors, particularly involving the absence of the accusative suffix with the basic form -ni. Along with a typological analysis of these errors, I show that the introduction of novel data that corrects for the missing accusative suffix can improve the performance of an SMT system. This result leads me to conclude that SMT can be a useful avenue for efficient translation. However, I also argue that SMT may benefit from the incorporation of some linguistic knowledge such as morphological rules in the early steps of creating a system. 1. Introduction This thesis explores the field of machine translation (MT), the use of computers in rendering one natural language into another, with a specific focus on MT between English and Tuvan, a Turkic language spoken in south central Siberia. While MT is a growing force in the translation of major languages with millions of speakers such as French, Spanish, and Russian, minority and non-dominant languages with relatively few numbers of speakers have been largely ignored.
    [Show full text]
  • Making Amharic to English Language Translator For
    Hana Demas Making Amharic to English Language Translator for iOS Helsinki Metropolia University of Applied Sciences Degree Programme In Information Technology Thesis Date 5.5.2016 2 Author(s) Hana Belete Demas Title Amharic To English Language Translator For iOS Number of Pages 54 pages + 1 appendice Date 5 May 2016 Degree Information Technology Engineering Degree Programme Information Technology Specialisation option Software Engineering Instructor(s) Petri Vesikivi The purpose of this project was to build a language translator for Amharic-English language pair, which in the beginning of the project was not supported by any of the known translation systems. The goal of this project was to make a language translator application for Amharic English language pair using swift language for iOS platform. The project has two components. The first one is the language translator application described above and the second component is an integrated Amharic custom keyboard which makes the user able to type Amharic letters which are not supported by iOS 9 system keyboard. The Amharic language has more than 250 letters and numbers and they are represented using extended keys. The project was implemented using the Swift language. At the end of the project an iOS application to translate English to Amharic and vice versa was made. The translator applications uses the translation system which was built on the Microsoft Translator Hub and accessed using Microsoft Translator API. The application can be used to translate texts from Amharic to English or vice versa. Keywords API, iOS, Custom Keyboard, Swift, Microsoft Translator Hub 3 Contents 1. Introduction ............................................................................................................... 1 2.
    [Show full text]
  • Proceedings of the 5Th Conference on Machine
    EMNLP 2020 Fifth Conference on Machine Translation Proceedings of the Conference November 19-20, 2020 Online c 2020 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-948087-81-0 ii Introduction The Fifth Conference on Machine Translation (WMT 2020) took place on Thursday, November 19 and Friday, November 20, 2020 immediately following the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020). This is the fifth time WMT has been held as a conference. The first time WMT was held as a conference was at ACL 2016 in Berlin, Germany, the second time at EMNLP 2017 in Copenhagen, Denmark, the third time at EMNLP 2018 in Brussels, Belgium, and the fourth time at ACL 2019 in Florence, Italy. Prior to being a conference, WMT was held 10 times as a workshop. WMT was held for the first time at HLT-NAACL 2006 in New York City, USA. In the following years the Workshop on Statistical Machine Translation was held at ACL 2007 in Prague, Czech Republic, ACL 2008, Columbus, Ohio, USA, EACL 2009 in Athens, Greece, ACL 2010 in Uppsala, Sweden, EMNLP 2011 in Edinburgh, Scotland, NAACL 2012 in Montreal, Canada, ACL 2013 in Sofia, Bulgaria, ACL 2014 in Baltimore, USA, EMNLP 2015 in Lisbon, Portugal. The focus of our conference is to bring together researchers from the area of machine translation and invite selected research papers to be presented at the conference.
    [Show full text]
  • Improvements in RWTH LVCSR Evaluation Systems for Polish, Portuguese, English, Urdu, and Arabic
    Improvements in RWTH LVCSR Evaluation Systems for Polish, Portuguese, English, Urdu, and Arabic M. Ali Basha Shaik1, Zoltan Tuske¨ 1, M. Ali Tahir1, Markus Nußbaum-Thom1, Ralf Schluter¨ 1, Hermann Ney1;2 1Human Language Technology and Pattern Recognition – Computer Science Department RWTH Aachen University, 52056 Aachen, Germany 2Spoken Language Processing Group, LIMSI CNRS, Paris, France f shaik, tuske, tahir, nussbaum, schlueter, ney [email protected] Abstract acoustical mismatch between the training and testing can be re- In this work, Portuguese, Polish, English, Urdu, and Arabic duced for any target language, by exploiting matched data from automatic speech recognition evaluation systems developed by other languages [8]. Alternatively, a few attempts have been the RWTH Aachen University are presented. Our LVCSR sys- made to incorporate GMM within a framework of deep neu- tems focus on various domains like broadcast news, sponta- ral networks (DNN). The joint training of GMM and shallow neous speech, and podcasts. All these systems but Urdu are bottleneck features was proposed using the sequence MMI cri- used for Euronews and Skynews evaluations as part of the EU- terion, in which the derivatives of the error function are com- Bridge project. Our previously developed LVCSR systems were puted with respect to GMM parameters and applied the chain improved using different techniques for the aforementioned lan- rule to update the GMM simultaneously with the bottleneck guages. Significant improvements are obtained using multi- features [9]. On the other hand, a different GMM-DNN integra- lingual tandem and hybrid approaches, minimum phone error tion approach was proposed in [10] by taking advantage of the training, lexical adaptation, open vocabulary long short term softmax layer, which defines a log-linear model.
    [Show full text]
  • Initial Considerations in Building a Speech-To-Speech Translation System for the Slovenian-English Language Pair
    Initial Considerations in Building a Speech-to-Speech Translation System for the Slovenian-English Language Pair J. Žganec Gros1, A. Mihelič1, M. Žganec1, F. Mihelič2, S. Dobrišek2, J. Žibert2, Š. Vin- tar2, T. Korošec2, T. Erjavec3, M. Romih4 1Alpineon R&D, Ulica Iga Grudna 15, SI-1000 Ljubljana, Slovenia 2University of Ljubljana, SI-1000 Ljubljana, Slovenia 3Jožef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia 4Amebis, Bakovnik 3, SI-1241 Kamnik, Slovenia E-mail: [email protected] Abstract. The paper presents the design concept of the VoiceTRAN Communicator that in- tegrates speech recognition, machine translation and text-to-speech synthesis using the DARPA Galaxy architecture. The aim of the project is to build a robust multimodal speech-to-speech translation communicator able to translate simple domain-specific sentences in the Slove- nian-English language pair. The project represents a joint collaboration between several Slovenian research organizations that are active in human language technologies. We pro- vide an overview of the task, describe the system architecture and individual servers. Fur- ther we describe the language resources that will be used and developed within the project. We conclude the paper with plans for evaluation of the VoiceTRAN Communicator. 1. Introduction point of the interaction. Consequently, this com- promises the flexibility and naturalness of using Automatic speech-to-speech (STS) translation the system. systems aim to facilitate communication among The VoiceTRAN Communicator is being built people who speak in different languages (Lavie within a national Slovenian research project in- et al., 1997), (Wahlster, 2000), (Lavie et al., volving 5 partners: Alpineon, the University of 2002).
    [Show full text]
  • The Sloleks Morphological Lexicon and Its Future Development Kaja Dobrovoljc, Simon Krek and Tomaž Erjavec
    Kaja Dobrovoljc, Simon Krek and Tomaž Erjavec The Sloleks Morphological Lexicon and its Future Development Kaja Dobrovoljc, Simon Krek and Tomaž Erjavec Abstract This paper presents Sloleks, the largest open-source machine-readable mor- phological lexicon of the Slovene language to date. We first briefly present its development and the formal grammar behind it, and then provide a detailed presentation of the types and structure of inflectional, derivational, grammat- ical and other included information, with a special emphasis on its formal representation within the standardized XML LMF framework. Given that Sloleks is a strong candidate to be used in the compilation of a new diction- ary of modern Slovene, both as a source of morphological information and as a background resource for the language technology tools needed to create it, the second part of the presentation explores the most important aspects of its future development, in particular the expansion of its entry list, addi- tion of pronunciation information, normative categorization of variants and a corpus-based re-evaluation of the existing inflectional paradigms. Such an extensive usage-based open-source morphological lexicon of modern Slovene with a unified system of morphological description will have a long-term use for both language technologies and for other born-digital reference works for the Slovene language. Keywords: morphological lexicon, lexicon of inflected forms, machine read- able dictionary, morphology, inflection, derivation, pronunciation, language standardisation 42 DICTIONARY OF MODERN SLOVENE: PROBLEMS AND SOLUTIONS THE SLOLEKS MORPHOLOGICAL LEXICON AND ITS FUTURE DEVELOPMENT 1 INTRODUCTION When it comes to morphologically rich languages, such as Slovene, the description of morphological paradigms of inflected parts of speech is traditionally very impor- tant.
    [Show full text]
  • Neural Speech Translation at Apptek
    Neural Speech Translation at AppTek Evgeny Matusov, Patrick Wilken, Parnia Bahar∗, Julian Schamper∗, Pavel Golik, Albert Zeyer∗, Joan Albert Silvestre-Cerda`+, Adria` Mart´ınez-Villaronga+, Hendrik Pesch, and Jan-Thorsten Peter Applications Technology (AppTek), Aachen, Germany ematusov,pwilken,pbahar,jschamper,pgolik,jsilvestre,amartinez,hpesch,jtpeter @apptek.com { } + ∗Also RWTH Aachen University, Germany Also Universitat Politecnica` de Valencia,` Spain Abstract processing, and punctuation prediction. Section 4 gives an This work describes AppTek’s speech translation pipeline overview of our ASR system. Section 5 describes the details that includes strong state-of-the-art automatic speech recog- of AppTek’s NMT system. Section 6 gives details of how nition (ASR) and neural machine translation (NMT) com- ASR confusion networks can be encoded as an input of the ponents. We show how these components can be tightly NMT system. In Section 7, we describe our direct speech coupled by encoding ASR confusion networks, as well as translation prototype. The results of our speech translation ASR-like noise adaptation, vocabulary normalization, and experiments are summarized in Section 8. implicit punctuation prediction during translation. In another experimental setup, we propose a direct speech translation 2. Related Work approach that can be scaled to translation tasks with large amounts of text-only parallel training data but a limited num- Theoretical background for tighter coupling of statistical ber of hours of recorded and human-translated speech. ASR and MT systems had been first published in [3]. In prac- tice, it was realized e. g. as statistical phrase-based transla- tion of ASR word lattices with acoustic and language model 1.
    [Show full text]
  • Initial Considerations in Building a Speech-To-Speech Translation System for the Slovenian-English Language Pair
    Initial Considerations in Building a Speech-to-Speech Translation System for the Slovenian-English Language Pair J. Žganec Gros1, A. Mihelič1, M. Žganec1, F. Mihelič2, S. Dobrišek2, J. Žibert2, Š. Vin- tar2, T. Korošec2, T. Erjavec3, M. Romih4 1Alpineon R&D, Ulica Iga Grudna 15, SI-1000 Ljubljana, Slovenia 2University of Ljubljana, SI-1000 Ljubljana, Slovenia 3Jožef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia 4Amebis, Bakovnik 3, SI-1241 Kamnik, Slovenia E-mail: [email protected] Abstract. The paper presents the design concept of the VoiceTRAN Communicator that in- tegrates speech recognition, machine translation and text-to-speech synthesis using the DARPA Galaxy architecture. The aim of the project is to build a robust multimodal speech-to-speech translation communicator able to translate simple domain-specific sentences in the Slove- nian-English language pair. The project represents a joint collaboration between several Slovenian research organizations that are active in human language technologies. We pro- vide an overview of the task, describe the system architecture and individual servers. Fur- ther we describe the language resources that will be used and developed within the project. We conclude the paper with plans for evaluation of the VoiceTRAN Communicator. 1. Introduction point of the interaction. Consequently, this com- promises the flexibility and naturalness of using Automatic speech-to-speech (STS) translation the system. systems aim to facilitate communication among The VoiceTRAN Communicator is being built people who speak in different languages (Lavie within a national Slovenian research project in- et al., 1997), (Wahlster, 2000), (Lavie et al., volving 5 partners: Alpineon, the University of 2002).
    [Show full text]