Machine Translation Summit XVI

Total Page:16

File Type:pdf, Size:1020Kb

Machine Translation Summit XVI Machine Translation Summit XVI http://www.mtsummit2017.org Proceedings of MT Summit XVI Vol.1 Research Track Sadao Kurohashi, Pascale Fung, Editors MT Summit XVI September 18 – 22, 2017 -- Nagoya, Aichi, Japan Proceedings of MT Summit XVI, Vol. 1: MT Research Track Sadao Kurohashi (Kyoto University) & Pascale Fung (Hong Kong University of Science and Technology), Eds. Co-hosted by International Asia-Pacific Association for Association for Graduate School of Machine Translation Machine Translation Informatics, Nagoya University http://www.eamt.org/iamt.php http://www.aamt.info http://www.is.nagoya-u.ac.jp/index_en.html ©2017 The Authors. These articles are licensed under a Creative Commons 3.0 license, no derivative works, attribution, CC-BY-ND. Research Program Co-chairs Sadao Kurohashi Kyoto University Pascale Fung Hong Kong University of Science and Technology Research Program Committee Yuki Arase Osaka University Laurent Besacier LIG Houda Bouamor Carnegie Mellon University Hailong Cao Harbin Institute of Technology Michael Carl Copenhagen Business School Daniel Cer Google Boxing Chen NRC Colin Cherry NRC Chenhui Chu Osaka University Marta R. Costa-jussa Universitat Politècnica de Catalunya Steve DeNeefe SDL Language Weaver Markus Dreyer Amazon Kevin Duh Johns Hopkins University Andreas Eisele DGT, European Commission Christian Federmann Microsoft Research Minwei Feng IBM Watson Group Mikel L. Forcada Universitat d'Alacant George Foster National Research Council Isao Goto NHK Spence Green Lilt Inc Eva Hasler SDL Yifan He Bosch Research and Technology Center Philipp Koehn Johns Hopkins University Roland Kuhn National Research Council of Canada Shankar Kumar Google Anoop Kunchukuttan IIT Bombay Jong-Hyeok Lee Pohang University of Science & Technology Gregor Leusch eBay William Lewis Microsoft Research Mu Li Microsoft Research Lemao Liu Tencent AI Lab Qun Liu Dublin City University Klaus Macherey Google Wolfgang Macherey Google Saab Mansour Apple Daniel Marcu Amazon Jonathan May USC Information Sciences Institute Arul Menezes Microsoft Research Haitao Mi Alipay US Graham Neubig Carnegie Mellon University Matt Post Johns Hopkins University Fatiha Sadat UQAM Michel Simard NRC Katsuhito Sudoh Nara Institute of Science and Technology Christoph Tillmann IBM Research Masao Utiyama NICT Taro Watanabe Google Andy Way ADAPT, Dublin City University Deyi Xiong Soochow University Francois Yvon LIMSI/CNRS Rabih Zbib Raytheon BBN Technologies Jiajun Zhang Institute of Automation Chinese Academy of Sciences Bing Zhao SRI International Contents Page 1 Empirical Study of Dropout Scheme for Neural Machine Translation Xiaolin Wang, Masao Utiyama and Eiichiro Sumita 15 A Target Attention Model for Neural Machine Translation Hideya Mino, Andrew Finch and Eiichiro Sumita 27 Neural Pre-Translation for Hybrid Machine Translation Jinhua Du and Andy Way 41 Neural and Statistical Methods for Leveraging Meta-information in Machine Translation Shahram Khadivi, Patrick Wilken, Leonard Dahlmann and Evgeny Matusov 55 Translation Quality and Productivity: A Study on Rich Morphology Languages Lucia Specia, Kim Harris, Frédéric Blain, Aljoscha Burchardt, Viviven Macketanz, Inguna Skadiņa, Matteo Negri and Marco Turchi 72 The Microsoft Speech Language Translation (MSLT) Corpus for Chinese and Japanese: Conversational Test data for Machine Translation and Speech Recognition Christian Federmann and William Lewis 86 Paying Attention to Multi-Word Expressions in Neural Machine Translation Matīss Rikters and Ondřej Bojar 96 Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages Raj Dabre, Fabien Cromieres and Sadao Kurohashi 108 Learning an Interactive Attention Policy for Neural Machine Translation Samee Ibraheem, Nicholas Altieri and John DeNero 116 A Comparative Quality Evaluation of PBSMT and NMT using Professional Translators Sheila Castilho, Joss Moorkens, Federico Gaspari, Rico Sennrich, Vilelmini Sosoni, Panayota Georgakopoulou, Pintu Lohar, Andy Way, Antonio Valerio Miceli Barone and Maria Gialama 132 One-parameter models for sentence-level post-editing effort estimation Mikel L. Forcada, Miquel Esplà-Gomis, Felipe Sánchez-Martínez and Lucia Specia 144 A Minimal Cognitive Model for Translating and Post-editing Moritz Schaeffer and Michael Carl 156 Fine-Tuning for Neural Machine Translation with Limited Degradation across In- and Out-of-Domain Data Praveen Dakwale and Christof Monz 170 Exploiting Relative Frequencies for Data Selection Thierry Etchegoyhen, Andoni Azpeitia and Eva Martinez Garcia 185 Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic Alexander Erdmann, Nizar Habash, Dima Taji and Houda Bouamor 201 Elastic-substitution decoding for Hierarchical SMT: efficiency, richer search and double labels Gideon Maillette de Buy Wenniger, Khalil Simaan and Andy Way 216 Development of a classifiers/quantifiers dictionary towards French-Japanese MT Mutsuko Tomokiyo, Mathieu Mangeot and Christian Boitet 227 Neural Machine Translation Model with a Large Vocabulary Selected by Branching Entropy Zi Long, Ryuichiro Kimura, Takehito Utsuro, Tomoharu Mitsuhashi and Mikio Yamamoto 241 Usefulness of MT output for comprehension — an analysis from the point of view of linguistic intercomprehension Kenneth Jordan-Núñez, Mikel L. Forcada and Esteve Clua 254 Machine Translation as an Academic Writing Aid for Medical Practitioners Carla Parra Escartín, Sharon O'Brien, Marie-Josée Goulet and Michel Simard 268 A Multilingual Parallel Corpus for Improving Machine Translation on Southeast Asian Languages Hai Long Trieu and Le Minh Nguyen 282 Exploring Hypotheses Spaces in Neural Machine Translation Frédéric Blain, Lucia Specia and Pranava Madhyastha 299 Confidence through Attention Matīss Rikters and Mark Fishel 312 Disentangling ASR and MT Errors in Speech Translation Ngoc-Tien Le, Benjamin Lecouteux and Laurent Besacier 324 Temporality as Seen through Translation: A Case Study on Hindi Texts Sabyasachi Kamila, Sukanta Sen, Mohammed Hasanuzzaman, Asif Ekbal, Andy Way and Pushpak Bhattacharyya 337 A Neural Network Transliteration Model in Low Resource Settings Tan Le and Fatiha Sadat Empirical Study of Dropout Scheme for Neural Machine Translation Xiaolin Wang [email protected] Masao Utiyama [email protected] Eiichiro Sumita [email protected] Advanced Translation Research and Development Promotion Center, National Institute of Information and Communications Technology, Japan Abstract Dropout has lately been recognized as an effective method to relieve over-fitting when train- ing deep neural networks. However, there has been little work studying the optimal dropout scheme for neural machine translation (NMT). NMT models usually contain attention mech- anisms and multiple recurrent layers, thus applying dropout becomes a non-trivial task. This paper approached this problem empirically through experiments where dropout were applied to different parts of connections using different dropout rates. The work in this paper not only leads to an improvement over an established baseline, but also provides useful heuristics about using dropout effectively to train NMT models. These heuristics include which part of con- nections in NMT models have higher priority for dropout than the others, and how to correctly enhance the effect of dropout for difficult translation tasks. 1 Introduction Neural machine translation (NMT), as a new technology emerged from the field of deep learn- ing, has improved the quality of automated machine translation into a significantly higher level compared to statistical machine translation (SMT) (Wu et al., 2016; Sennrich et al., 2017; Klein et al., 2017). State-of-the-art NMT models, like many other deep neural networks, typically contain multiple non-linear hidden layers. This makes them very expressive models, which is critical to successfully learning very complicated relationships between inputs and outputs (De- vlin et al., 2014). However, as these large models have millions of parameters, they tend to over-fit during training phrases. Dropout has recently been recognized as a very effective method to relieve over-fitting. Dropout was first proposed for feed-forward neural networks by Hinton et al. (2012). Dropout was then successfully applied to recurrent neural networks by Pham et al. (2014) and Zaremba et al. (2014). Dropout outperforms many traditional approaches to over-fitting, including early stop of training, introducing weight penalties of various kinds such as L1 and L2 regularization, decaying learning rate and so on. Reported state-of-the-art NMT systems all adopt dropout during training phrases (Wu et al., 2016; Klein et al., 2017), but their paper have not provided many details about how dropout was applied. The optimal way to apply dropout to NMT models is non-trivial. Strictly speaking, NMT is an application of deep neural networks, so it can directly benefit from the advance of deep neural networks. However, two facts make NMT models stand out from nor- mal deep neural networks. First, NMT models contain recurrent hidden layers in order to gain Proceedings of MT Summit XVI, vol.1: Research Track Nagoya, Sep. 18-22, 2017 | p. 1 the ability of operating on sequences. As contrast, deep neural networks in face recognition and phoneme recognition are feed-forward as they take separate inputs (Povey et al., 2011; Krizhevsky et al., 2012). Second, NMT models contain a novel attention mechanism to manage a memory of an input sequence, as
Recommended publications
  • International Computer Science Institute Activity Report 2007
    International Computer Science Institute Activity Report 2007 International Computer Science Institute, 1947 Center Street, Suite 600, Berkeley, CA 94704-1198 USA phone: (510) 666 2900 (510) fax: 666 2956 [email protected] http://www.icsi.berkeley.edu PRINCIPAL 2007 SPONSORS Cisco Defense Advanced Research Projects Agency (DARPA) Disruptive Technology Offi ce (DTO, formerly ARDA) European Union (via University of Edinburgh) Finnish National Technology Agency (TEKES) German Academic Exchange Service (DAAD) Google IM2 National Centre of Competence in Research, Switzerland Microsoft National Science Foundation (NSF) Qualcomm Spanish Ministry of Education and Science (MEC) AFFILIATED 2007 SPONSORS Appscio Ask AT&T Intel National Institutes of Health (NIH) SAP Volkswagen CORPORATE OFFICERS Prof. Nelson Morgan (President and Institute Director) Dr. Marcia Bush (Vice President and Associate Director) Prof. Scott Shenker (Secretary and Treasurer) BOARD OF TRUSTEES, JANUARY 2008 Prof. Javier Aracil, MEC and Universidad Autónoma de Madrid Prof. Hervé Bourlard, IDIAP and EPFL Vice Chancellor Beth Burnside, UC Berkeley Dr. Adele Goldberg, Agile Mind, Inc. and Pharmaceutrix, Inc. Dr. Greg Heinzinger, Qualcomm Mr. Clifford Higgerson, Walden International Prof. Richard Karp, ICSI and UC Berkeley Prof. Nelson Morgan, ICSI (Director) and UC Berkeley Dr. David Nagel, Ascona Group Prof. Prabhakar Raghavan, Stanford and Yahoo! Prof. Stuart Russell, UC Berkeley Computer Science Division Chair Mr. Jouko Salo, TEKES Prof. Shankar Sastry, UC Berkeley, Dean
    [Show full text]
  • A Resource of Corpus-Derived Typed Predicate Argument Structures for Croatian
    CROATPAS: A Resource of Corpus-derived Typed Predicate Argument Structures for Croatian Costanza Marini Elisabetta Ježek University of Pavia University of Pavia Department of Humanities Department of Humanities costanza.marini01@ [email protected] universitadipavia.it Abstract 2014). The potential uses and purposes of the resource range from multilingual The goal of this paper is to introduce pattern linking between compatible CROATPAS, the Croatian sister project resources to computer-assisted language of the Italian Typed-Predicate Argument learning (CALL). Structure resource (TPAS1, Ježek et al. 2014). CROATPAS is a corpus-based 1 Introduction digital collection of verb valency structures with the addition of semantic Nowadays, we live in a time when digital tools type specifications (SemTypes) to each and resources for language technology are argument slot, which is currently being constantly mushrooming all around the world. developed at the University of Pavia. However, we should remind ourselves that some Salient verbal patterns are discovered languages need our attention more than others if following a lexicographical methodology they are not to face – to put it in Rehm and called Corpus Pattern Analysis (CPA, Hegelesevere’s words – “a steadily increasing Hanks 2004 & 2012; Hanks & and rather severe threat of digital extinction” Pustejovsky 2005; Hanks et al. 2015), (2018: 3282). According to the findings of initiatives such as whereas SemTypes – such as [HUMAN], [ENTITY] or [ANIMAL] – are taken from a the META-NET White Paper Series (Tadić et al. shallow ontology shared by both TPAS 2012; Rehm et al. 2014), we can state that and the Pattern Dictionary of English Croatian is unfortunately among the 21 out of 24 Verbs (PDEV2, Hanks & Pustejovsky official languages of the European Union that are 2005; El Maarouf et al.
    [Show full text]
  • The Impact of Crowdsourcing Post-Editing with the Collaborative Translation Framework
    The Impact of Crowdsourcing Post-editing with the Collaborative Translation Framework Takako Aikawa1, Kentaro Yamamoto2, and Hitoshi Isahara2 1 Microsoft Research, Machine Translation Team [email protected] 2 Toyohashi University of Technology [email protected], [email protected] Abstract. This paper presents a preliminary report on the impact of crowdsourcing post-editing through the so-called “Collaborative Translation Framework” (CTF) developed by the Machine Translation team at Microsoft Research. We first provide a high-level overview of CTF and explain the basic functionalities available from CTF. Next, we provide the motivation and design of our crowdsourcing post-editing project using CTF. Last, we present the re- sults from the project and our observations. Crowdsourcing translation is an in- creasingly popular-trend in the MT community, and we hope that our paper can shed new light on the research into crowdsourcing translation. Keywords: Crowdsourcing post-editing, Collaborative Translation Framework. 1 Introduction The output of machine translation (MT) can be used either as-is (i.e., raw-MT) or for post-editing (i.e., MT for post-editing). Although the advancement of MT technology is making raw-MT use more pervasive, reservations about raw-MT still persist; espe- cially among users who need to worry about the accuracy of the translated contents (e.g., government organizations, education institutes, NPO/NGO, enterprises, etc.). Professional human translation from scratch, however, is just too expensive. To re- duce the cost of translation while achieving high translation quality, many places use MT for post-editing; that is, use MT output as an initial draft of translation and let human translators post-edit it.
    [Show full text]
  • How to Use Google Translate
    HOW TO USE GOOGLE TRANSLATE For some ASVAB CEP participants (or their parents), English is a second language. Google Translate is an easy way to instantly translate any webpage using these steps. Google Chrome Internet Explorer 1. Open Google Chrome. Google Translate is available on Internet Explorer version 6 and 2. Go to asvabprogram.com. later. To activate it: 3. Right click anywhere on the webpage. 1. Open Internet Explorer. 4. Select Translate from the menu. 2. Go to Google Toolbar’s website (toolbar.google.com), 5. Select Options. and click the “Download Google Toolbar” button. 6. On the Translate Language dropdown, 3. Click on “Accept and Install” and the toolbar will be select the desired language. automatically installed on your Internet Explorer. 4. Click Run or Open in the window that appears. 5. Enable the toolbar. 6. Go to asvabprogram.com. 7. Select More >> 8. Select Translate. 9. Then, the translate button will appear at the top of your webpage. 10. Right click to select the language option. 7. You will see the Google Translate icon in the browser bar, which you can use to manage your translation settings. iphone Android Microsoft Translator is a universal app for 1. On your Android phone or iPhone and iPad, and can be downloaded tablet, open the Chrome app. from the App Store for free. Once you’ve 2. Go to a webpage. got it downloaded, you can set up the action extension for translation web pages. 3. To change the language, tap 4. Tap Translate… To activate the Microsoft Translator extension in Safari: 5.
    [Show full text]
  • Final Study Report on CEF Automated Translation Value Proposition in the Context of the European LT Market/Ecosystem
    Final study report on CEF Automated Translation value proposition in the context of the European LT market/ecosystem FINAL REPORT A study prepared for the European Commission DG Communications Networks, Content & Technology by: Digital Single Market CEF AT value proposition in the context of the European LT market/ecosystem Final Study Report This study was carried out for the European Commission by Luc MEERTENS 2 Khalid CHOUKRI Stefania AGUZZI Andrejs VASILJEVS Internal identification Contract number: 2017/S 108-216374 SMART number: 2016/0103 DISCLAIMER By the European Commission, Directorate-General of Communications Networks, Content & Technology. The information and views set out in this publication are those of the author(s) and do not necessarily reflect the official opinion of the Commission. The Commission does not guarantee the accuracy of the data included in this study. Neither the Commission nor any person acting on the Commission’s behalf may be held responsible for the use which may be made of the information contained therein. ISBN 978-92-76-00783-8 doi: 10.2759/142151 © European Union, 2019. All rights reserved. Certain parts are licensed under conditions to the EU. Reproduction is authorised provided the source is acknowledged. 2 CEF AT value proposition in the context of the European LT market/ecosystem Final Study Report CONTENTS Table of figures ................................................................................................................................................ 7 List of tables ..................................................................................................................................................
    [Show full text]
  • Metia Cloud OS Ss
    U.S. Army Europe saves more than $150,000 by automating database translation Customer: U.S. Army Europe Website: www.eur.army.mil “By using the Microsoft Translator API to automate SQL Customer Size: 29,000 soldiers Server data translation into English, we are able to Country or Region: Germany Industry: Military/public sector present senior leaders with universally usable data that Customer Profile supports better informed decisions.” U.S. Army Europe trains and leads Army Mark Hutcheson forces in 51 countries to support U.S. IT Specialist, U.S. Army Europe European Command and Headquarters, Department of the Army. Before migrating to Microsoft Dynamics CRM, U.S. Army Europe Benefits needed to translate portions of a SQL Server database used for ◼ Enhanced force protection ◼ Saved $150,500 in manual translation screening and hiring local nationals. Using the Microsoft costs ◼ Improved usability of data Translator API, Microsoft Visual C#, and the common language runtime (CLR) environment, engineers automated the translation Software and Services ◼ Microsoft Server Product Portfolio of select SQL Server data into English. As a result, the Army saved − Microsoft SQL Server 2012 about $150,500 (about 1,750 hours) in manual translation costs, ◼ Microsoft Dynamics CRM ◼ Microsoft Visual Studio avoided a seven-month delay, and maintained access to all of its − Microsoft Visual C# historical employment screening data. ◼ Technologies − Microsoft Translator API information was typically submitted in a − Transact SQL Business Needs U.S. Army Europe trains, equips, deploys, language other than English. and provides command and control of troops to enhance transatlantic security. To All of the application data was stored in a support that mission, it employs many local SQL Server database to be used for nationals for civilian jobs such as land- screening and hiring employees and scaping, food services, and maintenance.
    [Show full text]
  • Empowering People with Disabilities Through AI
    Empowering people with disabilities through AI Microsoft WBCSD Future of Work case study February 2020 Table of Contents Summary ............................................................................................................................................................... 2 Company background ............................................................................................................................................ 2 Future of Work challenge ...................................................................................................................................... 3 Business case ......................................................................................................................................................... 3 Microsoft’s solution ............................................................................................................................................... 3 Seeing AI............................................................................................................................................................... 4 Helpicto ................................................................................................................................................................ 4 Microsoft Translator ............................................................................................................................................ 5 Results ..................................................................................................................................................................
    [Show full text]
  • 3 Corpus Tools for Lexicographers
    Comp. by: pg0994 Stage : Proof ChapterID: 0001546186 Date:14/5/12 Time:16:20:14 Filepath:d:/womat-filecopy/0001546186.3D31 OUP UNCORRECTED PROOF – FIRST PROOF, 14/5/2012, SPi 3 Corpus tools for lexicographers ADAM KILGARRIFF AND IZTOK KOSEM 3.1 Introduction To analyse corpus data, lexicographers need software that allows them to search, manipulate and save data, a ‘corpus tool’. A good corpus tool is the key to a comprehensive lexicographic analysis—a corpus without a good tool to access it is of little use. Both corpus compilation and corpus tools have been swept along by general technological advances over the last three decades. Compiling and storing corpora has become far faster and easier, so corpora tend to be much larger than previous ones. Most of the first COBUILD dictionary was produced from a corpus of eight million words. Several of the leading English dictionaries of the 1990s were produced using the British National Corpus (BNC), of 100 million words. Current lexico- graphic projects we are involved in use corpora of around a billion words—though this is still less than one hundredth of one percent of the English language text available on the Web (see Rundell, this volume). The amount of data to analyse has thus increased significantly, and corpus tools have had to be improved to assist lexicographers in adapting to this change. Corpus tools have become faster, more multifunctional, and customizable. In the COBUILD project, getting concordance output took a long time and then the concordances were printed on paper and handed out to lexicographers (Clear 1987).
    [Show full text]
  • 16.1 Digital “Modes”
    Contents 16.1 Digital “Modes” 16.5 Networking Modes 16.1.1 Symbols, Baud, Bits and Bandwidth 16.5.1 OSI Networking Model 16.1.2 Error Detection and Correction 16.5.2 Connected and Connectionless 16.1.3 Data Representations Protocols 16.1.4 Compression Techniques 16.5.3 The Terminal Node Controller (TNC) 16.1.5 Compression vs. Encryption 16.5.4 PACTOR-I 16.2 Unstructured Digital Modes 16.5.5 PACTOR-II 16.2.1 Radioteletype (RTTY) 16.5.6 PACTOR-III 16.2.2 PSK31 16.5.7 G-TOR 16.2.3 MFSK16 16.5.8 CLOVER-II 16.2.4 DominoEX 16.5.9 CLOVER-2000 16.2.5 THROB 16.5.10 WINMOR 16.2.6 MT63 16.5.11 Packet Radio 16.2.7 Olivia 16.5.12 APRS 16.3 Fuzzy Modes 16.5.13 Winlink 2000 16.3.1 Facsimile (fax) 16.5.14 D-STAR 16.3.2 Slow-Scan TV (SSTV) 16.5.15 P25 16.3.3 Hellschreiber, Feld-Hell or Hell 16.6 Digital Mode Table 16.4 Structured Digital Modes 16.7 Glossary 16.4.1 FSK441 16.8 References and Bibliography 16.4.2 JT6M 16.4.3 JT65 16.4.4 WSPR 16.4.5 HF Digital Voice 16.4.6 ALE Chapter 16 — CD-ROM Content Supplemental Files • Table of digital mode characteristics (section 16.6) • ASCII and ITA2 code tables • Varicode tables for PSK31, MFSK16 and DominoEX • Tips for using FreeDV HF digital voice software by Mel Whitten, KØPFX Chapter 16 Digital Modes There is a broad array of digital modes to service various needs with more coming.
    [Show full text]
  • Ruken C¸Akici
    RUKEN C¸AKICI personal information Official Ruket C¸akıcı Name Born in Turkey, 23 June 1978 email [email protected] website http://www.ceng.metu.edu.tr/˜ruken phone (H) +90 (312) 210 6968 · (M) +90 (532) 557 8035 work experience 2010- Instructor, METU METU Research and Teaching duties 1999-2010 Research Assistant, METU — Ankara METU Teaching assistantship of various courses education 2002-2008 University of Edinburgh, UK Doctor of School: School of Informatics Philosophy Thesis: Wide-Coverage Parsing for Turkish Advisors: Prof. Mark Steedman & Prof. Miles Osborne 1999-2002 Middle East Technical University Master of Science School: Computer Engineering Thesis: A Computational Interface for Syntax and Morphemic Lexicons Advisor: Prof. Cem Bozs¸ahin 1995-1999 Middle East Technical University Bachelor of Science School: Computer Engineering projects 1999-2001 AppTek/ Lernout & Hauspie Inc Language Pairing on Functional Structure: Lexical- Functional Grammar Based Machine Translation for English – Turkish. 150000USD. · Consultant developer 2007-2011 TUB¨ MEDID Turkish Discourse Treebank Project, TUBITAK 1001 program (107E156), 137183 TRY. · Researcher · (Now part of COST Action IS1312 (TextLink)) 2012-2015 Unsupervised Learning Methods for Turkish Natural Language Processing, METU BAP Project (BAP-08-11-2012-116), 30000 TRY. · Primary Investigator 2013-2015 TwiTR: Turkc¸e¨ ic¸in Sosyal Aglarda˘ Olay Bulma ve Bulunan Olaylar ic¸in Konu Tahmini (TwiTR: Event detection and Topic identification for events in social networks for Turkish language), TUBITAK 1001 program (112E275), 110750 TRY. · Researcher· (Now Part of ICT COST Action IC1203 (ENERGIC)) 2013-2016 Understanding Images and Visualizing Text: Semantic Inference and Retrieval by Integrating Computer Vision and Natural Language Processing, TUBITAK 1001 program (113E116), 318112 TRY.
    [Show full text]
  • Improving Pre-Trained Multilingual Model with Vocabulary Expansion
    Improving Pre-Trained Multilingual Models with Vocabulary Expansion Hai Wang1* Dian Yu2 Kai Sun3* Jianshu Chen2 Dong Yu2 1Toyota Technological Institute at Chicago, Chicago, IL, USA 2Tencent AI Lab, Bellevue, WA, USA 3Cornell, Ithaca, NY, USA [email protected], [email protected], fyudian,jianshuchen,[email protected] i.e., out-of-vocabulary (OOV) words (Søgaard and Abstract Johannsen, 2012; Madhyastha et al., 2016). A higher OOV rate (i.e., the percentage of the unseen Recently, pre-trained language models have words in the held-out data) may lead to a more achieved remarkable success in a broad range severe performance drop (Kaljahi et al., 2015). of natural language processing tasks. How- ever, in multilingual setting, it is extremely OOV problems have been addressed in previous resource-consuming to pre-train a deep lan- works under monolingual settings, through replac- guage model over large-scale corpora for each ing OOV words with their semantically similar in- language. Instead of exhaustively pre-training vocabulary words (Madhyastha et al., 2016; Ko- monolingual language models independently, lachina et al., 2017) or using character/word infor- an alternative solution is to pre-train a pow- mation (Kim et al., 2016, 2018; Chen et al., 2018) erful multilingual deep language model over or subword information like byte pair encoding large-scale corpora in hundreds of languages. (BPE) (Sennrich et al., 2016; Stratos, 2017). However, the vocabulary size for each lan- guage in such a model is relatively small, es- Recently, fine-tuning a pre-trained deep lan- pecially for low-resource languages. This lim- guage model, such as Generative Pre-Training itation inevitably hinders the performance of (GPT) (Radford et al., 2018) and Bidirec- these multilingual models on tasks such as se- tional Encoder Representations from Transform- quence labeling, wherein in-depth token-level ers (BERT) (Devlin et al., 2018), has achieved re- or sentence-level understanding is essential.
    [Show full text]
  • The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on Iot Nodes in Smart Cities
    sensors Article The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on IoT Nodes in Smart Cities Ammar Nasif *, Zulaiha Ali Othman and Nor Samsiah Sani Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science & Technology, University Kebangsaan Malaysia, Bangi 43600, Malaysia; [email protected] (Z.A.O.); [email protected] (N.S.S.) * Correspondence: [email protected] Abstract: Networking is crucial for smart city projects nowadays, as it offers an environment where people and things are connected. This paper presents a chronology of factors on the development of smart cities, including IoT technologies as network infrastructure. Increasing IoT nodes leads to increasing data flow, which is a potential source of failure for IoT networks. The biggest challenge of IoT networks is that the IoT may have insufficient memory to handle all transaction data within the IoT network. We aim in this paper to propose a potential compression method for reducing IoT network data traffic. Therefore, we investigate various lossless compression algorithms, such as entropy or dictionary-based algorithms, and general compression methods to determine which algorithm or method adheres to the IoT specifications. Furthermore, this study conducts compression experiments using entropy (Huffman, Adaptive Huffman) and Dictionary (LZ77, LZ78) as well as five different types of datasets of the IoT data traffic. Though the above algorithms can alleviate the IoT data traffic, adaptive Huffman gave the best compression algorithm. Therefore, in this paper, Citation: Nasif, A.; Othman, Z.A.; we aim to propose a conceptual compression method for IoT data traffic by improving an adaptive Sani, N.S.
    [Show full text]