Deep Learning for Web Search and Natural Language Processing

Total Page:16

File Type:pdf, Size:1020Kb

Deep Learning for Web Search and Natural Language Processing Deep Learning for Web Search and Natural Language Processing Jianfeng Gao Deep Learning Technology Center (DLTC) Microsoft Research, Redmond, USA WSDM 2015, Shanghai, China *Thank Li Deng and Xiaodong He, with whom we participated in the previous ICASSP2014 and CIKM2014 versions of this tutorial Mission of Machine (Deep) Learning Data (collected/labeled) Model (architecture) Training (algorithm) 2 Outline • The basics • Background of deep learning • A query classification problem • A single neuron model • A deep neural network (DNN) model • Potentials and problems of DNN • The breakthrough after 2006 • Deep Semantic Similarity Models (DSSM) for text processing • Recurrent Neural Networks 3 4 Scientists See Promise in Deep-Learning Programs John Markoff November 23, 2012 Rick Rashid in Tianjin, China, October, 25, 2012 Geoff Hinton The universal translator on A voice recognition program translated a speech given by Richard F. “Star Trek” comes true… Rashid, Microsoft’s top scientist, into Chinese. 5 6 Impact of deep learning in speech technology Cortana 7 8 A query classification problem • Given a search query 푞, e.g., “denver sushi downtown” • Identify its domain 푐 e.g., • Restaurant • Hotel • Nightlife • Flight • etc. • So that a search engine can tailor the interface and result to provide a richer personalized user experience 9 A single neuron model • For each domain 푐, build a binary classifier 푇 • Input: represent a query 푞 as a vector of features 푥 = [푥1, … 푥푛] • Output: 푦 = 푃 푐 푞 • 푞 is labeled 푐 is 푃 푐 푞 > 0.5 • Input feature vector, e.g., a bag of words vector • Regards words as atomic symbols: denver, sushi, downtown • Each word is represented as a one-hot vector: 0, … , 0,1,0, … , 0 푇 • Bag of words vector = sum of one-hot vectors • We may use other features, such as n-grams, phrases, (hidden) topics 10 A single neuron model 푥 Output: 푃(푐|푞) Input features 1 푦 = 휎 푧 = 1+exp(−푧) 푛 푧 = 푖=0 푤푖푥푖 • 푤: weight vector to be learned • 푧: weighted sum of input features • 휎: the logistic function • Turn a score to a probability • A sigmoid non-linearlity (activation function), essential in multi-layer/deep neural network models 11 Model training: how to assign 푤 • Training data: a set of 푥 푚 , 푦 푚 pairs 푚={1,2,…,푀} • Input 푥 푚 ∈ 푅푛 • Output 푦 푚 = {0,1} • Goal: learn function 푓: 푥 → 푦 to predict correctly on new input 푥 • Step 1: choose a function family, e.g., • neural networks, logistic regression, support vector machine, in our case 푛 푇 • 푓 푥 = 휎 푖=0 푤푖푥푖 = 휎(푤 푥) • Step 2: optimize parameters 푤 on training data, e.g., • minimize a loss function (mean square error loss) 푀 푚 • min 푚=1 퐿 푤 1 2 • where 퐿(푚) = 푓 푥 푚 − 푦 푚 2 푤 12 Training the single neuron model, 푤 • Stochastic gradient descent (SGD) algorithm • Initialize 푤 randomly 휕퐿 • Update for each training sample until convergence: 푤푛푒푤 = 푤표푙푑 − 휂 휕푤 1 • Mean square error loss: 퐿 = 휎 푤푇푥 − 푦 2 2 휕퐿 • Gradient: = 훿휎′ 푧 푥 휕푤 • 푧 = 푤푇푥 • Error: 훿 = 휎 푧 − 푦 • Derivative of sigmoid 휎′(푧) = 휎 푧 1 − 휎 푧 13 SGD vs. gradient descent • Gradient descent is a batch training algorithm • update 푤 per batch of training samples • goes in steepest descent direction • SGD is noisy descent (but faster per iteration) • Loss function contour plot (Duh 2014) 1 • 푀 휎 푤푇푥 − 푦 2 + 푤 푚=1 2 14 Multi-layer (deep) neural networks Output layer 푦표 = 휎(푤푇푦2) Vector 푤 This is exactly the single neuron model with hidden features. st 2 1 2 hidden layer 푦 = 휎(퐖2푦 ) Projection matrix 퐖2 Feature generation: project raw input st 1 1 hidden layer 푦 = 휎(퐖1푥) features (bag of words) to hidden features (topics). Projection matrix 퐖1 Input features 푥 15 Standard Machine Deep Learning Learning Process Adapted from [Duh 2014] 16 Revisit the activation function: 휎 • Assuming a L-layer neural network • 푦 = 퐖퐿휎 … 휎 퐖2휎 퐖1푥 , where 푦 is the output vector • If 휎 is a linear function, then L-layer neural network is compiled down into a single linear transform • 휎: map scores to probabilities • Useful in prediction as it transforms the neuron weighted sum into the interval [0..1] • Unnecessary for model training except in the Boltzman machine or graphical models 17 Training a two-layer neural net • Training data: a set of 푥 푚 , 푦 푚 pairs 푚={1,2,…,푀} • Input 푥 푚 ∈ 푅푛 • Output 푦 푚 = {0,1} • Goal: learn function 푓: 푥 → 푦 to predict correctly on new input 푥 • 푓 푥 = 휎 푗 푤푗 ∙ 휎( 푖 푤푖푗푥푖) • Optimize parameters 푤 on training data via 푀 푚 • minimize a loss function: min 푚=1 퐿 푤 1 2 • where 퐿(푚) = 푓 푥 푚 − 푦 푚 2 푤 18 Training neural nets: back-propagation • Stochastic gradient descent (SGD) algorithm 휕퐿 • 푤푛푒푤 = 푤표푙푑 − 휂 휕푤 휕퐿 • : sample-wise loss w.r.t. parameters 휕푤 • Need to apply the derivative chain rule correctly • 푧 = 푓 푦 • 푦 = 푔 푥 휕푧 휕푧 휕푦 • = 휕푥 휕푦 휕푥 • A detailed discussion in [Socher & Manning 2013] 19 Simple chain rule 20 [Socher & Manning 2013] Multiple paths chain rule [Socher & Manning 2013] 21 Chain rule in flow graph [Socher & Manning 2013] 22 Training neural nets: back-propagation Assume two outputs (푦1, 푦2) per input 푥, and 1 Loss per sample: 퐿 = 휎 푧 − 푦 2 푘 2 푘 푘 Forward pass: 푦푘 = 휎(푧푘), 푧푘 = 푗 푤푗푘ℎ푗 ℎ푗 = 휎(푧푗), 푧푗 = 푖 푤푖푗푥푖 Derivatives of the weights 휕퐿 휕퐿 휕푧푘 휕( 푗 푤푗푘ℎ푗) = = 훿푘 = 훿푘ℎ푗 휕푤푗푘 휕푧푘 휕푤푗푘 휕푤푗푘 휕퐿 휕퐿 휕푧푗 휕( 푖 푤푖푗푥푖) = = 훿푗 = 훿푗푥푖 휕푤푖푗 휕푧푗 휕푤푖푗 휕푤푖푗 휕퐿 ′ 훿푘 = = 휎 푧푘 − 푦푘 휎 푧푘 휕푧푘 휕퐿 휕푧푘 휕 훿푗 = 푘 = 푘 훿푘 푗 푤푗푘휎 푧푗 = 푘 훿푘푤푗푘 휎′(푧푗) 휕푧푘 휕푧푗 휕푧푗 23 Adapted from [Duh 2014] Training neural nets: back-propagation • All updates involve some scaled error from output × input feature: 휕퐿 ′ • = 훿푘ℎ푗 where 훿푘 = 휎 푧푘 − 푦푘 휎 푧푘 휕푤푗푘 휕퐿 • = 훿푗푥푖 where 훿푗 = 푘 훿푘푤푗푘 휎′(푧푗) 휕푤푖푗 • First compute 훿푘 from output layer, then 훿푗 for other layers and iterate. 훿 푘=푦1 훿푘=푦2 푤31 푤32 훿푗=ℎ3 = 훿푘=푦1푤31 + 훿푘=푦2푤32 휎′(푧푗=ℎ3) 24 Adapted from (Duh 2014) Potential of DNN This is exactly the single neuron model with hidden features. Project raw input features to hidden features (high level representation). 25 [Bengio, 2009] DNN is difficult to training • Vanishing gradient problem in backpropagation 휕퐿 휕퐿 휕푧푗 • = = 훿푗푥푖 휕푤푖푗 휕푧푗 휕푤푖푗 • 훿푗 = 푘 훿푘푤푗푘 휎′(푧푗) • 훿푗 may vanish after repeated multiplication • Scalability problem 26 Many, but NOT ALL, limitations of early DNNs have been overcome better learning algorithms and different nonlinearities. SGD can often allow the training to jump out of local optima due to the noisy gradients estimated from a small batch of samples. SGD effective for parallelizing over many machines with an asynchronous mode • Vanishing gradient problem? Try deep belief net (DBN) to initialize it – Layer-wise pre-training (Hinton et al. 2006) • Scalability problem Computational power due to the use of GPU and large-scale CPU clusters 27 DNN: (Fully-Connected) Deep Neural Networks Hinton, Deng, Yu, etc., DNN for AM in speech recognition, IEEE SPM, 2012 Geoff Hinton Li Deng First train a stack of N models each of Then compose them into Then add outputs which has one hidden layer. Each model in a single Deep Belief and train the DNN Dong Yu the stack treats the hidden variables of the Network. with backprop. previous model as data. 28 CD-DNN-HMM Dahl, Yu, Deng, and Acero, “Context-Dependent Pre- trained Deep Neural Networks for Large Vocabulary Speech Recognition,” IEEE Trans. ASLP, Jan. 2012 After no improvement for 10+ years by the research community… …MSR reduced error from ~23% to <13% (and under 7% for Rick Rashid’s S2S demo)! 29 Deep Convolutional Neural Network for Images CNN: local connections with weight sharing; pooling for translation invariance Image Output [LeCun et al., 1998] 30 A basic module of the CNN Pooling Convolution Image 31 Deep Convolutional NN for Images 2012-2014 Fully connected Fully connected earlier Fully connected Convolution/pooling SVM Convolution/pooling Pooling Convolution/pooling Histogram Oriented Grads Convolution/pooling Image Convolution/pooling Raw Image pixels 32 ImageNet 1K Competition Krizhevsky, Sutskever, Hinton, “ImageNet Classification with Deep Convolutional Neural Networks.” NIPS, Dec. 2012 Deep CNN Univ. Toronto team 33 Gartner hyper cycle graph for NN history 34 [Deng and Yu 2014] Useful Sites on Deep Learning • http://www.cs.toronto.edu/~hinton/ • http://ufldl.stanford.edu/wiki/index.php/UFLDL_Recommended_Readings • http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial (Andrew Ng’s group) • http://deeplearning.net/reading-list/ (Bengio’s group) • http://deeplearning.net/tutorial/ • http://deeplearning.net/deep-learning-research-groups-and-labs/ • Google+ Deep Learning community 35 Outline • The basics • Deep Semantic Similarity Models (DSSM) for text processing • What is DSSM • DSSM for web search ranking • DSSM for recommendation • DSSM for automatic image captioning • Recurrent Neural Networks 36 Computing Semantic Similarity • Fundamental to almost all Web search and NLP tasks, e.g., • Machine translation: similarity between sentences in different languages • Web search: similarity between queries and documents • Problems of the existing approaches • Lexical matching cannot handle language discrepancy. • Unsupervised word embedding or topic models are not optimal for the task of interest. 37 Deep Semantic Similarity Model (DSSM) [Huang et al. 2013; Gao et al. 2014a; Gao et al. 2014b; Shen et al. 2014] • Compute semantic similarity between two text strings X and Y • Map X and Y to feature vectors in a latent semantic space via deep neural net • Compute the cosine similarity between the feature vectors • Also called “Deep Structured Similarity Model” in Huang et al.
Recommended publications
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • Unified Language Model Pre-Training for Natural
    Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong∗ Nan Yang∗ Wenhui Wang∗ Furu Wei∗† Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com Abstract This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirec- tional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-of- the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. 1 Introduction Language model (LM) pre-training has substantially advanced the state of the art across a variety of natural language processing tasks [8, 29, 19, 31, 9, 1].
    [Show full text]
  • Analyzing Political Polarization in News Media with Natural Language Processing
    Analyzing Political Polarization in News Media with Natural Language Processing by Brian Sharber A thesis presented to the Honors College of Middle Tennessee State University in partial fulfillment of the requirements for graduation from the University Honors College Fall 2020 Analyzing Political Polarization in News Media with Natural Language Processing by Brian Sharber APPROVED: ______________________________ Dr. Salvador E. Barbosa Project Advisor Computer Science Department Dr. Medha Sarkar Computer Science Department Chair ____________________________ Dr. John R. Vile, Dean University Honors College Copyright © 2020 Brian Sharber Department of Computer Science Middle Tennessee State University; Murfreesboro, Tennessee, USA. I hereby grant to Middle Tennessee State University (MTSU) and its agents (including an institutional repository) the non-exclusive right to archive, preserve, and make accessible my thesis in whole or in part in all forms of media now and hereafter. I warrant that the thesis and the abstract are my original work and do not infringe or violate any rights of others. I agree to indemnify and hold MTSU harmless for any damage which may result from copyright infringement or similar claims brought against MTSU by third parties. I retain all ownership rights to the copyright of my thesis. I also retain the right to use in future works (such as articles or books) all or part of this thesis. The software described in this work is free software. You can redistribute it and/or modify it under the terms of the MIT License. The software is posted on GitHub under the following repository: https://github.com/briansha/NLP-Political-Polarization iii Abstract Political discourse in the United States is becoming increasingly polarized.
    [Show full text]
  • Automatic Spelling Correction Based on N-Gram Model
    International Journal of Computer Applications (0975 – 8887) Volume 182 – No. 11, August 2018 Automatic Spelling Correction based on n-Gram Model S. M. El Atawy A. Abd ElGhany Dept. of Computer Science Dept. of Computer Faculty of Specific Education Damietta University Damietta University, Egypt Egypt ABSTRACT correction. In this paper, we are designed, implemented and A spell checker is a basic requirement for any language to be evaluated an end-to-end system that performs spellchecking digitized. It is a software that detects and corrects errors in a and auto correction. particular language. This paper proposes a model to spell error This paper is organized as follows: Section 2 illustrates types detection and auto-correction that is based on n-gram of spelling error and some of related works. Section 3 technique and it is applied in error detection and correction in explains the system description. Section 4 presents the results English as a global language. The proposed model provides and evaluation. Finally, Section 5 includes conclusion and correction suggestions by selecting the most suitable future work. suggestions from a list of corrective suggestions based on lexical resources and n-gram statistics. It depends on a 2. RELATED WORKS lexicon of Microsoft words. The evaluation of the proposed Spell checking techniques have been substantially, such as model uses English standard datasets of misspelled words. error detection & correction. Two commonly approaches for Error detection, automatic error correction, and replacement error detection are dictionary lookup and n-gram analysis. are the main features of the proposed model. The results of the Most spell checkers methods described in the literature, use experiment reached approximately 93% of accuracy and acted dictionaries as a list of correct spellings that help algorithms similarly to Microsoft Word as well as outperformed both of to find target words [16].
    [Show full text]
  • N-Gram-Based Machine Translation
    N-gram-based Machine Translation ∗ JoseB.Mari´ no˜ ∗ Rafael E. Banchs ∗ Josep M. Crego ∗ Adria` de Gispert ∗ Patrik Lambert ∗ Jose´ A. R. Fonollosa ∗ Marta R. Costa-jussa` Universitat Politecnica` de Catalunya This article describes in detail an n-gram approach to statistical machine translation. This ap- proach consists of a log-linear combination of a translation model based on n-grams of bilingual units, which are referred to as tuples, along with four specific feature functions. Translation performance, which happens to be in the state of the art, is demonstrated with Spanish-to-English and English-to-Spanish translations of the European Parliament Plenary Sessions (EPPS). 1. Introduction The beginnings of statistical machine translation (SMT) can be traced back to the early fifties, closely related to the ideas from which information theory arose (Shannon and Weaver 1949) and inspired by works on cryptography (Shannon 1949, 1951) during World War II. According to this view, machine translation was conceived as the problem of finding a sentence by decoding a given “encrypted” version of it (Weaver 1955). Although the idea seemed very feasible, enthusiasm faded shortly afterward because of the computational limitations of the time (Hutchins 1986). Finally, during the nineties, two factors made it possible for SMT to become an actual and practical technology: first, significant increment in both the computational power and storage capacity of computers, and second, the availability of large volumes of bilingual data. The first SMT systems were developed in the early nineties (Brown et al. 1990, 1993). These systems were based on the so-called noisy channel approach, which models the probability of a target language sentence T given a source language sentence S as the product of a translation-model probability p(S|T), which accounts for adequacy of trans- lation contents, times a target language probability p(T), which accounts for fluency of target constructions.
    [Show full text]
  • Spell Checker in CET Designer
    Linköping University | Department of Computer Science Bachelor thesis, 16 ECTS | Datateknik 2016 | LIU-IDA/LITH-EX-G--16/069--SE Spell checker in CET Designer Rasmus Hedin Supervisor : Amir Aminifar Examiner : Zebo Peng Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin- istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam- manhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/. Copyright The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum- stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose.
    [Show full text]
  • Automatic Extraction of Personal Events from Dialogue
    Automatic extraction of personal events from dialogue Joshua D. Eisenberg Michael Sheriff Artie, Inc. Florida International University 601 W 5th St, Los Angeles, CA 90071 Department of Creative Writing [email protected] 11101 S.W. 13 ST., Miami, FL 33199 [email protected] Abstract language. The most popular corpora annotated In this paper we introduce the problem of ex- for events all come from the domain of newswire tracting events from dialogue. Previous work (Pustejovsky et al., 2003b; Minard et al., 2016). on event extraction focused on newswire, how- Our work begins to fill that gap. We have open ever we are interested in extracting events sourced the gold-standard annotated corpus of from spoken dialogue. To ground this study, events from dialogue.1 For brevity, we will hearby we annotated dialogue transcripts from four- refer to this corpus as the Personal Events in Di- teen episodes of the podcast This American alogue Corpus (PEDC). We detailed the feature Life. This corpus contains 1,038 utterances, made up of 16,962 tokens, of which 3,664 rep- extraction pipelines, and the support vector ma- resent events. The agreement for this corpus chine (SVM) learning protocols for the automatic has a Cohen’s κ of 0.83. We have open sourced extraction of events from dialogue. Using this in- this corpus for the NLP community. With this formation, as well as the corpus we have released, corpus in hand, we trained support vector ma- anyone interested in extracting events from dia- chines (SVM) to correctly classify these phe- logue can proceed where we have left off.
    [Show full text]
  • Training Automatic Transliteration Models on Dbpedia Data
    Training Automatic Transliteration Models on DBpedia Data Velislava Todorova Kiril Simov Linguistic Modeling Department Linguistic Modeling Department IICT-BAS IICT-BAS [email protected] [email protected] Abstract provide access to all RDF statements about the original URIs. Our goal is to facilitate named en- The paper presents several transliteration tity recognition in Bulgarian texts by models and their evaluation. Their evaluation extending the coverage of DBpedia is done over 100 examples, transliterated man- (http://www.dbpedia.org/) for Bul- ually by two people, independently from each garian. For this task we have trained other. The discrepancies between the two hu- translation Moses models to transliter- man transliterations demonstrate the complex- ate foreign names to Bulgarian. The ity of the task. training sets were obtained by extract- The structure of the paper is as follows: in ing the names of all people, places section 2 we describe the problem in more de- and organizations from DBpedia and tail; in section 3 we present some related ap- its extension Airpedia (http://www. proaches; section 4 reports on the preliminary airpedia.org/). Our approach is ex- experiments; section 5 describes how the train- tendable to other languages with small ing data is extended on the basis of the results DBpedia coverage. from the preliminary models and new models are trained; section 6 provides some heuristic 1 Introduction rules for combining transliteration and trans- lation of names’ parts; section 7 describes the 1 DBpedia Linked Open Dataset provides the evaluation of the models; the last section con- wealth of Wikipedia in a formalized way via cludes the paper and presents some directions the ontological language in which DBpedia for future work.
    [Show full text]
  • Heafield, K. and A. Lavie. "Voting on N-Grams for Machine Translation
    Voting on N-grams for Machine Translation System Combination Kenneth Heafield and Alon Lavie Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 fheafield,[email protected] Abstract by a single weight and further current features only guide decisions at the word level. To remedy this System combination exploits differences be- situation, we propose new features that account for tween machine translation systems to form a multiword behavior, each with a separate set of sys- combined translation from several system out- tem weights. puts. Core to this process are features that reward n-gram matches between a candidate 2 Features combination and each system output. Systems differ in performance at the n-gram level de- Most combination schemes generate many hypothe- spite similar overall scores. We therefore ad- sis combinations, score them using a battery of fea- vocate a new feature formulation: for each tures, and search for a hypothesis with highest score system and each small n, a feature counts to output. Formally, the system generates hypoth- n-gram matches between the system and can- esis h, evaluates feature function f, and multiplies didate. We show post-evaluation improvement of 6.67 BLEU over the best system on NIST linear weight vector λ by feature vector f(h) to ob- T MT09 Arabic-English test data. Compared to tain score λ f(h). The score is used to rank fi- a baseline system combination scheme from nal hypotheses and to prune partial hypotheses dur- WMT 2009, we show improvement in the ing beam search. The beams contain hypotheses of range of 1 BLEU point.
    [Show full text]
  • Proceedings of the 55Th Annual Meeting of the Association For
    ComputEL-3 Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers) February 26–27, 2019 Honolulu, Hawai‘i, USA Support: c 2019 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-950737-18-5 ii Preface These proceedings contain the papers presented at the 3rd Workshop on the Use of Computational Methods in the Study of Endangered languages held in Hawai’i at Manoa,¯ February 26–27, 2019. As the name implies, this is the third workshop held on the topic—the first meeting was co-located with the ACL main conference in Baltimore, Maryland in 2014 and the second one in 2017 was co-located with the 5th International Conference on Language Documentation and Conservation (ICLDC) at the University of Hawai‘i at Manoa.¯ The workshop covers a wide range of topics relevant to the study and documentation of endangered languages, ranging from technical papers on working systems and applications, to reports on community activities with supporting computational components. The purpose of the workshop is to bring together computational researchers, documentary linguists, and people involved with community efforts of language documentation and revitalization to take part in both formal and informal exchanges on how to integrate rapidly evolving language processing methods and tools into efforts of language description, documentation, and revitalization. The organizers are pleased with the range of papers, many of which highlight the importance of interdisciplinary work and interaction between the various communities that the workshop is aimed towards.
    [Show full text]
  • Phrase-Based Attentions
    Under review as a conference paper at ICLR 2019 PHRASE-BASED ATTENTIONS Phi Xuan Nguyen, Shafiq Joty Nanyang Technological University, Singapore [email protected], [email protected] ABSTRACT Most state-of-the-art neural machine translation systems, despite being different in architectural skeletons (e.g., recurrence, convolutional), share an indispensable feature: the Attention. However, most existing attention methods are token-based and ignore the importance of phrasal alignments, the key ingredient for the suc- cess of phrase-based statistical machine translation. In this paper, we propose novel phrase-based attention methods to model n-grams of tokens as attention entities. We incorporate our phrase-based attentions into the recently proposed Transformer network, and demonstrate that our approach yields improvements of 1.3 BLEU for English-to-German and 0.5 BLEU for German-to-English transla- tion tasks on WMT newstest2014 using WMT’16 training data. 1 INTRODUCTION Neural Machine Translation (NMT) has established breakthroughs in many different translation tasks, and has quickly become the standard approach to machine translation. NMT offers a sim- ple encoder-decoder architecture that is trained end-to-end. Most NMT models (except a few like (Kaiser & Bengio, 2016) and (Huang et al., 2018)) possess attention mechanisms to perform align- ments of the target tokens to the source tokens. The attention module plays a role analogous to the word alignment model in Statistical Machine Translation or SMT (Koehn, 2010). In fact, the trans- former network introduced recently by Vaswani et al. (2017) achieves state-of-the-art performance in both speed and BLEU scores (Papineni et al., 2002) by using only attention modules.
    [Show full text]
  • EECS 498-004: Introduction to Natural Language Processing
    EECS 498-004: Introduction to Natural Language Processing Instructor: Prof. Lu Wang Computer Science and Engineering University of Michigan https://web.eecs.umich.edu/~wangluxy/ 1 Outline • Probabilistic language model and n-grams • Estimating n-gram probabilities • Language model evaluation and perplexity • Generalization and zeros • Smoothing: add-one • Interpolation, backoff, and web-scale LMs • Smoothing: Kneser-Ney Smoothing 2 [Modified from slides by Dan Jurafsky and Joyce Chai] Probabilistic Language Models • Assign a probability to a sentence 3 Probabilistic Language Models • Assign a probability to a sentence •Machine Translation: • P(high winds tonight) > P(large winds tonight) •Spell Correction • The office is about fifteen minuets from my house • P(about fifteen minutes from) > P(about fifteen minuets from) •Speech Recognition • P(I saw a van) >> P(eyes awe of an) •Text Generation in general: • Summarization, question-answering, dialogue systems … 4 Probabilistic Language Modeling • Goal: compute the probability of a sentence or sequence of words: P(W) = P(w1,w2,w3,w4,w5…wn) • Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4) • A model that computes either of these: P(W) or P(wn|w1,w2…wn-1) is called a language model. • Better: the grammar • But language model (or LM) is standard 5 How to compute P(W) • How to compute this joint probability: • P(its, water, is, so, transparent, that) 6 How to compute P(W) • How to compute this joint probability: • P(its, water, is, so, transparent, that) • Intuition: let’s rely
    [Show full text]