Multilingual Distributional Lexical Similarity

Total Page:16

File Type:pdf, Size:1020Kb

Multilingual Distributional Lexical Similarity MULTILINGUAL DISTRIBUTIONAL LEXICAL SIMILARITY DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Kirk Baker, B.A., M.A. ***** The Ohio State University 2008 Dissertation Committee: Approved by Chris Brew, Advisor James Unger Mike White Advisor Graduate Program in Linguistics c Copyright by Kirk Baker 2008 ABSTRACT One of the most fundamental problems in natural language processing involves words that are not in the dictionary, or unknown words. The supply of unknown words is virtually unlimited – proper names, technical jargon, foreign borrowings, newly created words, etc. – meaning that lexical resources like dictionaries and thesauri inevitably miss important vocabulary items. However, manually creating and main- taining broad coverage dictionaries and ontologies for natural language processing is expensive and difficult. Instead, it is desirable to learn them from distributional lex- ical information such as can be obtained relatively easily from unlabeled or sparsely labeled text corpora. Rule-based approaches to acquiring or augmenting repositories of lexical information typically offer a high precision, low recall methodology that fails to generalize to new domains or scale to very large data sets. Classification-based ap- proaches to organizing lexical material have more promising scaling properties, but require an amount of labeled training data that is usually not available on the neces- sary scale. This dissertation addresses the problem of learning an accurate and scalable lexical classifier in the absence of large amounts of hand-labeled training data. One approach to this problem involves using a rule-based system to generate large amounts of data that serve as training examples for a secondary lexical classifier. The viability of this approach is demonstrated for the task of automatically identifying English loanwords in Korean. A set of rules describing changes English words undergo when ii they are borrowed into Korean is used to generate training data for an etymological classification task. Although the quality of the rule-based output is low, on a sufficient scale it is reliable enough to train a classifier that is robust to the deficiencies of the original rule-based output and reaches a level of performance that has previously been obtained only with access to substantial hand-labeled training data. The second approach to the problem of obtaining labeled training data uses the output of a statistical parser to automatically generate lexical-syntactic co-occurrence features. These features are used to partition English verbs into lexical semantic classes, producing results on a substantially larger scale than any previously reported and yielding new insights into the properties of verbs that are responsible for their lexical categorization. The work here is geared towards automatically extending the coverage of verb classification schemes such as Levin, VerbNet, and FrameNet to other verbs that occur in a large text corpus. iii ACKNOWLEDGMENTS I am indebted primarily to my dissertation advisor Chris Brew who supported me for four years as a research assistant on his NSF grant “Hybrid methods for acquisition and tuning of lexical information”. Chris introduced me to the whole idea of statistical machine learning and its applications to large-scale natural language processing. He gave me an enormous amount of freedom to explore a variety of projects as my interests took me, and I am grateful to him for all of these things. I am grateful to James Unger for generously lending his time to wide-ranging discussions of the ideas in this dissertation and for giving me a bunch of additional ideas for things to try with Japanese word processing. I am grateful to Mike White for carefully reading several drafts of my dissertation, each time offering feedback which crucially improved both the ideas contained in the dissertation and their presentation. His questions and comments substantially improved the overall quality of my dissertation and were essential to its final form. I am grateful to my colleagues at OSU. Hiroko Morioka contributed substan- tially to my understanding of statistical modeling and to the formulation of many of the ideas in the dissertation. Eunjong Kong answered a bunch of my questions about English loanwords in Korean and helped massively with revising the presentation of the material in Chapter 2 for other venues. I am grateful to Jianguo Li for lots of discussion about automatic English verb classification, and for sharing scripts and data. iv VITA 1998 ........................................B.A., Linguistics, University of North Carolina at Chapel Hill 2001 ........................................M.A., Linguistics, University of North Carolina at Chapel Hill 2003 - 2007 .................................Research Assistant, The Ohio State University 2008 ........................................Presidential Fellow, The Ohio State University PUBLICATIONS 1. Kirk Baker and Chris Brew (2008). Statistical identification of English loan- words in Korean using automatically generated training data. In Proceedings of The Sixth International Conference on Language Resources and Evaluation (LREC). Marrakech, Morocco. FIELDS OF STUDY Major Field: Linguistics Specialization: Computational Linguistics v TABLE OF CONTENTS Page Abstract ...................................... ii Acknowledgments ................................ iv Vita ........................................ v List of Figures .................................. xi List of Tables .................................. xiii 1 Introduction ................................. 1 1.1 Overview.................................. 1 1.2 GeneralMethodology........................... 2 1.2.1 LoanwordIdentification . 3 1.2.2 Distributional Verb Similarity . 3 1.3 Structure of Dissertation and Summary of Contributions ....... 5 2 Descriptive Analysis of English Loanwords in Korean ....... 8 2.1 ConstructionoftheDataSet. 8 2.1.1 Romanization ........................... 10 2.1.2 PhonemicRepresentation . 12 2.1.2.1 SourceofPronunciations. 12 2.1.2.2 Standardizing Pronunciations . 13 2.1.3 Alignments ............................ 16 2.2 Analysis of English Loanwords in Korean . .. 18 2.3 Conclusion................................. 27 3 English-to-Korean Transliteration .................... 29 3.1 Overview.................................. 29 3.2 Previous Research on English-to-Korean Transliteration........ 29 3.2.1 Grapheme-Based English-to-Korean Transliteration Models . 30 3.2.1.1 LeeandChoi(1998);Lee(1999) . 30 3.2.1.2 KangandChoi(2000a,b) . 32 vi 3.2.1.3 KangandKim(2000) . 34 3.2.2 Phoneme-Based English-to-Korean Transliteration Models . 35 3.2.2.1 Lee(1999);Kang(2001) . 35 3.2.2.2 Jung,Hong,andPaek(2000) . 37 3.2.3 Ortho-phonemic English-to-Korean Transliteration Models . 39 3.2.3.1 OhandChoi(2002) . 39 3.2.3.2 Oh and Choi (2005); Oh, Choi, and Isahara (2006) . 42 3.2.4 SummaryofPreviousResearch . 44 3.3 Experiments on English-to-Korean Transliteration . ........ 47 3.3.1 ExperimentOne ......................... 48 3.3.1.1 Purpose ......................... 48 3.3.1.2 Description of the Transliteration Model . 48 3.3.1.3 ExperimentalSetup . 52 3.3.1.4 ResultsandDiscussion. 52 3.3.2 ExperimentTwo ......................... 54 3.3.2.1 Purpose ......................... 54 3.3.2.2 DescriptionoftheModel. 55 3.3.2.3 ExperimentalSetup . 56 3.3.2.4 ResultsandDiscussion. 56 3.3.3 ExperimentThree ........................ 58 3.3.3.1 Purpose ......................... 58 3.3.3.2 DescriptionoftheModel. 59 3.3.3.3 ExperimentalSetup . 63 3.3.3.4 ResultsandDiscussion. 64 3.3.4 ErrorAnalysis........................... 68 3.3.5 Conclusion............................. 69 4 Automatically Identifying English Loanwords in Korean ..... 73 4.1 Overview.................................. 73 4.2 PreviousResearch............................. 74 4.3 CurrentApproach............................. 76 4.3.1 Bayesian Multinomial Logistic Regression . .. 77 4.3.2 NaiveBayes............................ 86 4.4 Experiments on Identifying English Loanwords in Korean ....... 87 4.4.1 ExperimentOne ......................... 87 4.4.1.1 Purpose ......................... 87 4.4.1.2 ExperimentalSetup . 87 4.4.1.3 Results ......................... 88 4.4.2 ExperimentTwo ......................... 89 4.4.2.1 Purpose ......................... 89 4.4.2.2 ExperimentalSetup . 89 4.4.2.3 Results ......................... 90 4.4.3 ExperimentThree ........................ 91 vii 4.4.3.1 Purpose ......................... 91 4.4.3.2 ExperimentalSetup . 91 4.4.3.3 Results ......................... 92 4.5 Conclusion................................. 93 5 Distributional Verb Similarity ...................... 95 5.1 Overview.................................. 95 5.2 PreviousWork .............................. 97 5.2.1 SchulteimWalde(2000) . 97 5.2.2 MerloandStevenson(2001) . 98 5.2.3 Korhonen,Krymolowski, andMarx(2003) . 100 5.2.4 LiandBrew(2008). .. .. 100 5.3 CurrentApproach............................. 101 5.3.1 Scope and Nature of Verb Classifications . 101 5.3.2 Nature of the Evaluation of Verb Classifications . ... 102 5.3.3 Relation to Other Lexical Acquisition Tasks . 103 5.3.4 AdvantagesoftheCurrentApproach . 104 5.4 Components of Distributional Verb Similarity . ..... 105 5.4.1 Representation of Lexical Context . 106 5.4.2 Bag-of-WordsContextModels . 106 5.4.3 Grammatical Relations Context Models . 107 5.5 Evaluation................................
Recommended publications
  • Proquest Dissertations
    Automated learning of protein involvement in pathogenesis using integrated queries Eithon Cadag A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2009 Program Authorized to Offer Degree: Department of Medical Education and Biomedical Informatics UMI Number: 3394276 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. UMI Dissertation Publishing UMI 3394276 Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against unauthorized copying under Title 17, United States Code. uest ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Eithon Cadag and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Chair of the Supervisory Committee: Reading Committee: (SjLt KJ. £U*t~ Peter Tgffczy-Hornoch In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of this dissertation is allowable only for scholarly purposes, consistent with "fair use" as prescribed in the U.S.
    [Show full text]
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • Unified Language Model Pre-Training for Natural
    Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong∗ Nan Yang∗ Wenhui Wang∗ Furu Wei∗† Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com Abstract This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirec- tional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-of- the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. 1 Introduction Language model (LM) pre-training has substantially advanced the state of the art across a variety of natural language processing tasks [8, 29, 19, 31, 9, 1].
    [Show full text]
  • Analyzing Political Polarization in News Media with Natural Language Processing
    Analyzing Political Polarization in News Media with Natural Language Processing by Brian Sharber A thesis presented to the Honors College of Middle Tennessee State University in partial fulfillment of the requirements for graduation from the University Honors College Fall 2020 Analyzing Political Polarization in News Media with Natural Language Processing by Brian Sharber APPROVED: ______________________________ Dr. Salvador E. Barbosa Project Advisor Computer Science Department Dr. Medha Sarkar Computer Science Department Chair ____________________________ Dr. John R. Vile, Dean University Honors College Copyright © 2020 Brian Sharber Department of Computer Science Middle Tennessee State University; Murfreesboro, Tennessee, USA. I hereby grant to Middle Tennessee State University (MTSU) and its agents (including an institutional repository) the non-exclusive right to archive, preserve, and make accessible my thesis in whole or in part in all forms of media now and hereafter. I warrant that the thesis and the abstract are my original work and do not infringe or violate any rights of others. I agree to indemnify and hold MTSU harmless for any damage which may result from copyright infringement or similar claims brought against MTSU by third parties. I retain all ownership rights to the copyright of my thesis. I also retain the right to use in future works (such as articles or books) all or part of this thesis. The software described in this work is free software. You can redistribute it and/or modify it under the terms of the MIT License. The software is posted on GitHub under the following repository: https://github.com/briansha/NLP-Political-Polarization iii Abstract Political discourse in the United States is becoming increasingly polarized.
    [Show full text]
  • Automatic Spelling Correction Based on N-Gram Model
    International Journal of Computer Applications (0975 – 8887) Volume 182 – No. 11, August 2018 Automatic Spelling Correction based on n-Gram Model S. M. El Atawy A. Abd ElGhany Dept. of Computer Science Dept. of Computer Faculty of Specific Education Damietta University Damietta University, Egypt Egypt ABSTRACT correction. In this paper, we are designed, implemented and A spell checker is a basic requirement for any language to be evaluated an end-to-end system that performs spellchecking digitized. It is a software that detects and corrects errors in a and auto correction. particular language. This paper proposes a model to spell error This paper is organized as follows: Section 2 illustrates types detection and auto-correction that is based on n-gram of spelling error and some of related works. Section 3 technique and it is applied in error detection and correction in explains the system description. Section 4 presents the results English as a global language. The proposed model provides and evaluation. Finally, Section 5 includes conclusion and correction suggestions by selecting the most suitable future work. suggestions from a list of corrective suggestions based on lexical resources and n-gram statistics. It depends on a 2. RELATED WORKS lexicon of Microsoft words. The evaluation of the proposed Spell checking techniques have been substantially, such as model uses English standard datasets of misspelled words. error detection & correction. Two commonly approaches for Error detection, automatic error correction, and replacement error detection are dictionary lookup and n-gram analysis. are the main features of the proposed model. The results of the Most spell checkers methods described in the literature, use experiment reached approximately 93% of accuracy and acted dictionaries as a list of correct spellings that help algorithms similarly to Microsoft Word as well as outperformed both of to find target words [16].
    [Show full text]
  • N-Gram-Based Machine Translation
    N-gram-based Machine Translation ∗ JoseB.Mari´ no˜ ∗ Rafael E. Banchs ∗ Josep M. Crego ∗ Adria` de Gispert ∗ Patrik Lambert ∗ Jose´ A. R. Fonollosa ∗ Marta R. Costa-jussa` Universitat Politecnica` de Catalunya This article describes in detail an n-gram approach to statistical machine translation. This ap- proach consists of a log-linear combination of a translation model based on n-grams of bilingual units, which are referred to as tuples, along with four specific feature functions. Translation performance, which happens to be in the state of the art, is demonstrated with Spanish-to-English and English-to-Spanish translations of the European Parliament Plenary Sessions (EPPS). 1. Introduction The beginnings of statistical machine translation (SMT) can be traced back to the early fifties, closely related to the ideas from which information theory arose (Shannon and Weaver 1949) and inspired by works on cryptography (Shannon 1949, 1951) during World War II. According to this view, machine translation was conceived as the problem of finding a sentence by decoding a given “encrypted” version of it (Weaver 1955). Although the idea seemed very feasible, enthusiasm faded shortly afterward because of the computational limitations of the time (Hutchins 1986). Finally, during the nineties, two factors made it possible for SMT to become an actual and practical technology: first, significant increment in both the computational power and storage capacity of computers, and second, the availability of large volumes of bilingual data. The first SMT systems were developed in the early nineties (Brown et al. 1990, 1993). These systems were based on the so-called noisy channel approach, which models the probability of a target language sentence T given a source language sentence S as the product of a translation-model probability p(S|T), which accounts for adequacy of trans- lation contents, times a target language probability p(T), which accounts for fluency of target constructions.
    [Show full text]
  • A Notation and Definitions
    A Notation and Definitions All notation used in this work is “standard”. I have opted for simple nota- tion, which, of course, results in a one-to-many map of notation to object classes. Within a given context, however, the overloaded notation is generally unambiguous. I have endeavored to use notation consistently. This appendix is not intended to be a comprehensive listing of definitions. The Index, beginning on page 519, is a more reliable set of pointers to defini- tions, except for symbols that are not words. A.1 General Notation Uppercase italic Latin and Greek letters, such as A, B, E, Λ, etc., are generally used to represent either matrices or random variables. Random variables are usually denoted by letters nearer the end of the Latin alphabet, such X, Y ,and Z, and by the Greek letter E. Parameters in models (that is, unobservables in the models), whether or not they are considered to be random variables, are generally represented by lowercase Greek letters. Uppercase Latin and Greek letters are also used to represent cumulative distribution functions. Also, uppercase Latin letters are used to denote sets. Lowercase Latin and Greek letters are used to represent ordinary scalar or vector variables and functions. No distinction in the notation is made between scalars and vectors;thus,β may represent a vector and βi may represent the ith element of the vector β. In another context, however, β may represent a scalar. All vectors are considered to be column vectors, although we may write a vector as x =(x1,x2,...,xn). Transposition of a vector or a matrix is denoted by the superscript “T”.
    [Show full text]
  • Ryan Tibshirani
    Ryan Tibshirani Carnegie Mellon University http://www.stat.cmu.edu/∼ryantibs/ Depts. of Statistics and Machine Learning [email protected] 229B Baker Hall 412.268.1884 Pittsburgh, PA 15213 Academic Positions Associate Professor (Tenured), Depts. of Statistics and Machine Learning, July 2016 { present Carnegie Mellon University Joint Appointment, Dept. of Machine Learning, Carnegie Mellon University Oct 2013 { present Assistant Professor, Dept. of Statistics, Carnegie Mellon University Aug 2011 { June 2016 Education Ph.D. in Statistics, Stanford University. Sept 2007 { Aug 2011 Thesis: \The Solution Path of the Generalized Lasso". Advisor: Jonathan Taylor. B.S. in Mathematics, Stanford University. Sept 2003 { June 2007 Minor in Computer Science. Grants, Awards, Honors \Theoretical Foundations of Deep Learning", Department of Defense (DoD) May 2020 { Apr 2025 Multidisciplinary University Research Initiative (MURI) grant. (co-PI, Rich Baraniuk is PI) \Delphi Influenza Forecasting Center of Excellence", Centers for Disease Sept 2019 | Aug 2024 Control and Prevention (CDC) grant no. U01IP001121, total award amount $3,000,000 (co-PI, Roni Rosenfeld is PI) \Improved Nowcasting via Adaptive Boosting of Highly Variable Biosurveil- Nov 2017 { May 2020 lance Data Sources", Defense Threat Reduction Agency (DTRA) grant no. HDTRA1-18-C-0008, total award amount $1,016,057 (co-PI, Roni Rosenfeld is PI) Teaching Innovation Award from Carnegie Mellon University Apr 2017 \Locally Adaptive Nonparametric Estimation for the Modern Age | New July 2016 { June 2021 Insights, Extensions, and Inference Tools", National Science Foundation (NSF) Division of Mathematical Sciences (DMS) CAREER grant no. 1554123, total award amount $400,000 (PI) \Graph Trend Filtering for Recommender Systems", Adobe Digital Marketing Sept 2014 { Sept 2015 Research Awards, total award amount $50,000, (co-PI, Alex Smola is PI) 1 \Advancing Theory and Computation in Statistical Learning Problems", July 2013 { June 2016 National Science Foundation (NSF) Division of Mathematical Sciences (DMS) grant no.
    [Show full text]
  • High-Dimensional and Causal Inference by Simon James
    High-dimensional and causal inference by Simon James Sweeney Walter A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Statistics in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Bin Yu, Co-chair Professor Jasjeet Sekhon, Co-chair Professor Peter Bickel Assistant Professor Avi Feller Fall 2019 1 Abstract High-dimensional and causal inference by Simon James Sweeney Walter Doctor of Philosophy in Statistics University of California, Berkeley Professor Bin Yu, Co-chair Professor Jasjeet Sekhon, Co-chair High-dimensional and causal inference are topics at the forefront of statistical re- search. This thesis is a unified treatment of three contributions to these literatures. The first two contributions are to the theoretical statistical literature; the third puts the techniques of causal inference into practice in policy evaluation. In Chapter 2, we suggest a broadly applicable remedy for the failure of Efron’s bootstrap in high dimensions is to modify the bootstrap so that data vectors are broken into blocks and the blocks are resampled independently of one another. Cross- validation can be used effectively to choose the optimal block length. We show both theoretically and in numerical studies that this method restores consistency and has superior predictive performance when used in combination with Breiman’s bagging procedure. This chapter is joint work with Peter Hall and Hugh Miller. In Chapter 3, we investigate regression adjustment for the modified outcome (RAMO). An equivalent procedure is given in Rubin and van der Laan [2007] and then in Luedtke and van der Laan [2016]; philosophically similar ideas appear to originate in Miller [1976].
    [Show full text]
  • Spell Checker in CET Designer
    Linköping University | Department of Computer Science Bachelor thesis, 16 ECTS | Datateknik 2016 | LIU-IDA/LITH-EX-G--16/069--SE Spell checker in CET Designer Rasmus Hedin Supervisor : Amir Aminifar Examiner : Zebo Peng Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin- istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam- manhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/. Copyright The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum- stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose.
    [Show full text]
  • Automatic Extraction of Personal Events from Dialogue
    Automatic extraction of personal events from dialogue Joshua D. Eisenberg Michael Sheriff Artie, Inc. Florida International University 601 W 5th St, Los Angeles, CA 90071 Department of Creative Writing [email protected] 11101 S.W. 13 ST., Miami, FL 33199 [email protected] Abstract language. The most popular corpora annotated In this paper we introduce the problem of ex- for events all come from the domain of newswire tracting events from dialogue. Previous work (Pustejovsky et al., 2003b; Minard et al., 2016). on event extraction focused on newswire, how- Our work begins to fill that gap. We have open ever we are interested in extracting events sourced the gold-standard annotated corpus of from spoken dialogue. To ground this study, events from dialogue.1 For brevity, we will hearby we annotated dialogue transcripts from four- refer to this corpus as the Personal Events in Di- teen episodes of the podcast This American alogue Corpus (PEDC). We detailed the feature Life. This corpus contains 1,038 utterances, made up of 16,962 tokens, of which 3,664 rep- extraction pipelines, and the support vector ma- resent events. The agreement for this corpus chine (SVM) learning protocols for the automatic has a Cohen’s κ of 0.83. We have open sourced extraction of events from dialogue. Using this in- this corpus for the NLP community. With this formation, as well as the corpus we have released, corpus in hand, we trained support vector ma- anyone interested in extracting events from dia- chines (SVM) to correctly classify these phe- logue can proceed where we have left off.
    [Show full text]
  • Deep Learning for Web Search and Natural Language Processing
    Deep Learning for Web Search and Natural Language Processing Jianfeng Gao Deep Learning Technology Center (DLTC) Microsoft Research, Redmond, USA WSDM 2015, Shanghai, China *Thank Li Deng and Xiaodong He, with whom we participated in the previous ICASSP2014 and CIKM2014 versions of this tutorial Mission of Machine (Deep) Learning Data (collected/labeled) Model (architecture) Training (algorithm) 2 Outline • The basics • Background of deep learning • A query classification problem • A single neuron model • A deep neural network (DNN) model • Potentials and problems of DNN • The breakthrough after 2006 • Deep Semantic Similarity Models (DSSM) for text processing • Recurrent Neural Networks 3 4 Scientists See Promise in Deep-Learning Programs John Markoff November 23, 2012 Rick Rashid in Tianjin, China, October, 25, 2012 Geoff Hinton The universal translator on A voice recognition program translated a speech given by Richard F. “Star Trek” comes true… Rashid, Microsoft’s top scientist, into Chinese. 5 6 Impact of deep learning in speech technology Cortana 7 8 A query classification problem • Given a search query 푞, e.g., “denver sushi downtown” • Identify its domain 푐 e.g., • Restaurant • Hotel • Nightlife • Flight • etc. • So that a search engine can tailor the interface and result to provide a richer personalized user experience 9 A single neuron model • For each domain 푐, build a binary classifier 푇 • Input: represent a query 푞 as a vector of features 푥 = [푥1, … 푥푛] • Output: 푦 = 푃 푐 푞 • 푞 is labeled 푐 is 푃 푐 푞 > 0.5 • Input feature vector,
    [Show full text]