Representation Learning for Information Extraction

Representation Learning for Information Extraction by Ehsan Amjadian A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Cognitive Science Carleton University Ottawa, Ontario ©2019 Ehsan Amjadian Abstract Distributed representations, predominantly acquired via neural networks, have been applied to natural language processing tasks including speech recognition and machine translation with a success comparable to sophisticated state-of-the-art algo- rithms. The present thesis offers an investigation of the application of such representations to information extraction. Specifically, I explore the suitability of applying shallow distributed representations to the automatic terminology extraction task, as well as the bridging reference resolution task. I created a dataset as a gold standard for automatic term extraction in the mathematical education domain. I carefully as- sessed the performance of the existing terminology extraction methods on this dataset. Then, I introduce a novel method for automatic terminology extraction for one word terms, and I evaluate the performance of the novel algorithm in various terminological domains. The introduced algorithm leverages the distributed representation of words from the local and global perspectives to encode syntactic, semantic, association, and frequency information at the same time. Furthermore, this novel algorithm can be trained with a minimal number of data points. I show that the algorithm is robust to the change of domain, and that information can be transferred from one technical domain to another, leveraging what we call anchor words with consistent semantics shared between the domains. As for the bridging reference resolution task, a dataset is built on the letter portion of the Open American National Corpus and I compare the performance of a preliminary method against a majority class baseline. ii iii Acknowledgements The great sacrifices and contributions made by others enriched my journey over the past years. Words cannot do justice in capturing their help and sacrifices, however, this is an attempt to acknowledge them. I would like to thank my brother Amir Amjadian and my mother Dr. Khadijeh Eshaghi for their support and sacrifice that first and foremost made this endeavor possible. I could travel, as a result, to the other side of the world in pursuit of knowledge and to contribute to the scientific community. Amir and Khadijeh, without your great help and sacrifice this journey would have never started. Thank you! I would like to thank my dear wife, Mahsa Raeisi Ardali, for her unconditional support during the writing stage of this dissertation. Her devotion and encouragement strengthened my steps forward. Immeasurable gratitude gos to my supervisors: Professor Raj Singh and Profes- sor Diana Inkpen. Dr. Singh took me under his wing and made Ottawa feel like home. My eyes were opened and my mind freed by our discussions, his clarity in thoughts, reasoning and writing, his vast experience in research, his absolute com- mand of formal and computational semantics and pragmatics, and above all, his invaluable mentorship and friendship. Dr. Inkpen noticed my passion for natural language processing, machine learning, and deep learning, and nurtured them with her depth and breadth of knowledge in the field of artificial intelligence and computer science, turning thoughts into ideas, and ideas into experiments. I could not have asked for a better supervision and mentorship. I would like to thank my thesis committee members, Professor Xiaodan Zhu and Professor Robert West, for the many inspiring discussions and conversations in NLP, iv information extraction, word embeddings, and high dimensional data structures as well as their great feedback for the present thesis. Great thanks go to Professor Patrick Drouin for his pioneering work in automatic terminology extraction as well as his invaluable comments on the thesis. I was fortu- nate to be one of the countless individuals who have been inspired by his work. Many thanks go to Professor Christopher Cox for his invaluable feedback on the thesis that resulted from a close assessment of the ideas and experiments in the document which lead to their further refinement. I could not have wished for better lab mates and friends than Roxana Barbu and Prasadith Kirinde Gamaarachchige. Prasadith brought an ocean of cutting-edge skills in software engineering and web development to the lab, in addition to his radiating serenity. Roxana made many extended hours of research seem normal and pleasant by her diligence and team work, even though we worked on different projects. Thank you both for being such great friends and all the thought-provoking conversations. Great thanks go to my friend and colleague Professor Muhammad Rizwan Abid. We had many inspiring conversations from the very beginning of my journey, many of which resulted in great academic work. Roxana and Muhammad both kindly proofread the present document and made many great suggestions that lead to its improvement. A significant portion of the work in automatic terminology extraction comes from the years of collaboration with Professor T.Sima Paribakht and Professor Farahnaz Faez as well as their contributions before I joined the project. Working with them was an absolute honor and pleasure. I benefited much from their advices and insights. Great thanks go to Liane Dubreuil for the warm welcome to the department and v for removing any administrative obstacle from my path. The present work benefited much from great efforts by Jeol Baylis and Christopher Genovesi constructing the bridging reference corpus. I would like to thank Amir Gharavi for all the detailed mathematical discussions as well as being a great friend over the past years. Contents Abstract ........................................ ii Acknowledgements .................................. iii 1 Introduction 1 1.1 TheTaskstoTackle .............................. 1 1.2 Automatic Terminology Extraction (ATE). ... 3 1.3 BridgingReferenceResolution . ... 6 1.4 Objectives.................................... 8 1.5 Contributions.................................. 9 1.5.1 List of Main Contributions of the Present Dissertation..... 10 1.6 ResearchQuestions .............................. 11 1.7 Publications................................... 12 1.8 ThesisStructure ................................ 12 1.9 Summary .................................... 13 2 Background 14 2.1 AutomaticTerminologyExtraction . .. 14 2.1.1 TraditionalATE............................ 15 2.1.2 MachineLearningATE........................ 16 vi CONTENTS vii 2.1.3 DistributedATE............................ 17 2.2 TransferLearning ............................... 23 2.3 DomainAdaptation .............................. 23 2.4 StatisticalModelingofLanguage . ... 27 2.4.1 Introduction .............................. 27 2.4.2 N-GramModels ............................ 27 2.4.3 VectorSpaceModels ......................... 29 2.4.4 WordEmbeddings........................... 35 2.5 Summary .................................... 51 3 Automatic Terminology Extraction Evaluation: A Gold Standard for the Mathematics Education Domain 53 3.1 Introduction................................... 54 3.2 RelatedWork.................................. 56 3.3 TermExtractionMethodsandTools . 57 3.3.1 AntConc ................................ 58 3.3.2 Topia .................................. 58 3.3.3 TermoStat ............................... 59 3.4 SketchEngine.................................. 60 3.5 Corpus...................................... 62 3.6 TermEvaluator................................. 62 3.7 TheAnnotationProcess............................ 64 3.8 ResultsandAnalysis.............................. 66 3.9 Summary .................................... 73 CONTENTS viii 4 Distributed Specificity 74 4.1 Introduction................................... 75 4.2 RelatedWork.................................. 78 4.3 Corpus...................................... 80 4.4 Methodology .................................. 81 4.4.1 SpecificityVectors........................... 81 4.4.2 FilteringApproach .......................... 84 4.4.3 DirectApproach............................ 86 4.5 Annotation ................................... 87 4.6 ExperimentsandResults ........................... 88 4.7 Summary .................................... 93 5 Cross-Domain Distributed Automatic Terminology Extraction 94 5.1 QuantitativeExperiments. .. 95 5.2 QualitativeAssessments. 98 5.3 DomainAdaptationforATE......................... 104 5.3.1 Introduction .............................. 104 5.3.2 Dataset ................................. 105 5.3.3 Methods................................. 106 5.3.4 ExperimentsandResults. 106 5.4 Summary .................................... 109 6 Distributed Bridging Reference Resolution 111 6.1 Introduction................................... 112 6.2 RelatedWork.................................. 117 CONTENTS ix 6.3 Data ....................................... 119 6.4 Methodology .................................. 120 6.5 Experiments&Results ............................ 122 6.6 Summary .................................... 123 7 Conclusion and Future Research 124 7.1 Summary .................................... 124 7.2 FutureWork .................................. 129 Appendices 132 A FullListofResultsinQualitativeAssessments 133 A.1 Microbiology .................................. 133 A.2 Optometry...................................

Representation Learning for Information Extraction

MASC: the Manually Annotated Sub-Corpus of American English

Background and Context for CLASP

Informatics 1: Data & Analysis

The Expanding Horizons of Corpus Analysis

Unit 3: Available Corpora and Software

English Corpus Linguistics: an Introduction - Charles F

A Study of Issues and Techniques for Creating Core Vocabulary Lists for English As an International Language

Towards a Global, Multilingual Framenet

The Architecture of a Multipurpose Australian National Corpus

ΓE 77 Computational Linguistics Essays/ Project Topics

New Kazakh Parallel Text Corpora with On-Line Access

The American National Corpus: a Standardized Resource for American English