Machine Learning of Antonyms in English and Arabic Corpora

Machine Learning of Antonyms in English and Arabic Corpora Luluh Basim M. Aldhubayi School of Computing University of Leeds Submitted in accordance with the requirements for the degree of Doctor of Philosophy January 2019 2 The candidate confirms that the work submitted is her own, ex- cept where work which has formed part of jointly-authored pub- lications has been included. The contribution of the candidate and the other authors to this work has been explicitly indicated below. The candidate confirms that appropriate credit has been given within the thesis where reference has been made to the work of others. Parts of the work presented in Chapters 7 have been published in the following jointly-authored article: Al-Yahya, M., Al-Malak, S., Aldhubayi, L. (2016). Ontological Lexicon Enrichment: The Badea System For Semi-Automated Extraction Of Antonymy Relations From Arabic Language Corpora. Malaysian Journal of Computer Science, 29(1). This copy has been supplied on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledgement. The right of Luluh Basim M. Aldhubayi to be identified as Au- thor of this work has been asserted by him in accordance with the Copyright, Designs and Patents Act 1988. c 2019 The University of Leeds and Luluh Basim M. Aldhubayi to my favourite antonym pair Amal & Ahmed Acknowledgements First and foremost, all praise and thanks to Allah The Almighty for His graces and guidance. Peace and Blessings of Allah be upon the Prophet Muhammad. I would like to thank my supervisors, Prof. Eric Atwell and Dr.Brandon Bennett, for the guidance, encouragement and ad- vice they have provided throughout my PhD journey. I would like to thank my parents Basim and Anwar for their support and love. I am grateful to my husband Suliman and our children Amal and Ahmed for their love, support and patience during these years. I'm also thankful for my sisters Asma and Hind and my brothers Mohammad, Yousuf, Saleh and Ahmed for their love and caring. Nevertheless, I am grateful to King Saud University for funding the scholarship and supporting me financially. Last but not the least, I would like to thank my friends in Leeds for accepting nothing less than excellence from me. Amirah Al- harbi, thank you for the interesting discussions and ideas about our projects. Thanks Aicha Al-Faraj, Areej Alghamdi and Huda Alshanbary for your caring and support throughout writing this thesis and my life in general. Abstract Identifying lexical semantic relations in the text has been a long- standing dream of artificial intelligence and the target of many researchers' attention over the past years. This thesis addresses the problem of identifying antonymy relations, such as (hot/cold) in an automatic method. This work presents three key points in capturing antonymy word pairs: extracting word pairs examples from a textual corpus, representing antonymy in a pair vector space model, and using a machine learning classifier to predict the antonymy relation. Researchers have found that discriminating antonymy from synonymy is a non-trivial task. Both relations show similar semantic distributions as they are found in similar contexts. This issue affects many similarity-based applications by displaying opposite words instead of synonyms. Moreover, both traditional and mod- ern vector space models such as Bag-of-Words and Word Embed- dings models show poor discrimination between antonymy and synonymy words. Therefore, this work proposed antonymy pair vector representation based on symmetric classified patterns extracted from a corpus. Besides, we are motivated by extracting novel antonymy and opposites relations between word pairs. This research aims to capture and predict antonymy pairs generated by a textual corpus to make computers able to understand and capture opposition relation in the text. Our research proposes the Antonymy Classifier which combines two approaches: the pattern-based approach and a machine learning classifier. We use the pattern-based approach to extract word pairs and patterns. We also propose using distant supervision learning to label the extracted pairs automatically. Distant supervision uses an external knowledge base (the Open Multilin- gual WordNet) to generate positive and negative antonymy instances. It also extracts every sentence from a corpus which shows both canonical antonymy pairs such as (national/international) and non-canonical antonymy or opposites pairs such as (internal/international) that might provide statistical evidence for an antonymy relation. In addition, this work presents a pattern classifier model which automatically extracts and classifies antonymy patterns by computing the average co-occurrence association between positive (antonymy) and negative (non-antonymy) instances in the train- ing set. A part of these patterns such as (between X and Y, both X and Y, from X to Y ) were found in related linguistic studies on manual patterns extraction and analysis. We also found some novel textual patterns that are highly associated with antonymy pairs such as (however X or Y, what is X and what is Y) and more. This work also shows experiments in extracting and predicting antonymy on the English BNC and SkELL corpora and the Ara- bic ArTenTen corpus. The overall outcomes showed a positive prediction improvement in distinguishing antonym pairs com- pared to previous attempts. Also, we presented new antonymy pairs that are not found in the English and Arabic WordNet. The antonymy classifier model uses a machine learning algorithm to extract and classify novel adjectival and noun antonymous pairs such as (verbal/visual), (input/output), (life/death) and (material/spiritual). Therefore, the work presented in this research is a promising method for better extraction and classification of antonymy pairs and patterns in a corpus. Abbreviations ACC Accuracy API Application Programming Interface ArTenTen Arabic TenTen corpus AWN Arabic WordNet BNC British National Corpus CART Classification and Regression Tree DT Decision Tree GUI Graphical User Interface ILI Interlingual Index KBs Knowledge Bases KNN K-Nearest Neighbour LSA Latent semantic analysis LSP Lexico-Syntactic Pattern MCL Markov Cluster Algorithm NB Naive Bayes NLP Natural Language Processing NLTK Natural Language Toolkit NN Neural Network OMW Open Multilingual WordNet P Precision PCA Principle Component Analysis PMI Pointwise Mutual Information PMI-IR Pointwise Mutual Information and Information Retrieval PoS Part-of-Speech PWN Princeton WordNet of English R Recall SkELL Sketch Engine for Language Learning corpus SMO Sequential Minimal Optimization SPs Symmetric Patterns SVD Singular Value Decomposition SVM Support Vector Machine synset synonyms set TF-IDF Term Frequency/ Inverse Document Frequency YAGO Yet Another Great Ontology Symbols and Typographical Conventions (X/Y) word pair symbol, e.g. (hot/cold) (ITALIC) represents a pattern, e.g. (from X to Y) CAPITAL refers to noun Attribute or Hypernym, e.g. TEMPERATURE \double quotation" refers to cited sentence BOLD indicates important value or important result Contents 1 Introduction 21 1.1 Research Motivation . 23 1.2 Why is Antonymy Extraction and Prediction Important? . 23 1.3 Research Objectives . 24 1.4 Research Contributions . 25 1.5 Thesis Structure . 26 2 Introducing Antonymy 30 2.1 Defining Antonymy . 31 2.1.1 Antonym Gradability . 32 2.1.2 Antonym Canonicity and Directionality . 33 2.1.3 The Componential Analysis of Antonyms . 33 2.1.4 Context-Dependent Opposites . 35 2.2 Opposition Categories . 36 2.3 Opposition Levels . 38 2.4 Antonymy Features . 41 2.4.1 Antonymy Association and Co-occurrence . 41 2.4.2 Antonymy and Substitutability . 42 2.4.3 Antonymy Patterns . 43 2.4.4 The Semantic Scale of Antonym Pairs . 45 2.5 Antonymy in the WordNet . 46 2.5.1 The Princeton WordNet of English . 48 2.5.2 The Open Multilingual WordNet . 49 2.5.3 WordNet Relation . 49 11 CONTENTS 2.6 Antonymy Applications . 52 2.7 Summary . 53 3 Antonymy Extraction Methods 56 3.1 Vector Space Models . 56 3.1.1 Bag-of-Words . 57 3.1.2 Word Embeddings . 58 3.2 Relation Extraction Approaches . 60 3.2.1 The Distributional Approach . 61 3.2.2 Pattern-Based Extraction Approach . 65 3.2.3 Machine Learning based Method . 75 3.2.4 Deep Learning for Semantic Relations Extraction . 79 3.2.5 Other Extraction Approaches . 80 3.3 Discussion . 81 3.4 Summary . 83 4 Semantic Representation of Antonymy 84 4.1 Pair Vector Representation Model . 85 4.2 Feature Analysis . 89 4.2.1 Feature Extraction . 89 4.2.2 Feature Selection . 92 4.3 Research Framework . 93 4.4 Implementation Tools: Sketch Engine . 96 4.5 Dataset Construction and Labelling . 96 4.5.1 Automatic Word Pair Extraction . 98 4.5.2 Pair Extraction Output . 99 4.6 Distant Supervision Learning . 102 4.6.1 Labelling Algorithm 1 . 105 4.6.2 The Semantic Labelling of Word Pairs . 107 4.6.3 N-Degree Opposition Labelling . 112 4.6.4 Evaluating Labelling Results . 116 4.6.5 Discussion . 122 4.7 Summary . 125 12 CONTENTS 5 Antonymy Pattern Extraction and Selection 126 5.1 Automatic Pattern Extraction Approach . 128 5.2 Pattern Selection . 131 5.2.1 Pattern Selection Strategies . 132 5.3 Antonymy Pattern Classifier . 141 5.3.1 Classifier Output . 143 5.4 Discussion . 150 5.4.1 Discovering Novel Noun Attribute Patterns . 151 5.5 Summary . 155 6 Antonymy Relation Prediction 156 6.1 Function Approximation . 157 6.2 Machine Learning Model Selection and Parameter . 157 6.2.1 Naive Bayes . 158 6.2.2 K-Nearest Neighbour . 158 6.2.3 Decision Tree . 159 6.2.4 Neural Network . 159 6.2.5 Vector Normalization Scale Selection . 159 6.2.6 Feature Vector Dimension Reduction and Evaluation . 161 6.3 Classification Results on BNC and SkELL . 164 6.3.1 New Antonymy Pairs .

Machine Learning of Antonyms in English and Arabic Corpora

Arabic Corpus and Word Sketches

Jazykovedný Ústav Ľudovíta Štúra

Korpusleksikograafia Uued Võimalused Eesti Keele Kollokatsioonisõnastiku Näitel

The Translation Equivalents Database (Treq) As a Lexicographer’S Aid

ESP Corpus Construction: a Plea for a Needs-Driven Approach Bâtir Des Corpus En Anglais De Spécialité : Plaidoyer Pour Une Approche Fondée Sur L’Analyse Des Besoins

A Corpus-Based Study of the Effects of Collocation, Phraseology

Enhancing Self-Directed Learning – Verolino

Better Web Corpora for Corpus Linguistics and NLP

Designing a Learner's Dictionary Based on Sinclair's Lexical Units by Means of Corpus Pattern Analysis and the Sketch Engine

Arabic Corpus and Word Sketches

Genre Annotation for the Web: Text-External and Text-Internal Perspectives

Open Source International Arabic News Corpus - Preparation and Integration Into the CLARIN-Infrastructure