Choosing the Most Reasonable Split of a Compound Word Using Wikipedia

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017 Choosing the most reasonable split of a compound word using Wikipedia YVONNE LE KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Choosing the most reasonable split of a compound word using Wikipedia YVONNE LE ([email protected]) Master’s Thesis at CSC Supervisor: Olov Engwall ([email protected]) Examiner: Viggo Kann ([email protected]) Abstract The purpose of this master thesis is to make use of the category taxonomy of Wikipedia to determine the most reasonable split from the suggestions generated by an independent compound word splitter. The articles a word was found in can be seen as a group of contexts the word can occur in and also different repre- sentations of the word, i.e. an article is a representation of the word. Instead of only analysing the data of each single article, the intention is to find more data for each representation/context to perform an analysis on. The idea is to expand each article representing one context by including related articles in the same category. Two perceptions of a ”reasonable split” was studied. The first case was a split consisting of only two parts and the second case of unlimited parts. This approach is well-suited for choosing the correct split out of a several suggestions but unsuitable for iden- tifying compound words. It would more often than not decide to not split a compound word. It is very dependant on the compound words appearing in Wikipedia. Referat Val av den rimligaste delningen av ett sammansatt ord med hjälp av Wikipedia Syftet med detta examensarbete är att utse den rimligaste uppdelningen av ett sammansatt ord genom användning av Wikipedias kategoritaxonomi. Förslag på olika uppdelningar genereras av en oberoende färdig algoritm. Artiklarna som ett ord finns can ses som en grupp av kontexter som ett ord kan förekomma i och olika framställ- ningar av ett ord. Avsikten är att hitta mer data för varje framställning/kontext att utföra en analys på istället för att bara analysera artikeln ordet hittades i. Idéen som ska testas är att expandera varje artikel som representerar en kontext genom att inkludera relaterade artiklar i samma kategori. Två olika synsätt på ”rimliga uppdelningar” studera- des. Första fallet var att endast dela upp sammansatta ord i två delar och andra fallet var att dela upp i obestämt antal delar. Metoden visade sig utmärka sig på att välja rätt uppdelning när den väl gjorde ett försök. En stor nackdel var att den ofta valde att inte dela upp sammansättningar trots att den skulle ha gjort det. Metoden är mycket beroende av att sammansättningarna måste finnas i Wikipedia. Contents 1 Introduction 1 1.1 Research Question . 1 1.2 Objective . 1 1.3 Delimitations . 2 2 Background 3 2.1 Early history and progress . 3 2.2 Application . 3 2.2.1 Machine Translation . 3 2.2.2 Information Retrieval . 4 2.3 Linguistics . 5 2.3.1 Semantic classification . 5 2.3.1.1 Copulative . 5 2.3.1.2 Determinative . 5 2.3.1.3 Endocentric . 5 2.3.1.4 Exocentric . 6 2.3.2 How much should be split? . 6 2.3.3 Where to split? . 6 2.3.4 What is a reasonable split? . 7 2.4 Compound splitting methods . 7 2.4.1 Statistical methods . 7 2.4.1.1 Component frequencies . 8 2.4.1.2 n-grams . 8 2.4.1.3 Part of speech combinations . 9 2.4.1.4 Parallel corpus . 9 2.4.2 Linguistic methods . 9 2.4.2.1 Semantic vector space . 9 3 Resources 11 3.1 Findwise’s compound splitter . 11 3.2 Search platforms . 12 3.2.1 Elastic . 12 3.2.2 Solr . 12 3.3 Wikipedia . 13 3.4 Choice . 14 4 Methodology 15 4.1 Concept . 15 4.2 Architecture . 16 4.3 Implementation . 17 4.3.1 Split suggestions retrieval . 17 4.3.2 Articles retrieval . 17 4.3.3 Computation . 17 4.4 Evaluation metrics . 17 4.4.1 Terms . 18 4.4.2 Metrics . 18 4.5 Test data . 18 4.5.1 Den stora svenska ordlistan (DSSO) . 18 4.5.2 Svenska akademiens ordlista (SAOL) . 19 4.5.3 Annotators . 19 4.6 Data sets . 19 4.6.1 Data set 1 . 20 4.6.2 Data set 2 . 20 4.6.3 Data set 3 . 20 4.6.4 Data set 4 . 20 5 Experiment 21 5.1 Language analyzers . 21 5.2 Redo the expansion . 21 5.3 No stemming as base index . 21 5.4 Merge words . 22 6 Results 23 6.1 Case 1 . 23 6.1.1 No stemming . 23 6.1.2 Swedish light . 24 6.1.3 Swedish . 24 6.1.4 Comparison against baseline . 24 6.2 Case 2 . 25 6.2.1 No stemming . 26 6.2.2 Swedish light . 26 6.2.3 Swedish . 26 6.2.4 Comparison against baseline . 27 7 Discussion 29 7.1 Test case 1 . 29 7.1.1 Optimizations . 29 7.1.2 Comparison to baseline . 29 7.2 Test case 2 . 30 7.2.1 Optimization . 30 7.2.2 Comparison to baseline . 30 7.3 Strengths . 30 7.4 Weaknesses . 31 7.5 Further work . 31 8 Conclusion 33 Bibliography 35 Chapter 1 Introduction Swedish is a compounding language allowing you to form a new word joining a the- oretically infinite amount of words together. The words are joined without blank spaces and sometimes connected with joining morphemes. Being able to split compound words is a powerful resource in natural language processing, e.g. to expand queries in information retrieval or in statistical machine translation. Depending on the application one may want to split into different number of parts. For statistical machine translation one may want to do an aggressive splitting of ”bordtennisspelare” (table tennis player) to ”bord tennis spelare” (table tennis player) enabling one by one translation of all the constituents. However, for a search engine one may be satisfied at ”bordtennis spelare” because querying ”bordtennis” (table tennis) is more relevant than querying ”bord” (table). How can one find the most reasonable split? Previous research has studied a win- dow of words appearing in the same document [23], this study will be a continuation studying an expanded context using the free encyclopedia Wikipedia. Wikipedia has a taxonomy system in which all the articles belong to at least one category[27]. The intention is to group together other articles in the same category and study the new expanded context with more data instead of only the data in a single article. 1.1 Research Question Is it possible to choose the most reasonable split using an expanded context generated by Wikipedia’s category taxonomy? 1.2 Objective Findwise has a compound word splitter for the Swedish language generating possible splits. The purpose of this master thesis is to make use of Wikipedia’s category taxonomy to determine the most reasonable split from the suggestions. The method will be tested on two perceptions of a reasonable split. The goal is to study the method and its performance from the two points of view. 1 CHAPTER 1. INTRODUCTION 1.3 Delimitations This method will only be tested for the Swedish language and the time complexity will not be taken into account. The algorithm will only work for compound words which appear in the Swedish Wikipedia and it will not be able to split words which Findwise’s compound word splitter fails to split. 2 Chapter 2 Background This chapter will present the history, state-of-the-art and application of automatic compound splitting. It will also present the relevant theory. 2.1 Early history and progress Research in compound analysing has been around for a long time in several lan- guages. Fujisaki et al [11] experimented with applying a probabilistic grammar for parsing Japanese noun compounds; Rackow et al [21] used a monolingual corpus to find appropriate English translations of German noun compounds and Lauer[14] used a statistical approach in his publication. Numerous studies have ever since then been carried out studying both statistical, linguistic and other approaches over the years, and even different combinations of them. The state-of-the-art studies the semantic vector space using word embeddings. Both Dima and Hinrichs [6] as well as Soricut and Och [25] have studied this approach with positive results. 2.2 Application As mentioned in the introduction, compound splitting is a powerful application in natural language processing. Two important use cases are in machine translation and information retrieval. This section will further explain the details. 2.2.1 Machine Translation Machine translation is a field which investigates the use of computers to automate translation from one language to another [12]. The methods are heavily dependant on data of varying types, e.g. corpora or rules. Compounding splitting can improve translations by providing more information about a word for a more accurate analysis. Furthermore, it provides the system an additional chance to analyse compound words which do not occur in the bilingual corpora, dictionary or other text sources used. 3 CHAPTER 2. BACKGROUND Statistical machine translation (SMT) is a method which analyses bilingual text corpora to build statistical models which transforms text from the source language to the target language. The capacity of what it can translate is therefore limited to the text corpora, it would never be able to translate a word which did not appear in it. This is the same for many other methods such as direct translations using a dictionary.

Choosing the Most Reasonable Split of a Compound Word Using Wikipedia

A Sentiment-Based Chat Bot

Applying Ensembling Methods to BERT to Boost Model Performance

Aconcorde: Towards a Proper Concordance for Arabic

Hunspell – the Free Spelling Checker

Package 'Ptstem'

Cross-Lingual and Supervised Learning Approach for Indonesian Word Sense Disambiguation Task

A Comparison of Word Embeddings and N-Gram Models for Dbpedia Type and Invalid Entity Detection †

Contextual Word Embedding for Multi-Label Classification Task Of

Mapping Natural Language Commands to Web Elements

Corpus Based Evaluation of Stemmers

Details on Stemming in the Language Modeling Framework

Stemming Stemming Stemming Porter Stemmer