<<

DEGREE PROJECT IN AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017

Choosing the most reasonable split of a using Wikipedia

YVONNE LE

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION Choosing the most reasonable split of a compound word using Wikipedia

YVONNE LE ([email protected])

Master’s Thesis at CSC Supervisor: Olov Engwall ([email protected]) Examiner: Viggo Kann ([email protected])

Abstract

The purpose of this master thesis is to make use of the cate- gory taxonomy of Wikipedia to determine the most reason- able split from the suggestions generated by an independent compound word splitter. The articles a word was found in can be seen as a group of contexts the word can occur in and also different repre- sentations of the word, i.e. an article is a representation of the word. Instead of only analysing the data of each single article, the intention is to find more data for each represen- tation/context to perform an analysis on. The idea is to expand each article representing one context by including related articles in the same category. Two perceptions of a ”reasonable split” was studied. The first case was a split consisting of only two parts and the second case of unlimited parts. This approach is well-suited for choosing the correct split out of a several suggestions but unsuitable for iden- tifying compound . It would more often than not decide to not split a compound word. It is very dependant on the compound words appearing in Wikipedia. Referat

Val av den rimligaste delningen av ett sammansatt ord med hjälp av Wikipedia

Syftet med detta examensarbete är att utse den rimligaste uppdelningen av ett sammansatt ord genom användning av Wikipedias kategoritaxonomi. Förslag på olika uppdelning- ar genereras av en oberoende färdig algoritm. Artiklarna som ett ord finns can ses som en grupp av kontexter som ett ord kan förekomma i och olika framställ- ningar av ett ord. Avsikten är att hitta mer data för varje framställning/kontext att utföra en analys på istället för att bara analysera artikeln ordet hittades i. Idéen som ska testas är att expandera varje artikel som representerar en kontext genom att inkludera relaterade artiklar i samma kategori. Två olika synsätt på ”rimliga uppdelningar” studera- des. Första fallet var att endast dela upp sammansatta ord i två delar och andra fallet var att dela upp i obestämt antal delar. Metoden visade sig utmärka sig på att välja rätt upp- delning när den väl gjorde ett försök. En stor nackdel var att den ofta valde att inte dela upp sammansättningar trots att den skulle ha gjort det. Metoden är mycket beroende av att sammansättningarna måste finnas i Wikipedia. Contents

1 Introduction 1 1.1 Research Question ...... 1 1.2 Objective ...... 1 1.3 Delimitations ...... 2

2 Background 3 2.1 Early history and progress ...... 3 2.2 Application ...... 3 2.2.1 ...... 3 2.2.2 ...... 4 2.3 ...... 5 2.3.1 Semantic classification ...... 5 2.3.1.1 Copulative ...... 5 2.3.1.2 Determinative ...... 5 2.3.1.3 Endocentric ...... 5 2.3.1.4 Exocentric ...... 6 2.3.2 How much should be split? ...... 6 2.3.3 Where to split? ...... 6 2.3.4 What is a reasonable split? ...... 7 2.4 Compound splitting methods ...... 7 2.4.1 Statistical methods ...... 7 2.4.1.1 Component frequencies ...... 8 2.4.1.2 n-grams ...... 8 2.4.1.3 combinations ...... 9 2.4.1.4 Parallel corpus ...... 9 2.4.2 Linguistic methods ...... 9 2.4.2.1 Semantic vector space ...... 9

3 Resources 11 3.1 Findwise’s compound splitter ...... 11 3.2 Search platforms ...... 12 3.2.1 Elastic ...... 12 3.2.2 Solr ...... 12 3.3 Wikipedia ...... 13 3.4 Choice ...... 14

4 Methodology 15 4.1 Concept ...... 15 4.2 Architecture ...... 16 4.3 Implementation ...... 17 4.3.1 Split suggestions retrieval ...... 17 4.3.2 Articles retrieval ...... 17 4.3.3 Computation ...... 17 4.4 Evaluation metrics ...... 17 4.4.1 Terms ...... 18 4.4.2 Metrics ...... 18 4.5 Test data ...... 18 4.5.1 Den stora svenska ordlistan (DSSO) ...... 18 4.5.2 Svenska akademiens ordlista (SAOL) ...... 19 4.5.3 Annotators ...... 19 4.6 Data sets ...... 19 4.6.1 Data set 1 ...... 20 4.6.2 Data set 2 ...... 20 4.6.3 Data set 3 ...... 20 4.6.4 Data set 4 ...... 20

5 Experiment 21 5.1 Language analyzers ...... 21 5.2 Redo the expansion ...... 21 5.3 No stemming as base index ...... 21 5.4 Merge words ...... 22

6 Results 23 6.1 Case 1 ...... 23 6.1.1 No stemming ...... 23 6.1.2 Swedish light ...... 24 6.1.3 Swedish ...... 24 6.1.4 Comparison against baseline ...... 24 6.2 Case 2 ...... 25 6.2.1 No stemming ...... 26 6.2.2 Swedish light ...... 26 6.2.3 Swedish ...... 26 6.2.4 Comparison against baseline ...... 27

7 Discussion 29 7.1 Test case 1 ...... 29 7.1.1 Optimizations ...... 29 7.1.2 Comparison to baseline ...... 29 7.2 Test case 2 ...... 30 7.2.1 Optimization ...... 30 7.2.2 Comparison to baseline ...... 30 7.3 Strengths ...... 30 7.4 Weaknesses ...... 31 7.5 Further work ...... 31

8 Conclusion 33

Bibliography 35

Chapter 1

Introduction

Swedish is a compounding language allowing you to form a new word joining a the- oretically infinite amount of words together. The words are joined without blank spaces and sometimes connected with joining morphemes. Being able to split com- pound words is a powerful resource in natural language processing, e.g. to expand queries in information retrieval or in statistical machine translation. Depending on the application one may want to split into different number of parts. For statistical machine translation one may want to do an aggressive splitting of ”bordtennisspelare” (table tennis player) to ”bord tennis spelare” (table tennis player) enabling one by one translation of all the constituents. However, for a one may be satisfied at ”bordtennis spelare” because querying ”bordtennis” (table tennis) is more relevant than querying ”bord” (table). How can one find the most reasonable split? Previous research has studied a win- dow of words appearing in the same document [23], this study will be a continuation studying an expanded context using the free encyclopedia Wikipedia. Wikipedia has a taxonomy system in which all the articles belong to at least one category[27]. The intention is to group together other articles in the same category and study the new expanded context with more data instead of only the data in a single article.

1.1 Research Question

Is it possible to choose the most reasonable split using an expanded context gener- ated by Wikipedia’s category taxonomy?

1.2 Objective

Findwise has a compound word splitter for the Swedish language generating possible splits. The purpose of this master thesis is to make use of Wikipedia’s category taxonomy to determine the most reasonable split from the suggestions. The method will be tested on two perceptions of a reasonable split. The goal is to study the method and its performance from the two points of view.

1 CHAPTER 1. INTRODUCTION

1.3 Delimitations

This method will only be tested for the Swedish language and the time complexity will not be taken into account. The will only work for compound words which appear in the Swedish Wikipedia and it will not be able to split words which Findwise’s compound word splitter fails to split.

2 Chapter 2

Background

This chapter will present the history, state-of-the-art and application of automatic compound splitting. It will also present the relevant theory.

2.1 Early history and progress

Research in compound analysing has been around for a long time in several lan- guages. Fujisaki et al [11] experimented with applying a probabilistic grammar for Japanese noun compounds; Rackow et al [21] used a monolingual corpus to find appropriate English translations of German noun compounds and Lauer[14] used a statistical approach in his publication. Numerous studies have ever since then been carried out studying both statistical, linguistic and other approaches over the years, and even different combinations of them. The state-of-the-art studies the semantic vector space using word embeddings. Both Dima and Hinrichs [6] as well as Soricut and Och [25] have studied this approach with positive results.

2.2 Application

As mentioned in the introduction, compound splitting is a powerful application in natural language processing. Two important use cases are in machine translation and information retrieval. This section will further explain the details.

2.2.1 Machine Translation Machine translation is a field which investigates the use of computers to automate translation from one language to another [12]. The methods are heavily dependant on data of varying types, e.g. corpora or rules. Compounding splitting can improve translations by providing more information about a word for a more accurate analy- sis. Furthermore, it provides the system an additional chance to analyse compound words which do not occur in the bilingual corpora, dictionary or other text sources used.

3 CHAPTER 2. BACKGROUND

Statistical machine translation (SMT) is a method which analyses bilingual text corpora to build statistical models which transforms text from the source language to the target language. The capacity of what it can translate is therefore limited to the text corpora, it would never be able to translate a word which did not appear in it. This is the same for many other methods such as direct translations using a dictionary. However, compound splitting would be able to break the unknown compound word into smaller known constituents within its capacity. Presume the Swedish compound word ”favoritmånad” (favourite month) did not occur in the cor- pora but ”favorit” (favourite) and ”månad” (month) did. The translation machine would fail to translate it because of it, but it would be able to acquire a transla- tion by performing compound splitting and then finding the proper translation for ”favorit” (favourite) and ”månad” (month). Compound splitting is not only useful to apply when attempting to translate a word, it is also useful in the pre-processing work. Applying compound splitting on the corpora before analysing would result in more data to work with and it would learn common patterns. The corpora with compound splitting can be a complement to the original corpora.

2.2.2 Information Retrieval is a process carried out to expand the original query to get im- proved retrieval performance in information retrieval. Query expansion can be used to both get better precision and better recall. Better precision means better accu- racy of relevant documents on the retrieved result. Better recall means increasing the total amount of relevant retrieved documents. In some texts the same concept may be referred to using different words, espe- cially in languages with compounding. An example is the Swedish word ”vinterskor” (winter shoes) and the Swedish phrase ”skor man har på vintern” (shoes you use during winter). They both grasp the same concept and are both relevant documents to get retrieved. By only using the search query ”vinterskor” we would only find documents explicitly containing the word ”vinterskor” and not the other phrase. However, by splitting the compound word and expanding the query with ”winter shoes” and weighting the other relevant documents would be found and retrieved as well. The previous examples were about using compound splitting on the query, but it can also be used on the documents. If we were to query ”skor” (shoes) we would be interested in all kind of shoes but only documents containing the ”shoes” explicitly would be retrieved. That would exclude all the documents describing different types of shoes such as ”vintersko”(winter shoe) or ”fotbollssko” (football shoe). By using compound splitting on the documents we would be able to capture them as well. Compound splitting is extremely valuable in both machine translation and in- formation retrieval to get improved result. It has an positive impact on applications relying on accurate result. It is difficult to discuss the impact it has from an ethical or sustainability standpoint because it is dependent on the application which uses

4 2.3. LINGUISTICS it.

2.3 Linguistics

Compound words in Swedish are words that consists of two or more independent words [4]. As mentioned in the introduction words can be joined by joining mor- phemes such as s, extra vowel, letter replacement and letter removal[9]. This chapter will present an overview of the linguistic theory and why compound word splitting is a difficult challenge. Moreover, it will end with explaining the two definitions of ”reasonable split” which will be used in this project.

2.3.1 Semantic classification Swedish compound words can be classified into the following four types:

• Copulative

• Determinative

• Endocentric

• Exocentric

A compound word can belong to several types. This section will explain these categories to deepen the understanding of different types of compound words.

2.3.1.1 Copulative Copulative words are words for which parts are equally weighted [2]. The con- stituents can usually be shuffled without losing the meaning. An example of copu- lative words are color-combinations, e.g. ”blåröd” in Swedish which is "blue red" in English. Shuffling the constituents ”rödblå” (red blue) does not change the meaning of it.

2.3.1.2 Determinative Determinative words are words for which the first part determinates the second part [17]. An example is the Swedish word ”matsked” (table spoon) where ”mat” (food) is an attribute to ”sked” (spoon).

2.3.1.3 Endocentric All endocentric words are determinative.Endocentric words have a head which is the dominant part of an endocentric compound word. The head holds the basic meaning of the whole word and the whole compound word inherits its inflectional properties from it [3]. An endocentric word is also a hyponym to its head. ”Blackboard” is an

5 CHAPTER 2. BACKGROUND example of an English endocentric word where board is the head and blackboard is a hyponym to board. ”Vattensäng” (water bed) is an example of a Swedish endocentric word where ”säng” (bed) is the head and ”vattensäng” is a hyponym to ”bed”.

2.3.1.4 Exocentric Exocentric, also known as bahuvrihi, words are words that does not have a head which can represent the whole compound word [3]. Splitting these words will result in words which do not provide any semantic context to the original word. Skinhead is an example of an English exocentric word. Neither of the words ”skin” or ”head” can represent the original word alone. ”Fågelskrämma” (scarecrow) is another Swedish example. Neither of the words ”fågel” (bird) or ”skrämma” (scare) can represent the original word alone.

2.3.2 How much should be split? All compound words can be divided into (at least) two parts, first element and last element [19], in Swedish called ”förled” and ”efterled” [4] [1]. The characteristics of compound words is that the first element and last element can be standalone words and they can even be compound words themselves. An example is the Swedish compound word ”bordtennisspelare” (table tennis player) which consists of the first element ”bordtennis” (table tennis) and the last element ”spelare” (player). ”Bor- dtennis” is the first element in that word, but itself is a compound word as well made up by ”bord” (table) and ”tennis” (tennis). Splitting a word which should not be split is called over splitting. It is a challenge connected with ambiguity. A lot of Swedish words can be interpreted as compound words although they are non-compound word as well. All Swedish compound words which are joined together correctly are technically correct, no matter how long they are or what words they are made of. An example is the Swedish word ”sparris” (asparagus). Splitting it into ”sparr is” (spar ice) is technically not wrong but it is unreasonable since it is not common. One challenge in compounds splitting is therefore to know when it is appropriate to stop splitting. In this thesis all technically correct splits will not automatically be evaluated as correct, they must be deemed as reasonable by the annotators or an open source word list. More about the evaluation and the definition of reasonable will be explained later in the report.

2.3.3 Where to split? Ambiguity is a difficult challenge in Swedish compound splitting and this is con- nected with the problem of over splitting mentioned in the previous paragraph. An example is the Swedish word ”Matris”. It can be interpreted both as an non-compound word ”matris” (matrix) and a compound word ”mat ris” (food rice). Another example of ambiguity is the Swedish word ”kamplåtar”. It can be inter-

6 2.4. COMPOUND SPLITTING METHODS preted as a compound word in two ways, either ”kamp låtar” (battle songs) or ”kam plåtar” (comb plates). Absence of context adds an extra layer of difficulty on determining whether a word is a compound word and if so, of which words it is composed of.

2.3.4 What is a reasonable split? The conclusion is that it is difficult to define ”reasonable”. It depends on context and application. However, a takeaway from this section is that compound words can always be divided into two constituents. This will therefore be one of the cases the algorithm will be tested on, a correct split consisting of two constituents. The second case is to test the algorithm’s performance from the standpoint that compounds words can be split into more than 2 splits. This would be the case if the constituents themselves are compound words. An example which illustrates what would be acceptable by the two cases can be made with the Swedish sword ”bordtennisspelare” (table tennis player). The first definition would accept ”bordtennis spelare” (table tennis and player) as a correct split and the second definition would accept ”bord tennis spelare” (table and tennis and player) as a correct split. Reasonable is defined as that the constituents should make sense being put together and not change the meaning of the original compound word (unless it is an exocentric word). The algorithm should not over split ambiguous constituents into smaller parts which changes the meaning of word. E.g. split ”tennis” (tennis) into ”tenn is” (tin ice) when splitting the ”bordtennis” (table tennis). The second case will be evaluated against splits of the same words made by annotators.

2.4 Compound splitting methods

There are different approaches to compound splitting. The two main approaches are either using a statistical-based method or a linguistic-based method. A combination of them are also quite common to maximise the result. The two approaches will be explained in this section followed by related work which used those approaches.

2.4.1 Statistical methods Statistical methods depends on the corpora which the statistics to be used in com- pound splitting is computed. This method is therefore more adaptable than linguis- tic methods. A linguistic method is bound to one language because of the unique language traits used whereas statistical methods can be used on other languages by changing the corpora. A disadvantage is that it requires both versatile and a large number of data to cover all the language traits which needs to be used for this method to be successful since it ignores the linguistic aspect.

7 CHAPTER 2. BACKGROUND

2.4.1.1 Component frequencies

A frequency based metric is described in Koehn and Knight [13] where the following formula is used to pick the most probable split:

Y 1 argmaxS( count(pi)) n (2.1) pi∈S

where: S = Set of split suggestions pi = Part i of a split suggestion n = Number of constituents in the split suggestion

Given the count of words in the corpus, the split S with the highest geometric mean of word frequencies of its parts pi is picked. n is the number of parts in a split. If a word as a whole occurs more frequently than its constituents, the word should be kept as a whole. E.g, for the English compound word ”wardrobe” would have the split S = ”ward”+”robe” as one of the split suggestions. The remaining variables would be p1=”ward”, p2=”robe” and n=2 because the split made up by two constituents. A drawback with this approach is that prepositions and determiners will always occur more frequently in the corpus compared to other words. This means that split containing prepositions will get a higher score. To exclude these mistakes, the knowledge of the part of speech of words are used and constituents which are not nouns, adverbs, adjectives, or verbs are not split. Sjöbergh and Kann [23] presented a frequency based metric as well. This ap- proach focused on studying the context the compound word appeared in by studying the 50 words appearing before and the 50 words appearing after the compound word. Constituents occurring closer to the compound word got a higher score. Similar to the approach described by Koehn and Knight [13] mentioned earlier, the split with the highest geometric mean was chosen. The drawbacks of this methods were that constituents would rarely appear close to the compound word and short constituents which usually are common words would appear more often by chance.

2.4.1.2 n-grams

Sjöbergh and Kann [23] presented in their study a method of compound splitting by analysing n-grams. They had a list of compound head and tails and frequencies of all character 4-grams in compound heads and tails (not overlapping a head/tail- border) were counted. The frequency data taught which 4-grams were most common and that part should therefore not be split. To get a better guess all frequencies of the character 4-grams containing the suggested split point were added. This was done for every split suggestion. The suggestion with the lowest sum would then be the split with highest probability.

8 2.4. COMPOUND SPLITTING METHODS

2.4.1.3 Part of speech combinations Some part of speech combinations are more common than others. More than 25% of the compounds are noun-noun combinations and very few are pronoun-pronoun combinations. By analysing which part of speech combinations were the most com- mon, Sjöberg and Kann [23] developed the next method calculating the probability of such combinations. The algorithm looks at the words in the suggested split up and learn their part of speech. It then selects the part of speech combinations with the highest probability.

2.4.1.4 Parallel corpus Koehn and Knight [13] presents an approach using a parallel corpus. This works since English compound words are mostly joined by space as opposed to Swedish compound words (e.g. “ishockeyspelare” and “ice hockey player”). This approach requires a translation lexicon and the easiest way is to learn it from a parallel corpus. This approach will translate the constituents for every split. If one of the (translated) splits can be found in the translation lexicon, it should be considered as a correct split. Since some words can have different translations depending on the context, this can cause some problems. An example mentioned in the paper is the word “grun- drechte” (basic rights). This word should have the split “grund rechte”. However, since “grund” usually translates into “reason” or “foundation”, this approach will look for “reason rights” in the translation lexicon. This does not exist and the correct split “grund rechte” will therefore be neglected. This could however be ac- counted for by building a second translation lexicon and join it with the first one. The first translation lexicon was obtained by learning from a parallel corpus without splitting the German compound words and the second one which will complement the first could be obtained by learning from a parallel corpus with the German compound words split using a frequency based approach. The resulting translation lexicon would then learn that “grund” ↔ “basic”.

2.4.2 Linguistic methods A linguistic approach relies on a lexical database and linguistic knowledge is also applied. Opposed from statistical methods, the data for linguistic methods are enriched with additional information such as part of speech tags or information about different frequencies.

2.4.2.1 Semantic vector space Daiber, Quiroz, Wechsler and Frank [5] analyse compounds by exploiting regularities in the semantic vector space. They are exploiting the fact that linguistic items with similar distributions have similar meanings and their work is based on analogies such as “king is to man what queen is to woman”. This paper only focuses on

9 CHAPTER 2. BACKGROUND endocentric compounds which are compounds where its head describes the basic meaning of the whole word. The head and the whole compound word will therefore be close to each other in the semantic space and this can be used to make correct splits. This worked well and outperformed a commonly used compound splitter on a gold standard task [20].

10 Chapter 3

Resources

Several resources were used to help to reach the project objective apart from the programming part, mainly the following:

• Findwise’s compound splitter: To generate split suggestions our algorithm should choose from

• Search platform: To store and fetch articles

• Wikipedia: The articles which will be stored in the search platform and which the algorithm will review

This chapter will present the resources used in this project together with other alternatives. A summary can be found at the end of the chapter stating which resources were chosen.

3.1 Findwise’s compound splitter

Findwise compound splitter returns a list of possible splittings of a compound word. It is based on a generated text file containing the following data of words:

• Word in its baseform

• Conjugations not ending a compound word

• Conjugations ending a compound word

The algorithm iterates through a graph built on the information in the text file to return all the possible splits. The compound splitter is also programmed to determine not only the positions to split, but also which part of speech the con- stituents belongs to. Therefore, ambiguous cases can generate multiple suggestions. E.g. ”spelkonsol” (gaming console) will generate both ”spel konsol” (game console) and ”spela konsol” (gaming console) because the word ”spel” is ambigouos and can be both a noun and a verb.

11 CHAPTER 3. RESOURCES

It also suggests the part of It also suggests the different part of speech of a word. E.g. The splitter is evaluated on splitting the compound words in the Den stora svenska ordlistan (DSSO) dictionary. The compound words in the dictionary have a specific and information about the first element and the last element of a compound word is also provided, i.e. the correct split into two constituents which is our first case to evaluate. Findwise’s compound splitter was tested on splitting words into two constituents. The accuracy was calculated as

Correct split (3.1) Correct split + Wrong faulty split A splitting attempt was considered as a ”correct split” if the correct split could be found in the top x suggestions of the returned sorted list of possible splits. The following is the results:

• The first split suggestion is a correct split - Accuracy: 0.74

• Consider a hit in the top 2 splittings as a correct split - Accuracy: 0.90

• Consider a hit in the top 3 splittings as a correct split - Accuracy: 0.92

• Consider a hit in any splitting as a correct split - Accuracy: 0.93

3.2 Search platforms

A search platform will be used to store and fetch the articles according to the algorithm presented in this project. The most important fields to be indexed are the article id, category and content for querying purposes. Two alternatives will be presented in this section.

3.2.1 Elastic Elastic was created in 2012 and is an open-source search platform based on Lucene [7]. Elastic supports aggregations, multitenancy and has better support for complex and analytical queries compared to Solr.

3.2.2 Solr Solr was created in 2004 and is also an open source search platform based on Lucene. [16] [10] Compared to Elastic which is relatively new, Solr is more mature and stable.

12 3.3. WIKIPEDIA

3.3 Wikipedia

Wikipedia was used as data to study the context the words appear in. It is a free encyclopedia created collaboratively by anyone who uses it and has grown to become one of the world’s largest encyclopedias since its creation in 2001. The English Wikipedia has of today (2016-01-21) 5,059,991 articles and is the largest one, the second is the Swedish Wikipedia with a total of 2,662,793 articles [28]. Wikipedia has a couple of ways to browse articles and group similar ones. One way is using categories. All articles in Wikipedia belong to at least one category and they are intended to group together pages on similar topics [27]. By clicking on different categories one can find similar articles related to the topic from different point of views. E.g. The English article ”Table tennis” has the categories ”Table tennis” and ”Racquet sports” amongst others [30]. They are both related to table tennis but put the topic into different contexts. In the first context table tennis is viewed as the main topic of the category and you will find articles about the equipments used, different play styles and so on. In the other one table tennis will be viewed as one type of racquet sport and you will find articles about other types of racquet sports. This can be interpreted as each Wikipedia article being connected to several contexts where each context is disguised as a group of related articles from a different point of view. This project will utilise this type of grouping for the context expansion. Another way of grouping articles is considering all internal links appearing in the same articles as a group. Internal links are links to another articles with the aim to allow readers to deepen their understanding of a topic [31]. Internal links are therefore in some way relevant to the topic since they appeared in the main article. However, a drawback is that they all get the same weight, e.g. an internal link mentioning the material of ”sponge” in the article table tennis gets the same relevancy as different ”tennis grips” although the latter one may be more specific and relevant to the topic. Backlinks can also be considered as a group of articles. As mentioned earlier, all Wikipedia articles have internal links linking to other Wikipedia articles. Similarly, they also have a collection of incoming links (links from other Wikipedia articles which have a link to them) and those are called backlinks. Although all the articles in the backlinks may have referred to the same article, they may have not discussed the topic in the same context. E.g. the English article ”Table tennis” have both the articles ”North Korea”, ”Table (furniture)” and ”Tennis” in its backlinks and are different contexts[29]. The context interpretation discussed in the paragraph about categories is also applicable here where each backlink is a context. The difference here is that a single article is a context instead of a group of articles. Some articles also have a section with related articles presented in a ”see also”- section. E.g. a town can have neighbouring towns in its ”see also”-section. It is more common in certain types of articles and the articles having this sections is very inconsistent. Another possible encyclopedia which can be used is the Swedish Nationalen-

13 CHAPTER 3. RESOURCES cyklopedin. It is stated on its website that it has over 200,000 articles [15]. By performing a search on their website a more exact number can be found which is 313,146 articles in total. Many articles have a simplified version called ”simple” article. The original is called ”long” article. The simple articles are written with a simple and causal language in mind without abbreviations but cover the same topics. The amount of articles covering a unique topic is therefore less than 313,146 since there are some duplicates. 72,925 of the articles on Nationaencyklopdin are classified as ”simple” articles and 240,221 as ”long” articles. The articles in Nationalencyklopedin do not belong to any categories but have a list of links to related pages. Nationalencyklopedin also states that the articles covers seven main topics. However, it is not possible to browse through the main topics and there is no visible information about which main topic an article belongs to. The articles on Nationalencyklopedin are reviewed by experts in contrast to Wikipedia. Therefore, although the amount of data is a lot less compared to Wikipedia, the data itself is more reliable. The advantage of using Nationalencyklopedin over Wikipedia is the quality of the data. The disadvantage is on the other hand the quantity compared to Wikipedia. Since this project uses a statistical approach the quantity is more im- portant than the quality of the data.

3.4 Choice

Elastic was used as the search platform. Elastic was chosen as the search platform because of better support for analytical queries which could potentially be of usage. Wikipedia was chosen because it was the initial intention with the project and the large amount of data is crucial for a statistical approach.

14 Chapter 4

Methodology

This chapter will give you an overview of the system and what the main parts do. The idea behind the algorithm will firstly be explained before presenting the details. It will be followed by a presentation of the evaluation metrics data test data.

4.1 Concept

Instead of analysing the article the compound word occurred in, the intention is to find a larger context for analysis. The articles it was found in can be seen as representations of the word, i.e. an article is a representation of the word. The idea is to expand the articles representing it by including related articles which can be found in the categories it belongs to. A illustration of the progress of the expansion can be seen in figure 4.1, it describes how the representation changes.

Figure 4.1. Illustration of the data retrieval progress

1. Get articles: ”String” –> articles.

15 CHAPTER 4. METHODOLOGY

2. Get categories: article –> categories

3. Get articles: categories –> articles

In the end, each representation of the word originally represented by one single article is now a representation made up by a bundle of several articles. The next step is to get the split suggestions from the splitter and count the occurrences of the constituents in the bundle. All constituents occurring in the same bag means the split suggestion is good enough.

4.2 Architecture

The architecture consists of three parts: Findwise’s compound splitter, a class which communicates with the Elastic server and main class sending data requests to the other components and does the calculation. 4.2 shows a simple figure of it.

Figure 4.2. Simplified overview of system architecture

16 4.3. IMPLEMENTATION

4.3 Implementation

4.3.1 Split suggestions retrieval The compound word to be split gets queried to the compound splitter and the retrieved results get filtered. The filter method removes all suggestions with more than two parts for the first case and suggestions with only one part (the whole word) for the second case.

4.3.2 Articles retrieval A single class manages all the requests to the Elastic server. There are two main requests: Get categories of the base articles: Search for maximum n numbers of arti- cles which contain the compound word and return all those articles’ categories. No analysis of the categories’ relevancy is done and the categories are chosen after how they are presented by Wikipedia which is an alphabetical order. All the categories originating from the same article are bundled together. Get articles of the categories: Search and return at maximum m numbers of articles belonging to each category in every category bundle.

4.3.3 Computation A list storing integers with the size of number of split suggestions is initialized first. Every element in the list represents one split suggestion and the integer is the number of bundles where all constituents of the split could be found in. For each bundle of articles, see if all parts of a split suggestion can be found in the same one. If so, increment the counter. When all bundles are done, do the same with the next split suggestion. Lastly, the split suggestion with the highest integer in the list will be chosen as the best split. This means all the constituents could be found in the same and most contexts. The implication is that the constituents have a strong connection with the original compound word and it is therefore reasonable to split it this way. If nothing was found, the algorithm will not choose any split suggestion at all and return the word as a whole.

4.4 Evaluation metrics

The same metrics and definitions used in the paper written by Koehn and Knight [13] will be used for the evaluation. It will first cover the terms used in the metrics and then the metrics themselves.

17 CHAPTER 4. METHODOLOGY

4.4.1 Terms • Correct split: Words that should be split and were split correctly.

• Correct non: Words that should not be split and were not.

• Wrong not: Words that should be split but were not.

• Wrong faulty split: Words that should be split, were split, but wrongly (either too much or too little)

• Wrong split: Words that should not be split, but were.

4.4.2 Metrics Correct split Precision = (4.1) Correct split + Wrong faulty split Precision is measures the amount of correct splits in comparison to all the at- tempted splits.

Correct split Recall = (4.2) Correct split + Wrong faulty split + Wrong not split Recall measures the amount of correct splits in comparison to all the words that should be split.

Correct Accuracy = (4.3) Correct + Wrong Accuracy measures the amount of correct decisions (correct splits and correct non splits).

4.5 Test data

The test cases are a combination of data sets from different sources. The sources used where DSSO, Svenska akadmiens ordlista and data from annotators. This section will cover the sources the data was collected from and the test cases used. The algorithm is dependant on Findwise’s compound word splitter generating splits. Since it is an evaluation of the algorithm and not Findwise’s splitter, all words which the splitter could not split got filtered away and the words tested are only those which Findwise’s splitter managed to split.

4.5.1 Den stora svenska ordlistan (DSSO) DSSO was used to collect compound words. It is a word list created collaboratively by its users [18] [26]. The same file of DSSO used by Findwise in their evaluation was used in this project. The file consists of a total of around 17000 marked compound words together with the correct split into two parts.

18 4.6. DATA SETS

4.5.2 Svenska akademiens ordlista (SAOL) SAOL was used to collect non-compound words. That was done manually by ran- domly adding a words that were not marked as compound words in the word list by going through every page. A couple of words were chosen from every page if possible. It was in attempt to include compound words starting with every letter in the alphabet and a letter combination distribution similar to the Swedish language the dictionary represents. 515 non-compound words in total were collected through this method.

4.5.3 Annotators The third source was two subsets of the compound words from DSSO annotated by annotators. Each subset consists of 500 words and each subset is annotated by 5 different annotators. The annotation was carried out by handing the annotators a list of 500 words with the same instruction to split the words as much as possible, but the meaning of the now ”group of words” should still remain the same as the original com- pound word according to the annotators themselves. The key word should not be split. E.g. ”Långdistansförhållande” (long distance relationship) should be split aggressively into ”lång distans förhållande” (long distance relationship) whereas ”bordtennisbord” (table tennis table) should only be split into ”bordtennis bord” (’table tennis’ table) according to me. Splitting the latter one into three constituents would not retain the original key word ”table tennis”. The annotators were free to put 0, 1 or more marks where they thought the word should be split. The two groups of 5 annotators each were handed two different word lists, the first one consists of non-compound words which occur in Wikipedia and the second one words that do not. For the evaluation phase, a split was marked as correct if at least 2 annotators had agreed on the split.

4.6 Data sets

This section will present the data sets used. Some data were applied with the Wikipedia filter before they were shuffled and randomly selected to a data set. This filter removed all the words which could not be found in any Wikipedia article. This is only applied to some test case to be able to evaluate the algorithm’s maximum potential. Data set 1 and 2 were used to evaluate the first case and data set 3 and 4 to evaluate the second case. The first case is a split into two constituents and the second case is a split into as many constituents as possible while still retaining the meaning of the word. All the words used in data set 1 and 2 were checked with Findwise’s splitter that it could generate at least one split suggestion of two parts.

19 CHAPTER 4. METHODOLOGY

4.6.1 Data set 1 Data set 1 contains 1000 already annotated compound words from DSSO and 287 non-compound words from SAOL. This data set only contains words that occur in Wikipedia.

4.6.2 Data set 2 Data set 2 contains 1000 already annotated compound word and 300 non-compound words.

4.6.3 Data set 3 Data set 3 contains 500 compound words annotated by 5 annotators and 287 non- compound words from SAOL. This data set only contains words that occur in Wikipedia.

4.6.4 Data set 4 Data set 4 contains 500 compounds word annotated by 5 annotators and 300 non- compound words from SAOL.

20 Chapter 5

Experiment

This chapter will cover the experiments done which were applied to some of the test cases.

5.1 Language analyzers

The purpose of testing different language analyzers was to compare the impact they had. Elastic’s Swedish language analyzer based on the Snowball stemmer [8] [24], the light stemmer created by Jacques Savoy [22] and no stemming were tested.

5.2 Redo the expansion

There is a high risk the compound word will not be found in Wikipedia, especially longer and unusual ones. Some words have split suggestions where they have at least one word in common. This could be exploited by redoing the search on that dominant word. In this way, articles that are relevant, but do not contain the word, may be found to get some context at all.

5.3 No stemming as base index

The task of stemmers is to reduce a word into its form, e.g. ”player” and ”plays” have the root form ”play”. The first base 15 articles which will be expanded on are crucial for finding relevant data to base the compound splitting on. Querying the root form returns more articles but also more irrelevant articles. E.g. querying the root form of ”play” instead of ”player” would return all articles that do not mention the word ”player” as well increasing the recall but decreasing the precision. Since we are only using 15 base articles, the precision is more important than the recall. No stemming at all to retrieve the first 15 base articles was therefore tested in attempt to increase the amount of relevant articles expanding the context on.

21 CHAPTER 5. EXPERIMENT

5.4 Merge words

Because the way the external compound splitter is programmed to determine the part of speech as well, it will generate different representations of the same split suggestion as mentioned earlier. E.g. the word ”lekplats” (playground) have the split suggestions ”lek plats” (play ground) and ”leka plats” (to play ground) because of ambiguity, ”lek” could mean both the noun ”play” or the verb ”to play”. However, instead of choosing one randomly when there is a tie between two suggestions, we choose the one with no modifications from the original word when the constituents are merged together. In this case, ”lek plats” would be chosen because it is identical to the original word when the constituents are put together.

22 Chapter 6

Results

This chapter presents the results from the experiments and is divided into two parts. The first part covers the first case and second part the second case. The metrics were explained in 4.4.1.

6.1 Case 1

The algorithm is only allowed to split the compound word into two constituents or not at all. The first row shows the result of the default algorithm without any optimization. The second and last row show the results of the experiment when two optimizations options each were enabled.

• First row - Default algorithm.

• Second row - Stemming is initially disabled as described in 5.3 and the algo- rithm is allowed to perform an additional attempt on a new query as described in 5.2.

• Last row - The algorithm is allowed to perform an additional attempt on a new query as described in 5.2 and should choose the split suggestion which is closest to the original compound word when the constituents are put together as described in 5.4.

6.1.1 No stemming Experiment on the index without stemming. In this case, optimizing with initially disabling stemming as the second row suggests is cancelled out because no stemming is used on the index at all. Data set 1 was tested.

23 CHAPTER 6. RESULTS

Correct Correct Wrong Wrong Wrong Precision Recall Accuracy split non non faulty split split Default 618 195 305 77 92 0.89 0.62 0.63 Base index and Redo 660 176 197 143 111 0.82 0.66 0.65 Redo and Merge 671 176 197 132 111 0.84 0.67 0.66 Table 6.1. The results for data set 1 tested on the algorithm without a stemmer for case 1

6.1.2 Swedish light Experiment on the index with stemming using the Swedish light analyzer created by Jacques Savoy [22]. This stemmer does not stem as aggressively as Elastic’s Swedish stemmer. Data set 1 was tested.

Correct Correct Wrong Wrong Wrong Precision Recall Accuracy split non non faulty split split Default 473 187 385 142 100 0.77 0.47 0.51 Base Index and Redo 619 120 124 247 167 0.71 0.63 0.58 Redo and Merge 701 120 134 165 167 0.81 0.70 0.63 Table 6.2. The results for data set 1 tested on the algorithm with Swedish light language analyzer for case 1

6.1.3 Swedish Experiment on the index with stemming using the Swedish stemmer based on the Snowball stemmer [8] [24]. This stemmer uses a more aggressive approach when stemming resulting in better recall but worse precision. More (relevant and irrele- vant) data is collected with this stemmer. Data set 1 was tested.

Correct Correct Wrong Wrong Wrong Precision Recall Accuracy split non non faulty split split Default 448 210 433 119 166 0.79 0.45 0.51 Base Index and Redo 634 127 124 242 160 0.72 0.69 0.63 Redo and Merge 713 127 124 163 160 0.81 0.71 0.62 Table 6.3. The results for data set 1 tested on the algorithm with Elastic’s Swedish language analyzer for case 1

6.1.4 Comparison against baseline Comparison of the best combination of optimizations (or lack of) of each stemming option and the top suggestion in Findwise’s compound splitter. Data set 2 was tested.

24 6.2. CASE 2

Correct Correct Wrong Wrong Wrong Precision Recall Accuracy split non non faulty split split No stemming 530 189 331 139 111 0.80 0.53 0.55 Light 584 133 273 143 167 0.80 0.58 0.55 Swedish 596 140 266 138 160 0.81 0.59 0.57 Findwise 778 0 0 222 300 0.78 0.78 0.59 Table 6.4. The results for data set 2 tested on Findwise’s compound word splitter versus this project’s method.

The largest impact of an optimization can be seen in table 6.1.2 where the number of correct splits increased from 473 to 619 and table 6.1.3 where the number of wrong non decreased from 433 to 124. However, this optimization also decreased the number of correct nons. The precision was high in all the cases meaning the algorithm often chose the right split when it had to choose. However, recall and accuracy were lower, ranging from 0.51 to 0.71. This was caused by the high number of wrong nons and wrong splits.

6.2 Case 2

The second part covers the second case where there is no limit to the amount of splits. The first row shows the result of the default algorithm without any optimization. The second and last row show the results of the experiment when optimization options were used.

• First row - Default algorithm.

• Second row - Stemming is initially disabled as described in 5.3.

• Last row - The algorithm is allowed to perform an additional attempt on a new query as described in 5.2 5.4.

25 CHAPTER 6. RESULTS

6.2.1 No stemming Experiment on the index without stemming. In this case, optimizing with initially disabling stemming as the second row suggests is cancelled out because no stemming is used on the index at all. Data set 3 was tested.

Correct Correct Wrong Wrong Wrong Precision Recall Accuracy split non non faulty split split Default 344 193 131 25 94 0.93 0.68 0.62 Base Index 344 193 131 25 94 0.93 0.68 0.62 Redo 365 175 101 34 112 0.91 0.73 0.67 Table 6.5. The results for data set 3 tested on the algorithm with no stemmer for case 2.

6.2.2 Swedish light Experiment on the index with stemming using the Swedish light analyzer created by Jacques Savoy [22]. This stemmer does not stem as aggressively as Elastic’s Swedish stemmer. Data set 1 was tested.

Correct Correct Wrong Wrong Wrong Precision Recall Accuracy split non non faulty split split Default 347 184 115 38 103 0.90 0.69 0.67 Base Index 362 125 101 37 162 0.91 0.72 0.62 Redo 374 118 77 49 169 0.89 0.75 0.63 Table 6.6. The results for data set 3 tested on the algorithm with the Swedish light language analyzer for case 2.

6.2.3 Swedish Experiment on the index with stemming using the Swedish stemmer based on the Snowball stemmer [8] [24]. This stemmer uses a more aggressive approach when stemming resulting in better recall but worse precision. More (relevant and irrele- vant) data is collected with this stemmer. Data set 3 was tested.

Correct Correct Wrong Wrong Wrong Precision Recall Accuracy split non non faulty split split Default 366 307 87 47 80 0.89 0.73 0.73 Base Index 371 143 95 34 144 0.92 0.74 0.65 Redo 377 123 73 50 164 0.88 0.75 0.64 Table 6.7. The results for data set 3 tested on the algorithm with Elastic’s Swedish language analyzer for case 2.

26 6.2. CASE 2

6.2.4 Comparison against baseline Comparison of the best optimization option (or lack of) of each stemming option and the top suggestion in Sjöbergh and Kann’s compound splitter. Data set 4 was tested.

Correct Correct Wrong Wrong Wrong Precision Recall Accuracy split non non faulty split split No stemming 182 195 307 11 105 0.94 0.36 0.47 Light 184 155 287 29 145 0.86 0.37 0.60 Swedish 187 158 292 21 142 0.90 0.37 0.43 Sjöbergh Kann 441 290 32 27 10 0.94 0.88 0.91 Table 6.8. The results for data set 4 tested on the algorithm developed by Sjöbergh and Kann for case 2 [23].

27 CHAPTER 6. RESULTS

The number of wrong nons in test case 2 where significantly lower per default compared to test case 1. This had a large impact on the precision which landed on around 0.90 for all cases. Similarly to test case 1, applying a base index and redo increased the number of correct split. However, the impact on wrong nons was not as large as in test case 1. This is probably because of the low number it had even before applying anything as mentioned in the previous paragraph. The precision landed on around 0.90 for all cases whereas recall and accuracy ranged between 0.62 and 0.75. The algorithm is good at choosing the correct split but less good at identifying compound words. More often than not it would rather not split it because the conditions were too weak, i.e. some but not every single constituent could be found in any bundle of articles at all. The developed algorithm had slightly lower precision than Sjöbergh and Kann’s, but the recall and accuracy was much lower. Sjöbergh and Kann’s recall was 0.88 while the project’s algorithm landed on around 0.37. Sjöbergh and Kann’s accuracy was 0.91 while the project’s algorithm ranged between 0.43 - 0.60.

28 Chapter 7

Discussion

7.1 Test case 1

7.1.1 Optimizations Having a non stemmed base index and redoing the retrieval for not found compound words had a large positive impact on the number of correct splits and wrong nons for test case 1. It is an effect of a higher precision of relevant articles to expand and the increased quantity of retrieved articles and data. Redoing the retrieval on a substring increased the chance of finding some relevant articles instead of not doing any analysis at all and wrongly mark it as a non-compound word. However, this optimization also decreased the number of correct non’s. Although the algorithm got better at correctly splitting more compound words, it also got worse on judging when not to split. The increased amount of data improved the amount of relevant data but also the amount of less relevant which got the same weight. The amount of wrong faulty splits decreased when applying ”merge”. Many suggestions were split on the same position but the constituents were different forms of the same words and they had the same score because of the stemming. An example is the suggestions for the compound word ”lekplats” (playground) which were ”leka plats” (to play gorund) and ”lek plats” (play ground) among others. The first suggestion would be chosen without merge. However, using merge, ”lek plats” would be chosen instead which is the correct one. Applying ”merge” increased the chance of choosing noun-noun words which also are statistically more common.

7.1.2 Comparison to baseline No filtering out of words which could not be found in Wikipedia was done when comparing the algorithm against the baseline. Findwise’s default compound word splitter was used as a baseline for test case 1. Findwise’s compound word splitter beat the algorithm in all metrics but preci- sion. This is partly due to that the splitter assumes all words are compound words

29 CHAPTER 7. DISCUSSION and should be split, it therefore does not judge whether a word should be split or not. This results in the splitter having 0 correct nons and 0 wrong nons. The amount of compound words are much higher than non-compound words, the ratio is approximately 3:1, leading to the metrics being misleading and partial to a splitter who attempts to split all the words. In conclusion, Findwise’s default splitter was better at splitting based on the quantity of correctly split compound words but the algorithm developed in this thesis had a higher precision and chose the correct split more often. If both splitters were to split the same document, Findwise’s compound splitter would return more correct split words but also more incorrect splits. The thesis’ splitter would return less correct splits but also less incorrect splits. The document split by this thesis’ algorithm would therefore be more understandable although not as many compound words were split.

7.2 Test case 2

7.2.1 Optimization The results on test case 2 were slightly better than test case 1. However, that was also because the criterion on a correct split was much looser. There could be several correct suggestions (all suggestions which at least two annotators considered correct were judged as correct), hence there was a higher chance scoring a correct one. This explains the low number of wrong nons. One major difference to test case 1 was that although the number of correct splits increased after applying redo, the precision decreased. One common factor was that the number of wrong splits also increased. So, although redoing the search on a substring of the original word increased the quantity of correct splits it also got worse at identifiying words that should not be split. As mentioned in test case 1, this is due to the amount of increased data and that all data (both very relevant and less relevant) has the same weight.

7.2.2 Comparison to baseline The performance of this thesis’ splitter were worse compared to Sjöbergh and Kann’s. Since this data set did include words that do not appear in Wikipedia a lot of words were returned as a whole which partially caused the low number of recall and accuracy. Only around 40% of the compound words were correctly identified as compound words. The number of wrong faulty splits was otherwise low and the precision was equal or slightly worse than Sjöbergh and Kann’s.

7.3 Strengths

The algorithm had a high precision on the test cases and is good at choosing the correct split from a set of suggestions due to the possibility to analyse context. Even

30 7.4. WEAKNESSES though a lot of suggestions are technically correct, it was able to choose the most common interpretation and therefore most reasonable one.

7.4 Weaknesses

Since the algorithm performed well on the task of choosing the correct split (showed by the high precision), the major bottle neck was that it was bad at identifying compound words. This can be seen by comparing the amount of wrong nons and wrong faulty splits. The amount of wrong nons was always higher than wrong faulty splits. The most extreme case is showcased by the last comparison against the baseline for test case 2, the amount of wrong nons are 307 to 11 wrong faulty splits using the ”No stemming” index. A large reason for the high amount of wrong nons was that the algorithm would mark a word as ”non compound” if it did not find the compound word in Wikipedia. A lot of the compound words were not found in the Wikipedia and this poorly made decision affected the result negatively. Trying to split all the words even though they were not found in Wikpedia would increase the number of correct splits but at the time increase the number of wrong splits (trying to split non compound words).

7.5 Further work

To counter the disadvantage of the issue of not being able to find relevant con- text/articles, further work should focus on study methods to capture it. This would definitely decrease the number of wrong nons and low recall and accuracy since many of them arised from not finding articles which contain that word. Methods worth studying to find more and better contexts could be by improving the data source, use and studying different substrings of the word which can capture relevant context. Improving the data source can be done by changing or adding other encyclo- pedias, e.g. Nationalencyklopedin which was mentioned in background. Another suggestion is to change the way of expanding the context. In this project the order of the base articles and categories were not changed from the retrieval and were therefore alphabetically ordered . An improvement would be to first sort them based on some criterion, e.g. most popular or size. This would probably increase the relevancy of the data and result in more accurate data for the algorithm to base the compound splitting decision on. Expanding by the backlinks instead of categories is also a suggestion worth trying. Since many compound words are rare, especially long ones, studying different substrings of the compound words in combination with synonyms could increase the amount of found articles for rare words. Further work improving the precision can be done by giving words different weight. As mentioned earlier all words get the same weight as long as they are found in the same context/category. A word appearing in the same document

31 CHAPTER 7. DISCUSSION as the whole compound word is therefore given the same weight as some word appearing in a different document but in the same category. To improve the analysis, words that are found closer to the original compound word should weight heavier. Implementing stop words is also a method worth trying.

32 Chapter 8

Conclusion

This method developed in this thesis yielded a high precision but low recall and accuracy. It chose the correct split most of the time but had difficulties in separating compound words and non-compound words. A major limitation is that it is highly dependant on the compound words ap- pearing in Wikipedia which has a large impact on the result when comparing to other . The approach is good when trying to identify the correct split if the compound word is a common word, there are otherwise other algorithms that perform better for other use cases.

33

Bibliography

[1] Carina Ahlin. Spargris – Är det en gris som man sparar till jul? 2016. url: https://gupea.ub.gu.se/bitstream/2077/29653/1/gupea_2077_29653_ 1.pdf (visited on 01/24/2016). [2] Johan van der Auwera and Ekkehard König. The Germanic languages. Rout- ledge, 1994. [3] Réka Benczes. Creative compounding in English. John Benjamins Pub. Co., 2006, pp. 8–9. [4] Sven Eriksson Carl-Johan Markstedt. Svenska Impulser 2 - Språkets byggste- nar. 2016. url: http://sanomautbildning.se/upload/SvenskaImpulser2_ sid455_474.pdf (visited on 01/24/2016). [5] Joachim Daiber et al. “Splitting Compounds by Semantic Analogy”. In: arXiv preprint arXiv:1509.04473 (2015). [6] Corina Dima and Erhard Hinrichs. “Automatic Noun Compound Interpreta- tion using Deep Neural Networks and Word Embeddings”. In: IWCS 2015 (2015), p. 173. [7] Elastic. Elastic | We’re About Data. 2015. url: https://www.elastic.co/ about (visited on 12/22/2015). [8] ElasticSearch. Stemmer Token Filter. 2016. url: https://www.elastic.co/ guide/en/elasticsearch/reference/2.0/analysis-stemmer-tokenfilter. html (visited on 12/20/2016). [9] Claes-Christian Elert. Ordbildning. 2016. url: https://en.wikipedia.org/ wiki/Compound_(linguistics) (visited on 01/23/2016). [10] The Apache Software Foundtation. APACHE SOLR™ 5.4.1. 2016. url: http: //lucene.apache.org/solr/) (visited on 01/15/2016). [11] Tetsu Fujisaki et al. “A probabilistic parsing method for sentence disambigua- tion”. In: Current Issues in Parsing Technology. Springer, 1991, pp. 139–152. [12] Daniel Jurafsky and James Martin H. Speech and Language Processing. Pear- son Education, 2009, p. 895. isbn: 9780135041963.

35 BIBLIOGRAPHY

[13] Philipp Koehn and Kevin Knight. “Empirical methods for compound split- ting”. In: Proceedings of the tenth conference on European chapter of the As- sociation for Computational Linguistics-Volume 1. Association for Computa- tional Linguistics. 2003, pp. 187–193. [14] Mark Lauer. “Designing statistical language learners: Experiments on noun compounds”. In: arXiv preprint cmp-lg/9609008 (1996). [15] Nationalencyklopedin. Uppslagsverket. 2015. url: http://www.ne.se/info/ tj%C3%A4nster/uppslagsverket (visited on 12/18/2015). [16] M. Nayrolles. Mastering Apache Solr: A practical guide to get to grips with Apache Solr. Inkstall Solutions LLC, 2014. isbn: 9788192784502. url: https: //books.google.se/books?id=HSWeAwAAQBAJ. [17] Lena Öhrman. “Felaktigt särskrivna sammansättningar”. In: C-uppsats i da- torlingvistik, Institutionen för lingvistik, Stockholms universitet (1998). [18] Apache OpenOffice. Förbättra vår ordlista. 2016. url: https://www.openoffice. org/sv/dsso.html) (visited on 01/26/2016). [19] Norstedts ordböcker. Ord.se. 2016. url: http://www.ord.se/oversattning/ engelska/?s=element&l=ENGSVE) (visited on 12/20/2016). [20] Maja Popović, Daniel Stein, and Hermann Ney. “Statistical machine trans- lation of German compound words”. In: Advances in Natural Language Pro- cessing (2006), pp. 616–624. [21] Ulrike Rackow, Ido Dagan, and Ulrike Schwall. “Automatic translation of noun compounds”. In: Proceedings of the 14th conference on Computational linguistics-Volume 4. Association for Computational Linguistics. 1992, pp. 1249– 1253. [22] Jacques Savoy. “Report on CLEF-2001 experiments: Effective combined query- translation approach”. In: Evaluation of Cross-Language Information Retrieval Systems. Springer Berlin Heidelberg. 2002, pp. 27–43. [23] Jonas Sjöbergh and Viggo Kann. “Vad kan statistik avslöja om svenska sam- mansättningar”. In: Språk och Stil 16 (2006), pp. 199–214. [24] Snowball. Swedish Stemming Algorithm. 2016. url: http://snowball.tartarus. org/algorithms/swedish/stemmer.html (visited on 12/20/2016). [25] Radu Soricut and Franz Och. “Unsupervised induction using word embeddings”. In: Proc. NAACL. 2015. [26] Språkteknologi.se. Den stora svenska ordlistan, svensk rättstavningslista. 2016. url: http://sprakteknologi.se/resurser/sprakresurser/den-stora- svenska-ordlistan-svensk-raettstavningslista-foer-open-office) (visited on 01/26/2016). [27] sv.wikipedia.org. Kategorier. 2016. url: https://sv.wikipedia.org/wiki/ Special:Kategorier (visited on 11/26/2015).

36 BIBLIOGRAPHY

[28] Wikipedia. List of Wikipedias. 2016. url: https : / / en . wikipedia . org / wiki/List_of_Wikipedias (visited on 01/21/2016). [29] Wikipedia. Pages that link to "Table tennis". 2016. url: https://en.wikipedia. org/wiki/Special:WhatLinksHere/Table_tennis (visited on 01/10/2016). [30] Wikipedia. Table tennis. 2016. url: https://en.wikipedia.org/wiki/ Table_tennis (visited on 01/10/2016). [31] Wikipedia. Wikipedia:Manual of Style/Linking. 2016. url: https : / / en . wikipedia.org/wiki/Wikipedia:Manual_of_Style/Linking (visited on 01/10/2016).

37 www.kth.se