VYTAUTAS MAGNUS UNIVERSITY

Vidas DAUDARAVIČIUS

COLLOCATION SEGMENTATION FOR TEXT CHUNKING

Doctoral dissertation Physical Sciences, Informatics (09P)

Kaunas, 2012 UDK 004 Da-371

This doctoral dissertation was written at Vytautas Magnus University in 2008–2012.

Research supervisor: doc. dr. Minija Tamoši unait¯ e˙ Vytautas Magnus University, Physical Sciences, Informatics – 09 P

ISBN 978-9955-12-844-1 VYTAUTO DIDŽIOJO UNIVERSITETAS

Vidas DAUDARAVIČIUS

TEKSTO SKAIDYMAS PASTOVIU˛JU˛JUNGINIU˛SEGMENTAIS

Daktaro disertacija Fiziniai mokslai, Informatika (09P)

Kaunas, 2012 Disertacija rengta 2008–2012 metais Vytauto Didžiojo universitete

Mokslinis vadovas: doc. dr. Minija Tamoši unait¯ e˙ Vytauto Didžiojo universitetas, fiziniai mokslai, informatika – 09 P In the beginning was the Word (Jon 1:1) Summary

Segmentation is a widely used paradigm in . There are many methods of segmentation for text processing, including: topic segmentation, sentence segmentation, morpheme segmentation, phoneme segmentation, and Chinese . Rule-based, statistical and hybrid methods are employed to perform the segmentation. This dissertation introduces a new type of segmentation– segmentation–and a new method to perform it, and applies them to three different text processing tasks:

– Lexicography. Collocation segmentation makes possible the use of large corpora to evaluate the usage and importance of terminology over time. It highlights important methodological changes in the history of research and allows actual research trends throughout history to be rediscovered. Such an analysis is applied to the ACL Anthology Reference Corpus of scientific papers published during the last 50 years in this research area.

– Text categorization. Text categorization results can be improved using collocation segmen- tation. The study shows that collocation segmentation, without any other language resources, achieves better results than the widely used n-gram techniques together with POS (Part-of- Speech) processing tools. The categorization precision of EU legislation using EuroVoc was improved, from 48.30 ±0.48% to 54.67 ±0.78%, from 51.41±0.56% to 59.15 ±0.48% and from 54.05±0.39% to 58.87 ±0.36% for English, Lithuanian and Finnish languages respectively.

– Statistical . The preprocessing of data with collocation segmentation and subsequent integration of these segments into a Statistical Machine Translation system im- proves the translation results. Diverse word combinability measures variously influence the final collocation segmentation and, thus, the translation results.

The new collocation segmentation method is simple, efficient and applicable to language processing for diverse applications.

i Commonly Used Terms and Acronyms

NLP – Natural Language Processing

MI – Mutual Information

MT – Machine Translation

TF–IDF – Term Frequency Inverse Document Frequency

SMT – Statistical Machine Translation

BLEU – Bilingual Evaluation Understudy

ACL – Association for Computational Linguistics

HMM – Hidden Markov Model

iii Contents

Summary i

Commonly Used Terms and Acronyms iii

Introduction 1

2 Segmentation in NLP 7 2.1 Segmentation types ...... 7 2.1.1 Discourse segmentation ...... 7 2.1.2 Topic Segmentation ...... 8 2.1.3 Paragraph segmentation ...... 9 2.1.4 Chinese Text Segmentation ...... 9 2.1.5 Word Segmentation and tokenization ...... 12

3 Collocation Segmentation 13 3.1 Collocation in NLP ...... 13 3.2 The Algorithm of Collocation Segmentation ...... 15 3.3 The combinability of words ...... 17 3.4 Setting segment boundaries with a threshold ...... 18 3.5 Setting segment boundaries with the Average Minimum Law ...... 19 3.6 Segmentation examples ...... 20

4 Applying Collocation Segmentation 23 4.1 Automatic multilingual annotation of EU legislation with EuroVoc descriptors . . . . 23 4.1.1 Related work ...... 24 4.1.2 Conceptual Thesaurus ...... 24 4.1.3 EuroVoc descriptor assignment ...... 25 4.1.4 Evaluation of the experiment results ...... 33 4.1.5 Conclusions ...... 38 4.2 Integration into Statistical Machine Translation ...... 38 4.2.1 The IWSLT 2010 Evaluation Campaign BTEC task ...... 39 4.2.2 Phrase–based baseline system ...... 41 4.2.3 Introducing collocation segmentation into a phrase–based system ...... 41 4.2.4 Experiments ...... 42 4.2.5 Official Evaluation results ...... 47 4.2.6 Conclusion ...... 51 4.3 Application to the ACL Anthology Reference Corpus ...... 52 4.3.1 The ACL Anthology Reference Corpus ...... 53 4.3.2 Collocation segments from the ACL ARC ...... 53 4.3.3 Conclusions ...... 57

Conclusions 61

Bibliography 63 Introduction

“The (single) word is not privileged in terms of meaning. [...] Lexical items can be single words, compounds, multiword units, phrases, and even idioms” [Teubert, 2005]. Collocation is a well–known linguistic phenomenon which has a long history of research and use [Daudaravičius, 2012a]. Surpris- ingly, the term collocation itself has no standard definition, and this dissertation does not attempt to analyze the theoretical background of what a collocation is . There is a fundamental divide between those researchers who believe in collocation as a serious theoretical concept –

– “The meaning of a collocation is not a straightforward composition of the meanings of its parts. For example, the meaning of red tape is completely different from the meaning of its compo- nents.” [Wermter and Hahn, 2004]

– “Collocation is an expression of two or more words that corresponds to some conventional way of saying things. Sometimes, the notion of collocation is defined in terms of syntax (by possible part-of-speech patterns) or in terms of semantics (requiring to exhibit non-compositional meaning).” [Franz and Milch, 2002]

– and those who regard it as a convenient descriptive label without theoretical status.

– “Collocation is the occurrence of two or more words within a short space of each other in a text.” [Frantzi and Ananiadou, 1996]

– “In its most general sense, a collocation is a habitual or lexicalised word combination.” [Weeds et al., 2004]

– “A collocation is a recurrent sequence of words that co-occur together more than expected by chance in a given domain.” [Gil and Dias, 2003]

Linguists study collocations from a lexical point of view. Both Melčuk [Melc’uk and Polguere, 1987] and the more recent computational linguists [Pecina, 2005] have recognized the need to define collocation phenomena from both the statistical and syntactical points of view. Natural language processing community uses collocations as a key issue for tasks like natural language (syntactic and semantic structure analysis) and generation (translation), as well as real-life applications such as machine translation, and retrieval. The general context of the term collocation is highly related to these concepts:

– word – the base unit of a language. Words are not simple phenomena. In general, a word is a partial or full representation of some concept in the world. While Indo-European languages write words using strings of individual characters, in Asian languages there is no clear definition of what a word is.

1 2 Introduction

– token – an atomic element, usually a word in a natural language processing field. Tokenization is the process of splitting a text into a sequence of tokens. Simple words are easy to recognize, but there are words with no clear boundary. For instance, the token don’t can be tokenized as: a single word | don’t |, two words | don | ’t | or | do | n’t |, or three words | don | ’ | t |. There are no clear guidelines for how to tokenize a text into atomic units. Usually this decision is left to the engineers themselves. A recent study [He and Kayaalp, 2006] shows that the large number of freely available tokenizers produce distinct results (eleven out of thirteen are distinct).

– n-gram – a contiguous sequence of n words from a given text. Word n-grams are widely used in natural language modeling using HMM. They are also used to produce multidimensional vector spaces for text classification, machine translation, information retrieval, etc.

Collocation phenomena are simple to understand, yet hard to employ in real tasks. Little work has been done to simplify the use and application of collocations over the last few decades. [Smadja, 1991] introduced a general three-step method of collocation processing that presented 80% precision and 94% recall of the collocation extraction system. The system was used to compile a collocation dictionary of about 2,000 collocations – a lexicographic task. In general, the proposed algorithm al- lows collocations of two or more words in length to be extracted efficiently. While the results showed great success and a breakthrough in collocation extraction, this collocation extraction technique has never been applied to other tasks. The main obstacle to applying this approach was the complexity of the algorithm and the computational resources required. Even nowadays, applications that could make use of the implementation of such a system are barely possible in practice. Thus, for no good reason, collocations “were killed”, and no new methods were introduced for a long time. Meanwhile, n-grams became the main strategy in natural language processing tasks: they are easy to use, sim- ple to implement, and theoretically grounded. Collocations continued to be studied by enthusiasts. Beginning in the 1990s, computers became powerful enough to apply statistical methods ( n-grams), but linguistically grounded, rule-based methods (collocations) remained too complex for the available resources. Thus, n-grams replaced collocations as the method of choice in NLP. The importance of the collocation paradigm shift is raised in the most recent study on collo- cations, [Seretan, 2011]. The n-gram approach lacks the main property of language– expression of meaning, a descriptive property. How long should an n-gram be to capture meaning? The main approach is to keep the n-grams as long as possible without exhausting the technical resources or exceeding the time limit. This approach is practical, but poor in linguistic understanding. In practice, unigrams are applied when speed is required, and are used for higher quality results. The main practical issues for applying collocation are how far a collocation should be expanded, and what length it should be. Higher order n-grams require large text corpora to extract reliable statistics. Such corpora ex- ist for widely used languages such as English, Chinese, French, and German. The main issue here Introduction 3 becomes less-resourced languages. While most studies are for the English language, there are many other languages to be processed. It is naive to believe that the methods applied to English can easily be applied to any other language. There are not many multilingual resources for comparative lan- guage processing. The largest multilingual corpus available is The JRC–Acquis Multilingual Parallel Corpus, compiled by the Joint Research Center of the European Commission. Until more corpora became available, there will not be many language processing tools, like parsers, for less-resourced languages. The last issue in moving from n-grams to collocations is: can n-grams of different lengths be comparable? One recent approach is based on extracting , , quadgram, and five-gram collocations independently. However, while the main goal here is to recognize meaningful multiword collocations, there is no space left for single-word units, i.e., the status of single-word units (unigrams) is not clear, and unigrams are often omitted without justification. In this dissertation I propose a new definition of the collocation: A collocation is the longest non-overlapping combination of words in a text in which consecutive word pairs occur more of- ten than by chance . This definition extends the most common traditional definition of collocation (as a recurrent sequence of words that co-occur together more than expected by chance ) with the require- ment that the collocation processing technique be text-based, rather than n-gram filtering-based. In principle, this follows sentence segmentation issues: while it is hard to define what a sentence is , it is simple to recognize the start and end of a sentence without needing to analyze it first. My dissertation aims to remedy the gap presented above by proposing a new type of segmentation. The object of my dissertation are word collocations, which combine sequential words into meaningful (e.g., machine translation ) or functional (e.g., by the way ) lexical units in human lan- guages. The main goal of my dissertation is to introduce collocation segmentation , a collocation pro- cessing method that is: of low algorithmic complexity; not limited by collocation length; language- independent (only plain corpus); efficient; and applicable to many NLP tasks. The main tasks that must be undertaken in order to accomplish this goal are:

– to summarize segmentation types, thereby presenting the main issues in defining segmentation units (text, discourse, topic, paragraph, sentence, word, etc.);

– to define an algorithmic procedure for collocation extraction, as well as to develop a gen- eral framework for collocation segmentation, including the selection of measures, features and thresholds that allow the collocation extraction system to be implemented;

– to find the optimal threshold for collocability, which measures the strength of relation between two words and defines the necessity to combine sequential words into one collocation, thus achieving the best text classification results; 4 Introduction

– to study the influence of collocation segmentation on text classification for diverse languages (English, Lithuanian and Finnish), and to compare the results against unigram and bigram feature-based text classification;

– to study the influence of collocation segmentation, as compared to the standard unigram ap- proach, on Statistical Machine Translation, by applying it to alignment;

– to apply collocation segmentation to a study of the main trends of research methods via the use of terminology in published articles.

The novelty and practical significance of the dissertation :

– A new, language-independent, and efficient collocation processing method is introduced, one which is applicable to many NLP tasks, enables a step back from the currently most widely adopted n-gram approach, and has acceptable computational complexity.

– A new approach for collocation processing, using a text as a curve of combinability values between two preceding words in the text, is investigated. The curve allows a new technique for collocation processing to be applied, based on the observation of the changing combinability values between two words. This approach avoids the necessity to define the threshold manually, which is obligatory in traditional collocation processing.

– A new framework for applying collocations to text classification is investigated. The similarities in classification performance in different languages (including English, Lithuanian and Finnish) show that collocation segmentation improves classification results and is applicable to diverse languages. Collocation segmentation does indeed produce the best results when compared to unigrams and .

– A new framework for applying collocations to Statistical Machine Translation is investigated. The analysis shows that collocation segmentation is especially beneficial to the Statistical Ma- chine Translation system in the parallel text alignment step.

– A new framework for applying collocations to terminology processing in large corpora is pro- posed and investigated. This study has shown that collocation segmentation helps in the extrac- tion of significant terms, and in the analysis of term yearly significance.

The structure of the dissertation . Chapter 2 presents an overview of various units and their main identification techniques. The discussion shows that unit identification is often based on setting boundaries. I apply this practice to the collocation segmentation method, which I introduce in Chapter 3. Then, in Chapter 4, I apply collocation segmentation and study three diverse tasks to show the value of the new proposed method. It must be mentioned that the integration of collocations into the Introduction 5

Statistical Machine Translation system was done in cooperation with the Barcelona Media Innovation Center. Finally, the main conclusions are listed at the end of the dissertation. Chapter 2

Segmentation in NLP

There are many different terms related to segmentation that are used in NLP: discourse segmentation, topic segmentation, text segmentation, Chinese text segmentation, word segmentation, etc. Segmen- tation is performed by detecting boundaries, which may also be of several different types: syllable boundaries, sentence boundaries, clause boundaries, phrase boundaries, prosodic boundaries, mor- pheme boundaries, paragraph boundaries, word boundaries, constituent boundaries, topic boundaries, etc. The segmentation is used mainly as an initial or intermediate step for many tasks, such as classi- fication, feature selection, deep structure analysis, , machine translation, and so on. Statistical, rule-based or hybrid methods are employed to perform this task. In this chapter, I summa- rize the different types of segmentations that are used in text analysis for various language processing levels (e.g., pragmatics, semantics, syntax, morphology, phonology).

2.1 Segmentation types

2.1.1 Discourse segmentation

Computational discourse analysis is based on Rhetorical Structure Theory (RST). RST was origi- nally developed for studies of computer-based text generation. Coherent texts consist of minimal units which are recursively linked to each other through rhetorical relations. Rhetorical relations are also known, in other theories, as coherence or discourse relations. The discourse unit is typi- cally a clause, though this varies from researcher to researcher depending on the level of granularity needed. Therefore, discourse segmentation identifies related, non-overlapping text spans which form new discourse units. RST is intended to describe texts, rather than the processes of creating or read- ing and understanding them. An example of discourse segments and their analysis is presented in Figure 2.1. This example shows that it is very difficult to formally define what a discourse unit is because these units are wide, long, and their boundaries are not strict. An advanced study on clause identification can be found in [Sang and Dejean, 2001]. More advanced overviews and studies on RST appear in [Mann and Thompson, 1988, Taboada and Mann, 2006, Levow, 2004, Passonneau and Litman, 1997, Hernault et al., 2010, Tofiloski et al., 2009]. An Internet site is devoted to RST at http://www.sfu.ca/rst/. The manually annotated RST Discourse [Carlson et al., 2002] is the main source for the training and testing of tools for discourse analysis. Discourse seg- mentation is very important for the deeper understanding of a text structure to be processed. Such structures provide other important non-lexical features of texts.

7 8 Segmentation in NLP

Figure 2.1: An example of discourse segmentation and references based on RST analysis.

2.1.2 Topic Segmentation

Topic analysis examines a text’s topic structure to identify which topics are included in the text and how these topics change. The analysis consists of two main tasks: topic identification and text segmentation (based on topic changes). Topic segmentation is often referred to as “text segmentation”, as in [Li and Yamanishi, 2000]. The unit of a topic is a piece of text, which is very often referred to as a paragraph or a sequence of paragraphs. The method of topic segmentation relies mostly on boundary detection as changes of topic. Topic analysis is extremely useful in a variety of text processing applications. For example, it can be used in the automatic indexing of texts for purposes of information retrieval, text summarization, or the discovery of story boundaries in a stream of text or audio recordings [Matveeva and Levow, 2007, Arguello and Rose, 2006]. The goals are to detect the main topics and subtopics of a text, and identify where those subtopics lie within the text. The segmentation method (e.g., TextTiling [Hearst, 1997]) generally segments a text into blocks (paragraphs) in accordance with topic changes within the text, but on its own it does not identify (or label) the topics discussed in each of the blocks. That is, given a sequence of (written or spoken) words, the aim of topic segmentation is to find the boundaries where topics change. Work on topic segmentation is generally based on two broad classes of cues. On the one hand, one can exploit the fact that topics are correlated with topical content-word usage, and that global shifts in word usage are indicative of changes in topic. Quite independently, discourse cues, or lin- guistic devices such as discourse markers, cue phrases, syntactic constructions, and prosodic signals 2.1. Segmentation types 9 are employed by speakers (or writers) as generic indicators of beginnings or endings of topical seg- ments. In automatic topic segmentation systems, word usage cues are often captured by statistical language modeling and information retrieval techniques. Discourse cues, on the other hand, are typi- cally modeled with rule-based approaches or classifiers derived by machine learning techniques (such as decision trees). An advanced introduction and study on topic segmentation can be found in [Tur et al., 2001].

2.1.3 Paragraph segmentation

Paragraph segmentation has a number of useful applications, including: multi-document summariza- tion of web documents; producing the layout for transcripts provided by speech recognizers and optical character recognition systems; and determining the layout of documents generated for output devices with different screen sizes [Filippova and Strube, 2006]. Though related to the task of topic segmen- tation, paragraph segmentation has not yet been thoroughly investigated. Paragraphs are considered a stylistic phenomenon, and there is no clear opinion on what a paragraph is as a unit. Paragraph struc- ture is arbitrary, and cannot be determined based solely on the properties of the text. Psycholinguistic studies report that humans agree on placing boundaries between paragraphs, and that these boundaries are informative [Stark, 1988]. In contrast to topic segmentation, paragraph segmentation has the ad- vantage that large amounts of annotated data are readily available for supervised learning. [Sporleder and Lapata, 2004, 2006]] mainly focused on superficial and easily obtainable surface features like punctuation, quotation marks, distance, and words in the sentence. Those linguistically motivated features that can be computed automatically provide better paragraph segmentation [Filippova and Strube, 2006] and significantly outperform surface features. The task of topic segmentation is closely related to the task of paragraph segmentation. If there is a topic boundary, it is very likely that it coincides with a paragraph boundary. However, the reverse is not true, as one topic can extend over several paragraphs. Thus, topic boundaries may be used as high- precision, low-recall predictors for paragraph boundaries. Still, there is an important difference: while work on topic segmentation mainly depends on content words [Hearst, 1997] and the relations between them, which are computed using lexical chains [Galley et al., 2003], paragraph segmentation (as a stylistic phenomenon) may depend equally likely on function words. Hence, paragraph segmentation is a task which encompasses the traditional borders between content and style.

2.1.4 Chinese Text Segmentation

Words are the basic linguistic units of natural language processing. Word extraction is important in lexicon extraction, speech recognition, syntactic parsing, , semantic interpretation, and so on. Thus, the identification of lexical words and/or the delimitation of words in running texts 10 Segmentation in NLP is a prerequisite of NLP. Words and tokens are the generic units in almost all linguistic theories and language-processing systems, including Japanese [Yosiyuki et al., 1994] and Korean [Ho et al., 1994]. The identification of words in natural language is nontrivial. Chinese texts are character-based, not word-based. Each Chinese character stands for one phonological syllable and, in most cases, represents a morpheme. This presents a problem, as less than 10% of the word types (and less than 50% of the tokens in a text) in Chinese are composed of a single character [Che, 1993]. However, Chinese texts (and texts in some other Oriental languages, such as Japanese) do not have delimiters (e.g., spaces) to mark the boundaries of meaningful words. Even in English texts, some phrases consist of several words (e.g., look for , take out , hold on , first of all ). Moreover, single Chinese characters often carry different meanings. This ambiguity can be resolved by combining characters to form a word. Chinese words are n-grams. According to the Frequency Dictionary of Modern Chinese (Beijing Language Institute 1986), among the 9,000 most frequent Chinese words, 26.7% are unigrams, 69.8% are bigrams, 2.7% are trigrams, 0.007% are quadgrams, and 0.002% are five-grams. There are generally two ways to identify words in a text [Huang et al., 1996]. One is the de- ductive strategy, whereby words are identified through the segmentation of running texts. The other is the inductive strategy, which identifies words using morpho-lexical rules of composition. In Chi- nese text segmentation there are three basic approaches [Sproat et al., 1996]: rule-based, statistical, and hybrid. The rule-based approach identifies words by applying prior knowledge or morpho-lexical rules governing the derivation of new words. The statistical approach identifies words based on the distribution of their components in a large corpus. [Sproat and Shih, 1990] developed a statistical method that utilizes the mutual information shared between two characters. The limitation of the method is that it can only deal with bigrams. [Chien, 1997] develops a PAT tree-based method that extracts significant words by observing the mu- tual information of two overlapping patterns with a significance function. [Zhang et al., 2000] propose the application of a statistical method that is based on context dependence and mutual information. In most cases, term frequency is the main feature of word recognition. [Chen and Bai, 1998] propose a corpus-based learning approach that learns grammatical rules and automatically evaluates them. [Teahan et al., 2000] propose a compression-based algorithm for Chinese text segmentation. There is no commonly accepted standard for evaluating the performance of word extraction methods, and it is very hard to decide whether a word is meaningful or not [Sproat et al., 1996]. As it is difficult to find all of the words in an original corpus that would be found meaningful by a Chinese person, it is very hard to count recall in the traditional way, that is, as the number of meaningful words extracted divided by the number of all meaningful words in the original data. On the other hand, it is also impossible to approach traditional precision and traditional recall by comparing hand-segmented sample sentences with those segmented automatically. For example, the sentence 发展中国家用电器换取外汇 may be segmented as: 2.1. Segmentation types 11

发展 中国 家用电器 换取 外汇 develop / China / household-appliance / exchange / foreign currency “to develop China’s household–appliance industry to exchange for foreign currency.”

发展中国家 用 电器 换取 外汇 developing country / use / appliance / exchange / foreign currency “developing countries use appliances to exchange foreign currency.” Both segmentation results are correct from the lexical point of view, and there is no need to solve the ambiguity. Ambiguous words in the example can be recognized as: 发 – issue, dispatch, send out, emit

展 – open, unfold; stretch, extend

发展 – development; growth; to develop; to grow; to expand

中 – central

国 – nation, country, nation-state

中国 – China

家 – house, home, residence; family

用 – use, employ, apply, operate; use

家用 – home-use

电 – electricity, electric; lightning

器 – receptacle, vessel; instrument

电器 – (electrical) appliance; device

家用电器 – major appliance

换 – change, exchange; substitute

取 – take, receive, obtain; select

换取 – give something and get something in return

外 – out, outside, external; foreign

汇 – concourse; flow together, gather

外汇 – foreign (currency) exchange 12 Segmentation in NLP

2.1.5 Word Segmentation and tokenization

Tokenization is the process of breaking a stream of text up into words or symbols, called tokens. The list of tokens then becomes the input for further processing, such as parsing or POS tagging. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of . Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a “word”. Often a tokenizer relies on simple heuristics, for example:

– All contiguous strings of alphabetic characters are part of one token; likewise with numbers.

– Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.

– Punctuation and whitespace may or may not be included in the resulting list of tokens.

This approach is fairly straightforward. A classic example is “New York-based”, which a tok- enizer may break at the space even though the better break is at the hyphen. Another common example is “don’t,” which a tokenizer may break as |do |n’t |, |don|’|t|, or |don|’t |. Tokenization is particularly difficult for languages such as Chinese, which have no word bound- aries. A comparison of various freely available tokenizers is made in [He and Kayaalp, 2006]. The study exploits thirteen different tokenizers to tokenize simple and complex text samples. The results show that eleven of these tokenizers provided distinct results, and that only two were the same as others in the study. Chapter 3

Collocation Segmentation

3.1 Collocation in NLP

Collocation is a well-known linguistic phenomenon which has a long history of research and use. The reader is invited to refer to this recent study on collocation extraction [Seretan, 2011]. In this section I study the usage of terms related to collocations in the large and complex ACL Anthology Reference Corpus 1 to make a historical overview. Until 1986, the most significant terms in the ACL ARC were related to formal/rule-based methods [Daudaravičius, 2012a]. Starting in 1987, terms related to statistical methods became more important (e.g., language model , similarity measure , text classification). The list of terms related to the term collocation may be prohibitively lengthy and could include many aspects of what it is and how it is used . For simplicity’s sake, a short list of related terms (including word , collocation, multiword , token , unigram , bigram , trigram , collocation extraction and segmentation) was compiled. Table 3.2 shows when these terms were introduced in the ACL ARC: some terms were introduced early on, others more recently. The term collocation was introduced nearly fifty years ago and has been in use ever since. This is not unexpected, as collocation phenomena were already being studied by the ancient Greeks [Seretan, 2011]. Table 3.2 presents the first use of terms, showing that the terms segmentation, collocation and multiword are related to the similar concept of gathering consecutive words together into one unit. While the term collocation has been used for many years, the first attempt to define what a collocation is could be related to the time period when statistics first began to be used heavily in linguistics. Until that time, collocation was used mostly in the sense of an expression produced by

1http://acl-arc.comp.nus.edu.sg/

Term Count Documents Introduced in word 218813 7725 1965 segmentation 11458 1413 1965 collocation 6046 786 1965 multiword 1944 650 1969 token 3841 760 1973 trigram 3841 760 1973/87 bigram 5812 995 1988 unigram 2223 507 1989 collocation extraction 214 57 1992 Table 3.1: Term usage in ACL ARC

13 14 Collocation Segmentation

Term Source and Citation word [Culik, 1965] : 3. Translation " word by word " . ”Of the same simplicity and uniqueness is the decomposition of the sentence S in its single words w1 , w2 , ..., wk separated by interspaces , so that it is possible to write s = ( w1 w2 ... wk ) like at the text.“ A word is the result of a sentence decomposition. segmentation [Sakai, 1965]: The statement "x is transformed to y" is a generalization of the original fact, and this generalization is not always true. The text should be checked before a transformational rule is applied to it. Some separate steps for this purpose will save the machine time. (1) A text to be parsed must consist of segments specified by the rule. The correct segmentation can be done by finding the tree structure of the text. Therefore, the concatenation rules must be prepared so as to account for the structure of any acceptable string. Collocation [Tosh, 1965]: We shall include features such as lexical collocation (agent-action agree- ment) and transformations of semantic equivalence in a systematic description of a higher order which presupposes a morpho-syntactic description for each language [8, pp. 66-71]. The following analogy might be drawn: just as strings of alphabetic and other characters are taken as a body of data to be parsed and classified by a phrase structure grammar, we may regard the string of rule numbers generated from a phrase structure analysis as a string of symbols to be parsed and classified in a still higher order grammar [11; 13, pp. 67-83], for which there is as yet no universally accepted nomenclature. multi-word [Yang, 1969]: When title indices and catalogs, subject indices and catalogs, business telephone directories, scientific and technical dictionaries, lexicons and idiom-and- phrase dictionaries , and other descriptive multi-word information are desired, the first character of each nontrivial word may be selected in the original word sequence to form a keyword. For example, the rather lengthy title of this paper may have a keyword as SADSIRS. Several known information systems are named exactly in this manner such as SIR (Raphael’s Semantic Information Retrieval), SADSAM (Lindsay’s Sentence Ap- praiser and Diagrammer and Semantic Analyzing Machine), BIRS (Vinsonhaler’s Ba- sic Indexing and Retrieval System), and CGC (Klein and Simmons’ Computational Grammar Coder). token [Beebe, 1973]: The type/token ratio is calculated by dividing the number of discrete entries by the total number of syntagms in the row. trigram [Knowles, 1973]: sort of phoneme triples ( trigrams), giving list of clusters and third- order information-theoretic values. [D’Orta et al., 1987]: Such a model it called trigram language model. It is based on a very simple idea and, for this reason, its statistics can be built very easily only counting all the sequences of three consecutive words present in the corpus. On the other band, its predictive power is very high. bigram [van Berkelt and Smedt, 1988]: Bigrams are in general too short to contain any useful identifying information while tetragrams and larger n-gram are already close to average word length. [Church and Gale, 1989]: Our goal is to develop a methodology for extending an n- gram model to an (n+l)-gram model. We regard the model for unigrams as completely fixed before beginning to study bigrams. unigram the same as bigram for [Church and Gale, 1989] collocation [McKeown et al., 1992]: Added syntactic parser to Xtract, a collocation extraction extraction system, to further filter collocations produced, eliminating those that are not consis- tently used in the same syntactic relation. Table 3.2: Introductions of terms in ACL ARC. 3.2. The Algorithm of Collocation Segmentation 15

Figure 3.1: The collocation segmentation of the sentence a collocation is a recurrent and conventional fixed expression of words that holds syntactic and semantic relations [Xue et al., 2006].

a particular syntactic rule. The first definition of collocation in ACL ARC is found in [Cumming, 1986]. [Cumming, 1986]: By “collocation” I mean lexical restrictions (restrictions which are not pre- dictable from the syntactic or semantic properties of the items) on the modifiers of an item; for exam- ple, you can say answer the door but not answer the window . The phenomenon which I’ve called collocation is of particular interest in the context of a paper on the lexicon in text generation because this particular type of idiom is something which a generator needs to know about, while a parser may not. It is not the purpose of this study to provide a definition of the term collocation, because at the moment there is no definition that everybody would agree upon. The introduction of unigrams , bigrams and trigrams in the 1980s had a big influence on the use of collocations in practice. N- grams, as a substitute to collocations, started being used intensively and in many applications. On the other hand, n-grams are lacking in generalization capabilities and recent research tends to combine n-grams, syntax and semantics [Pecina, 2005] .

3.2 The Algorithm of Collocation Segmentation

Collocation segmentation is a new type of segmentation whose goal is to detect fixed word sequences and to segment a text into word sequences called collocation segments. I use the definition of a se- quence in the notion of one or more. Thus, a collocation segment is a sequence of one or more consec- utive words with high inter-combinability relations. A collocation segment can be of any length (even a single word) and the length is not defined in advance. This definition differs from other collocation 16 Collocation Segmentation definitions that are usually based on n-gram lists [Tjong Kim Sang and Buchholz, 2000, Choueka, 1988, Smadja, 1993]. Collocation segmentation is related to collocation extraction using syntactic rules [Lin, 1998]. The syntax-based approach allows the extraction of collocations that are easier to describe, and the process of collocation extraction is well-controlled. On the other hand, the syntax- based approach is not easily applied to languages with fewer resources. Collocation segmentation is based on a discrete signal of combinability values between two consecutive words, and boundaries that are used to chunk a sequence of words.

The main differences separating collocation segmentation from other methods are: (1) colloca- tion segmentation does not break up nested collocations – it takes the longest one possible in a given context, while the n-gram list-based approach cannot detect if a collocation is nested in another one (e.g., machine translation system ); (2) collocation segmentation is able to process long collocations quickly with the complexity of a bigram list size, while the n-gram list-based approach is usually limited to three-word collocations and has high processing complexity.

There are many word combinability measures, such as Mutual Information (MI), T-score, Log- Likelihood, etc. Detailed overviews of combinability measures can be found in [Pecina, 2010, Pecina and Schlesinger, 2006], and any of these measures can be applied to collocation segmentation. MI and Dice scores are almost similar in the sense of distribution of values [Daudaravičius and Marcinke- vičiene,˙ 2004], but the Dice score is always in a range between zero and one, while the range of an MI score depends on the size of the corpus. Thus, the Dice score is preferable. This score is used, for instance, in the collocation compiler XTract [Smadja, 1993] and in the lexicon extraction system Champollion [Smadja et al., 1996]. Dice is defined as follows:

2 · f( xi−1, x i) Dice( xi−1, x i) = f( xi−1) + f( xi)

where f(xi−1, x i) is the number of co-occurrences of xi−1 and xi, and f(xi−1) and f(xi) are the numbers of occurrence of xi−1 and xi in the training corpus. If xi−1 and xi tend to occur in conjunction, their Dice score will be high. The Dice score is sensitive to low-frequency word pairs. If two consecutive words are used only once and appear together, there is a good chance that these two words are highly related and form some new concept, e.g., a proper name.

The new method employs unigram and bigram frequencies to evaluate combinability between two words and to build a combinability curve. A text is seen as a changing curve of Dice values between two adjacent words (see Figure 3.1). This approach is new. The curve of combinability values is used to detect the boundaries of collocation segments, which can be done using a threshold or by following certain rules, as described in the following sections. A collocation segment is a unit, which is used as a single token in n-gram language models. 3.3. The combinability of words 17

3.3 The combinability of words

In this dissertation I use the six following metrics:

1. Mutual Information (MI): N · f( wi, w i+1 ) MI( wi, w i+1 ) = f( w1) + f( wi+1 )

2. Dice: 2 · f( wi, w i+1 ) dice( wi, w i+1 ) = f( wi) + f( wi+1 )

3. T-score: f( wi)f( wi+1 ) f( wi, w i+1 ) − N tscore( wi, w i+1 ) = f( wi, w i+1 ) p

4. Gravity-Counts (GC)[Daudaravičius and Marcinkevičien e,˙ 2004]:

f( wi) · f( wi, w i+1 ) f( wi+1 ) · f( wi, w i+1 ) gc( wi, w i+1 ) = log + log 0  n( wi)   n (wi+1 ) 

5. Chi-squared (Chi2):

N N · f( wi, w i+1 ) − f( wi) · f( wi+1 ) chi2( wi, w i+1 ) = · · f( wi) · f( wi+1 ) N − f( wi) + f( wi, w i+1 )

N · f( wi, w i+1 ) − f( wi) · f( wi+1 )

N − f( wi+1 ) + f( wi, w i+1 )

6. Log-likelihood 2:

0 : f( wi) = f( wi, w i+1 ) = 1  likelihood Dunning (wi, w i+1 ) =  0 : f( wi+1 ) = f( wi, w i+1 ) = 1  L · R : Otherwise  

2 Log-likelihood evaluates the associativity strength from wi to wi+1 only. Log-likelihood formula is modified to be symmetrical (strength from wi to wi+1 multiplied by strength from wi+1 to wi). Original formula contains L part only. 18 Collocation Segmentation

, where

f( wi) − f( wi, w i+1 ) L = 2 · (f( wi) − f( wi, w i+1 )) · log −   N − f( wi, w i+1 )  f( w ) f( w ) · log i + i  N 

f( wi, w i+1 ) log (f( wi, w i+1 ) − f( wi)) · log   f( wi) 

f( wi+1 ) − f( wi, w i+1 ) R = 2 · (f( wi+1 ) − f( wi, w i+1 )) · log −   N − f( wi, w i+1 )  f( w ) f( w ) · log i+1 + i+1  N 

f( wi, w i+1 ) log (f( wi, w i+1 ) − f( wi+1 )) · log   f( wi+1 ) 

, where: f(wi−1, w i) is the number of co-occurrences of wi−1 and wi, f(wi−1) and f(wi) are the numbers of occurrences of wi−1 and wi in the training corpus, N is the size of the corpus, n(wi) is the number of distinct words on the right-hand side, 0 n (wi) is the number of distinct words on the left-hand side.

3.4 Setting segment boundaries with a threshold

A boundary can be set between two adjacent words in a text when the combinability value is lower than a certain threshold. I use a dynamic threshold which defines the range between the minimum and the average combinability values of a sentence. Zero equals the minimum combinability value and 100 equals the average value of the sentence. Thus, the threshold value is expressed as a percentage between the minimum and the average combinability values. If the threshold is set to zero, then no threshold filtering is used and no collocation segment boundaries are set using the threshold. This approach allows comparable settings to be applied to different combinability measures. The threshold value for each paragraph or sequence of combinability values between adjacent tokens w is calculated as follows:

threshold( w) = avg( w) − (1 − TH) · (avg( w) − min( w))

where avg (w) is the average of combinability values, T H is the threshold level expressed as a value between one (the average value) and zero (the minimum value), and min (w) is the mini- 3.5. Setting segment boundaries with the Average Minimum Law 19 mum value of combinability values of a word sequence. If a paragraph contains only two words, then the boundary between these two tokens heavily depends on the combinability values between the beginning of the sentence and the first token and between the second token and the end of the paragraph. Such dynamic assignment is important for producing similar threshold definition condi- tions for different combinability measures. Different combinability measures have different scales of values, making it difficult to set thresholds manually. The main purpose of using a threshold is to keep only “strongly” connected tokens. On the other hand, it is possible to set the threshold to the maximum value of combinability values. This would make no words combine into more than single- word segments, i.e., collocation segmentation would be equivalent to simple tokenization. In general, the threshold makes it possible to move from only single-word segments to whole-sentence segments by changing the threshold from the minimum to the maximum values of the sentence. There is no reason to use the maximum value threshold, but this helps to understand how the threshold can be used. [Daudaravičius and Marcinkevičien e,˙ 2004] uses a global constant threshold which produces very long collocation segments that are like the clichés used in legal documents, and hardly related to collocations. A dynamic threshold allows the problem of very long segments to be reduced. Fig- ure 3.1 shows an example of the application of such a threshold. In the example [Xue et al., 2006], when the threshold is 50%, then segmentation is as follows: | a | collocation | is a | recurrent | and | conventional | fixed | expression | of words that | holds | syntactic | and | semantic relations | . | . To reduce the problem of long segments even more, the Average Minimum Law can also be used, as described in the following section.

3.5 Setting segment boundaries with the Average Minimum Law

[Daudaravičius, 2010] introduces the Average Minimum Law (AML) for setting the boundaries of collocation segmentation. AML is a simple rule which is applied to three adjacent combinability values and is expressed as follows:

Dice( xi−3, x i−2) + Dice( xi−1, x i) True < Dice( xi−2, x i−1) boundary( x − , x − ) =  2 i 2 i 1  False Otherwise 

A boundary is set between two adjacent words in a text when the Dice value is lower than the average of the preceding and following Dice values. In addition, I use artificial tokens such as sequence beginning and sequence ending in order to apply AML to the first and last two words. These artificial tokens are used to calculate the combinability between the beginning of the sequence and the 20 Collocation Segmentation

first word, and the last word and the end of the sequence as shown in Figure 3.1. AML can be used together with the threshold or alone. The recent study of [Daudaravičius, 2012b] shows that AML is able to produce segmentation that gives the best text categorization results, while the threshold degrades them. On the other hand, AML can produce collocation segments when the combinability values between two adjacent words are very low (see Figure 3.1). AML corresponds to second-order derivation positive or negative values, if the combinability points are intermediate values of some continuous signal. Second-order derivations express acceler- ation. This means that AML deals with “ the speed of tokens ” that follow each other. This language feature has not been explored and is a topic for future work.

3.6 Segmentation examples

The proposed segmentation method is new and different from other widely used statistical methods for collocation extraction [Tjong Kim Sang and Buchholz, 2000]. For instance, the general method used by Choueka [Choueka, 1988] is the following: for each length n, ( 1 ≤ n ≤ 6), produce all word sequences of length n and sort them by frequency; impose a threshold frequency of 14. Xtract is designed to extract significant bigrams, and then expands bigrams to n-grams [Smadja, 1993]. Lin [Lin, 1998] extends the collocation extraction methods with syntactic dependency triples. These collocation extraction methods are performed on a dictionary level. The result of this process is a dictionary of collocations. Our collocation segmentation is performed within a text itself, and the result of this process is a segmented text (see Figure 3.1). The segmented text may later be used to create a dictionary of collocations. Such a dictionary accepts all collocation segments. The main difference from Choueka and Smadja’s methods is that our proposed method accepts all collocations, and performs no significance tests for collocations. On the other hand, the disadvantage of our method is that the segments do not always conform to the correct grammatical and lexical phrases. The linguistic nature of the segments in most cases are: noun phrases (all relevant information ), verb phrases ( are carried out ), prepositional phrases ( by mutual agreement ), noun or verb phrases with a preposition at the end ( additional information on , carried out by ), and the combination of a preposition and an article ( in the ). Also, it is important to notice that the collocation segmentation of the same translated text was similar in different languages, even if the word or phrase order is different. The same sentence in nineteen different languages, from the document jrc2005E0143 in the AC corpus, was segmented, and the results appear in Figure 3.2. This collocation segmentation study will be extended in the near feature to evaluate the conformity of the proposed collocation segmentation method to phrase-based segmentation using parsers. 3.6. Segmentation examples 21     

                           & /# /&  . 

     $!%&' %#*"-%  ( )#(  *+,-%      1$(!#'#    !"#  0- !0&+ 0-                                2  3       2 9    44       44 4   4 8   67 4 5 84 844 6  ; ; ; :              :: 85    87  :          <  = ?       A     ;  @       7  > ;      =B       C    C  F B B7 D C C F   F  C  E=               ; F       JK      H8     ; F ; F2  JK ;G  ;8I   ;          ;    M         3  L    73              4 4  

Figure 3.2: Segmentation of the same sentence in nineteen languages using AML only. Chapter 4

Applying Collocation Segmentation

4.1 Automatic multilingual annotation of EU legislation with Eu- roVoc descriptors

Multilingual documentation in our Information Society has become a common phenomenon in many official institutions, private companies and on the Internet. One way to categorize these large amounts of textual information is through keyword assignment. Keyword assignment is the identification of appropriate keywords from the controlled vocabulary of a reference list ( a thesaurus ). Controlled vo- cabulary keywords, which are usually referred to as descriptors, are therefore not necessarily present explicitly in a text. Within this particular context, some monolingual and cross-lingual indexing systems have been developed [Névéol et al., 2005, Pouliquen et al., 2003, Civera and Juan, 2006, Civera Saiz, 2008] using multilingual thesauri such as EuroVoc [eurovoc]. Automatic document annotation with a controlled conceptual thesaurus is useful for establishing precise links between similar documents. Automatic multilingual document annotation requires an understanding of the differences among the languages and features that are useful for the classification. This study presents a language-independent document annotation system based on features derived from collocation segmentation. In this study I take three diverse languages – English, Lithuanian, and Finnish. These languages differ in inflection and agglutination aspects. The Finnish language is highly inflected and aggluti- nated, Lithuanian is highly inflected but undertakes no compounds, and English is the least inflected and agglutinated of the three. The goals of this study are:

– compare different methods for building feature vectors,

– to understand how the results of automatic annotation depend on the document size, and

– to evaluate the collocation segmentation threshold to build feature vectors for document anno- tation with EuroVoc descriptors.

Using the manually tagged multilingual corpus Acquis Communautaire 3.0 (AC) [Steinberger et al., 2006] and all of the descriptors found therein, I attained improvements in keyword assignment precision of 5–8 % (i.e., from 48–54% success) over the three diverse languages (English, Lithua- nian and Finnish) tested. I found a high correlation between the precision of automatic assignment and document length. The results of this study show that different collocation segmentation thresh- olds introduce different features that can be used for various purposes. A zero threshold (meaning no

23 24 Applying Collocation Segmentation threshold) allows the best text classification results to be achieved. This application-oriented technique shows whether collocations acquired by collocation segmentation are useful in document categoriza- tion tasks.

4.1.1 Related work

The EuroVoc thesaurus and all the legal documents indexed with its categories have been widely used in natural language processing tasks. [Steinberger et al., 2006] created one of the most widely used multilingual aligned parallel corpora, the JRC-Acquis, from manually labeled documents in 22 languages. Automatic EuroVoc indexing is a multi-label classification task, successfully applied and discussed abundantly in the literature (e.g., [Manning et al., 2008]). Here, I focus on EuroVoc–related work. [Mencía and Fürnkranz, 2010] has applied three different multi-label classification algorithms to the Eur-Lex database of the European Union, indexed with EuroVoc, to address the problem of memory storage for classifiers. A keyword assignment system uses a training corpus of manually indexed documents to pro- duce, for each descriptor, a feature vector of words whose presence in a text indicates that the de- scriptor may be appropriate for this text. Generally, feature vectors are associated with single words [Pouliquen et al., 2003], bigrams or trigrams [Civera and Juan, 2006]. A frequency approach [Névéol et al., 2005, Weiss et al., 2004] could be used to create multiword feature vectors from a corpus. This approach takes frequently occurring word combinations. The most straightforward implementation simply examines all combinations of up to the maximum length of the n-gram. Clearly, the number of features grows significantly when longer multiword features are considered. This approach is not effective for low-frequency words [Smadja, 1993]. In my work, I am interested in investigating the ability of collocation segmentation to improve classification results with a minimal increase in computational and memory costs for (non-)inflected and (non-)compounded languages such as English, Lithuanian and Finnish. My approach is based on the work proposed in [Pouliquen et al., 2003].

4.1.2 Conceptual Thesaurus

A conceptual thesaurus (CT) is a list of descriptors that are relatively abstract, conceptual terms. An example of a CT is EuroVoc [eurovoc], whose descriptors describe the main concepts of a wide variety of subject fields by using high-level descriptive terms. This thesaurus contains more than 6,000 descriptors. EuroVoc covers diverse fields such as politics, law, finance, social issues, science, transportation, the environment, geography, organizations, etc. The thesaurus has been translated into the majority of official EU languages. EuroVoc descriptors are defined precisely, using scope notes, so that each descriptor has exactly one translation into each language. Most official EU documents 4.1. Automatic multilingual annotation of EU legislation with EuroVoc descriptors 25

ID Document Number of Number of descriptors per document length documents Maximum Average From To En Lt Fi En Lt Fi En Lt Fi 1 1 100 11 37 18 11 9 9 5.64 4.68 4.78 2 101 250 790 1409 1613 18 18 18 5.64 5.64 5.51 3 251 500 4970 5852 6679 20 20 20 5.21 5.14 5.13 4 501 1000 7065 6263 5758 17 15 15 5.24 5.37 5.43 5 1001 2500 3853 3522 3452 24 24 24 5.68 5.65 5.69 6 2501 ... 3684 3152 2897 20 20 20 5.94 5.98 6.00 Table 4.1: Corpus data by document length.

have been manually classified into subject domain classes using the EuroVoc thesaurus. For instance, the EU document with CELEX 1 number 22005X0604(01):

Information relating to the entry into force of the Agreement between the European Community and the Prin- cipality of Monaco providing for measures equivalent to those laid down in Council Directive 2003/48/EC on taxation of savings income in the form of interest payments (2005/C 137/01) The Agreement between the European Community and the Principality of Monaco providing for measures equivalent to those laid down in Council Directive 2003/48/EC on taxation of savings income in the form of interest payments will enter into force on 1 July 2005, the procedures provided for in Article 16 of the Agreement having been completed on 31 May 2005.

This document contains the following manually assigned descriptors: EC agreement , interest , Monaco, savings, tax on income , tax system .

4.1.3 EuroVoc descriptor assignment

Data Set

Experiments were carried out on the Acquis Communautaire (AC) corpus [Steinberger et al., 2006]. This large text collection contains documents selected from European Union (EU) legislation in all the official EU languages. Most of these documents have been manually classified according to the EuroVoc thesaurus [eurovoc]. In this study I take only those documents that contain assigned EuroVoc descriptors. The AC documents are XML marked. The body and signature parts of the documents were used in the study, while the Annexes were excluded. The total number of distinct descriptors in the English corpus is 3,822, while in Lithuanian and Finnish both it is 3,812. These numbers correspond to the number of classes that are used during the classification experiments.

1CELEX document number is the identification number of EU legislation/law document which provides direct access to a specific document. 26 Applying Collocation Segmentation

The corpus was split into a training corpus and a test corpus. For each language, 95% of the documents were randomly selected for the training corpus, while 5% (about 550 documents) were used for the test corpus. There is no overlap between the training corpus and the test corpus. The documents for the training and test corpora for different languages were not the same, as the AC corpus is not fully parallel. Thus, the results for different languages can only be compared with caution. Nevertheless, the test corpus in each language is the same for experiments with different features such as unigrams, bigrams, and collocation segments. The number of descriptors varies slightly, depending on the length of the document (see Table 4.1): – The maximum number of descriptors for a very short document is half that of longer documents.

– The maximum number of descriptors is four times the average number of descriptors.

– The average number of descriptors for documents between 251 and 1,000 words in length is lower than that of documents of other lengths. This dependency of the average number of descriptors can be related to the different types of docu- ments found in the AC corpus. Different types of documents are annotated with a different amount of context necessary to understand what the document is about . The reason for the higher number of descriptors for short documents is that the descriptors may be used as additional information for query systems. It was not required that any descriptors occur in more than one document. The average number of descriptors per document in the test corpus was 5.4; the majority of documents had four, five, or six descriptors, while only a few had more than ten or less than three descriptors. The ranges of document lengths were set to six intervals, as shown in Table 4.1. Table4.1 shows that the most of the documents are in the range of 251–500 words in length, and that there is a slight movement in the distribution of document length by language. The median document length in English is 583 words, in Lithuanian 435 words, and in Finnish 396 words. This movement of the median shows that document length is influenced by such language features as inflection and agglutination. Inflections and compounds produced the highest influence on the reduction of the document length. Before training and testing of the classifier, the AC corpus underwent a basic preprocessing that consisted of lowercasing and tokenization. I decided not to apply any language-dependent prepro- cessing, such as lemmatization, since this study is intended to work in a multilingual environment. Table 4.2 shows all collocation segments containing the substring language that were produced by collocation segmentation with a threshold level of zero. We can see that most of the segments are two words in length, and that many of them correspond to noun phrases. Some segments are not grammatical and belong to other phrases. For instance, and norwegian languages is the end of an enumeration of languages, and clearly is not grammatical. 4.1. Automatic multilingual annotation of EU legislation with EuroVoc descriptors 27

Segment f(x) Segment f(x) Segment f(x)

languages 661 slovene languages 9 accessible language 3 official languages 539 used languages 9 accommodate language 3 and norwegian languages 380 community language 8 and language 3 official language 343 several languages 8 bulgarian languages 3 authentic language 134 and working languages 7 danish language 3 language versions 116 different language 7 greek language 3 language learning 83 language barriers 7 in several languages 3 swedish languages 77 language proficiency rating 7 language and 3 original language 76 language teachers 7 language facilities 3 language courses 74 language teaching 7 language integrated learning 3 english language 58 romanian languages 7 language needs 3 language version 52 russian languages 7 language problems 3 working languages 46 second language 7 language professions 3 language skills 43 different languages 6 language understood 3 or languages 38 host language 6 languages acceptable 3 community languages 37 language endorsements 6 languages allowed 3 language or languages 36 language used 6 learn languages 3 the language 35 languages in 6 local language requirements 3 spanish languages 33 of languages 6 maltese language 3 foreign languages 30 20 languages 5 national language 3 10 languages 29 eu languages 5 official community language 3 norwegian language 27 irish language 5 other languages 3 slovenian languages 25 language chosen 5 russian language 3 working language 25 language configuration 5 technical language 3 language books 24 language diversity 5 turkish language 3 italian language 23 language regime 5 – language 2 language accepted 23 language technologies 5 21 languages 2 language indicated 20 language training for 5 a second language 2 languages , 20 languages accepted 5 additional languages 2 15 languages 18 languages used 5 albanian languages 2 another language 17 minority language 5 clear language 2 authentic languages 17 specifies language 5 five languages 2 french language 17 union languages 5 german language 2 language field 16 a language 4 in another language 2 foreign language 15 georgian languages 4 inclusive language 2 minority languages 15 guilanguage 4 interpretation into that language 2 languages and 14 in different languages 4 interpretation language profile 2 , language 13 intelligible language 4 interpreters per language booth 2 language competence 13 language combinations 4 language ( 2 language arrangements 12 language portfolio 4 language books abroad 2 language training 12 language testing 4 language certification 2 language proficiency 11 language training , 4 language competences 2 language units 11 languages of 4 language comprehensible 2 language works 11 languages permitted 4 language employed 2 language acceptable 10 local language 4 language endorsement 2 language or 10 of language 4 language error 2 language customary 9 passive languages 4 language is 2 language knowledge 9 per language 4 language issues 2 language of 9 regional languages 4 language learners 2 language teacher 9 that language 4 language preparation 2 procedural languages 9 these languages 4 language regimes 2 language requirements 2 finnish language 1 language testing — 1 language resources 2 foreign language competence 1 language training and cultural 1 language training and 2 foreign language teacher 1 language used . 1 language unit 2 foreign language teaching 1 language used throughout 1 language would 2 foreign languagemodern 1 language verion 1 languages : 2 format and language 1 language via 1 languages dropped 2 french - language 1 languages . 1 languages portal 2 further languages 1 languages . besides 1

Continued on next page 28 Applying Collocation Segmentation

Segment f(x) Segment f(x) Segment f(x) languages spoken 2 impose language requirements 1 languages accepted for 1 languages spoken inside 2 in more languages 1 languages and form 1 linguistic aspectslanguage 2 including language 1 languages are equal 1 migrant languages 2 indigenous languages 1 languages can flourish 1 minority language periodical 2 intended language 1 languages different 1 multiple languages 2 interoperable language 1 languages is 1 native languages 2 involve language expertise 1 languages of our 1 official community languages 2 italian languages 1 languages prescribed 1 original language version 2 japanese languages 1 languages requested 1 prescribed languages 2 korean language 1 languages were 1 serbian languages 2 kurdish language 1 local languages and 1 slanguageiso 2 language , 1 moldavian languages 1 slovak languages 2 language acquisition 1 moldovan languages 1 source language 2 language areas 1 more languages 1 spoken languages 2 language assistants 1 ms language 1 swedish language 2 language borders 1 ms languages 1 teaching languages 2 language broadening 1 new language 1 - language 1 language classes 1 new languages 1 - language secretaries 1 language combination 1 norwegian languages 1 , language and 1 language competency 1 of languages only 1 ? which language 1 language configurations 1 of several languages 1 “ language industries 1 language course 1 other language 1 ) other languages 1 language danish 1 overcome language 1 11 languages 1 language department 1 pages per language 1 active languages 1 language dependent 1 per language unit 1 active languages are 1 language division 1 pose language 1 agreed languages 1 language education 1 romanian language 1 and active languages 1 language of proceedings 1 schengen languages 1 and working language 1 language of transmission 1 sign language 1 appropriate language 1 language optical 1 some languages 1 belarusian language 1 language other 1 special language 1 bosnian languages 1 language paves 1 stronger environmental language 1 catalan language 1 language peculiarities 1 target languages requested 1 clearer language 1 language preferred 1 teach languages 1 common language 1 language preparation preparatory 1 technological language 1 concepts and language 1 language publications 1 the languages 1 correct language 1 language requirements ( 1 those languages 1 definite language 1 language rules 1 understood language 1 description language as 1 language searchability 1 using different languages 1 determining the language 1 language section 1 various language 1 dutch language 1 language selected 1 which languages 1 emotional language 1 language shortages 1 xenophobic language in 1 everyday language 1 language tends to 1 Table 4.2: Collocation segments containing the substring language.

Experiment setup

Collocation segmentation was performed six times on each language corpus using different thresholds: 0 (seg0), 10 (seg10), 20 (seg20), 30 (seg30), 40 (seg40) and 50 (seg50). I chose to use these relatively low segmentation threshold levels because higher thresholds quickly reduce collocation segments to nearly single-word segments (see Table 4.3). 4.1. Automatic multilingual annotation of EU legislation with EuroVoc descriptors 29

The collocation segmentation seg0 increases the dictionary size at least three times. This in- crease is bigger in English and Lithuanian and smaller in Finnish. The dictionary size of the colloca- tion segments in English is 0.7, in Lithuanian 1.4, and in Finnish 1.7 million. The number of unigram types is at least double in highly inflected languages as compare to almost non-inflected languages. For instance, the ratio of the unigram dictionary sizes for Finnish and English is 2.58 to 1. Segmen- tation increased the dictionary size of each language considerably, though this increase was smaller in highly inflected languages and bigger in non-inflected languages. Thus, dictionaries of segments become more comparable in size than dictionaries of words. The performance of collocation segmentation with various thresholds was compared to unigram and bigram features that are widely used features for similar tasks. All descriptors and all words or segments were taken into consideration in order to preserve the real environment and to observe the real possibilities of automatic EuroVoc descriptor assignment. All words were kept as is, thereby avoiding the use of linguistic resources that are difficult to prepare, ambiguous, and domain-dependent. For each experiment I also produced feature vectors, both for each test document, and also for each descriptor on the basis of one large meta-text (a concatenation of all texts indexed with a particular descriptor).

Stop–words and TF–IDF

The use of stop–words improves results [Mohamed et al., 2012]. To tackle this problem I decided not to use any manually made stop-word lists for removing common words, instead making a modification to the widely used TF-IDF measure:

N T F − IDF (x) = T F (x) · ln D(x)

This TF-IDF modification is the normalized TF–IDF with a confidence adjustment that is de-

En Lt Fi Unigram 208,583 345,292 536,022 Bigram 1,501,113 2,493,255 2,631,438 Segmentation 0 771,974 1,431,953 1,739,846 Segmentation 10 416,110 756,527 1,076,087 Segmentation 20 349,798 622,684 916,751 Segmentation 30 319,073 559,966 834,986 Segmentation 40 300,823 522,260 784,052 Segmentation 50 288,481 497,153 748,907

Table 4.3: Vacabulary size 30 Applying Collocation Segmentation

Stop-word list List Overlapping size items The default MySQL list of full-text stopwords 2 543 188 A comprehensive list of words Ignored by Search Engines 3 635 196 SMART stopword list 4 571 203 A stopword list provided with a nice balance between coverage 429 205 and size 5 JRC EuroVoc Indexer - JEX 6 1586 527

Table 4.4: The overlap of filtered words (718 items) in building descriptor profiles and several freely available stop-word lists on the Internet.

fined as follows:

T F (x) · | Doc | N − D(x) + 1 NCT F IDF (x) = AV G · ln |Doc |x  D(x) + 1 

where T F (x) is the raw frequency of token x in the document or descriptor profile, |Doc |AV G is the average document length in the corpus or the sum of the frequencies of items in the descriptor profile, |Doc |x is the length of the inspected document or the sum of the frequencies of its descriptor profile items, N is the total number of documents in the corpus or the total number of descriptors, and D(x) is the number of documents or profiles in which the token x occurs. The idea of this modification is to change the item order, putting frequent and common tokens at the end of the list and giving them negative values. The range of TF-IDF is zero and up. This modification assigns negative values to items that occur in more than half of the descriptor profiles or documents. For instance, items such as greater , help, progress , analysis, except , unless, regions , said, clearly , latter , pay, equal , major , throughout , and developments are used in slightly more than half of the descriptor profiles. The NCTFIDF values of these items are negative and close to zero, while their TF-IDF values would be high. Table 4.5 shows a comparison of TF-IDF and NCTFIDF values for the unigrams and collocation segments of a sample document. I removed all items with negative values in my experiments. For instance, 718 items occur in more than half of the descriptor profiles; they were removed as stop-words from the English data. This worked as stop-word filtering. Table 4.4 shows that there is a significant overlap between the list of removed items and several stop-word lists freely available on the Internet, or the JRC tuned stop-word list of the JEX system for English. This approach is flexible and allows a stop-word list to be built in

2http://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html 3http://www.webconfs.com/stop-words.php 4http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop- list/english.stop 5http://www.lextek.com/manuals/onix/stopwords1.html 6http://langtech.jrc.it/Eurovoc.html 4.1. Automatic multilingual annotation of EU legislation with EuroVoc descriptors 31 any language. For instance, the JEX system includes only ninety-two stop-words for Finnish.

Descriptor assignment

Once feature vectors exist for all descriptors, the descriptors can be assigned to new texts by calculating the similarity between the feature vector of a text and the feature vector of a descriptor. I use the cosine measure [Salton, 1989] for calculating the similarity distance between the descriptor and the document: (x · y) cos( x, y ) = Px2 · y2 p p For each document in the test corpus I calculateP the cosineP distance to each descriptor profile, then sort the descriptors by cosine similarity and take the exact number of descriptors that were manu- ally assigned. This is similar to the kNN classification when k is equal to one neighbor. For instance, if fourteen descriptors were manually assigned, I take the fourteen descriptors with the highest cosine similarities and evaluate the accuracy of the automatic annotation by the intersection of both sets. Ta- ble 4.8 provides an example of automatically assigned descriptors. [Steinberger et al., 2012] assigns six descriptors to each document, on the assumption that the average number of manually assigned descriptors is 5.6. I decided to assign the same number of descriptors were done manually on the assumption that all manually assigned descriptors are important and the system should approximate human behavior. For instance, here is the document from the 251–500 word range of the AC corpus which has the highest number of manually assigned descriptors:

Protocol to the Agreement with Switzerland on the free movement of persons *** European Parliament legislative resolution on the proposal for a Council decision on the conclusion, on behalf of the European Community and its Member States, of a Protocol to the Agreement between the European Community and its Member States, of the one part, and the Swiss Confederation, of the other, on the free movement of persons, regarding the participation, as contracting parties, of the Czech Republic, the Republic of Estonia, the Republic of Cyprus, the Republic of Latvia, the Republic of Lithuania, the Republic of Hungary, the Republic of Malta, the Republic of Poland, the Republic of Slovenia and the Slovak Republic, pursuant to their accession to the European Union (12585/2004 — COM(2004)0596 — C6-0247/2004 — 2004/0201(AVC))

The full version of this document may be found at http://eur-lex.europa.eu/Notice.do?val=423395 This document contains twenty manually assigned descriptors:

2901 ratification of an agreement 1633 free movement of persons 4324 Switzerland 12 accession to the European Union 2850 protocol to an agreement 32 Applying Collocation Segmentation

Token TF(x) TFIDF(x) NCTFIDF(x) Segment Token TF(x) TFIDF(x) NCTFIDF(x) fibers 2 6.812 205.143 fibers corp 2 7.664 200.931 taiwan 3 6.679 203.646 taiwan 3 6.679 180.509 futura 2 6.618 199.811 | 8 6.703 166.868 | 8 6.701 188.173 synthetic 2 4.856 131.225 corp 2 5.331 162.353 " 7 4.828 112.012 " 8 5.509 144.070 futura polymers 1 4.008 103.852 synthetic 2 4.088 124.625 l 1 4.008 103.852 shingkong 1 3.707 110.379 shingkong 1 3.707 97.838 shinkong 1 3.707 110.379 shinkong 1 3.707 97.838 india 2 3.429 104.271 india 2 3.432 92.485 ltd 2 3.263 99.083 futura 1 3.355 89.677 2604 1 3.030 91.961 ltd 2 3.272 88.080 polyesters 1 2.929 88.995 2604 1 3.030 81.513 polymers 1 2.686 81.781 january 2005 2 3.007 80.713 terephthalate 1 2.517 76.697 polyesters 1 2.967 79.866 polyethylene 1 2.401 73.184 polyethylene terephthalate 1 2.524 68.172 corrigendum 1 1.947 59.333 " ; 1 2.292 61.946 korea 1 1.909 58.164 korea 1 2.033 54.916 83 1 1.202 35.865 corrigendum 1 1.947 52.592 alia 1 1.139 33.829 republic 1 1.309 34.862 inter 1 1.049 30.808 originating 1 1.243 32.999 originating 1 1.048 30.784 83 1 1.203 31.829 january 2 1.061 23.173 inter alia 1 1.140 30.003 imports 1 0.820 22.877 imports 1 0.996 25.724 republic 1 0.775 21.240 / 2000 1 0.943 24.134 2000 1 0.561 12.870 / 2005 1 0.586 12.361 amending 1 0.495 10.009 amending 1 0.529 10.190 18 1 0.455 8.181 18 1 0.486 8.516 19 1 0.425 6.721 19 1 0.460 7.452 journal 1 0.367 3.759 21 1 0.443 6.750 21 1 0.364 3.610 council regulation ( ec 1 0.407 5.181 union 1 0.334 1.939 official journal 1 0.369 3.419 official 1 0.321 1.180 european union 1 0.348 2.401 ; 1 0.299 -0.155 regulation ( ec 1 0.268 -1.879 2005 3 0.735 -11.055 ) no 2 0.413 -11.664 l 1 0.113 -16.098 - - 2 0.391 -13.324 regulation 2 0.350 -18.594 on 1 0.081 -18.593 council 1 0.071 -22.847 in the 1 0.062 -22.065 no 2 0.216 -33.518 ) 1 0.044 -26.335 european 1 0.031 -34.664 to 1 0.021 -35.314 on 1 0.020 -40.841 and 1 0.017 -37.695 in 1 0.017 -43.022 of the 1 0.008 -46.304 and 1 0.014 -45.658 ( 1 0.006 -49.717 to 1 0.012 -47.405 , 2 0.005 -123.445 ec 2 0.128 -48.783 . 3 0.006 -189.764 / 2 0.016 -105.108 of 4 0.028 -193.635 , 2 0.002 -157.010 %. 7 0.000 -817.489 ( 3 0.007 -207.432 .% 7 0.000 -817.489 ) 3 0.007 -207.432 - 46 1.936 -1236.881 the 2 0.000 -220.758 . 3 0.002 -251.342 of 5 0.028 -288.719 %. 7 0.000 -922.273 .% 7 0.000 -922.273 - 50 1.906 -1585.659 Table 4.5: Frequency, TF-IDF and NCTFIDF values of a sample document. The average document length is 2,084 words, 1,577 collocation segments. The total number of documents is 20,373. 4.1. Automatic multilingual annotation of EU legislation with EuroVoc descriptors 33

5420 accession to an agreement 1634 free movement of workers 2543 Poland 1255 Hungary 5859 Slovakia 5898 Slovenia 5989 Cyprus 1774 Malta 5619 Estonia 5709 Lithuania 5706 Latvia 5860 Czech Republic 2814 agricultural real estate 5093 acquisition of property 3464 secondary residence

At the moment, my system is not capable of assigning enough descriptors to fully describe the document as humans do. In the future the system will be extended to be able to assign the full number of descriptors.

4.1.4 Evaluation of the experiment results

The results were evaluated by a count of precision. I took the same number of automatically assigned descriptors as the number that had been assigned manually, making the precision and recall equal. A confidence interval was evaluated using a confidence level of 0.95. The same experiment was repeated ten times with different non-overlapping test and training corpora. This approach was applied due to limited computational resources. The results, found in Table 4.6, show that collocation segmenta- tion with no threshold applied outperforms other methods, such as unigram, bigram or collocation segmentation with a threshold applied. Thus, using a threshold to set the boundaries of collocation segments degrades classification precision in general. [Steinberger et al., 2012] shows that, in general, there are no significant differences among the results of descriptor assignment in different languages. However, they can be influenced by differ- ences in average document length and the number of stop-words used for experiments. In the Finnish language only ninety-two stop-words were used, compared to 1,972 in English. [Mohamed et al., 2012]shows that stop-word lists have a high impact on classification results. Also, the average docu- ment length in Finnish is twice that of English–756 and 309 words respectively. My study shows that in all the languages I have tested, the precision was 3 % higher for documents of the 251–500 word 34 Applying Collocation Segmentation 7.09 % 0.78 % 0.80 % 0.49 % 2.44 % 4.17 % 2.51 % 1.23 % 1.32 % 1.65 % 1.27 % 0.88 % 1.46 % 0.95 % 1.09 % 1.26 % 2.12 % 1.29 % 0.83 % 19.60 % 29.26 % ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 6.47 % 31.48 3.99 % 39.37 2.28 % 40.53 0.81 % 55.14 1.46 % 52.46 0.87 % 46.97 0.82 % 49.94 0.76 % 54.16 0.52 % 55.91 1.33 % 47.33 1.42 % 54.41 2.76 % 43.74 1.88 % 57.25 1.13 % 65.01 1.33 % 49.92 1.28 % 55.31 1.01 % 58.21 2.11 % 67.25 0.95 % 52.66 19.60 % 10.00 28.40 % 34.17 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 0.78 % 50.31 0.66 % 54.54 6.47 % 29.94 0.52 % 55.97 1.19 % 47.88 1.38 % 54.97 2.84 % 44.04 3.50 % 38.95 2.24 % 41.21 1.86 % 57.70 0.97 % 65.40 0.74 % 55.52 1.31 % 52.76 1.20 % 58.29 1.02 % 47.36 1.38 % 50.16 1.43 % 55.33 1.14 % 52.56 2.10 % 67.46 19.60 % 10.00 28.40 % 35.83 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 4.23% 29.94 0.71 % 50.81 0.54 % 55.11 0.54 % 56.08 1.29 % 48.64 1.43 % 55.65 3.00 % 43.53 3.54 % 39.34 2.31 % 41.39 1.84 % 58.24 1.12 % 65.79 0.65 % 56.31 1.35 % 53.10 1.30 % 58.68 1.04 % 47.77 1.28 % 50.66 2.24 % 67.88 1.43 % 55.50 1.06 % 52.48 19.60 % 10.00 27.89 % 35.83 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Finnish English Lithuanian 1.19 % 56.33 1.38 % 66.77 0.71 % 51.50 0.56 % 55.63 0.61 % 56.47 4.23 % 27.16 1.43 % 49.47 2.91 % 43.99 2.53 % 41.08 3.56 % 40.07 1.80 % 59.46 0.78 % 56.67 1.35 % 53.50 1.33 % 59.26 0.90 % 48.29 1.37 % 51.18 2.32 % 68.29 1.26 % 55.73 0.85 % 52.89 39.20 % 10.00 27.89 % 37.50 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 41.36 53.96 49.41 % 53.93 % 57.77 % 56.48 % 57.48 % 57.55 % 60.66 % 69.59 % 68.45 % 52.53 % 56.71 % 45.27 % 41.19 % 52.19 0.78 6.47 % 27.16 0.48 2.79 2.55 1.55 % 50.46 1.64 % 60.96 1.25 1.19 0.64 1.56 2.16 0.80 1.07 1.07 1.02 4.09 % 1.11 % 0.98 % 39.20 % 20.00 27.89 % 37.50 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 4.6: Descriptor assignment precision. 71.57 54.33 54.67 59.15 46.02 42.84 58.87 62.66 72.10 59.45 61.00 56.97 55.23 42.10 54.70 51.39 % 64.34 % 53.54 6.12 % 29.94 0.62 % 0.48 % 0.62 % 4.11 % 2.57 % 2.55 % 1.10 % 0.80 % 1.19 % 1.39 % 0.82 % 1.38 1.07 % 1.25 % 1.29 % 2.11 % 1.13 % 1.24 39.20 % 20.00 20.38 % 37.50 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 53.94 65.49 6.72 % 21.88 1.19 % 0.48 % 53.58 0.56 % 56.15 0.39 % 54.29 1.26 % 58.31 3.27 % 35.80 2.89 % 38.53 1.77 % 1.04 % 70.57 1.96 % 37.25 1.07 % 58.59 0.96 % 50.88 0.70 % 55.41 0.93 % 51.39 1.42 % 51.61 0.79 % 50.57 0.45 % 50.74 1.95 % 70.17 58.80 % 20.00 30.45 % 27.50 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Uni Bi Seg 0 Seg 10 Seg 20 Seg 30 Seg 40 Seg 50 ± ± All 48.30 All 51.41 All 54.05 0 – 100 30.00 0 – 100 31.48 0 – 100 32.50 2500 – ... 53.65 2500 – ... 60.99 2500 – ... 65.10 101 – 250 40.04 251 – 500 52.10 251 – 500 52.99 101 – 250 40.19 101 – 250 42.45 251 – 500 54.53 501 – 1000 47.60 501 – 1000 45.56 501 – 1000 50.68 Document length 1001 – 2500 45.21 1001 – 2500 50.46 1001 – 2500 54.74 4.1. Automatic multilingual annotation of EU legislation with EuroVoc descriptors 35 length than for documents of the 501–1,000 word length. Therefore, in [Steinberger et al., 2012] we could expect the precision in Finnish to be higher by at least 2% than in English. Clearly there is a significant difference among languages. The same experimental design can show different results for different languages. Also, collocation segmentation leads to similar results for Finnish and Lithua- nian. In addition, the results show that I gained more improvements with collocation segments in less agglutinative languages. [Mohamed et al., 2012] exploits English, Estonian, Czech and French language resources (e.g., a lemmatizer and POS tagger). The study uses the same English data that I used in my study. For instance, [Mohamed et al., 2012] shows improvement in the same classification task that I did for En- glish, from 49.63±0.22% to 50.35 ±0.29%. Bigram or trigram features were not used in that study. There is one difference in my experimental design as compared to [Mohamed et al., 2012]: the num- ber of assigned descriptors for each tested document. I assigned the same number of descriptors as had been assigned manually, while [Mohamed et al., 2012] assigns a fixed number of descriptors. The motivation for that is explained above. I improved the English results from 48.30 ±0.48% to 54.67±0.78% using only collocation segments–language resources such as a lemmatizer or POS tag- ger were not used. Therefore, this result shows that collocation segmentation has a higher impact on classification results than the language-dependent resources. The time required to assign descriptors directly depends on vocabulary size. Table 4.3 shows the vocabulary size for different features. For instance, our developed system takes about twenty seconds to calculate the distance between a document and all descriptors when unigrams are used. For bigrams, it takes about sixty-five seconds, and it is about thirty seconds for collocation segments with a zero threshold. The results show that collocation segmentation improves the classification results better than bigrams while the annotation time is cut at least in half.

Document size

There is a high correlation between document length and the classification results. For documents of up to 250 words in length, bigram features give the worst classification results, though they outperform unigram features for larger documents. On average, the best results are achieved using collocation seg- mentation without any threshold applied. For Finnish, bigrams work slightly less well than unigrams. This degradation could be related to the size of features each method captures. The Finnish dictio- nary of bigrams is very long (see Table 4.3), and it may be that such a lengthy dictionary both fails to produce good features for classification and makes the system difficult to train. Even though seg0 outperforms other methods, it is difficult to compare the assignment preci- sion of various methods. In Table 4.8 we see an example of the descriptors assigned to the document jrc22005D0095–en , and the order in which the descriptors were ranked. The length of this document is 3,661 tokens, and we may expect high precision. But we can clearly see that the True descriptors 36 Applying Collocation Segmentation

English Lithuanian Finnish uni 3.7 3.7 3.4 bi 3.3 3.2 2.8 seg0 2.2 1.9 1.7 seg10 0.8 0.6 0.6 seg20 0.4 0.4 0.2 seg30 0.2 0.3 0.1 seg40 0.2 0.4 0.3 seg50 0.5 0.6 0.4 Table 4.7: Percentage of correct and unique descriptor assignments discovered by different methods.

Rank unigram bigram seg0 seg10 seg20 seg30 seg40 seg50 1 4081 4081 4081 4081 4081 4081 4081 4081 2 3409 1509 1509 3409 3409 3409 3409 3409 3 2409 6269 6269 6269 1509 6269 6269 6269 4 2723 2084 3409 1509 6269 1509 1509 1509 5 2068 3409 4708 2723 2723 2723 2723 2723 6 6269 4708 2084 2409 2068 2068 2068 2068 7 13 2756 2723 4708 2409 2409 2409 2409 8 720 6100 191 2068 4708 4708 4708 4708 9 191 2723 2409 2084 2084 2084 191 191 10 867 5618 6100 191 191 191 720 720 Table 4.8: The order of the automatically assigned descriptors (as numbers) for the test document jrc22005D0095–en using different features. Grey cells are correct assignment. Ranked by cosine similarity.

are not the same in the different methods. Unigram, bigram, and collocation segmentation each rec- ognize different subsets of the True descriptors. All together, these methods recognize seven different True descriptors. We could expect that the best method should recognize True descriptors that are not recognized by other methods. Surprisingly, in the given example this appears to be the case for the worst method, namely, the unigram feature vector. In Table 4.7 we see the percentage of True assignments done by only one method (i.e., other methods did not recognize them as True descrip- tors), which we call unique True assignment. All methods are capable of introducing unique True assignments. Precision in the unigram method is the worst, but it introduces the most unique True assignments. The bigram and seg0 methods also introduce comparable amounts of new unique True assignments. All methods learn different features and make different language models that are difficult to compare. This study shows that collocation segmentation introduces new features that are not seen in unigram and bigram feature sets. Also, the seg0 and seg50 collocation segmentation thresholds lead to different results, and are thus key features in information retrieval and information extraction tasks. 4.1. Automatic multilingual annotation of EU legislation with EuroVoc descriptors 37

[Pouliquen et al., 2003] evaluates the results of the manual descriptor assignment. The evalua- tors working on the English texts judged 74% of the previously manually assigned descriptors as good. This user inter-agreement is comparable to the classification results of collocation segmentation for long documents, which is 64% for English and 72% for Finnish. is still difficult to judge the preci- sion of short documents. Table 4.6 shows the correlation between document length and precision: the longer the document, the higher precision we achieve. This may also correlate to user inter-agreement: the shorter the document, the more diverse descriptors are assigned. Manual descriptor assignment appears to be more accurate for longer documents. However, this is not discussed in [Pouliquen et al., 2003] and we cannot compare the results to decide whether the size of the document correlates with user inter-agreement.

Short document annotation

The results in Table 4.6 show that short document annotation poses many problems, and that high- precision results are difficult to achieve. This can be easily seen in the short document ( jrcC2006#121#34- en) in the test corpus:

Order of the Court of First Instance of 21 March 2006 — Holcim (France) v Commission (Case T -86/03) [1] (2006/C 121/34) Language of the case: French The President of the Second Chamber has ordered that the case be removed from the register. [1] OJ C 112, 10.5.2003.

This document had seven manually assigned descriptors:

5837 action for annulment of an EC decision

215 inter-company cooperation

1549 fine

3263 redemption

1476 interest

5993 cement

4038 EC Commission

None of the automatic assignment methods captured even one descriptor correctly. This example shows that short documents do not contain enough information for the machine to catch good features for classification. Wider context understanding is required. 38 Applying Collocation Segmentation

4.1.5 Conclusions

In this study I found that short documents of up to 100 words in length are difficult to annotate, while longer ones can be annotated with acceptable precision. However, the different collocation segmentation thresholds introduce different features that can be used for different purposes. Zero threshold (meaning no threshold applied) allows the best text classification results to be achieved, while a high threshold (of more than 50%) may be important for other NLP tasks. Unigram, bigram and collocation segmentation extract different features, and this should be taken into consideration. I have presented the influence of collocation segmentation on the assignment of descriptors to texts. The assignment performance was assessed on the multilingual AC corpus. The main conclusions can be stated from the results presented:

– Collocation segmentation increases the size of the dictionary. This increase is smaller for highly inflected languages, and bigger for non-inflected languages. Thus, the dictionaries of segments for different languages are more comparable than dictionaries of words.

– The deviation of precision is lower, language by language, when segmentation is performed. Collocation segmentation does indeed produce the best results.

– Its similar performance in the different languages (including Finnish and English) shows that collocation segmentation is language-independent, and that the method can be applied to further languages.

4.2 Integration into Statistical Machine Translation

One of the main problems in the statistical machine translation approach is how to segment a bilingual corpus in order to build the most appropriate translation dictionary. Standard phrase-based SMT systems first align the parallel corpus at the word level by using IBM probabilities, and then use standard constraints to extract final translation units [Och and Ney, 2004]. Variations of this type of segmentation can be found in [Mariño et al., 2006, Casacuberta, 2001, García Varea et al., 2005]. Other approaches consist of integrating phrase segmentation and alignment; one example is [Zhang et al., 2003], where the point-wise mutual information between the source and target words is used to identify aligned phrase pairs. In [Nevado and Casacuberta, 2004], a greedy algorithm computes recursive alignments from a bilingual parallel corpus. In this section, I describe my participation in the IWSLT 2010 MT evaluation campaign. To- gether with the Universitat Politècnica de Catalunya (UPC) and the Barcelona Media Innovation Cen- ter (BMIC), I have built a standard phrase-based SMT system enriched with novel segmentations. These novel segmentations are computed using statistical measures such as Log–likelihood, T–score, 4.2. Integration into Statistical Machine Translation 39

Chi–squared, Dice, Mutual Information or Gravity–Counts. Adding novel segmentations to an SMT system enriches the translation dictionary and/or smooths the existing translation probabilities. We participated in the French-to-English BTEC task, described below. Our primary and contrastive sys- tems were two standard phrase-based SMT systems enriched with different novel segmentations. We proposed to combine the standard phrase-based segmentation [Och and Ney, 2004] with a comple- mentary bilingual segmentation learned from a statistical collocation segmentation technique. This statistical collocation segmentation uses measures such as the Dice score to estimate segments of words. The benefits of this procedure are twofold: (1) it extracts new translation units and (2) it smooths the probabilities of existing translation units. The translation results allow the measures to be divided into three groups. First, Log–likelihood, Chi–square and T–score tend to combine high-frequency words and collocation segments that are very short. They improve the SMT system by adding new translation units. Second, Mutual Information and Dice tend to combine low-frequency words and collocation segments that are short. They improve the SMT system by smoothing the translation units. And third, Gravity–Counts tend to combine high- and low-frequency words and collocation segments that are long. However, in this case, the SMT system is not improved. Thus, the road-map for improving the translation system is to introduce novel phrases containing either low-frequency or high-frequency words. This is a difficult method of improving translation quality. The experimental results were reported in the French-to-English IWSLT 2010 automatic evaluation, where our system was ranked third out of nine systems. My main contributions to the study of the developed system are:

– applying collocation segmentation to the data using different combinability measures;

– manual analysis of the translated examples and translation results.

4.2.1 The IWSLT 2010 Evaluation Campaign BTEC task

The International Workshop on Spoken Language Translation (IWSLT) is a yearly, open evaluation campaign for spoken language translation. IWSLT’s evaluations are not competition-oriented; the goal is to foster cooperative work and scientific exchange. In this respect, IWSLT proposes challenging research tasks and an open experimental infrastructure for the scientific community working on spoken and written language translation. In 2010, a standard BTEC task was provided for the translation of Arabic, Turkish and French spoken language into English. The BTEC task is carried out using the Basic Travel Expression Cor- pus, a multilingual containing tourism-related sentences similar to those that are usually found in phrase books for tourists going abroad. 40 Applying Collocation Segmentation

BTEC Task

A BTEC translation task, focusing on frequently used utterances in the domain of travel conversations, was provided for the translation of Arabic (A), French (F) and Turkish (T) spoken language texts into English (E). In total, twenty research groups took part in at least one of the three BTEC translation tasks, submitting twelve primary runs for Arabic–English, nine primary runs for French–English (see

Table 4.9), and eight primary runs for Turkish–English. We took part in the French–English ( BT F E ) translation task.

Research Group MT System Description Type System Universidad CEU-Cardenal Herrera N-gram-based Machine Translation enhanced NBSMT 7 dsic-upv & Politecnica de Valencia (Spain) with Neural Networks for the French-English BTECIWSLT’10 task [Zamora-Martinez et al., 2010] Instituto de Engenharia de Sistemas The INESC-ID Machine Translation System for PBSMT 8 inesc-id e Computadores Investigacao e De- the IWSLT 2010 [Ling et al., 2010] senvolvimento (Portugal) Karlsruhe Institute of Technology, The KIT Translation system for IWSLT 2010 PBSMT kit interACT (Germany) [Niehues et al., 2010] MIT Lincoln Laboratory (USA) The MIT/LL-AFRL IWSLT-2010 MT System PBSMT mit [Shen et al., 2010] National Institute of Information The NICT Translation System for IWSLT 2010 Hybrid nict and Communications Technology [Goh et al., 2010] (Japan) Queen Mary, University of London The QMUL System Description for IWSLT 2010 PBSMT qmul (United Kingdom) [Yahyaei and Monz, 2010] Tottori University (Japan) Statistical Pattern-Based Machine Transla- Hybrid tottori tion with Statistical French-English Machine Translation [Murakami et al., 2010] Universitat Politécnica de Catalunya UPC-BMIC-VDU system description for the PBSMT upc (Spain) & Vytautas Magnus Univer- IWSLT 2010: testing several collocation segmen- sity (Lithuania) tations in a phrase-based SMT system [Henríquez et al., 2010] University Amsterdam, Intelligence The UvA System Description for IWSLT 2010 PBSMT uva-isca Systems Lab (Netherlands) [Martzoukos and Monz, 2010]

Table 4.9: Participants in the BT F E task.

In addition to the MT outputs provided by the participants, the organizers used an online MT server to translate the test-set data sets. The online system ( online) represents a state–of–the–art, general–domain MT system that differs from the participating MT systems in two aspects: (1) its language resources are not limited to the supplied corpora, and (2) its parameters are not optimized using in–domain data. Its purpose is to investigate the applicability of a baseline system with unlimited language resources to the spoken language translation tasks investigated by the IWSLT evaluation

7NBSMT – N–gram Based SMT 8PBSMT – Phrase Based SMT 4.2. Integration into Statistical Machine Translation 41 campaign. The translation input condition of the BTEC task consisted of correct recognition results. The target language for the BTEC task was English. The monolingual and bilingual language resources that were allowed for training the translation engines for the primary runs were limited to the supplied corpus. All other BTEC language resources, besides the ones for the given language pair, were treated as additional language resources. The evaluation specifications for the BTEC task were defined as case-sensitive with punctuation marks (case + punc ). Tokenization scripts were applied automatically to all run submissions prior to evaluation. In addition, automatic evaluation scores were also calculated for case-insensitive (lower- case only) MT outputs with punctuation marks removed ( no case + no punc ).

4.2.2 Phrase–based baseline system

The basic idea of phrase–based translation is to segment the given source sentence into units (hereafter called phrases), translate each one individually, and finally compose the target sentence from these phrase translations. Basically, a bilingual phrase is a pair of m source words and n target words. For extraction from a bilingual, word–aligned training corpus, two additional constraints are considered:

1. words are consecutive, and

2. words are in line with the word alignment matrix.

Given the collected phrase pairs, the phrase translation probability distribution is commonly estimated by relative frequency in both directions. The translation model is combined together with the following six additional feature models: the target language model, the word and phrase bonuses, the source–to–target and target–to–source lexicon models, and the reordering model. These models are optimally weighted for decoding.

4.2.3 Introducing collocation segmentation into a phrase–based system

In order to build an augmented phrase table using collocation segmentation, as described in Section 3.2,I segmented each language of the bilingual corpus independently using AML and a threshold level of 5% (almost no threshold). Then, using the collocation segments as words, we aligned the corpus and extracted phrases from it. Once the phrases were extracted, the segments of each phrase were split back into words to find standard phrases. Finally, we used the union of these phrases and the phrases extracted from the baseline system to compute the final phrase table. A diagram of the whole procedure can be seen in Figure 4.1. 42 Applying Collocation Segmentation

Measure Collocation segmentation example chi2 they were listening to his speech with open ears . likelihood they were listening to his speech with open ears . dice they_were_listening to his_speech with open_ears . MI they_were_listening_to his_speech_with open_ears . t-score they_were listening_to his_speech_with open ears . GC they_were listening_to his_speech_with_open ears_. chi2 ils écoutaient attentivement son discours . likelihood ils écoutaient attentivement son discours . dice ils_écoutaient_attentivement son_discours . MI ils_écoutaient_attentivement son_discours . t-score ils écoutaient attentivement son discours . GC ils écoutaient_attentivement son_discours . chi2 could_you show me what places are worth seeing near here , please ? likelihood could_you show me what places are worth seeing near here ,_please ? dice could_you show_me what places are worth_seeing near_here ,_please ? MI could_you show_me what_places are_worth_seeing near_here , please ? t-score could_you show_me what_places are worth_seeing near_here ,_please ? GC could_you show_me what_places are_worth_seeing near_here ,_please ? chi2 pourriez-vous me montrer les endroits qui valent la peine d’ être vus dans le voisinage , s’_il_vous_plaît ? likelihood pourriez-vous me montrer les endroits qui valent la peine d’ être vus dans le voisinage , s’ il vous plaît ? dice pourriez-vous_me_montrer les_endroits qui valent la peine d’_être vus dans_le voisinage ,_s’_il vous_plaît ? MI pourriez-vous_me_montrer les_endroits qui_valent la peine_d’ être_vus dans le_voisinage , s’_il vous_plaît ? t-score pourriez-vous_me montrer les_endroits qui valent la peine d’_être vus dans_le_voisinage ,_s’_il vous_plaît ? GC pourriez-vous_me_montrer les_endroits qui valent la peine d’_être vus dans_le_voisinage ,_s’ il_vous plaît_?

Table 4.10: Segmentation examples.

The objective of this integration is to add new phrases to the translation table and enhance the relative frequency of the phrases that were extracted from both methods (hereinafter, collocation segmentation both ). For the analysis, we differentiate new phrases (marked with ∗∗ in the figure) and phrases which already existed in the baseline segmentation. Then we may either integrate the baseline segmentation and the new phrases (collocation segmentation new phrases) or the baseline segmentation and the existing phrases (collocation segmentation smooth ).

4.2.4 Experiments

We participated in the French–to–English BTEC task [Takezawa et al., 2002]. We build our baseline system using the online Moses SMT system with the standard configuration, available on the Internet 4.2. Integration into Statistical Machine Translation 43

V]`X1RVJ C1J QJJQ% ::C%V RVC:I:1J 8 V]`X1RVJ C1J QJJQ% ::C%V RVZC: I:1J 8 `V1RVJ C1J QJ1:0VR :$`VV 1J$ Q % 8 `V1RVJ C1J QJ1:0VR :$`VV 1J$ Q % 8

V]`X1RVJ C1J QJJQ% ::C%V RVC:I:1J 8 V]`X1RVJ C1J QJJQ% ::C%V RVZC: I:1J 8

`V1RVJ C1J QJ1:0VR :$`VV 1J$ Q % 8 `V1RVJ C1J QJ1:0VR :$`VV 1J$ Q % 8

V]`X1RVJ C1J QJ3`V1RVJ C1J QJ V]`X1RVJ C1J QJ3`V1RVJ C1J QJ JQ% 3% JQ% 3% :3: ::C%V 3:$`VV 1J$ Q JQ% :3% ::C%V RV3:$`VV 1J$ Q  JQ% :3 Q % :C%V 3$`VV 1J$ Q JQ% :3 Q % 8 :C%V RV3$`VV 1J$ Q JQ% :3% 8 :C%V RVC:3$`VV 1J$ Q ::C%V 31:0VR :C%V 3 Q ::C%V 31:0VR : :C%V RV3 Q ::C%V 31:0VR :$`VV 1J$ :C%V RVC:3 Q :C%V 31:0VR 838 :C%V 31:0VR : C:I:1J 838 :C%V 31:0VR :$`VV 1J$ I:1J 838 :C%V RVC:31:0VR :C%V RVC:31:0VR : :C%V RVC:31:0VR :$`VV 1J$

838 I:1J 838

Figure 4.1: Example of the expansion of the phrase table using collocation segmentation (in this particular case, likelihood). New phrases added by the collocation-based system are marked ∗∗ . The most interesting new phrases are in bold. A compilation of both extractions is used to compute the final phrase table. at http://www.statmt.org/moses/ .

Data

We used the BTEC corpus provided in the evaluation, without any out–of–domain additional corpora. The BTEC task was carried out using the Basic Travel Expression Corpus (BTEC), a multilingual speech corpus containing tourism–related sentences similar to those that are usually found in phrase– books for tourists going abroad. All participants were supplied with a training corpus of 20,000 sentence pairs. In addition, the test–sets of the previous IWSLT evaluation campaigns were also provided to the participants, and could be used to improve the performance of the MT system on the respective translation tasks. The corpus supplied for the BTEC task was in true case and contained punctuation marks. The corpus statistics are summarized in Table 4.11. The model weights were tuned with the development corpus named 1 (sixteen reference trans- lations), and the development corpus named 3 was chosen as an internal test set (sixteen reference 44 Applying Collocation Segmentation

BTEC data lang sent avg.len word token word type train (text) F 19,972 9.5 189,665 10,735 (text) E 19,972 9.1 182,627 8,344 dev (text) F 1,512 7.5 11,409 2,244 (ref) E 24,192 8.1 196,806 4,660 test09 (text) F 469 7.8 3,642 976 (ref) E 3,283 8.4 27,507 1,739 test10 (text) F 464 7.7 3,582 1,004 (ref) E 3,248 8.4 27,183 1,580

Table 4.11: The supplied BT EC F E corpus. translations) which allowed us to decide whether the system performance needed to be improved. All sixteen references from the development corpus named 2 were added to the language model during tuning. The weights obtained in the optimization were also used for the evaluation test. When trans- lating the official test sets, we concatenated the training, development and test sets (see Table 4.12 and used the concatenation as training data. Because the three development sets included sixteen source sides and sixteen references for each sentence, we paired them one-to-one according to their IDs, thus building a larger parallel corpus prior to concatenation with the training corpus. This full devset is also mentioned in Table 4.12. Finally, we also added all references from the three development sets to the language–model corpus. To build and tune the translation systems, we lowercased and tokenized all data using the stan- dard Moses tools. The final translation output was recased and detokenized following a standard Moses procedure as well. A detailed explanation of this preprocessing and postprocessing can be found in parts III and VII of the Moses Step–by–Step Guide 9.

Automatic translation results

The internal experimentation used different segmentations to enrich the phrase table. As described in Section 4.2.3, we proposed to study the effects of adding new translation units ( new phrases ), smoothing the translation units from the standard phrase-based system ( smooth ), and both adding and smoothing (both ). Table 4.13 show the results in the internal test set. We can divide the statistical measures into three groups:

– Dice and Mutual Information

– Chi–squared, Log–likelihood and T–Score

– Gravity–Counts

9http://www.statmt.org/moses_steps.html 4.2. Integration into Statistical Machine Translation 45

Fr En Training Num sentences 19,972 Words 189k 182k Vocabulary 10.7k 8.3k Development Num sentences 506 (devset1) Words 3.8k Vocabulary 1077 Test Num sentences 506 (devset3) Words 3.8k Vocabulary 1103 Evaluation Num sentences 464 2010 testset Words 3.6k Vocabulary 1005 Evaluation Num sentences 469 2009 testset Words 3.6k Vocabulary 977 devset2 Num sentences 8000 Include in LM Words 55k during tuning Vocabulary 3.8k All devsets Num sentences 24,192 Pairing 16 sources Words 171k 166k with 16 refs Vocabulary 10.6k 7.6k

Table 4.12: Corpus statistics.

This division is based on the fact that the first group outperforms the baseline system when adding new phrases, the second group outperforms the baseline system when smoothing the existing phrases, and the third does not achieve the baseline results.

Initially, we did not expect any improvement with Chi–squared or Log–likelihood segmenta- tion. For instance, Log–likelihood segmentations introduced very short lists of collocations–105 in English and ninety-four in French): do you = 1299 / could you = 1237 / it ’s = 1163 / ’d like = 1013 / would you = 963 / is the = 957 / ’d like to = 901 / how much = 642 / is it = 631 / , please = 621 / want to = 605 / to the = 589 / is there = 582 / do you have = 570 / tell me = 539 / in the = 511 / give me = 432 / like to = 378 / can you = 196 / you have = 196 / on the = 163 / to get = 158 / would like to = 77 / what ’s = 76 / ’s the = 72 / this is = 71 / at the = 52 / that ’s = 47 / would like = 42 / thank you = 33 / would you have = 27 / for me = 26 / of the = 26 / want to get = 17 / it ’s the = 14 / this is the = 13 / what ’s the = 12 / are you = 10 / could you have = 7 / i ’d like = 7 / can ’t = 6 / i ’d like to = 6 / what time = 6 / that ’s the = 5 / want to go = 5 / where is=5/i’d=4/ to go = 4 / going to = 3 / i ’m = 3 / would you like = 3 / how many = 2 / like to get = 2 / no , = 2 / the number = 2 / where is the = 2 / you tell me = 2 / ’re not = 1 / ’s that ? i = 1 / ’ve got = 1 / , but = 1 / . what = 1 / Good morning = 1 / a nice = 1 / an account = 1 / an open = 1 / can you have = 1 / can you tell = 1 / could you tell = 1 / fifty minutes = 1 / five six seven = 1 / for the = 1 / get the = 1 / go to = 1 / have to = 1 / if you = 1 / in another = 1 / is this = 1 / it ’s too = 1 / let ’s = 1 / let me = 46 Applying Collocation Segmentation

1 / like to go = 1 / like to the = 1 / long time = 1 / name is = 1 / no see = 1 / of cigarettes = 1 / that ’s a = 1 / that is = 1 / the party = 1 / there ’s no = 1 / this number = 1 / this train = 1 / those are = 1 / to go to = 1 / to leave = 1 / to take = 1 / two three four = 1 / we have = 1 / with us = 1 / would like to go = 1 / would you mind = 1 / would you tell = 1 / you can = 1 / you tell = 1 vous plaît = 3547 / s’ il = 3036 / c’ est = 2285 / j’ ai = 1586 / est-ce que = 1534 / je voudrais = 1395 / je suis = 887 / j’ aimerais = 822 / je peux = 783 / à l’ = 754 / de la = 704 / , s’ il = 603 / est-ce que vous = 593 / pourriez-vous me = 541 / quelque chose = 522 / est le = 508 / je ne = 488 / que vous = 400 / quelle heure = 347 / je veux = 336 / qu’ il = 275 / est-ce que je = 273 / je vais = 242 / que je = 240 / à la = 229 / se trouve = 170 / je n’ = 118 / je vous = 111 / de l’ = 104 / est-ce qu’ = 92 / au japon = 73 / c’ est le = 71 / vous avez = 58 / que je veux = 35 / combien de = 34 / il vous = 30 / pourriez-vous me dire = 29 / que je ne = 28 / me dire = 24 / est-ce qu’ il = 21 / n’ est = 18 / que je suis = 18 / que je vais = 16 / que vous avez = 12 / que je peux = 10 / je n’ ai = 7 / que je n’ = 6 / que je voudrais = 6 / que je vous = 6 / quel est le = 6 / une chambre = 5 / qu’ il vous = 4 / un peu = 4 / dans le = 3 / il vous plaît = 3 / je vous prie = 3 / n’ est pas = 3 / quel est = 3 / de temps = 2 / est la = 2 / l’ hôtel = 2 / où se trouve = 2 / pouvez-vous me = 2 / , merci = 1 / , s’ il vous = 1 / . merci = 1 / belle boutique = 1 / bonjour . = 1 / c’ est un = 1 / ce serait = 1 / dans cette = 1 / de le faire = 1 / depuis le temps = 1 / elle est = 1 / l’ occasion = 1 / n’ ai = 1 / non , je ne sais = 1 / nous avons = 1 / pas d’ = 1 / pas de = 1 / pour le = 1 / pouvez-vous me dire = 1 / qu’ il est = 1 / que c’ est = 1 / que ce film = 1 / que j’ = 1 / s’ est pas = 1 / s’ il vous = 1 / tout simplement = 1 / trop lourd = 1 / vous plaît . = 1 / à l’ hôtel = 1 / ça fait = 1 / ça pèse = 1 A further analysis of segmentation phrase lists shows that Chi–squared, Log–likelihood and T– score segmentations tend to keep high-frequency words together, resulting in very short phrases. The opposite is true of Mutual Information and Dice, which keep low-frequency words together; however, the resulting phrases remain relatively short on average. Gravity-Counts keep both low- and high- frequency words together, leading to longer phrases. Thus, Mutual Information and Dice are good at capturing collocations with low-frequency words. This is acceptable in many situations from a lexical point of view. T–score, Log–likelihood and Chi–squared are good at capturing collocations with high-frequency words. Gravity-Counts take care of both low- and high-frequency words. In order to achieve the best translation quality, either low-frequency or high-frequency words should be selected. This outcome is acceptable at least for small corpora. The main results are:

1. It is important to keep very frequent words together before alignment. This allows good new translations to be introduced.

2. The low-frequency words allow corrections to be made to the translation probabilities, but do not introduce good new translation phrases.

3. It is very difficult to introduce good new entries while smoothing the probabilities. Either smoothing or the introduction of new translation phrases will improve the translation results, but not both at the same time. 4.2. Integration into Statistical Machine Translation 47

System Internal baseline 60.88 +dice smooth 61.21 +dice new phrases 60.23 +dice both 60.28 +mi smooth 60.93 +mi new phrases 59.79 +mi both 60.10 +chi2 smooth 60.55 +chi2 new phrases 61.09 +chi2 both 61.11 +likelihood smooth 60.97 +likelihood new phrases 61.23 +likelihood both 60.61 +t-score smooth 60.79 +t-score new phrases 61.19 +t-score both 61.08 +gc smooth 60.58 +gc new phrases 60.47 +gc both 60.49 +dice smooth +likelihood new phrases 61.11

Table 4.13: Translation results in terms of BLEU for the internal test set. Cases that outperform the baseline system are in bold.

The last row of Table 4.13 shows one last both translation system, in which smooth phrases were extracted with the Dice collocation and new phrases were extracted with the Log-likelihood strategy. It can be seen that even though the internal test outperformed the baseline system, it did not achieve the levels of “+dice smooth” or “+likelihood new phrases”.

Table 4.14 shows the results of the primary and contrastive systems presented in the evaluation, as compared to the baseline system. Our primary system was “+likelihood new phrases” and the contrastive system was “+dice smooth”. Table 4.14, shows that the results of these systems were equally good when using the 2010 test data. However, the contrastive system’s results were weak when using the test data from 2009. In the future we plan to investigate the cause(s) of this discrepancy.

4.2.5 Official Evaluation results

This section briefly presents the official results of the human and automatic evaluations of the trans- lations, as reported in [Paul et al., 2010]. 48 Applying Collocation Segmentation

System 2009 2010 baseline 60.93 52.61 +likelihood new phrases 62.00 53.27 +dice smooth 60.13 53.24

Table 4.14: Translation results in terms of BLEU for the evaluation sets. Cases that outperform the baseline system are in bold.

Official BT F E human assessment evaluation results

Human assessments of the overall translation quality of each MT system were carried out with respect to the Fluency and Adequacy of the translation. Fluency indicates how the evaluation segment sounds to a native speaker of the target language. For Adequacy, the evaluator was presented with the source language input as well as a “gold standard” translation, and had to judge how much of the information in the original text was expressed in the translation [White et al., 1994]. Fluency and Adequacy were judged according to the grades listed in Table 4.15. Due to their high evaluation costs, the Fluency and Adequacy assessments were limited to the top-ranked MT systems in each translation task, according to the ranking evaluation results. In addi- tion, the translation results of each translation task were pooled, i.e., in cases of identical translations of the same source sentence by multiple engines, the pooled translation was graded only once, and the respective rank was assigned to all MT engines with the same output. The results of the manual evaluation appear in Tables 4.16 and 4.17. Systems were ranked on two scales, Ranking (best = 1.0, worst = 0.0) and NormRank (best = 5.0, worst = 1.0), where

– the Ranking scores are the average numbers of times that a system was judged better than any other system, and

– the NormRank scores are normalized on a per–judge basis using the method of [Blatz et al., 2004].

Fluency Adequacy 5 Flawless English 5 All Information 4 Good English 4 Most Information 3 Non-native English 3 Much Information 2 Disfluent English 2 Little Information 1 Incomprehensible 1 None

Table 4.15: Human assessment. 4.2. Integration into Statistical Machine Translation 49

System Fluency Adequacy BTFE dsic-upv 3.91 4.05 online 4.02 4.30

Table 4.16: IWSLT’10 test–set translation. Human Assessment of the best system.

MT Ranking MT NormRank online 0.4114 online 3.24 tottori 0.3482 dsic-upv 3.13 kit 0.3256 kit 3.13 dsic-upv 0.3248 tottori 3.11 mit 0.3135 mit 3.09 inesc-id 0.3069 inesc-id 3.08 upc 0.3057 upc 3.08 nict 0.3046 nict 3.03 qmul 0.2794 qmul 2.94 uva-isca 0.1437 uva-isca 2.19

Table 4.17: Ranking of systems by human assessment.

BT F E automatic evaluation results

Automatic evaluation of run submissions was carried out using the standard automatic evaluation metrics listed in [Paul et al., 2010]. In order to decide whether the translation output (at the document level) of any given MT engine was significantly better than any other, the IWSLT’2010 organizers used the bootStrap method, which (1) performs a random sampling with replacement from the evaluation data set, (2) calculates the respective evaluation metric score of each engine for the sampled test sentences and the difference between the two MT system scores, (3) repeats the sampling/scoring step iteratively, and (4) applies the Student’s t-test at a significance level of 95% confidence to test whether the score differences are significant [Zhang et al., 2004]. In the IWSLT’2010 evaluations, 2,000 iterations were used to analyze the results of automatic evaluation.

– “case+punc” evaluation: case-sensitive, with punctuations tokenized.

– “no case+no punc” evaluation: case-insensitive, with punctuations removed.

– Only a subset of the sentence IDs used for the human assessments were used to calculating the automatic scores of each MT output.

– The mean score and 95% confidence intervals were calculated for each MT output according to the bootStrap method [Zhang et al., 2004]. 50 Applying Collocation Segmentation

IWSLT10 testset “case+punc” evaluation System BLEU METEOR WER PER TER GTM NIST z-avg mit 52.69 78.43 32.36 27.90 27.72 77.58 8.148 1.174 dsic-upv 50.56 77.55 32.97 29.09 27.83 75.59 7.777 0.856 nict 50.52 77.31 33.41 28.85 28.24 75.79 7.844 0.852 upc 50.46 77.56 33.69 29.26 28.39 75.13 7.814 0.803 tottori 49.30 77.45 35.08 30.24 30.09 76.00 7.963 0.721 kit 48.59 77.33 35.33 30.25 30.36 76.01 7.896 0.666 inesc-id 49.38 77.01 34.97 29.75 29.53 75.34 7.756 0.660 qmul 50.46 76.19 35.08 29.85 29.28 74.95 7.657 0.608 online 46.93 76.01 37.13 32.45 32.14 75.16 7.777 0.332 uva-isca 30.13 67.28 50.80 45.02 42.52 64.28 6.118 -2.387

“no case+no punc” evaluation System z-avg BLEU METEOR WER PER TER GTM NIST mit 1.232 51.27 75.36 36.69 31.11 31.16 74.20 8.201 dsic-upv 0.842 48.76 74.52 37.74 32.34 31.35 72.42 7.784 nict 0.833 48.84 74.00 38.21 32.21 31.73 72.45 7.871 upc 0.784 48.28 74.47 38.29 32.42 31.70 71.78 7.831 tottori 0.728 47.09 74.23 39.81 33.33 33.82 72.72 8.053 kit 0.647 46.12 74.37 40.51 33.89 34.07 72.96 7.967 inesc-id 0.694 47.75 74.02 39.56 32.47 33.16 72.13 7.845 qmul 0.512 48.46 72.70 39.83 33.47 32.71 71.41 7.574 online 0.314 44.11 72.89 42.13 35.56 35.93 72.19 7.861 uva-isca -2.302 31.00 62.97 54.55 46.71 48.29 63.28 6.621

IWSLT09 testset “case+punc” evaluation System BLEU METEOR WER PER TER GTM NIST z-avg mit 63.53 82.95 25.90 23.54 21.91 82.09 9.003 1.162 dsic-upv 63.52 81.95 26.69 24.19 22.16 80.40 8.710 0.921 upc 61.90 81.89 27.31 23.95 22.92 79.84 8.738 0.834 nict 61.41 81.39 27.73 24.70 23.08 80.28 8.741 0.781 qmul 61.70 81.65 28.37 24.76 23.53 80.38 8.683 0.753 tottori 58.86 81.71 29.35 25.72 24.78 80.13 8.826 0.627 online 60.97 80.94 29.95 26.86 25.19 80.22 8.813 0.568 inesc-id 59.44 80.73 28.63 25.86 24.01 79.12 8.575 0.528 kit 58.79 81.77 29.65 26.43 25.07 79.34 8.691 0.522 uva-isca 38.51 72.36 44.84 40.22 36.68 69.17 6.987 -2.411

“no case+no punc” evaluation System z-avg BLEU METEOR WER PER TER GTM NIST mit 1.190 61.82 80.01 29.54 26.43 24.94 80.23 9.298 dsic-upv 0.948 62.23 78.99 30.04 26.82 24.99 78.44 8.972 qmul 0.793 60.41 78.79 31.99 27.40 26.55 78.53 8.982 nict 0.786 59.58 78.42 31.31 27.43 26.06 78.25 9.016 upc 0.775 59.48 79.10 31.66 27.47 25.97 77.64 9.013 tottori 0.599 56.74 78.45 33.54 28.64 28.20 77.97 9.169 online 0.606 58.96 77.97 33.83 29.78 28.80 78.63 9.194 inesc-id 0.502 57.52 77.50 32.65 28.78 27.38 77.21 8.896 kit 0.440 56.12 78.71 34.46 29.92 28.67 77.31 9.011 uva-isca -2.354 39.79 68.06 49.22 42.39 42.16 68.81 7.605

Table 4.18: Automatic evaluation. 4.2. Integration into Statistical Machine Translation 51

– Z-avg is the average system score of z-transformed automatic evaluation metric scores achieving the highest rank correlation towards Ranking.

– MT systems are ordered according to the z-avg.

– In addition to the NIST metrics, all automatic evaluation metric scores are given as percentages.

The results of the automatic evaluation are shown in Table 4.18.

Manual analysis

We chose 100 random sentences from the evaluation set and compared the performance of the baseline system against the dice-smooth and likelihood-new-phrases approaches. We observed that the new approaches are equivalent or superior to the baseline. The main improvements are due to:

1. Better selection of translation units, which implies better semantic preservation. For example: My main matter is right (baseline translation) vs. My main matter is law (dice–smooth and likelihood–new–phrases).

2. Better grammatical preservation. For example: Can I bring a drink (baseline translation) vs. May I bring you a drink? (dice–smooth and likelihood–new–phrases).

3. Better word order. For example: How was on the paquebot life? (baseline translation) vs. How was life on the paquebot? (dice–smooth and likelihood–new–phrases).

In the manual analysis we see that likelihood-new-phrases performs an indirect smoothing, be- cause adding new phrases affects the existing ones–at least when hypotheses compete in decoding. Additionally, likelihood–new–phrases is able to reduce the number of unknown words. For example: you should méfier of these people (baseline translation) vs. You should be careful of these people (likelihood–new–phrases).

4.2.6 Conclusion

We performed an experiment to determine whether collocation segmentation benefits from smoothing the existing baseline phrases or introducing new phrases. We can conclude that segmentation with Dice or Mutual Information allows the existing baseline phrases to be smoothed, while segmentation with Chi–squared, Log–likelihood or T–score allows new phrases to be introduced. 52 Applying Collocation Segmentation

4.3 Application to the ACL Anthology Reference Corpus

The rise of vast on-line collections of scholarly papers has made it possible to develop a computational history of any given field of science. Methods from natural language processing and other areas of computer science can naturally be applied to study the ways a field and its ideas have developed and expanded [Au Yeung and Jatowt, 2011, Gerrish and Blei, 2010, Aris et al., 2009]. One particular direction in computational history has been the use of topic models to analyze the rise and fall of research topics to study the progress of scientific fields, both in general [Griffiths and Steyvers, 2004] and in the ACL Anthology [Hall et al., 2008]. A more people–centered view of computational history [Anderson et al., 2012] examines the trajectories of individual authors across research topics in the field of computational linguistics. My recent study concentrates on the usage of terminology across the last fifty years. The ACL–2012 Special Workshop on Rediscovering 50 Years of Discoveries [Banchs, 2012] was organized to take a brief pause to review the past and project the future, to ensure that the current legacy will endure the indifference of time, and that future generations will be able to build securely on the foundations laid by the many pioneers of this discipline, in which language is clearly showing its reluctance to being tamed by mathematics. The main objective of the workshop was to gather con- tributions about the history, the evolution and the future of research in computational linguistics. The organizers especially encouraged the submission of research projects that applied natural language processing and techniques to the ACL Anthology Reference Corpus (ACL ARC), which is publicly available from the ACL ARC project website 10 . The main goal of this workshop was to investigate the efficacy of these techniques for computational history in general. A second goal was to use the ACL Anthology Reference Corpus [Bird et al., 2008] to answer specific questions about the history of computational linguistics. What is the path that the ACL has taken throughout its fifty-year history? What roles did various research topics play in the ACL’s development? What have been the pivotal turning points? techniques are mainly based on the extraction of multiword units using n-grams and syntactic phrase patterns [Seretan, 2011]. The main challenges of this task are: (1) the length of the terms – one-word, two-word, or longer terms; (2) the use of linguistic processing tools such as parsers; (3) language coverage and vocabulary/terminology changes; (4) uneven annual redis- tribution of the data amount. Therefore, I applied collocation segmentation to study the main trends of research methods in the ACL community through the use of terminology. The main advantages of this approach are, first, the ability to automatically detect the length of the term, and second, the multilingualism of the system, because no linguistic processing (POS tagging or parsing) is required. To evaluate the ability of collocation segmentation to handle different aspects of collocations,

10 http://acl-arc.comp.nus.edu.sg/ 4.3. Application to the ACL Anthology Reference Corpus 53

Length Unique segments Segment count Word count Corpus coverage 1 289,277 31,427,570 31,427,570 60.58% 2 222,252 8,594,745 17,189,490 33.13% 3 72,699 994,393 2,983,179 5.75% 4 12,669 66,552 266,208 0.51% 5 1,075 2,839 14,195 0.03% 6 57 141 846 0.00% 7 3 7 49 0.00% Total 598,032 41,086,247 51,881,537 100% Table 4.19: The distribution of collocation segments.

I extracted the most significant collocation segments in the ACL Anthology Reference Corpus. In addition, based on a ranking like that of T F -IDF , I extracted terms that are related to different phenomena of natural language analysis and processing. The distribution of these terms in the ACL ARC helps to understand the main breakthroughs of different research areas across the years. On the other hand, I did not intend to make a thorough study of the methods used by the ACL ARC, as such a task is complex and prohibitively extensive.

4.3.1 The ACL Anthology Reference Corpus

This study uses the ACL ARC version 20090501. The first step was to clean and preprocess the corpus. First of all, files that were unsuitable for the analysis were removed. These were texts containing characters with no clear word boundaries, i.e., wherein each character was separated from the next by whitespace. This problem is related to the extraction of text from files in the .pdf format, and is hard to solve. Each file in the ACL ARC represents a single printed page. The file name encodes the document ID and page number, e.g., the file name C04-1001_0007.txt is made up of four parts: C is the publication type, (20)04 is the year, 1001 is the document ID, and 0007 is the page number. The next step was to compile files of the same paper into a single document. Also, headers and footers that appear on each document page were removed, though they were not always easily recognized and, therefore, some of them remained. A few simple rules were then applied to remove line breaks, thus keeping each paragraph on a single line. Finally, documents that were smaller than 1 kB were also removed. The final corpus comprised 8,581 files with a total of 51,881,537 tokens.

4.3.2 Collocation segments from the ACL ARC

Prior to collocation segmentation, the ACL ARC was preprocessed with lowercasing and tokenization. No taggers or parsers were used, and all punctuation was kept. Collocation segmentation is done on a separate line basis, which is usually a paragraph. Previous applications in Sections 4.1 and 4.2 54 Applying Collocation Segmentation

2 word segments CTF- 3 word segments CTF- IDF IDF machine translation 10777 in terms of 4099 speech recognition 10524 total number of 3926 training data 10401 th international conference 3649 language model 10188 is used to 3614 named entity 9006 one or more 3449 error rate 8280 a set of 3439 test set 8083 note that the 3346 maximum entropy 7570 it is not 3320 sense disambiguation 7546 is that the 3287 training set 7515 associated with the 3211 noun phrase 7509 large number of 3189 our system 7352 there is a 3189 7346 support vector machines 3111 information retrieval 7338 are used to 3109 the user 7198 extracted from the 3054 word segmentation 7194 with the same 3030 machine learning 7128 so that the 3008 parse tree 6987 for a given 2915 knowledge base 6792 it is a 2909 information extraction 6675 fact that the 2876

4 word segments CTF- 5 word segments CTF- IDF IDF if there is a 1690 will not be able to 255 human language technology conference 1174 only if there is a 212 is defined as the 1064 would not be able to 207 is used as the 836 may not be able to 169 human language technology workshop 681 a list of all the 94 could be used to 654 will also be able to 43 has not yet been 514 lexical information from a large 30 may be used to 508 should not be able to 23 so that it can 480 so that it can also 23 our results show that 476 so that it would not 23 would you like to 469 was used for this task 23 as well as an 420 indicate that a sentence is 17 these results show that 388 a list of words or 16 might be able to 379 because it can also be 16 it can also be 346 before or after the predicate 16 have not yet been 327 but it can also be 16 not be able to 323 has not yet been performed 16 are shown in table 320 if the system has a 16 is that it can 311 is defined as an object 16 if there is an 305 is given by an expression 16 Table 4.20: Top 20 segments for the segment length of two to five words. 4.3. Application to the ACL Anthology Reference Corpus 55 show that the combinability threshold does not improve results and can be omitted during collocation segmentation. Therefore, only the Average Minimum Law is used. Table 4.19 presents the distribution of segments by length, i.e., by the number of words. The length of collocation segments varies from one to seven words. In the ACL ARC there are 345,455 distinct tokens. After segmentation, the size of the segment list was 598,032 segments, almost double the length of the single word list. The length of the bigram list is 4,484,358, which is more than 10 times the size of the word list and seven times that of the collocation segment list. About 40% of the corpus comprises collocation segments of two or more words, showing the amount of fixed language present therein. The longest collocation segment, which contains seven tokens (punctuation is included as a separate token), is | described | in | section | 2 | . | 2 | , |. This shows that collocation segmentation with a threshold of 50% and AML diverges to one-, two-, or three-word segments. Despite that, the list of collocation segments is much shorter than the list of bigrams, and shorter still than that of trigrams. After segmentation, it was of interest to find the most significant segments used in the ACL ARC. For this purpose I used the NCTFIDF described in Section 4.1.3, without frequency normalization; it is thus defined as follows:

N − D(x) + 1 CT F IDF (x) = T F (x) · ln  D(x) + 1 

Table 4.20 presents the top twenty collocation segments for two-, three-, four-, and five-word segments of items that contain alphabetic characters only. The term machine translation is the most significant in CTFIDF terms. This short list contains many of the main methods and datasets used in daily computational linguistics research, such as: error rate , test set , maximum entropy , training set , parse tree , unknown words , word alignment , Penn Treebank , language models , mutual information , translation model , etc. These terms show that computational linguistics has its own terminology, methods and tools to research many topics. Finally, the seventy-six terms of two or more words in length with the highest CTFIDF values were selected. The goal was to try to find how significant terms were used in the ACL ARC each year. The main part of the ACL ARC was compiled using papers published after 1995, so for each selected term, the average NCTFIDF value of each document for each year was calculated as follows:

NCT F IDF (term d,y ) NCT F IDF AV G (term y) = |term d,y |

This approach allows the usage of terms throughout the history of the ACL to be analyzed, and reduces the influence of the unbalanced amount of published papers.

Only those terms whose NCT F IDF AV G in any year was higher than 20 were kept. For in- stance, the term machine translation had to be removed, as it was not significant throughout all the 56 Applying Collocation Segmentation

Figure 4.2: Yearly significance of several selected terms.

years. Each term was ranked by the year in which its NCT F IDF AV G value peaked. The ranked terms are shown in Table 4.22. For instance, the peak of the NCT F IDF AV G of the term statistical parsing occurred in 1990, of the term language model in 1987, and of the term BLEU score in 2006. The results (see Table 4.22) show the main research trends and time periods of the ACL community.

Most of the terms with NCT F IDF AV G peaks prior to 1986 are related to formal/rule-based meth- ods. Beginning in 1987, terms related to statistical methods became more important. For instance, language model , similarity measure, and text classification . The year 1990 stands out as a kind of watershed. In this year, the terms Penn Treebank , Mutual Information , statistical parsing , bilingual corpus, and dependency tree became the most important terms, showing that newly released language resources were supporting many new research areas in computational linguistics. Despite the fact that Penn Treebank was only significant temporarily, the corpus is still used by researchers today. The most recent important terms are BLEU score and . Figure 4.2 shows the yearly significance of the terms dependency tree , word segmentation , spo- ken language, semantic role labeling , and cfg. Each curve shows the trend of the term “activity”. The most noticeable trends are:

– continuous usage with ups and downs (e.g., dependency tree , cfg ),

– waterfall (e.g., spoken language), and

– uphill (e.g., word segmentation , semantic role labeling ). 4.3. Application to the ACL Anthology Reference Corpus 57

Term trend Terms Past topic annotation scheme, Brown corpus, lexical rules, maximum entropy, multi-word, mutual information, parsing algorithm, Penn tree bank, ref- erence resolution, semantic representation, similarity measure, source language, speech recognition, spoken language, statistical parsing, target language, text classification, text generation, tree adjoining grammars Decreasing topic bilingual corpus, language model, spontaneous speech, text categoriza- tion, text summarization Continuous topic anaphora resolution, coreference resolution, dependency tree, edit dis- tance, feature selection, lexical entry, logical form, named entity, naive bayes, POS tagging, search engine, speech synthesis, spelling correc- tion, target word, translation model, trigram model, word senses Peak topic word alignment, word segmentation Fresh & hot topic BLEU score, semantic role labeling Table 4.21: The distribution of terms by the trend of the yearly significance.

The selected terms can be classified according to these trends and their years of appearance:

– past topic – a group of terms which have not been significant for several years

– decreasing topic – a group of terms that are still used, but whose significance is dropping

– continuous topic – a group of terms with a long usage history that are still used

– peak topic – a group of terms that have appeared recently, whose usage is uphill and which were not significant in the beginning

– fresh and hot topic – a group of terms that have appeared recently, whose usage is a sharp uphill

Table 4.3.2 shows how the selected terms can be classified. This distribution shows the main topics that the researchers in the particular community are or have been working on.

4.3.3 Conclusions

This study has shown that collocation segmentation can help extract terms from large and complex corpora, which speeds up research and simplifies the study of ACL history. The results show that the most significant terms prior to 1986 are related to formal/rule based research methods. Beginning in 1987, terms related to statistical methods (e.g., language model , similarity measure , text clas- sification) become more important. In 1990, a major turning point appears, when the terms Penn Treebank , Mutual Information , statistical parsing , bilingual corpus , and dependency tree become the most important, showing that research into new areas of computational linguistics is supported by the publication of new language resources. The Penn Treebank , which was only significant temporarily, 58 Applying Collocation Segmentation 28 21 25 25 31 2624 24 24 30 22 17 10 11 16 12 18 16 14 . AV G 10 10 21 9 43 1438 12 1031 13 13 1229 23 11 10 13 20 13 19 21 12 16 11 17 10 12 10 11 12 11 36 17 14 13 13 13 12 9 13 11 32 23 463134 21 37 18 21 11 14 15 10 11 16 11 18 27 20 22 10 16 27 10 13 19 20 11 31 1627 12 10 10 9 12 20 12 10 11 9 10 NCT F IDF 4229 13 9 9 12 9 16 14 11 20 11 13 11 10 10 11 21 494237 9 27 26 21 15 13 10 17 16 17 11 10 10 15 14 13 18 11 19 15 12 16 14 15 10 12 20 12 14 11 14 15 14 27 51 10 7 10 11 9 12 14 15 10 10 42 1625 17 18 9 16 13 9 10 15 14 1220 19 10 26 16 16 13 10 11 20 23 14 10 11 10 10 10 11 13 9 21 20 11 22 12 22 25 5530 29 23 1927 17 15 1226 152221 13 11 15 9 19 11 6 12 12 9 11 15 11 11 13 9 16 16 17 12 11 11 16 15 11 20 20 16 13 13 14 28 27 3733 23 2033 19 19 21 17 21 19 13 16 16 14 15 8 34 22 20 13 12 17 3429 11 19 14 13 19 12 18 12 7 7 13 11 12 10 9 15 41 9 30 17 13 16 18 9 9 23 14 12 12 11 3 14 22 19 9 22 21 9 11 5 10 9 15 17 10 10 14 9 7 17 17 16 19 3621 11 16 17 13 2 6 6 9 18 12 19 15 30 16 16 17 14 8 21 18 21 13 20 8 6 14 11 11 9 10 29 10 3 8 11 9 8 14 18 16 11 4 8 4 7 8 24 2 18 5 11 10 13 10 17 31 22 12 6 10 7 7 19 17 13 6 9 9 Table 4.22: Selected terms and their yearly importance in terms of 3621 21 4 4 15 14 9 12 65 6725 69 73 75 78 79 80 81 82 83 84 85 2 86 87 9 88 5 89 13 90 91 92 93 94 95 96 97 98 13 99 7 00 01 02 03 04 05 06 65 67 69 73 75 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 lexical entry tree adjoining grammars mutual information penn treebank bilingual corpus dependency treepos tagging spontaneous speech 10 8spelling correction 9edit distance target word 16 naive bayes 11 8word alignment semantic role labeling bleu score 9 7 10 target languagebrown corpus multi - word 11 15 word senses 10 logical form semantic representationreference resolution language model text generationspoken language 9 4 6 3 24text categorization feature selection 17 9 25 25speech synthesis search engine 13 9 trigram model named entity anaphora resolution word segmentation source language 11 statistical parsing parsing algorithm annotation scheme coreference resolution text summarization speech recognition similarity measuretext classification 13 translation model maximum entropy lexical rules 12 6 2 10 5 18 9 11 36 18 8 24 4.3. Application to the ACL Anthology Reference Corpus 59 is still used today. The most recent terms are BLEU score and semantic role labeling . While machine translation as a term is significant throughout the ACL ARC, it is not significant in any particular time period. This shows that some terms can be significant globally, but insignificant at a local level. Conclusions

My aim in writing this dissertation was to provide a simple and efficient method for applying colloca- tions to many NLP tasks. I have presented collocation segmentation, which allows the production of text units that may be used successfully in many terminology and lexicon extraction tasks, the build- ing of efficient feature lists for text classification tasks, and the improvement of bilingual alignment for training Statistical Machine Translation systems. The main outcomes of my dissertation are as follows:

1. A new collocation processing method was presented, one which is fully automatic and applica- ble to various NLP tasks, e.g., machine translation, text classification, lexicography, etc.

2. The speed and complexity of the classification task is directly proportional to the size of the dic- tionary used. Collocation segmentation achieves classification results that are equal to or better than those achieved using bigrams, while the processing time is close to that of unigrams. This is possible because the complexity of the collocation segmentation method is equal to the search in dictionary size of bigrams used for measuring the combinability of two consecutive words, and is linear to the length of the text. This makes collocation segmentation highly applicable in practice.

3. For the classification task, collocation segmentation outperforms other methods such as uni- gram and bigram. The precision of the categorization of EU legislation using EuroVoc was improved from 48.30 ±0.48% to 54.67 ±0.78%, from 51.41±0.56% to 59.15 ±0.48% and from 54.05±0.39% to 58.87 ±0.36% for the English, Lithuanian and Finnish languages respectively.

4. The best results in the classification task are achieved using collocation segmentation with- out any threshold applied. This contradicts the traditional view towards collocation extraction techniques, in which the threshold is the main criterion for filtering “good collocations”. The collocation segmentation method thus opens new research areas in collocation processing.

5. Collocation segmentation, in general, is language-independent, and the method can be applied to many languages, especially those less resourced. Nowadays it is simple to acquire raw cor- pora that provide sufficient data for the performance of collocation segmentation.

6. Collocation segmentation is capable of handling meaningful units and can help in terminology studies.

61 Bibliography

Some distributional properties of mandarin chinese : A study based on the academia sinica corpus. In Proceedings of PACFoCoL I (1993) : Pacific Asia Conference on Formal and Computational Linguistics, pages 81–95. The Computational Linguistics Society of R.O.C., 1993.

Ashton Anderson, Dan Jurafsky, and Daniel A. McFarland. Towards a computational history of the acl: 1980-2008. In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries , pages 13–21, Jeju Island, Korea, July 2012. Association for Computational Linguis- tics.

Jaime Arguello and Carolyn Rose. Topic-segmentation of dialogue. In Proceedings of the Analyzing Conversations in Text and Speech , pages 42–49, New York City, New York, June 2006. Association for Computational Linguistics.

Aleks Aris, Ben Shneiderman, Vahed Qazvinian, and Dragomir Radev. Visual overviews for dis- covering key papers and influences across research fronts. J. Am. Soc. Inf. Sci. Technol. , 60(11): 2219–2228, November 2009. ISSN 1532-2882.

Ching-man Au Yeung and Adam Jatowt. Studying how the past is remembered: towards compu- tational history through large scale text mining. In Proceedings of the 20th ACM international conference on Information and knowledge management , CIKM ’11, pages 1231–1240, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0717-8.

Rafael E. Banchs, editor. Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries . Association for Computational Linguistics, Jeju Island, Korea, July 2012.

Ralph D. Beebe. The frequency distribution of english syntagms. In Proceedings of the International Conference on Computational Linguistics, COLING , 1973.

Steven Bird, Robert Dale, Bonnie J. Dorr, Bryan R. Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir R. Radev, and Yee Fan Tan. The acl anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In LREC. European Language Resources Association, 2008.

John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. Confidence estimation for machine translation. In Proceedings of the 20th international conference on Computational Linguistics , COLING ’04, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics.

Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. Rst discourse treebank, ldc2002t07. Lin- guistic Data Consortium, 2002.

63 64 BIBLIOGRAPHY

F. Casacuberta. Finite-state transducers for speech input translation. In Proc. IEEE ASRU , Madonna di Campiglio, Italy, 2001.

Keh-Jiann Chen and Ming-Hong Bai. Unknown Word Detection for Chinese by a Corpus-based Learning Method. International Journal of Computational linguistics and Chinese Language Pro- cessing, 3(1):27–44, 1998.

Lee-Feng Chien. Pat-tree-based keyword extraction for chinese information retrieval. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in informa- tion retrieval , SIGIR ’97, pages 50–58, New York, NY, USA, 1997. ACM. ISBN 0-89791-836-3.

Yaacov Choueka. Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO Conference on User-Oriented Content-Based Text and Image Handling , pages 609–624, March 1988.

Kenneth W. Church and William A. Gale. Enhanced good-turing and cat.cal: Two new methods for estimating probabilities of english bigrams (abbreviated version). In Speech and Natural Language: Proceedings of a Workshop Held at Cape Cod , pages 15–18, Massachusetts, 1989.

J. Civera and A. Juan. Bilingual Machine-Aided Indexing. In Proceedings of the 5th international conference on Language Resources and Evaluation , LREC’2006, pages 1302–1305, May 2006.

Jorge Civera Saiz. Novel statistical approaches to text classification, machine translation and computer-assisted translation . PhD thesis, Universidad Politécnica de Valencia, Valencia (Spain), Juny 2008. Advisors: A. Juan and F. Casacuberta.

Karel Culik. Machine translation and connectedness between phrases. In International Conference on Computational Linguistics, COLING , 1965.

Susanna Cumming. The lexicon in text generation. In Strategic Computing - Natural Language Workshop: Proceedings of a Workshop Held at Marina del Rey , 1986.

Vidas Daudaravičius. The influence of collocation segmentation and top 10 items to keyword assign- ment performance. In Alexander F. Gelbukh, editor, CICLing , volume 6008 of Lecture Notes in Computer Science , pages 648–660. Springer, 2010. ISBN 978-3-642-12115-9.

Vidas Daudaravičius. Applying collocation segmentatio to the acl anthology reference corpus. In Proceedings of Rediscovering 50 years of discoveries Workshop , pages 10–18, Jeju, Rep. of Korea, July 2012a. Association for Computational Linguistics.

Vidas Daudaravičius. Automatic multilingual annotation of eu legislation with eurovoc descriptors. In In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012), pages 2142–2147, 2012b. BIBLIOGRAPHY 65

Vidas Daudaravičius and Ruta Marcinkevičien e.˙ Gravity counts for the boundaries of collocations. International Journal of , 9(2):321–348, 2004.

Paolo D’Orta, Marco Ferretti, Alessandro Martelli, and Stefano Scarci. An automatic speech recogni- tion system for the italian language. In Third Conference of the European Chapter of the Association for Computational Linguistics , 1987. eurovoc. Thesaurus EUROVOC . Publications Office, Luxemburg, 2009.

Katja Filippova and Michael Strube. Using linguistically motivated features for paragraph boundary identification. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing , pages 267–274, Sydney, Australia, July 2006. Association for Computational Linguis- tics.

Katerina T. Frantzi and Sophia Ananiadou. Extracting nested collocations. In COLING 1996: The 16th International Conference on Computational Linguistics , 1996.

Alexander Franz and Brian Milch. Searching the web by voice. In COLING 2002: The 17th Interna- tional Conference on Computational Linguistics , 2002.

Michel Galley, Kathleen R. McKeown, Eric Fosler-Lussier, and Hongyan Jing. Discourse segmen- tation of multi-party conversation. In Erhard Hinrichs and Dan Roth, editors, Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL–03) , pages 562–569, Sapporo, Japan, July 2003.

Ismael García Varea, Daniel Ortiz, Francisco Nevado, Pedro A. Gómez, and Francisco Casacuberta. Automatic segmentation of bilingual corpora: A comparison of different techniques. In Jorge S. Marques, Nicolás Pérez de la Blanca, and Pedro Pina, editors, Iberian Conference on Pattern Recognition and Image Analysis , volume 3523 of Lecture Notes in Computer Science , pages 93–97. Springer Berlin / Heidelberg, Estoril (Portugal), June 2005.

Sean Gerrish and David M. Blei. A language–based approach to measuring scholarly impact. In Jo- hannes Fürnkranz and Thorsten Joachims, editors, ICML , pages 375–382. Omnipress, 2010. ISBN 978-1-60558-907-7.

Alexandre Gil and Gaël Dias. Using masks, suffix array-based data structures and multidimensional arrays to compute positional ngram statistics from corpora. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment , pages 25–32, Sapporo, Japan, July 2003. Association for Computational Linguistics.

Chooi-Ling Goh, Taro Watanabe, Michael Paul, Andrew Finch, and Eiichiro Sumita. The NICT Translation System for IWSLT 2010. In Marcello Federico, Ian Lane, Michael Paul, and François 66 BIBLIOGRAPHY

Yvon, editors, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT) , pages 139–146, 2010.

T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(1):5228–5235, April 2004.

David Hall, Daniel Jurafsky, and Christopher D. Manning. Studying the history of ideas using topic models. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Pro- cessing, pages 363–371, Honolulu, Hawaii, October 2008. Association for Computational Linguis- tics.

Ying He and Mehmet Kayaalp. A comparison of 13 tokenizers on medline. Technical Report LHNCBC-TR-2006–003, The Lister Hill National Center for Biomedical Communications, 2006.

Marti A. Hearst. Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23:33–64, 1997.

Carlos Henríquez, Marta R. Costa-Jussá, Vidas Daudaravičius, Rafael E. Banchs, and José Marino. UPC-BMIC-VDU system description for the IWSLT 2010: testing several collocation segmenta- tions in a phrase-based SMT system. In Marcello Federico, Ian Lane, Michael Paul, and François Yvon, editors, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT) , pages 189–195, 2010.

Hugo Hernault, Danushka Bollegala, and Mitsuru Ishizuka. A sequential model for discourse seg- mentation. In Alexander F. Gelbukh, editor, CICLing , volume 6008 of Lecture Notes in Computer Science, pages 315–326. Springer, 2010. ISBN 978-3-642-12115-9.

Bo-Hyun Yun Ho, Ho Lee, and Hae chang Rim. Analysis of korean nouns using statistical information. In Proceedings of the 22nd Korea Information Science Society Spring Conference , pages 925–928, 1994.

Chu-Ren. Huang, Benjamin K. Tsou, and Keh-Jiann. Chen. Readings in Chinese natural language processing . University of California, Berkeley, 1996.

F. Knowles. The quantitative syntagmatic analysis of the russian and polish phonological systems. In Computational And Mathematical Linguistics: Proceedings of the International Conference on Computational Linguistics, COLING , 1973.

Gina-Anne Levow. Prosodic cues to discourse segment boundaries in human-computer dialogue. In Michael Strube and Candy Sidner, editors, Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue, pages 93–96, Cambridge, Massachusetts, USA, April 30 - May 1 2004. Association for Computational Linguistics. BIBLIOGRAPHY 67

Hang Li and Kenji Yamanishi. Topic analysis using a finite mixture model. In 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora , pages 35–44, Hong Kong, China, October 2000. Association for Computational Linguistics.

D. Lin. Extracting collocations from text corpora. In First Workshop on Computational Terminology , Montreal, 1998.

Wang Ling, Tiago Luís, Joao Graça, Luísa Coheur, and Isabel Trancoso. The INESC-ID Machine Translation System for the IWSLT 2010. In Marcello Federico, Ian Lane, Michael Paul, and François Yvon, editors, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT) , pages 81–84, 2010.

William C. Mann and Sandra A. Thompson. Rhetorical structure theory: Toward a functional theory of text organization. Text , 8(3):243–281, 1988.

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. Introduction to Informa- tion Retrieval . Cambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715.

J. B. Mariño, R. E. Banchs, J. M. Crego, A. de Gispert, P. Lambert, J. A.R. Fonollosa, and M. R. Costa-jussà. N-gram based machine translation. Computational linguistics, 32(4):527–549, 2006.

Spyros Martzoukos and Christof Monz. The UvA System Description for IWSLT 2010. In Mar- cello Federico, Ian Lane, Michael Paul, and François Yvon, editors, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT) , pages 205–208, 2010.

Irina Matveeva and Gina-Anne Levow. Topic segmentation with hybrid document indexing. In Pro- ceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , pages 351–359, Prague, Czech Re- public, June 2007. Association for Computational Linguistics.

Kathleen McKeown, Diane Litman, and Rebecca Passonneau. Extracting constraints on word usage from large text corpora. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, pages 23–26, New York, 1992.

Igor A. Melc’uk and Alain Polguere. A formal lexicon in the meaning-text theory or (how to do lexica with words). Computational Linguistics, 13(3–4):261–275, July–December 1987. ISSN 0362-613X.

Eneldo Loza Mencía and Johannes Fürnkranz. Efficient multilabel classification algorithms for large- scale problems in the legal domain. In Semantic Processing of Legal Texts , pages 192–215, 2010. 68 BIBLIOGRAPHY

Ebrahim Mohamed, Maud Ehrmann, Marco Turchi, and Ralf Steinberger. Multi-label EuroVoc clas- sification for Eastern and Southern EU Languages . Multilingual processing in Eastern and South- ern EU languages – Low-resourced technologies and translation. Cambridge Scholars Publishing, Cambridge, UK, 2012.

Jin’ichi Murakami, Takuya Nishimura, and Masato Tokuhisa. Statistical Pattern-Based Machine Translation with Statistical French-English Machine Translation. In Marcello Federico, Ian Lane, Michael Paul, and François Yvon, editors, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT) , pages 175–182, 2010.

F. Nevado and F. Casacuberta. Bilingual corpora segmentation using bilingual recursive alignments. In Proceedings of the 3th Jornadas en Tecnología del Habla , Valencia, 2004.

Jan Niehues, Mohammed Mediani, Teresa Herrmann, Michael Heck, Christian Herff, and Alex Waibel. The KIT Translation system for IWSLT 2010. In Marcello Federico, Ian Lane, Michael Paul, and François Yvon, editors, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT) , pages 93–98, 2010.

A. Névéol, J. G. Mork, A. R. Aronson, and S. J. Darmoni. Evaluation of french and english mesh indexing systems with a parallel corpus. In Proceedings of the AMIA Symposium , pages 565–569, 2005.

F.J. Och and H. Ney. The alignment template approach to statistical machine translation. Computa- tional linguistics, 30(4):417–449, December 2004.

Rebecca J. Passonneau and Diane J. Litman. Discourse segmentation by human and automated means. Computational Linguistics, 23:103–139, 1997.

Michael Paul, Marcello Federico, and Sebastian Stücker. Overview of the IWSLT 2010 Evaluation Campaign. In Marcello Federico, Ian Lane, Michael Paul, and François Yvon, editors, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT) , pages 3–27, 2010.

P. Pecina and P. Schlesinger. Combining association measures for collocation extraction. In Proceed- ings of the COLING/ACL on Main conference poster sessions , pages 651–658, Morristown, NJ, USA, 2006. Association for Computational Linguistics.

Pavel Pecina. An extensive empirical study of collocation extraction methods. In Proceedings of the ACL Student Research Workshop , pages 13–18, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.

Pavel Pecina. Lexical association measures and collocation extraction. Language Resources and Evaluation , 44(1-2):137–158, 2010. BIBLIOGRAPHY 69

Bruno Pouliquen, Ralf Steinberger, and Camelia Ignat. Automatic annotation of multilingual text col- lections with a conceptual thesaurus. In Proceedings of the Workshop Ontologies and Information Extraction at (EUROLAN’2003) , pages 9–28, 2003.

Itiroo Sakai. Some mathematical aspects on syntactic discription. In International Conference on Computational Linguistics, COLING , 1965.

Gerard Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer . Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989. ISBN 0-201-12227-8.

Erik F. Tjong Kim Sang and Herve Dejean. Introduction to the conll-2001 shared task: clause identifi- cation. In Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning , Toulouse, France, July 2001. Association for Computational Linguistics.

Violeta Seretan. Syntax-Based Collocation Extraction , volume 44 of Text, Speech and Language Technology . Springer, 2011. ISBN 978-94-007-0133-5.

Wade Shen, Tim Anderson, Ray Slyh, and A. Ryan Aminzadeh. The MIT/LL-AFRL IWSLT-2010 MT System. In Marcello Federico, Ian Lane, Michael Paul, and François Yvon, editors, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT) , pages 127–134, 2010.

Frank Smadja. Retrieving collocations from text: Xtract. Computational Linguistics, 19:143–177, 1993.

Frank Smadja, Vasileios Hatzivassiloglou, and Kathleen R. McKeown. Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22:1–38, 1996.

Frank A. Smadja. From n-grams to collocations: An evaluation of xtract. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics , pages 279–284, Berkeley, California, USA, June 1991. Association for Computational Linguistics.

Caroline Sporleder and Mirella Lapata. Automatic paragraph identification: A study across languages and domains. In Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004 , pages 72–79, Barcelona, Spain, July 2004. Association for Computational Linguistics.

Caroline Sporleder and Mirella Lapata. Broad coverage paragraph segmentation across languages and domains. ACM Transactions on Speech and Language Processing , 3(2):1–35, 2006.

Richard Sproat, William Gales, Chilin Shih, and Nancy Chang. A stochastic finite-state word- segmentation algorithm for chinese. Computational Linguistics, 22:377–404, 1996. 70 BIBLIOGRAPHY

Richard W. Sproat and Chilin Shih. A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages , 4(4):336–351, 1990.

Heather A. Stark. What do paragraph markings do? Discourse Processes , 11(3):275–303, 1988.

Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, and Dániel Varga. The jrc-acquis: A multilingual aligned parallel corpus with 20+ languages. In Pro- ceedings of the 5th International Conference on Language Resources and Evaluation , LREC’2006, pages 2142–2147, Genoa, Italy, May 2006.

Ralf Steinberger, Ebrahim Mohamed, and Marco Turchi. Jrc eurovoc indexer jex - a freely available multi-label categorisation tool. In In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012) , pages 2142–2147, 2012.

Maite Taboada and William C. Mann. Rhetorical structure theory: looking back and moving ahead. Discourse Studies , 8:423–459, 2006.

T. Takezawa, E. Sumita, F. Sugaya, H. Yamamoto, and S. Yamamoto. Toward a broad-coverage bilingual corpus for speech translation of travel conversations in the real world. In Proceeding of LREC-2002: Third International Conference on Language Resources and Evaluation , pages 147–152, Las Palmas, Spain, May 2002.

W. J. Teahan, Rodger McNab, Yingying Wen, and Ian H. Witten. A compression-based algorithm for chinese word segmentation. Computational Linguistics, 26(3):375–393, September 2000. ISSN 0891-2017.

Wolfgang Teubert. My version of corpus linguistics. International Journal of Corpus Linguistics , 10 (1):1–13, 2005.

Erik F. Tjong Kim Sang and Sabine Buchholz. Introduction to the conll-2000 shared task: chunking. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7 , ConLL ’00, pages 127–132, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics.

Milan Tofiloski, Julian Brooke, and Maite Taboada. A syntactic and lexical-based discourse seg- menter. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers , pages 77–80, Suntec, Singapore, August 2009. Association for Computational Linguistics.

L. W. Tosh. Data preparation for syntactic translation. In International Conference on Computational Linguistics, COLING , 1965. BIBLIOGRAPHY 71

Gokhan Tur, Andreas Stolcke, Dilek Hakkani-Tur, and Elizabeth Shriberg. Integrating prosodic and lexical cues for automatic topic segmentation. Computational Linguistics, 27:31–57, 2001.

Brigitte van Berkelt and Koenraad De Smedt. Triphone analysis: A combined method for the cor- rection of orthographical and typographical errors. In Proceedings of the Second Conference on Applied Natural Language Processing , pages 77–83, Austin, Texas, USA, February 1988. Associ- ation for Computational Linguistics.

Julie Weeds, David Weir, and Diana McCarthy. Characterising measures of lexical distributional similarity. In Proceedings of Coling 2004 , pages 1015–1021, Geneva, Switzerland, Aug 23–Aug 27 2004. COLING.

Sholom Weiss, Nitin Indurkhya, Tong Zhang, and Fred Damerau. Text Mining: Predictive Methods for Analyzing Unstructured Information . SpringerVerlag, 2004. ISBN 0387954333.

Joachim Wermter and Udo Hahn. Collocation extraction based on modifiability statistics. In Pro- ceedings of Coling 2004 , pages 980–986, Geneva, Switzerland, Aug 23–Aug 27 2004. COLING.

J. White, T. O’Connell, and F. O’Mara. The arpa mt evaluation methodologies: evolution, lessons, and future approaches. In Proceedings of the First Conference of the Association for Machine Translation in the Americas , pages 193–205, 1994.

Nianwen Xue, Jinying Chen, and Martha Palmer. Aligning features with sense distinction dimensions. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions , pages 921–928, Syd- ney, Australia, July 2006. Association for Computational Linguistics.

Sirvan Yahyaei and Christof Monz. The QMUL System Description for IWSLT 2010. In Marcello Federico, Ian Lane, Michael Paul, and François Yvon, editors, Proceedings of the seventh Interna- tional Workshop on Spoken Language Translation (IWSLT) , pages 157–162, 2010.

Shou-Chuan Yang. A search algorithm and data structure for an efficient information system. In International Conference on Computational Linguistics, COLING , 1969.

Kobayasi Yosiyuki, Tokunaga Takenobu, and Tanaka Hozumi. Analysis of japanese compound nouns using collocational information. In Proceedings of the 14th Conference on Computational Linguis- tics (COLING94) , pages 865–869, 1994.

Francisco Zamora-Martinez, Maria José Castro-Bleda, and Holger Schwenk. N-gram-based Machine Translation enhanced with Neural Networks for the French-English BTEC-IWSLT’10 task. In Marcello Federico, Ian Lane, Michael Paul, and François Yvon, editors, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT) , pages 45–52, 2010. 72 BIBLIOGRAPHY

Jian Zhang, Jianfeng Gao, and Ming Zhou. Extraction of chinese compound words: an experimental study on a very large corpus. In Proceedings of the second workshop on Chinese language pro- cessing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12 , CLPW ’00, pages 132–139, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics.

Y. Zhang, S. Vogel, and A. Waibel. Integrated phrase segmentation and alignment algorithm for statistical machine translation. In International Conference on Natural Language Processing and Knowledge Engineering , pages 567–573, 2003.

Ying Zhang, Stephan Vogel, and Alex Waibel. Interpreting bleu/nist scores: How much improvement do we need to have a better system? In LREC. European Language Resources Association, 2004.

Vidas DAUDARAVI ýIUS

Doctoral Dissertation

Išleido ir spausdino – Vytauto Didžiojo universiteto leidykla (S. Daukanto g. 27, LT-44249 Kaunas) Užsakymo Nr. K12-178. Tiražas 15 egz. 2012 12 20. Nemokamai.