Part-Of-Speech Tagging: a Machine Learning Approach Based on Decision Trees

Part-of-speech Tagging: A Machine Learning Approach based on Decision Trees Memòria presentada al Departament de Llenguatges i Sistemes Informàtics de la Universitat Politècnica de Catalunya per a optar al grau de Doctor en Informàtica Lluís Marquez i Villodre sota la direcció del doctor Horacio Rodríguez Hontoria Barcelona, Maig de 1999 Abstract The study and application of general Machine Learning (ML) algorithms to the classical ambiguity problems in the area of Natural Language Processing (NLP) is a currently very active area of research. This trend is sometimes called Natural Language Learning. Within this framework, the present work explores the application of a concrete machine-learning technique, namely decision-tree induction, to a very basic NLP problem, namely part-of-speech disambiguation (POS tagging). Its main contributions fall in the NLP field, while topics appearing are addressed from the artificial intelligence perspective, rather from a linguistic point of view. A relevant property of the system we propose is the clear separation between the acquisition of the language model and its application within a concrete disambiguation algorithm, with the aim of constructing two components which are as independent as possible. Such an approach has many advantages. For instance, the language models obtained can be easily adapted into previously existing tagging formalisms; the two modules can be improved and extended separately; etc. As a first step, we have experimentally proven that decision trees (DT) provide a flexible (by allowing a rich feature representation), efficient and compact way for acquiring, representing and accessing the information about POS ambiguities. In addition to that, DTs provide proper estimations of conditional probabilities for tags and words in their particular contexts. Additional machine learning techniques, based on the combination of classifiers, have been applied to address some particular weaknesses of our tree-based approach, and to further improve the accuracy in the most difficult cases. As a second step, the acquired models have been used to construct simple, accurate and effective taggers, based on diiferent paradigms. In particular, we present three different taggers that include the tree-based models: RTT, STT, and RELAX, which have shown different properties regarding speed, flexibility, accuracy, etc. The idea is that the particular user needs and environment will define which is the most appropriate tagger in each situation. Although we have observed slight differences, the accuracy results for the three taggers, tested on the WSJ test bench corpus, are uniformly very high, and, if not better, they are at least as good as those of a number of current taggers based on automatic acquisition (a qualitative comparison with the most relevant current work is also reported. Additionally, our approach has been adapted to annotate a general Spanish corpus, with the particular limitation of learning from small training sets. A new technique, based on tagger combination and bootstrapping, has been proposed to address this problem and to improve accuracy. Experimental results showed that very high accuracy is possible for Spanish tagging, with a relatively low manual effort. Additionally, the success in this real application has confirmed the validity ABSTRACT of our approach, and the validity of the previously presented portability argument in favour of automatically acquired taggers. Lluís Marquez i Villodre TALP Research Center Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Jordi Girona Salgado, 1-3. E08034 Barcelona. Catalonia. E-mail: Iluism01si.upc.es URL: http://www.Isi.upc.es/~lluism Agraïments Voldria fer arribar el meu sincer agraïment a un bon nombre de persones que m'han ajudat directa o indirectament a desenvolupar la meva tasca com a recercaire durant els darrers anys i, finalment, a donar forma a aquest treball. Ara ve una llista, oi? En primer lloc i molt especialment al meu director, l'Horaci Rodríguez, que apart de ser el savi més modest que conec, és, sens dubte, el director de tesi que jo recomanaria al meu millor amic. Voldria donar les gràcies també especialment a en Lluís Padró, amb qui he realitzat la major part de recerca d'aquest treball. Val a dir que ell i en German Rigau han estat uns companys d'una vàlua humana i científica incomparables i de fet han exercit de "segons directors" de tesi (si és que aquesta figura existeix). També, a tota la resta de companys del Grup de Recerca en Processament del Llenguatge Natural del departament de LSI i de la Universitat de Barcelona, amb qui hem creat (i seguim creant) un bon equip de treball: Alicia Ageno, Jordi Atserias, Núria Castell, Neus Català, Irene Castellón, Salvador Climent, Gerard Escudero, Toni Martí, Mariona Taulé, Jordi Turmo,... i a tot els que em deixo. També a en Josep Carmona i a en Josep Montolio (els Joseps) que han estat ajudant- me en els darrers experiments. En general, a tots els companys del departament i especialment a en Rafel Cases, Ton Sales i Martí Vergés, de qui he après una pila de coses no necessàriament relacionades amb el tema de la tesi. També voldria destacar a Itziar Aduriz i a tot el grup de processament del lleguatge (IXA taldea) de la Euskal Herriko Unibersitatea pel boníssim acolli- ment que em van dispensar durant la meva estada a Donostia l'any 1998 (bereziki goikoei!) També cal mencionar el laboratori de càlcul i la secretaria del nostre departament que funcionen de manera modèlica i estan formats per persones magnífiques. Finalment, voldria fer constar que durant el période final de realització de la tesi he gaudit d'una descàrrega docent atorgada pel departament de LSI, que m'ha permès, sens dubte, acabar aquest treball una miqueta abans. La revisió de l'anglès en les parts més atepeïdes de la tesi l'ha fet també un bon amic, David Owen. Tota la resta és exclusivament culpa (i mèrit) d'un servidor. Acknowledgements I would like to thank two anonymous referees for helpful and encouraging com- ments on a previous extended abstract of this document. 6 AGRAÏMENTS The research reported in this dissertation has been developed inside the framework of several research projects, funded by the following institutions: The Spanish Research Department (CICYT'sITEM project, ref: TIC96-1243-C03-02); The EU Commission (ACQUILEX II, ESPRIT-BRA 7315 and EuroWordNet, LE4003) and The Catalan Research Department (CIRIT's consolidated research group 1997SGR 00051 and CREL project). Contents Abstract 3 Agraïments 5 List of Figures 11 List of Tables 13 Chapter 1. Introduction 15 1. Setting 16 1.1. The Ambiguity Problem 16 1.2. The Part-of-speech Tagging Problem 18 2. A Particular Approach to Tagging 20 3. Contributions 21 3.1. On the Application of Decision Trees 21 3.2. On the Combination of Classifiers 22 3.3. On the Evaluation and Comparison of Taggers 22 3.4. On the Survey 22 3.5. From a Practical Perspective 22 4. Overview of the Thesis 23 Chapter 2. State of the Art 25 1. Corpus-based Linguistics 25 1.1. Corpora Compilation 27 1.2. Existing Corpora 28 2. Part-of-speech Tagging 29 2.1. Linguistic Taggers 29 2.2. Statistical Taggers 30 2.3. Machine-Learning-based Taggers 30 2.4. Current Research 31 2.5. Related Issues 33 2.6. Acknowledgments 37 3. Application of Machine-learning Techniques to NLP 37 3.1. Stochastic Machine Learning Approaches 37 3.2. Symbolic Machine Learning Approaches 40 3.3. Subsymbolic Machine Learning Approaches 42 3.4. Others 43 3.5. A Reversed Summary 43 3.6. Acknowledgments 45 4. A Machine-learning Oriented Review of Decision Trees 46 4.1. Supervised Learning for Classification 47 8 CONTENTS 4.2. Decision Trees 47 4.3. Decision-tree Induction 48 4.4. Acknowledgements 51 Chapter 3. Tagging-oriented Language Modelling Using Decision Trees 53 1. Setting 53 1.1. Ambiguity Classes 53 1.2. Context Modelling 55 2. Automatic Acquisition 57 2.1. Basic Algorithm 57 2.2. Particular Implementation 58 2.3. An Example 62 3. Model Evaluation 63 3.1. The Wall Street Journal Annotated Corpus 63 3.2. Domain of Evaluation and Methodology 65 3.3. Results 67 4. Extending the Basic Model 70 4.1. Dealing with Unknown Words 70 4.2. Enriching Features 74 4.3. Dealing with Sparseness 77 4.4. Pending Work 81 Chapter 4. Tagging with the Acquired Decision Trees 83 1. RTT: a Reductionistic Tree-based Tagger 83 1.1. Evaluation 85 2. STT: A Statistical Tree-based Tagger 89 2.1. Statistical and HMM-based Tagging 89 2.2. The STT Tagger 90 3. Comparison and Discussion 92 3.1. Identifying Problems 92 3.2. Adding n-grams to the STT 93 4. Using the Tree-model in a Flexible Tagger 94 4.1. RELAX: A Relaxation-labelling based Tagger 94 4.2. Incorporating Decision Trees into RELAX 96 4.3. Using Small Training Sets 99 Chapter 5. Spanish Part-of-speech Tagging 101 1. Tagging the LExEsp Copus 101 1.1. The MACO+ Morphological Analyzer 102 1.2. Adapting the English POS Taggers 104 2. Improving Accuracy by Combining different Taggers 105 2.1. Motivation 105 2.2. Bootstrapping algorithm 106 2.3. Applying and Evaluating the Bootstrapping Algorithm 108 2.4. Best Tagger 113 3. Conclusions and Further Work 113 Chapter 6. Ensembles of Classifiers 117 1. Introduction 117 1.1. Ensembles of Homogeneous Classifiers 118 CONTENTS 9 1.2. Constructing Ensembles of Heterogeneous Classifiers 119 1.3. Applying Ensembles of Classifiers 120 2. Improving POS Tagging by using Ensembles of Classifiers 121 2.1. Setting 121 2.2. Baseline Results 123 2.3. Ensembles of Decision Trees 125 2.4. Constructing and Evaluating Ensembles 127 2.5.

Part-Of-Speech Tagging: a Machine Learning Approach Based on Decision Trees

Classifiers: a Typology of Noun Categorization Edward J

The “Person” Category in the Zamuco Languages. a Diachronic Perspective

Double Classifiers in Navajo Verbs *

1 Noun Classes and Classifiers, Semantics of Alexandra Y

Nominal Aspect in the Romance Infinitives

Automatic Acquisition of Subcategorization Frames from Untagged Text

Basic English Grammar Module Unit 1B: the Noun Group

8. Classifiers

From a Classifier Language to a Mass-Count Language: What Can Historical Data Show Us?

Plurality in a Classifier Language Author(S): Yen-Hui Audrey Li Source: Journal of East Asian Linguistics, Vol

153 Natasha Abner (University of Michigan)

Appositive Possession in Ainu and Around the Pacific