Semi-Supervised Dependency Parsing Using Lexical Affinities

Semi-supervised Dependency Parsing using Lexical Affinities Seyed Abolghasem Mirroshandel†,? Alexis Nasr† Joseph Le Roux †Laboratoire d’Informatique Fondamentale de Marseille- CNRS - UMR 7279 Universite´ Aix-Marseille, Marseille, France LIPN, Universite´ Paris Nord & CNRS,Villetaneuse, France ?Computer Engineering Department, Sharif university of Technology, Tehran, Iran ([email protected], [email protected], [email protected]) Abstract is, as mentioned above, a better modeling of bilexical dependencies and the second is a method to adapt Treebanks are not large enough to reliably a parser to new domains. model precise lexical phenomena. This de- The paper is organized as follows. Section 2 re- ficiency provokes attachment errors in the views some work on the same topic and highlights parsers trained on such data. We propose their differences with ours. In section 3, we describe in this paper to compute lexical affinities, the parser that we use in our experiments and give on large corpora, for specific lexico-syntactic configurations that are hard to disambiguate a detailed description of the frequent attachment er- and introduce the new information in a parser. rors. Section 4 describes how lexical affinities be- Experiments on the French Treebank showed tween lemmas are calculated and their impact is then a relative decrease of the error rate of 7.1% La- evaluated with respect to the attachment errors made beled Accuracy Score yielding the best pars- by the parser. Section 5 describes three ways to in- ing results on this treebank. tegrate the lexical affinities in the parser and reports the results obtained with the three methods. 1 Introduction 2 Previous Work Probabilistic parsers are usually trained on treebanks Coping with lexical sparsity of treebanks using raw composed of few thousands sentences. While this corpora has been an active direction of research for amount of data seems reasonable for learning syn- many years. tactic phenomena and, to some extent, very frequent One simple and effective way to tackle this prob- lexical phenomena involving closed parts of speech lem is to put together words that share, in a large (POS), it proves inadequate when modeling lexical raw corpus, similar linear contexts, into word clus- dependencies between open POS, such as nouns, ters. The word occurrences of the training treebank verbs and adjectives. This fact was first recognized are then replaced by their cluster identifier and a new by (Bikel, 2004) who showed that bilexical depen- parser is trained on the transformed treebank. Us- dencies were barely used in Michael Collins’ parser. ing such techniques (Koo et al., 2008) report signi- The work reported in this paper aims at a better ficative improvement on the Penn Treebank (Marcus modeling of such phenomena by using a raw corpus et al., 1993) and so do (Candito and Seddah, 2010; that is several orders of magnitude larger than the Candito and Crabbe,´ 2009) on the French Treebank treebank used for training the parser. The raw cor- (Abeille´ et al., 2003). pus is first parsed and the computed lexical affinities Another series of papers (Volk, 2001; Nakov between lemmas, in specific lexico-syntactic config- and Hearst, 2005; Pitler et al., 2010; Zhou et al., urations, are then injected back in the parser. Two 2011) directly model word co-occurrences. Co- outcomes are expected from this procedure, the first occurrences of pairs of words are first collected in a 777 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 777–785, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics raw corpus or internet n-grams. Based on the counts they consider the first best parse and compute all po- produced, lexical affinity scores are computed. The tential alternative attachments, that may not actually detection of pairs of words co-occurrences is gen- occur in the n-best forests. erally very simple, it is either based on the direct adjacency of the words in the string or their co- 3 The Parser occurrence in a window of a few words. (Bansal and Klein, 2011; Nakov and Hearst, 2005) rely on The parser used in this work is the second order the same sort of techniques but use more sophisti- graph based parser (McDonald et al., 2005; Kubler¨ cated patterns, based on simple paraphrase rules, for et al., 2009) implementation of (Bohnet, 2010). The identifying co-occurrences. parser was trained on the French Treebank (Abeille´ Our work departs from those approaches by the et al., 2003) which was transformed into dependency fact that we do not extract the lexical information trees by (Candito et al., 2009). The size of the tree- directly on a raw corpus, but we first parse it and bank and its decomposition into train, development then extract the co-occurrences on the parse trees, and test sets is represented in table 1. based on some predetermined lexico-syntactic patterns. The first reason for this choice is that the lin- nb of sentences nb of words guistic phenomena that we are interested in, such as FTB TRAIN 9 881 278 083 as PP attachment, coordination, verb subject and ob- FTB DEV 1 239 36 508 ject can range over long distances, beyond what is FTB TEST 1 235 36 340 generally taken into account when working on lim- Table 1: Size and decomposition of the French Treebank ited windows. The second reason for this choice was to show that the performances that the NLP commu- The part of speech tagging was performed with nity has reached on parsing, combined with the use the MELT tagger (Denis and Sagot, 2010) and lem- of confidence measures allow to use parsers to ex- matized with the MACAON tool suite (Nasr et al., tract accurate lexico-syntactic information, beyond 2011). The parser gave state of the art results for what can be found in limited annotated corpora. parsing of French, reported in table 2. Our work can also be compared with self training approaches to parsing (McClosky et al., 2006; pred. POS tags gold POS tags Suzuki et al., 2009; Steedman et al., 2003; Sagae punct no punct punct no punct and Tsujii, 2007) where a parser is first trained on LAS 88.02 90.24 88.88 91.12 a treebank and then used to parse a large raw cor- UAS 90.02 92.50 90.71 93.20 pus. The parses produced are then added to the ini- tial treebank and a new parser is trained. The main Table 2: Labeled and unlabeled accuracy score for auto- difference between these approaches and ours is that matically predicted and gold POS tags with and without we do not directly add the output of the parser to the taking into account punctuation on FTB TEST. training corpus, but extract precise lexical information that is then re-injected in the parser. In the self Figure 1 shows the distribution of the 100 most training approach, (Chen et al., 2009) is quite close common error types made by the parser. In this to our work: instead of adding new parses to the tree- figure, x axis shows the error types and y axis bank, the occurrence of simple interesting subtrees shows the error ratio of the related error type number of errors of the specific type are detected in the parses and introduced as new fea- ( total number of errors ). We define an error tures in the parser. type by the POS tag of the governor and the POS The way we introduce lexical affinity measures in tag of the dependent. The figure presents a typical the parser, in 5.1, shares some ideas with (Anguiano Zipfian distribution with a low number of frequent and Candito, 2011), who modify some attachments error types and a large number of unfrequent error in the parser output, based on lexical information. types. The shape of the curve shows that concen- The main difference is that we only take attachments trating on some specific frequent errors in order to that appear in an n-best parse list into account, while increase the parser accuracy is a good strategy. 778 0.14 dependency freq. acc. contrib. name 0.12 N→N 1.50 72.23 2.91 V → a` 0.88 69.11 2.53 VaN 0.1 V—suj → N 3.43 93.03 2.53 SBJ 0.08 N → CC 0.77 69.78 2.05 NcN 0.06 N → de 3.70 92.07 2.05 NdeN error ratio V → de 0.66 74.68 1.62 VdeN 0.04 V—obj → N 2.74 90.43 1.60 OBJ 0.02 V → en 0.66 81.20 1.24 0 V → pour 0.46 67.78 1.10 10 20 30 40 50 60 70 80 90 100 Error Type N → ADJ 6.18 96.60 0.96 ADJ N → a` 0.29 70.64 0.72 NaN Figure 1: Distribution of the types of errors N → pour 0.12 38.64 0.67 N → en 0.15 47.69 0.57 Table 3 gives a finer description of the most com- Table 3: The 13 most common error types mon types of error made by the parser. Here we define more precise patterns for errors, where some lexical values are specified (for prepositions) and, in 4 Creating the Lexical Resource some cases, the nature of the dependency is taken The lexical resource is a collection of tuples into account. Every line of the table corresponds to hC, g, d, si where C is a lexico-syntactic configu- one type of error.

Semi-Supervised Dependency Parsing Using Lexical Affinities

Lexical Resource Reconciliation in the Xerox Linguistic Environment

Wordnet As an Ontology for Generation Valerio Basile

Modeling and Encoding Traditional Wordlists for Machine Applications

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Unification of Multiple Treebanks and Testing Them with Statistical Parser with Support of Large Corpus As a Lexical Resource

Instructions for ACL-2010 Proceedings

The Interplay Between Lexical Resources and Natural Language Processing

Exploring Complex Vowels As Phrase Break Correlates in a Corpus of English Speech with Proposel, a Prosody and POS English Lexicon

Methods for Efficient Ontology Lexicalization for Non-Indo

Maurice Gross' Grammar Lexicon and Natural Language Processing

Using Relevant Domains Resource for Word Sense Disambiguation

Using a Lexical Semantic Network for the Ontology Building Nadia Bebeshina-Clairet, Sylvie Despres, Mathieu Lafourcade