<<

Unsupervised Dependency Parsing, new PCFG

Marie Arcadias, Guillaume Cleuziou, Edmond Lassalle, Christel Vrain

LIFO, Universite´ d’Orleans´ Orange Labs, Lannion

Rapport no RR-2014-03 Unsupervised Dependency Parsing, a new PCFG

Marie Arcadias Guillaume Cleuziou Edmond Lassalle Christel Vrain Orange Labs / 2 avenue Marzin LIFO / Universite´ d’Orleans´ 22300 Lannion, FRANCE Rue Leonard´ Vinci [email protected] 45067 Orleans´ cedex 2, FRANCE [email protected]

Abstract pendency model is an interesting compromise - tween full syntactic analysis and a representa- Dependency learning aims at building a tion as a “bag-of-words”. model that allows transforming textual A large amount of manually annotated ex- sentences into trees representing a syn- amples are necessary for supervised dependency tactical hierarchy between the words of learning. It is a very long and tedious task, and it the sentence. present an intermediate requires deep linguistic knowledge. Furthermore, model between full syntactic parsing of a it has to be done anew for each kind of text to an- sentence and bags of words. It is based alyze. This explains why the amount of annotated on a very light probabilistic context free text is poor compared to the abundance of different grammar, allowing to express dependen- types of text available on the web. In this paper, cies between the words of a sentence. Our we suggest an unsupervised approach demanding model can be tuned a little depending on only a shallow knowledge on the language and on the language. Experimentally, we were the type of the text. The framework is therefore able to surpass the scores of the DMV Unsupervised Dependency Learning (UDL). reference on attested benchmarks for five over ten languages, such as English, Por-

tuguese or Japanese. We give the first re- MD/could_3 sults on French corpora. Learning is very fast and parsing is almost instantaneous. NN/Rice_2 RB/hardly_4 VB/believe_5 PU/._8

1 Introduction and state of the art NN/Cathryn_1 NN/eyes_7 The dependency structure (DS) of a sentence PRP/her_6 shows a syntactic hierarchy between the words, al- lowing then to infer semantic information. Among other applications, dependency structures are used in language modeling (Chelba et al., 1997), tex- Figure 1: Dependency Tree given by the treebank tual entailment (Haghighi et al., 2005), question for the sentence “Cathryn Rice could hardly be- answering (Wang et al., 2007), information ex- lieve her eyes”. traction (Culotta and Sorensen, 2004), lexical on- tology induction (Snow et al., 2004) and machine The Penn Treebank, an American tagged corpus translation (Quirk et al., 2005). of newspaper articles, offers a dependency ver- The DS of a sentence (cf. Figure 1) is a tree, sion, giving the DS of each sentence. Klein and the nodes of which are labelled by the words of Manning (2004) were the first to obtain significant the sentence. One of the words is defined as the results in UDL. They got better scores, on sen- root of the tree (most of the time, the main verb). tences of under 10 words, than the basic attach- Then subtrees, covering contiguous parts of the ment of each word to its next right neighbor. They sentences, are attached to the root. In other words, called their model Dependency Model of Valence a dependency tree is made of directed relations (DMV). between a syntactically strong word (called head) Sentences are coded as sequences of parts-of- and a weaker word (called dependent). The de- speech (POS) tags and are used as inputs for the learning and the parsing algorithms. DMV is a dencies between the words of a sentence. For ex- generative model based on the valence of the POS, ample, in the sentence “Cathryn Rice could hardly .. their ability to generate children (i.e. depen- believe her eyes.”, “could” is a dominant to which dents), their number and type of POS. The root of “Rice” is attached to the left, and “hardly”, “be- the sentence is first probabilistically chosen. Then, lieve” and the full stop are attached to the right. this root generates recursively its children among The dependency tree is represented in Figure 1. the other words of the sentence, and the subtree of Our model classifies each word (represented by its each child is built, depending on their POS and POS tag) as a dominant or a dominated item be- relative position (left or right). The estimation side its neighbors. Then, to parse a sentence, the of probabilities includes the type of preferred de- model combines, thanks to intermediate symbols, pendencies (verb over noun rather than noun over the groups of words until each word finds a posi- verb for example). Starting with initial probabil- tion in the dependency tree, as we can see in figure ities tuned manually based on linguistic knowl- 2. edge, an expectation-maximization step learns the probabilities of the model. This is a rich and interesting model, but the pa- rameters initialization is a full and complex prob- lem. It demands both technical innovation from a machine learning expert and a strong linguistic background from an expert of the syntax of the studied language. 2 Learning a probabilistic context free grammar The originality of our contribution is the choice of a simple context free grammar which can express Figure 2: Parse of the sentence by the context free dependencies between the words of a sentence. grammar DGdg. Our approach is then decomposed into two parts: learning this probabilistic context free grammar To do that, we consider 5 non terminal symbols (nt) : the start symbol S, two symbols G and D (PCFG) by the Inside-Outside algorithm (Lari representing respectively left and right dominants and Young, 1990), parsing based on the learned and two symbols g and d for left and right domi- PCFG, using a probabilistic version of CYK algo- nated items. The terminals represent the POS tags; they can differ depending on the language and on rithm (Jurafsky and Martin, 2009). Finally, formal the tagger. Here are universal tags used by Mc- trees are transformed into dependency trees. For Donald et al. (2013) (e.g. DET for determiner). the definition of formal grammars, like PCFG, we Σ = {ADJ, ADP, ADV, CONJ, DET, NOUN, NUM, suggest the reading of Jurafsky and Martin (2009). PRON,PRT,PUNC,VERB,X} Inside-Outside is a generative model that can The production rules are in Chomsky normal be considered as an extension of hidden Markov form (this is compulsory for Inside-Outside and models (HMM). Whereas HMM are limited to CYK). Our constraints are : learning regular grammars, Inside-Outside can • The uppercase non terminal dominates the deal with context free grammars. While HMM lowercase non terminal it is associated with use calculations on subsequences before and af- by a production rule. E.g. G → G d means ter a position t to obtain the probabilities of the that a left dominant splits into another left derivation rules of the grammar, Inside-Outside dominant and a right dominated symbol. algorithm calculates it from subsequences inside and outside two positions t1 and t2. The proba- • A left non terminal g (respectively G) is asso- bilistic version of CYK choses the most probable ciated to the left with D (resp. d). nt → G d parse among all possible analysis. or nt → g D.

DGdg formalism As already written, the origi- The meaning we impose to the non terminals nality of our contribution is the choice of a simple forbids many rules, and thus limits the size of the context free grammar which can express depen- grammar, while keeping its dependency state of Tuning phase All UDL models are tuned ac- mind. cording to the language of the corpus. For our The first type of rule in Chomsky normal form model, it consists in selecting only the rules of (nt → nt nt) builds the internal construction of type nt → terminal which are linguistically rel- the sentences. We call them structure rules. The evant. We already give an example illustrating se- second type of rules in Chomsky normal form lection of such rules for the determiner. In the fol- (nt → terminal) expresses information whether lowing experimentations, we tune the models ob- a POS can (or cannot) dominate its right or left serving for each language some trees given as a neighbors. For example, in English, we will for- reference in the dependency treebanks. bid rules as nt → DET for all nt 6= g because a determiner is always dominated by the next right 3 Experiments and results noun. CONLL 2006 3bin 4bin 4ter 5ter 5ter+ The variants Depending on the structure of the + FTB studied language, the structure rules may not fit the Bulgarian 17.7% 23.9 % 22.9% 22.8 % 23.6 % deep split of the sentence. The grammar we pre- Danish 26.7% 20.5% 14.0% 13.9 % 13.6 % Dutch 29.7% 34.6% 30.3% 27.0% 27.0 % sented before is called 4bin because it contains, English 15.3% 29.0% 26.4% 38.1% 39.0% in addition to the start symbol, 4 non terminals French 29.8% 32.9% 42.1% 37.4% 42.2 % (D, G, d and g) and the structure rules are writ- German 20.9% 33.1 % 20.9% 31.5% 31.7% Japanese 31.2% 32.4 % 32.2% 33.4% 64.7% ten in a binary way, according to Chomsky normal Portuguese 30.0% 54.0% 42.0% 37.8% 34.1 % form. Slovene 12.2% 21.5% 23.2% 21.6% 21.8% Spanish 20.8% 39.2 % 39.2% 30.0% 40.2% The meaning of these 4 non terminals attests Swedish 21.2% 18.9 % 21.6% 21.8% 21.8 % the fundamental difference between POS which would dominate from left, those which would Table 3: Scores for all DGdg methods dominate from right and those which would be dominated from right and left. For English, we The French Treebank (Abeille´ and Barrier, can see cases where it is useful. But sometimes, 2004) gives the constituent structures (noun the difference can be irrelevant. For example, in phrases, verb phrases. . . ) as well as the syntactic the noun phrase “the last president elect”, the two functions (subject, object. . . ) of many sentences adjectives “last” and “elect” are both dominated by from Le Monde newspaper. From 2009, this tree- the noun “president”, and in the same way. There- bank was converted into dependency trees (Can- fore, we consider a 3bin version with only 3 non dito et al., 2010). We compare the trees learned by terminals (in addition to S). We maintain in this our models to those given as a reference in Can- variant the fact that g and d are left and right, but dito et al. (2010) treebank. We compute the Un- keep only one uppercase dominant symbol, called labeled Attachment score (UAS) which gives the N (for neutral), which had no side information rate of correct dependency (without punctuations). meaning. The scores are quite different according to the Because of Chomsky normal form, the splits variant used. We obtained for 3bin: 29.8%, for of the sentences are binary. However, translating 4ter : 42.1%, for 5ter : 37.4% and 5ter+ : 42.2 %; ternary rules into a binary form allows us to use in order to assess the quality of these scores, we ternary structures suggested by sentences as : sub- randomly generate trees and measure their UAS, ject (left), verb (middle), object (right). obtaining only 14.2%. We notice that the two Following this idea, and keeping the directed variants allowing ternary recursive rules (4ter and dominants D and G, but allowing also a centered 5ter+), with a central group of words dominat- domination, we use again the neutral symbol N in ing one group on each side gives almost identical variants 5ter and 5ter+. These last versions differ- scores, much higher than the other variants. This ers because 5ter forbids a recursive use of N while would imply that the underlying structure of these 5ter+ accepts it, leading to more complex struc- journalistic sentences, quite elaborated, would be tures. Table 1 sums up the differences between the better captured by more complex models. As far as variants. we know, we are the first to apply UDL for French. On the other hand, this task was widely pro- cessed for English, and for other languages since 4bin 3bin 4ter 5ter 5ter+ The main First The dominant Ternary 4ter + 5ter + differences grammar has no rules are dominant recursive between side meaning allowed non terminal rules the variants centered N for N The structure nt → G d, nt → N d, nt → G d, nt → G d, nt → G d, rules nt → g D, nt → g N, nt → g D, nt → g D, nt → g D, nt(6= N) → g CAP d, nt → g CAP d nt → g CAP d, Non terminals S, D, G, d, g S, N, d, g S, D, G, d, g S, D, N, G, d, g S, D, N, G, d, g

Table 1: The variants of the context free grammar DGdg (CAP represent non terminals in capital letters (G, D or N)).

CONLL 2006 Random DMV DGdg Variant Time Nb of corpus Nb of distinct + FTB soft- score used for learning words categories Bulgarian 16.1% 39.1% 23.9% 4bin 23 min 190 217 12 Danish 14.7% 43.5% 26.7% 3bin 35 min 94 386 10 Dutch 14.8% 21.3% 34.6% 4bin 12 min 195 069 13 English 13.4% 38.1% 39.0% 5ter+ 99 min 937 545 23 French 14.2% no ref. 42.2% 5ter+ 320 min 278 083 15 German 13.1% 33.3% 33.1% 4bin 196 min 699 331 52 Japanese 20.7% 56.6% 64.7% 5ter+ 19 min 151 461 21 Portuguese 15.3% 37.9% 54.0% 4bin 46 min 206 490 16 Slovene 13.7% 30.8% 23.2% 4ter 16 min 28 750 12 Spanish 13.3% 33.3% 40.2% 5ter+ 35 min 89 334 15 Swedish 14.8% 41.8% 21.8% 5ter,5ter+ 80 min 191 467 15 Table 2: Best scores the conference CONLL 2006 (Buchholz and ing sets of sentences. Marsi, 2006). We compare our model to the DMV reference. Table 2 summarizes the results, as well 4 Discussion and conclusion as the variant which achieves the best score. As The learning times depend strongly on the volume we can see in table 3, results can be very different of the data and weakly on the number of syntactic depending on the variants, showing that the choice categories. The little number of structure rules of of a variant must be wisely done according to the the grammar leads to a reasonable learning time, language and the type of text. These results shows even very fast for little corpora. Once the gram- that for some language, as for instance Bulgar- mar is learned, parsing is almost instantaneous (a ian, all our methods badly modelize the shape of few seconds for thousands of sentences). This the corpus sentences, between 17.7 and 23.9% for shows the flexibility and the speed of our model. Bulgarian. On the other hand, for some other lan- This is why we can say that it is portable and - guages, such as Japanese, one of our variant (here ficient. Some complementary tests show that we 5ter+) strongly outperforms all the others. can obtain better scores with more fine grained Table 2 gives the best UAS compared to the ref- categories, even though the learning time is then erence soft-EM given in (Spitkovsky et al., 2011). a bit less fast. The dependency treebanks come from for English To improve our model, we think about integrat- : (Marcus et al., 1993), for French (Candito et al., ing lexical information to be able to make a differ- 2010; Abeille´ and Barrier, 2004), for the other lan- ence between two sequences with the same POS guages CONLL 2006 that is Bulgarian : (Simov tags, but which should have different dependency et al., 2002), Danish : (Kromann, 2003), Dutch trees. : (Van der Beek et al., 2002), German : (Brants et al., 2002), Japanese : (Hinrichs et al., 2000), Por- Acknowledgments tuguese : (Afonso et al., 2002), Slovene : (- We thank Mark Johnson for his codes of Inside- roski et al., 2006), Spanish : (Civit and Mart`ı, Outside and CYK and Marie Candito for deliv- 2004) and Swedish : (Nilsson et al., 2005). The ering her dependency treebank as well as for her references are obviously the same for 3. The tree- advices. banks are provided already split into test and train- References [Klein and Manning2004] Dan Klein and Christo- pher D. Manning. 2004. Corpus-based induction [Abeille´ and Barrier2004] Anne Abeille´ and Nicolas of syntactic structure: Models of dependency and Barrier. 2004. Enriching a french treebank. In Proc. constituency. In Proceedings of the 42nd Annual of LREC 2004, Lisbon. Meeting on ACL, page 478. [Afonso et al.2002] Susana Afonso, Eckhard Bick, Re- [Kromann2003] Matthias Trautner Kromann. 2003. nato Haber, and Diana Santos. 2002. Floresta sint The danish dependency treebank and the DTAG (c) tica: A treebank for portuguese. In LREC. treebank tool. In Proceedings of the Second Work- [Brants et al.2002] Sabine Brants, Stefanie Dipper, Sil- shop on Treebanks and Linguistic Theories (TLT), via Hansen, Wolfgang Lezius, and George Smith. page 217. 2002. The TIGER treebank. In Proceedings of the [Lari and Young1990] K. Lari and S.J. Young. 1990. workshop on treebanks and linguistic theories, page The estimation of stochastic context-free gram- 24 41. mars using the inside-outside algorithm. Computer [Buchholz and Marsi2006] Sabine Buchholz and Erwin Speech & Language, 4(1):35 56, January. Marsi. 2006. CoNLL-X shared task on multilin- [Marcus et al.1993] Mitchell P. Marcus, Mary Ann gual dependency parsing. In Proceedings of the Marcinkiewicz, and Beatrice Santorini. 1993. Tenth Conference on Computational Natural Lan- Building a large annotated corpus of english: guage Learning. The penn treebank. Computational linguistics, [Candito et al.2010] Marie Candito, Benot Crabbe,´ and 19(2):313 330. Pascal Denis. 2010. Statistical french dependency [McDonald et al.2013] Ryan McDonald, Joakim Nivre, parsing: treebank conversion and first results. In Yvonne Quirmbach-Brundage, Yoav Goldberg, Di- LREC 2010, page 1840 1847. panjan Das, Kuzman Ganchev, Keith Hall, Slav [Chelba et al.1997] Ciprian Chelba, David Engle, Fred- Petrov, Hao Zhang, and Oscar Tckstrm. 2013. Uni- erick Jelinek, Victor Jimenez, Sanjeev Khudanpur, versal dependency annotation for multilingual pars- Proceedings of ACL Lidia Mangu, Harry Printz, Eric Ristad, Ronald ing. In . Rosenfeld, and Andreas Stolcke. 1997. Structure [Nilsson et al.2005] Jens Nilsson, Johan Hall, and and performance of a dependency language model. Joakim Nivre. 2005. MAMBA meets TIGER: re- In EUROSPEECH. constructing a treebank from antiquity. [Civit and Mart`ı2004] Montserrat Civit and MaAntnia [Quirk et al.2005] Chris Quirk, Arul Menezes, and Mart`ı. 2004. Building Cast3LB: a spanish treebank. Colin Cherry. 2005. Dependency treelet translation: Research on Language and Computation, 2(4):549– Syntactically informed phrasal SMT. In Proceed- 574. ings of the 43rd Annual Meeting on Association for Computational Linguistics, page 271 279. [Culotta and Sorensen2004] Aron Culotta and Jeffrey Sorensen. 2004. Dependency tree kernels for rela- [Simov et al.2002] Kiril Simov, Petya Osenova, Milena tion extraction. In Proceedings of the 42nd Annual Slavcheva, Sia Kolkovska, Elisaveta Balabanova, Meeting on Association for Computational Linguis- Dimitar Doikoff, Krassimira Ivanova, Alexander tics, page 423. Simov, and Milen Kouylekov. 2002. Building a lin- guistically interpreted corpus of bulgarian: the Bul- [Dzeroski et al.2006] Saso Dzeroski, Tomaz Erjavec, TreeBank. In LREC. Nina Ledinek, Petr Pajas, Zdenek Zabokrtsky, and Andreja Zele. 2006. Towards a slovene dependency [Snow et al.2004] Rion Snow, Daniel Jurafsky, and An- treebank. In Proc. of the Fifth Intern. Conf. on Lan- drew Y. Ng. 2004. Learning syntactic patterns for guage Resources and Evaluation (LREC). automatic hypernym discovery. In NIPS. [Haghighi et al.2005] Aria D. Haghighi, Andrew Y. Ng, [Spitkovsky et al.2011] Valentin I. Spitkovsky, Hiyan and Christopher D. Manning. 2005. Robust tex- Alshawi, and Daniel Jurafsky. 2011. Lateen tual inference via graph matching. In Proceedings EM: unsupervised training with multiple objectives, of the conference on Human Language Technology applied to dependency grammar induction. In and Empirical Methods in Natural Language Pro- EMNLP, page 1269 1280. cessing, HLT ’05, page 387 394, Stroudsburg, PA, USA. Association for Computational Linguistics. [Van der Beek et al.2002] Leonoor Van der Beek, Gosse Bouma, Rob Malouf, and Gertjan Van Noord. [Hinrichs et al.2000] Erhard W. Hinrichs, Julia Bartels, 2002. The alpino dependency treebank. Language Yasuhiro Kawata, Valia Kordoni, and Heike Telljo- and Computers, 45(1):8 22. hann. 2000. The VERBMOBIL treebanks. In KONVENS, page 107 112. [Wang et al.2007] Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the jeopardy [Jurafsky and Martin2009] Dan Jurafsky and James H. model? a quasi-synchronous grammar for QA. In Martin. 2009. Speech and Language Processing. EMNLP-CoNLL, volume 7, page 22 32. Prentice Hall.