Unsupervised Dependency Parsing, a New PCFG

Unsupervised Dependency Parsing, a new PCFG Marie Arcadias, Guillaume Cleuziou, Edmond Lassalle, Christel Vrain LIFO, Universite´ d’Orleans´ Orange Labs, Lannion Rapport no RR-2014-03 Unsupervised Dependency Parsing, a new PCFG Marie Arcadias Guillaume Cleuziou Edmond Lassalle Christel Vrain Orange Labs / 2 avenue Marzin LIFO / Universite´ d’Orleans´ 22300 Lannion, FRANCE Rue Leonard´ de Vinci [email protected] 45067 Orleans´ cedex 2, FRANCE [email protected] Abstract pendency model is an interesting compromise between the full syntactic analysis and a representa- Dependency learning aims at building a tion as a “bag-of-words”. model that allows transforming textual A large amount of manually annotated ex- sentences into trees representing a syn- amples are necessary for supervised dependency tactical hierarchy between the words of learning. It is a very long and tedious task, and it the sentence. We present an intermediate requires deep linguistic knowledge. Furthermore, model between full syntactic parsing of a it has to be done anew for each kind of text to an- sentence and bags of words. It is based alyze. This explains why the amount of annotated on a very light probabilistic context free text is poor compared to the abundance of different grammar, allowing to express dependen- types of text available on the web. In this paper, cies between the words of a sentence. Our we suggest an unsupervised approach demanding model can be tuned a little depending on only a shallow knowledge on the language and on the language. Experimentally, we were the type of the text. The framework is therefore able to surpass the scores of the DMV Unsupervised Dependency Learning (UDL). reference on attested benchmarks for five over ten languages, such as English, Por- tuguese or Japanese. We give the first re- MD/could_3 sults on French corpora. Learning is very fast and parsing is almost instantaneous. NN/Rice_2 RB/hardly_4 VB/believe_5 PU/._8 1 Introduction and state of the art NN/Cathryn_1 NN/eyes_7 The dependency structure (DS) of a sentence PRP/her_6 shows a syntactic hierarchy between the words, allowing then to infer semantic information. Among other applications, dependency structures are used in language modeling (Chelba et al., 1997), tex- Figure 1: Dependency Tree given by the treebank tual entailment (Haghighi et al., 2005), question for the sentence “Cathryn Rice could hardly be- answering (Wang et al., 2007), information ex- lieve her eyes”. traction (Culotta and Sorensen, 2004), lexical on- tology induction (Snow et al., 2004) and machine The Penn Treebank, an American tagged corpus translation (Quirk et al., 2005). of newspaper articles, offers a dependency ver- The DS of a sentence (cf. Figure 1) is a tree, sion, giving the DS of each sentence. Klein and the nodes of which are labelled by the words of Manning (2004) were the first to obtain significant the sentence. One of the words is defined as the results in UDL. They got better scores, on sen- root of the tree (most of the time, the main verb). tences of under 10 words, than the basic attach- Then subtrees, covering contiguous parts of the ment of each word to its next right neighbor. They sentences, are attached to the root. In other words, called their model Dependency Model of Valence a dependency tree is made of directed relations (DMV). between a syntactically strong word (called head) Sentences are coded as sequences of parts-of- and a weaker word (called dependent). The de- speech (POS) tags and are used as inputs for the learning and the parsing algorithms. DMV is a dencies between the words of a sentence. For ex- generative model based on the valence of the POS, ample, in the sentence “Cathryn Rice could hardly i.e. their ability to generate children (i.e. depen- believe her eyes.”, “could” is a dominant to which dents), their number and type of POS. The root of “Rice” is attached to the left, and “hardly”, “be- the sentence is first probabilistically chosen. Then, lieve” and the full stop are attached to the right. this root generates recursively its children among The dependency tree is represented in Figure 1. the other words of the sentence, and the subtree of Our model classifies each word (represented by its each child is built, depending on their POS and POS tag) as a dominant or a dominated item be- relative position (left or right). The estimation side its neighbors. Then, to parse a sentence, the of probabilities includes the type of preferred de- model combines, thanks to intermediate symbols, pendencies (verb over noun rather than noun over the groups of words until each word finds a posi- verb for example). Starting with initial probabil- tion in the dependency tree, as we can see in figure ities tuned manually based on linguistic knowl- 2. edge, an expectation-maximization step learns the probabilities of the model. This is a rich and interesting model, but the pa- rameters initialization is a full and complex prob- lem. It demands both technical innovation from a machine learning expert and a strong linguistic background from an expert of the syntax of the studied language. 2 Learning a probabilistic context free grammar The originality of our contribution is the choice of a simple context free grammar which can express Figure 2: Parse of the sentence by the context free dependencies between the words of a sentence. grammar DGdg. Our approach is then decomposed into two parts: learning this probabilistic context free grammar To do that, we consider 5 non terminal symbols (nt) : the start symbol S, two symbols G and D (PCFG) by the Inside-Outside algorithm (Lari representing respectively left and right dominants and Young, 1990), parsing based on the learned and two symbols g and d for left and right domi- PCFG, using a probabilistic version of CYK algo- nated items. The terminals represent the POS tags; they can differ depending on the language and on rithm (Jurafsky and Martin, 2009). Finally, formal the tagger. Here are universal tags used by Mc- trees are transformed into dependency trees. For Donald et al. (2013) (e.g. DET for determiner). the definition of formal grammars, like PCFG, we Σ = fADJ; ADP; ADV; CONJ; DET; NOUN; NUM; suggest the reading of Jurafsky and Martin (2009). PRON;PRT;PUNC;VERB;Xg Inside-Outside is a generative model that can The production rules are in Chomsky normal be considered as an extension of hidden Markov form (this is compulsory for Inside-Outside and models (HMM). Whereas HMM are limited to CYK). Our constraints are : learning regular grammars, Inside-Outside can • The uppercase non terminal dominates the deal with context free grammars. While HMM lowercase non terminal it is associated with use calculations on subsequences before and af- by a production rule. E.g. G ! G d means ter a position t to obtain the probabilities of the that a left dominant splits into another left derivation rules of the grammar, Inside-Outside dominant and a right dominated symbol. algorithm calculates it from subsequences inside and outside two positions t1 and t2. The proba- • A left non terminal g (respectively G) is asso- bilistic version of CYK choses the most probable ciated to the left with D (resp. d). nt ! G d parse among all possible analysis. or nt ! g D. DGdg formalism As already written, the origi- The meaning we impose to the non terminals nality of our contribution is the choice of a simple forbids many rules, and thus limits the size of the context free grammar which can express depen- grammar, while keeping its dependency state of Tuning phase All UDL models are tuned ac- mind. cording to the language of the corpus. For our The first type of rule in Chomsky normal form model, it consists in selecting only the rules of (nt ! nt nt) builds the internal construction of type nt ! terminal which are linguistically rel- the sentences. We call them structure rules. The evant. We already give an example illustrating se- second type of rules in Chomsky normal form lection of such rules for the determiner. In the fol- (nt ! terminal) expresses information whether lowing experimentations, we tune the models ob- a POS can (or cannot) dominate its right or left serving for each language some trees given as a neighbors. For example, in English, we will for- reference in the dependency treebanks. bid rules as nt ! DET for all nt 6= g because a determiner is always dominated by the next right 3 Experiments and results noun. CONLL 2006 3bin 4bin 4ter 5ter 5ter+ The variants Depending on the structure of the + FTB studied language, the structure rules may not fit the Bulgarian 17.7% 23.9 % 22.9% 22.8 % 23.6 % deep split of the sentence. The grammar we pre- Danish 26.7% 20.5% 14.0% 13.9 % 13.6 % Dutch 29.7% 34.6% 30.3% 27.0% 27.0 % sented before is called 4bin because it contains, English 15.3% 29.0% 26.4% 38.1% 39.0% in addition to the start symbol, 4 non terminals French 29.8% 32.9% 42.1% 37.4% 42.2 % (D; G; d and g) and the structure rules are writ- German 20.9% 33.1 % 20.9% 31.5% 31.7% Japanese 31.2% 32.4 % 32.2% 33.4% 64.7% ten in a binary way, according to Chomsky normal Portuguese 30.0% 54.0% 42.0% 37.8% 34.1 % form. Slovene 12.2% 21.5% 23.2% 21.6% 21.8% Spanish 20.8% 39.2 % 39.2% 30.0% 40.2% The meaning of these 4 non terminals attests Swedish 21.2% 18.9 % 21.6% 21.8% 21.8 % the fundamental difference between POS which would dominate from left, those which would Table 3: Scores for all DGdg methods dominate from right and those which would be dominated from right and left.

Load more