Enhancing Freeling Rule-Based Dependency Grammars with Subcategorization Frames
Total Page:16
File Type:pdf, Size:1020Kb
Enhancing FreeLing Rule-Based Dependency Grammars with Subcategorization Frames Marina Lloberes Irene Castellon´ Llu´ıs Padro´ U. de Barcelona U. de Barcelona U. Politecnica` de Catalunya Barcelona, Spain Barcelona, Spain Barcelona, Spain [email protected] [email protected] [email protected] Abstract explicitly relate the improvements in parser per- formance to the addition of subcategorization. Despite the recent advances in parsing, This paper analyses in detail how subcatego- significant efforts are needed to improve rization frames acquired from an annotated cor- the current parsers performance, such as pus and distributed among fine-grained classes in- the enhancement of the argument/adjunct crease accuracy in rule-based dependency gram- recognition. There is evidence that verb mars. subcategorization frames can contribute to The framework used is that of the FreeLing parser accuracy, but a number of issues re- Dependency Grammars (FDGs) for Spanish and main open. The main aim of this paper is Catalan, using enriched lexical-syntactic informa- to show how subcategorization frames ac- tion about the argument structure of the verb. quired from a syntactically annotated cor- FreeLing (Padro´ and Stanilovsky, 2012) is an pus and organized into fine-grained classes open-source library of multilingual Natural Lan- can improve the performance of two rule- guage Processing (NLP) tools that provide linguis- based dependency grammars. tic analysis for written texts. The FDGs are the core of the FreeLing dependency parser, the Txala 1 Introduction Parser (Atserias et al., 2005). Statistical parsers and rule-based parsers have ad- The remainder of this paper is organized as fol- vanced over recent years. However, significant ef- lows. Section 2 contains an overview of previous forts are required to increase the performance of work related to this research. Section 3 presents current parsers (Klein and Manning, 2003; Nivre the rule-based dependency parser used and the et al., 2006; Ballesteros and Nivre, 2012; Marimon Spanish and Catalan grammars. Section 4 de- et al., 2014). scribes the strategy followed initially to integrate One of the linguistic phenomena which parsers subcategorization into the grammars and how this often fail to handle correctly is the argu- information has been redesigned. Section 5 fo- ment/adjunct distinction (Carroll et al., 1998). For cuses on the evaluation and the analysis of sev- this reason, the main goal of this paper is to eral experiments testing versions of the grammars test empirically the accuracy of rule-based depen- including or discarding subcategorization frames. dency grammars working exclusively with syntac- Finally, the main conclusions and the further re- tic rules or adding subcategorization frames to the search goals arisen from the results of the experi- rules. ments are exposed in Section 6. A number of studies shows that subcategoriza- 2 Related Work tion frames can contribute to improve parser per- formance (Carroll et al., 1998; Zeman, 2002; Mir- There has been an extensive research on parser de- roshandel et al., 2013). Particularly, these studies velopment, and most approaches can be classified are mainly concerned with the integration of sub- as statistical or rule-based. In the former, a statis- categorization information into statistical parsers. tical model learnt from annotated or unannotated The list of studies about rule-based parsers in- texts is applied to build the syntactic tree (Klein tegrating subcategorization information is also ex- and Manning, 2003; Collins and Koo, 2005; Nivre tensive (Lin, 1998; Alsina et al., 2002; Bick, 2006; et al., 2006; Ballesteros and Nivre, 2012), whereas Calvo and Gelbukh, 2011). However, they do not the latter uses hand-built grammars to guide the 201 Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pages 201–210, Uppsala, Sweden, August 24–26 2015. parser in the construction of the tree (Sleator and formation to make a decision (Carroll et al., 1998). Temperley, 1991; Jarvinen¨ and Tapanainen, 1998; Depending on the characteristics of the parser, Lin, 1998). subcategorization assists in this task in a different Concerning the languages this study is based way. Subcategorization information can be used on, some research on Spanish has been performed to assign a probability to every possible syntactic from the perspective of Constraint Grammar tree and to rank them in parsers that perform the (Bick, 2006), Unification Grammar (Ferrandez´ whole set of possible syntactic analysis of a partic- and Moreno, 2000), Head-Driven Phrase Structure ular sentence (Carroll et al., 1998; Zeman, 2002; Grammar (Marimon et al., 2014), and Dependency Mirroshandel et al., 2013). Grammar for statistical parsing, both supervised In contrast, subcategorization may help to re- (Carreras et al., 2006) and semi-supervised (Calvo strict the application of certain rules. Then, when and Gelbukh, 2011). For Catalan, a rule-based the parser detects the subcategorization frame in parser based on Constraint Grammar (Alsina et the input sentence, it labels the syntactic tree ac- al., 2002) and a statistical dependency parser (Car- cording to the frame discarding any other possible reras, 2007) are available. analysis (Lin, 1998; Calvo and Gelbukh, 2011). Despite the huge achievements in the area of parsing, argument/adjunct recognition is still a lin- 3 Dependency Parsing in FreeLing guistic problem in which parsers still show low ac- curacy and in which there is still no generalized The rule-based dependency grammars presented consensus in Theoretical Linguistics (Tesniere,` in this article are the core of the Txala Parser (At- 1959; Chomsky, 1965). This phenomenon refers serias et al., 2005), the NLP module in charge to the subcategorization notion, which corre- of Dependency Parsing in the FreeLing library sponds to the definition of the type and the number (Padro´ and Stanilovsky, 2012).1 of arguments of a syntactic head. FreeLing is an open-source project that has been The acquisition of subcategorization frames developed for more than ten years. It is a com- from corpora is one of the strategies for integrat- plete NLP pipeline built on a chain of modules ing information about the argument structure into that provide a general and robust linguistic anal- a parser. Depending on the level of language anal- ysis. Among the available tools, FreeLing offers ysis of the annotated corpus, two main strategies sentence recognition, tokenization, named entity are used in automatic acquisition. recognition, tagging, chunking, dependency pars- If the acquisition is performed over a mor- ing, word sense disambiguation, and coreference phosyntactically annotated text, the subcatego- resolution. rization frames are inferred by applying statisti- cal techniques on morphosyntactically annotated 3.1 Txala Parser data (Brent, 1993; Manning, 1993; Korhonen et The Txala Parser is one of the dependency pars- al., 2003). ing modules available in FreeLing. It is a rule- Alternatively, acquisition can be performed based, non-projective and multilingual depen- with syntactically annotated texts (Sarkar and Ze- dency parser that provides robust syntactic anal- man, 2000; O’Donovan et al., 2005; Aparicio et ysis in three steps. al., 2008). Subcategorization acquisition can be Txala receives the partial syntactic trees pro- performed straightforwardly because the informa- duced by the chunker (Civit, 2003) as input. tion about the argument structure is available in Firstly, the head-child relations are identified us- the corpus. Therefore, this approach generally fo- ing a set of heuristic rules that iteratively decide cuses on the methods for subcategorization frames whether two adjacent trees must be merged, and classification. in which way, until there is only one tree left. Sec- The final classification in a lexicon of frames ondly, it is converted into syntactic dependencies is a computational resource for several NLP tools. according to Mel’cukˇ (1988). Finally, each depen- In the framework which this research focuses on, dency arch of the tree is labelled with a syntactic the integration of the acquired subcategorization is function tag. orientated to the contribution towards building the syntactic tree when the parser has incomplete in- 1http://nlp.cs.upc.edu/freeling/ 202 Rules 911 !grup-verb $$ - (sn,subord ˆPR ) Language Total Linking Labelling top left RELABEL - { } English 2961 2239 722 Spanish 4042 3310 732 Catalan 2879 2099 780 Figure 1: Linking rule for relative clauses Galician 178 87 91 Asturian 4438 3842 596 grup-verb dobj Table 1: Sizes of the FDGs d.label=grup-sp p.class=trans d.side=right d.lemma=a|al 3.2 FreeLing Dependency Grammars d:sn.tonto=Human d:sn.tonto!=Building|Place The current version of FreeLing includes rule- based dependency grammars for English, Span- Figure 2: Labelling rule for human direct objects ish, Catalan, Galician and Asturian (see Table 1 for a brief overview of their sizes). In this pa- per, the Spanish and Catalan dependency gram- Net Semantic File, WordNet Synonyms and Hy- mars are described. pernyms and other semantic features predefined The FDGs follow the linguistic basis of syn- by the user). tactic dependencies (Tesniere,` 1959; Mel’cuk,ˇ In the rule illustrated in Figure 2, the direct ob- 1988). However, we propose a different analy- ject label (dobj) is assigned to the link between sis for prepositional phrases (preposition-headed), a verbal head (grup-verb) and a prepositional subordinate