Branch and Bound Algorithm for Dependency Parsing with Non-Local Features

Branch and Bound Algorithm for Dependency Parsing with Non-local Features Xian Qian and Yang Liu Computer Science Department The University of Texas at Dallas qx,yangl @hlt.utdallas.edu { } Abstract Charniak and Johnson, 2005; Hall, 2007). It typi- cally has two steps: the low level classifier gener- Graph based dependency parsing is inefficient ates the top k hypotheses using local features, then when handling non-local features due to high the high level classifier reranks these candidates us- computational complexity of inference. In ing global features. Since the reranking quality is this paper, we proposed an exact and effi- bounded by the oracle performance of candidates, cient decoding algorithm based on the Branch some work has combined candidate generation and and Bound (B&B) framework where non- local features are bounded by a linear combi- reranking steps using cube pruning (Huang, 2008; nation of local features. Dynamic program- Zhang and McDonald, 2012) to achieve higher or- ming is used to search the upper bound. Ex- acle performance. They parse a sentence in bottom periments are conducted on English PTB and up order and keep the top k derivations for each s- Chinese CTB datasets. We achieved competi- pan using k best parsing (Huang and Chiang, 2005). tive Unlabeled Attachment Score (UAS) when After merging the two spans, non-local features are 93.17% no additional resources are available: used to rerank top k combinations. This approach for English and 87.25% for Chinese. Parsing speed is 177 words per second for English and is very efficient and flexible to handle various non- 97 words per second for Chinese. Our algo- local features. The disadvantage is that it tends to rithm is general and can be adapted to non- compute non-local features as early as possible so projective dependency parsing or other graph- that the decoder can utilize that information at inter- ical models. nal spans, hence it may miss long historical features such as long dependency chains. Smith and Eisner modeled dependency parsing 1 Introduction using Markov Random Fields (MRFs) with glob- For graph based projective dependency parsing, dy- al constraints and applied loopy belief propaga- namic programming (DP) is popular for decoding tion (LBP) for approximate learning and inference due to its efficiency when handling local features. (Smith and Eisner, 2008). Similar work was done It performs cubic time parsing for arc-factored mod- for Combinatorial Categorial Grammar (CCG) pars- els (Eisner, 1996; McDonald et al., 2005a) and bi- ing (Auli and Lopez, 2011). They used posterior quadratic time for higher order models with richer marginal beliefs for inference to satisfy the tree con- sibling and grandchild features (Carreras, 2007; Koo straint: for each factor, only legal messages (satisfy- and Collins, 2010). However, for models with gen- ing global constraints) are considered in the partition eral non-local features, DP is inefficient. function. There have been numerous studies on global in- A similar line of research investigated the use ference algorithms for general higher order parsing. of integer linear programming (ILP) based parsing One popular approach is reranking (Collins, 2000; (Riedel and Clarke, 2006; Martins et al., 2009). This 37 Transactions of the Association for Computational Linguistics, 1 (2013) 37–48. Action Editor: Ryan McDonald. Submitted 11/2012; Revised 2/2013; Published 3/2013. c 2013 Association for Computational Linguistics. method is very expressive. It can handle arbitrary 2 Graph Based Parsing non-local features determined or bounded by linear 2.1 Problem Definition inequalities of local features. For local models, LP is less efficient than DP. The reason is that, DP works Given a sentence x = x1, x2, . , xn where xi is th on a small number of dimensions in each recursion, the i word of the sentence, dependency parsing as- while for LP, the popular revised simplex method signs exactly one head word to each word, so that needs to solve a m dimensional linear system in dependencies from head words to modifiers form a each iteration (Nocedal and Wright, 2006), where tree. The root of the tree is a special symbol de- m is the number of constraints, which is quadratic noted by x0 which has exactly one modifier. In this in sentence length for projective dependency pars- paper, we focus on unlabeled projective dependency ing (Martins et al., 2009). parsing but our algorithm can be adapted for labeled or non-projective dependency parsing (McDonald et Dual Decomposition (DD) (Rush et al., 2010; al., 2005b). Koo et al., 2010) is a special case of Lagrangian re- The inference problem is to search the optimal laxation. It relies on standard decoding algorithms parse tree y∗ as oracle solvers for sub-problems, together with a simple method for forcing agreement between the y∗ = arg maxy (x)ϕ(x, y) ∈Y different oracles. This method does not need to con- where (x) is the set of all candidate parse trees of sider the tree constraint explicitly, as it resorts to dy- Y namic programming which guarantees its satisfac- sentence x. ϕ(x, y) is a given score function which tion. It works well if the sub-problems can be well is usually decomposed into small parts defined, especially for joint learning tasks. Howev- ϕ(x, y) = ϕ (x) (1) er, for the task of dependency parsing, using various c c y non-local features may result in many overlapped ∑⊆ sub-problems, hence it may take a long time to reach where c is a subset of edges, and is called a factor. a consensus (Martins et al., 2011). For example, in the all grandchild model (Koo and Collins, 2010), the score function can be represented In this paper, we propose a novel Branch and as Bound (B&B) algorithm for efficient parsing with various non-local features. B&B (Land and Doig, ϕ(x, y) = ϕehm (x) + ϕegh,ehm (x) 1960) is generally used for combinatorial optimiza- ehm y egh,ehm y ∑∈ ∑ ∈ tion problems such as ILP. The difference between our method and ILP is that the sub-problem in ILP where the first term is the sum of scores of all edges x x , and the second term is the sum of the is a relaxed LP, which requires a numerical solution, h → m scores of all edge chains x x x . while ours bounds the non-local features by a lin- g → h → m ear combination of local features and uses DP for In discriminative models, the score of a parse tree y decoding as well as calculating the upper bound of is the weighted sum of the fired feature functions, the objective function. An exact solution is achieved which can be represented by the sum of the factors if the bound is tight. Though in the worst case, T T ϕ(x, y) = w f(x, y) = w f(x, c) = ϕc(x) time complexity is exponential in sentence length, c y c y it is practically efficient especially when adopting a ∑⊆ ∑⊆ pruning strategy. where f(x, c) is the feature vector that depends on c. For example, we could define a feature for grand- Experiments are conducted on English PennTree child c = egh, ehm Bank and Chinese Tree Bank 5 (CTB5) with stan- { } 93.17% dard train/develop/test split. We achieved 1 if xg = would xh = be Unlabeled Attachment Score (UAS) for English at a ∧ f(x, c) = x = happy c is selected 177 87.25% ∧ m ∧ speed of words per second and for Chi- nese at a speed of 97 words per second. 0 otherwise 38 2.2 Dynamic Programming for Local Models bound of the optimal parse tree score in the subspace: UB maxy ϕ(x, y). If this bound is In first order models, all factors c in Eq(1) contain a Yi ≥ ∈Yi no more than any obtained parse tree score UB single edge. The optimal parse tree can be derived Yi ≤ 3 ϕ(x, y ), then all parse trees in subspace are no by DP with running time O(n ) (Eisner, 1996). The ′ Yi more optimal than y , and could be pruned safely. algorithm has two types of structures: complete s- ′ Yi pan, which consists of a headword and its descen- The efficiency of B&B depends on the branching dants on one side, and incomplete span, which con- strategy and upper bound computation. For exam- sists of a dependency and the region between the ple, Sun et al. (2012) used B&B for MRFs, where head and modifier. It starts at single word spans, and they proposed two branching strategies and a novel merges the spans in bottom up order. data structure for efficient upper bound computation. Klenner and Ailloud (2009) proposed a variation of For second order models, the score function Balas algorithm (Balas, 1965) for coreference reso- ϕ(x, y) adds the scores of siblings (adjacent edges lution, where candidate branching variables are sort- with a common head) and grandchildren ed by their weights. Our bounding strategy is to find an upper bound ϕ(x, y) = ϕe (x) hm for the score of each non-local factor c containing ehm y ∑∈ multiple edges. The bound is the sum of new scores + ϕ (x) ehm,egh of edges in the factor plus a constant e ,e y gh∑hm∈ + ϕe ,e (x) ϕ (x) ψ (x) + α hm hs c ≤ e c ehm,ehs y e c ∑ ∈ ∑∈ There are two versions of second order models, Based on the new scores ψ (x) and constants { e } used respectively by Carreras (2007) and Koo et al. α , we define the new score of parse tree y { c} (2010).

Branch and Bound Algorithm for Dependency Parsing with Non-Local Features

12. Coordinate Descent Methods

Process Optimization

The Gradient Sampling Methodology

A New Algorithm of Nonlinear Conjugate Gradient Method with Strong Convergence*

Using a New Nonlinear Gradient Method for Solving Large Scale Convex Optimization Problems with an Application on Arabic Medical Text

Conditional Gradient Method for Multiobjective Optimization

Successive Linear Programming to The

Metaheuristic Techniques in Enhancing the Efficiency and Performance of Thermo-Electric Cooling Devices

Linear-Memory and Decomposition-Invariant Linearly Convergent Conditional Gradient Algorithm for Structured Polytopes

Conjugate Gradient Descent Conjugate Direction Methods

Block Stochastic Gradient Method

13. Conjugate Gradient Method