2D Trie for Fast Parsing
Total Page:16
File Type:pdf, Size:1020Kb
2D Trie for Fast Parsing Xian Qian, Qi Zhang, Xuanjing Huang, Lide Wu Institute of Media Computing School of Computer Science, Fudan University xianqian, qz, xjhuang, ldwu @fudan.edu.cn { } Abstract FeatureGeneration Template: Feature: p .word+p .pos lucky/ADJ In practical applications, decoding speed 0 0 is very important. Modern structured Feature Retrieval learning technique adopts template based Index: method to extract millions of features. Parse Tree 3228~3233 Complicated templates bring about abun- Buildlattice,inferenceetc. dant features which lead to higher accu- racy but more feature extraction time. We Figure 1: Flow chart of dependency parsing. propose Two Dimensional Trie (2D Trie), p0.word, p0.pos denotes the word and POS tag a novel efficient feature indexing structure of parent node respectively. Indexes correspond which takes advantage of relationship be- to the features conjoined with dependency types, tween templates: feature strings generated e.g., lucky/ADJ/OBJ, lucky/ADJ/NMOD, etc. by a template are prefixes of the features from its extended templates. We apply our technique to Maximum Spanning Tree dependency parsing. Experimental results feature extraction beside of a fast parsing algo- on Chinese Tree Bank corpus show that rithm is important for the parsing and training our 2D Trie is about 5 times faster than speed. He takes 3 measures for a 40X speedup, traditional Trie structure, making parsing despite the same inference algorithm. One impor- speed 4.3 times faster. tant measure is to store the feature vectors in file to skip feature extraction, otherwise it will be the 1 Introduction bottleneck. Now we quickly review the feature extraction In practical applications, decoding speed is very stage of structured learning. Typically, it consists important. Modern structured learning technique of 2 steps. First, features represented by strings adopts template based method to generate mil- are generated using templates. Then a feature in- lions of features. Such as shallow parsing (Sha dexing structure searches feature indexes to get and Pereira, 2003), named entity recognition corresponding feature weights. Figure 1 shows (Kazama and Torisawa, ), dependency parsing the flow chart of MST parsing, where p .word, (McDonald et al., 2005), etc. 0 p .pos denote the word and POS tag of parent The problem arises when the number of tem- 0 node respectively. plates increases, more features generated, mak- We conduct a simple experiment to investi- ing the extraction step time consuming. Espe- gate decoding time of MSTParser, a state-of-the- cially for maximum spanning tree (MST) depen- art java implementation of dependency parsing 1. dency parsing, since feature extraction requires Chinese Tree Bank 6 (CTB6) corpus (Palmer and quadratic time even using a first order model. Ac- cording to Bohnet’s report (Bohnet, 2009), a fast 1http://sourceforge.net/projects/mstparser 904 Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 904–912, Beijing, August 2010 Step Feature Index Other Total Unit Meaning th Generation Retrieval p i/pi the i node left/right to parent node − th Time 300.27 61.66 59.48 421.41 c i/ci the i node left/right to child node − th r i/ri the i node left/right to root node − Table 1: Time spent of each step (seconds) of n.word word of node n n.pos POS tag of node n MSTParser on CTB6 standard test data (2660 sen- n.length word length of node n tences). Details of the hardware and corpus are l conjoin current feature with linear distance described in section 5 | between child node and parent node d conjoin current feature with direction of de- | pendency (left/right) Xue, 2009) with standard train/development/test split is used for evaluation. Experimental results Table 2: Template units appearing in this paper are shown in Table 1. The observation is that time spent of inference is trivial compared with feature venience, we use another formulation: T = t + extraction. Thus, speeding up feature extraction is 1 ...+t . All template units appearing in this paper critical especially when large template set is used m are described in Table 2, most of them are widely for high accuracy. used. For example, “T = p .word + c .word l ” General indexing structure such as Hash and 0 0 | denotes the word pair of parent and child nodes, Trie does not consider the relationships between conjoined with their distance. templates, therefore they could not speed up fea- ture generation, and are not completely efficient 2.2 Template tree for searching feature indexes. For example, fea- In the rest of the paper, for simplicity, let si be a ture string s1 generated by template “p0.word” feature string generated by template Ti. is prefix of feature s2 from template “p0.word + We define the relationship between templates: c0.word” (word pair of parent and child), hence T1 is the ancestor of T2 if and only T1 T2, and index of s1 could be used for searching s2. Fur- ⊂ T2 is called the descendant of T1. Recall that, ther more, if s1 is not in the feature set, then s2 must be absent, its generation can be skipped. feature string s1 is prefix of feature s2. Suppose T T T , obviously, the most efficient way We propose Two Dimensional Trie (2D Trie), 3 ⊂ 1 ⊂ 2 a novel efficient feature indexing structure which to look up indexes of s1, s2, s3 is to search s3 first, takes advantage of relationship between tem- then use its index id3 to search s1, and finally use plates. We apply our technique to Maximum id1 to search s2. Hence the relationship between Spanning Tree dependency parsing. Experimental T2 and T3 can be neglected. results on CTB6 corpus show that our 2D Trie is Therefore we define direct ancestor of T1: T2 is a direct ancestor of T if T T , and there is about 5 times faster than traditional Trie structure, 1 2 ⊂ 1 no template T such that T T T . Corre- making parsing speed 4.3 times faster. 0 2 ⊂ 0 ⊂ 1 The paper is structured as follows: in section 2, spondingly, T1 is called the direct descendant of we describe template tree which represents rela- T2. tionship between templates; in section 3, we de- Template graph G = (V, E) is a directed graph scribe our new 2D Trie structure; in section 4, we that represents the relationship between templates, where V = T ,...,T is the template set, E = analyze the complexity of the proposed method { 1 n} e , . , e is the edge set. Edge from T to T and general string indexing structures for parsing; { 1 N } i j experimental results are shown in section 5; we exists, if and only if Ti is the direct ancestor of conclude the work in section 6. Tj. For templates having no ancestor, we add an empty template as their common direct ancestor, 2 Template tree which is also the root of the graph. The left part of Figure 2 shows a template 2.1 Formulation of template graph for templates T1 =p0.word, T2 =p0.pos , A template is a set of template units which are T3 =p0.word + p0.pos. In this example, T3 has 2 manually designed: T = t , . , t . For con- direct ancestors, but in fact s has only one prefix { 1 m} 3 905 root root root root p .word p0 .word p0 .pos p0 .word p0 .pos 0 c0 .word p0 .word c0 .word +p0. pos +c0. pos p-1 .word p .pos p .word+p .word 0 -1 0 c .pos p .pos 0 p0 .word +p 0. pos 0 +p1 .word p1 .word Figure 2: Left graph shows template graph for T1 =p0.word, T2 =p0.pos , T3 =p0.word + Figure 3: Templates that are partially overlapped: T T =p .word, virtual vertexes shown in p0.pos. Right graph shows the corresponding tem- red ∩ blue 0 plate tree, where each vertex saves the subset of dashed circle are created to extract the common template units that do not belong to its father unit root which depends on the order of template units in Level0 generation step. If s3 = s1 + s2, then its prefix is p0 .word Level1 ... parse ... tag ... s1, otherwise its prefix is s2. In this paper, we sim- ply use the breadth-first tree of the graph for dis- Level2 ... VV ... ... NN ... VV ... ambiguation, which is called template tree. The p0 .pos only direct ancestor T1 of T2 in the tree is called father of T2, and T2 is a child of T1. The right Figure 4: 2D Trie for single template, alphabets at part of Figure 2 shows the corresponding template level 1 and level 2 are the word set, POS tag set tree, where each vertex saves the subset of tem- respectively plate units that do not belong to its father. 2.3 Virtual vertex 3 2D Trie Consider the template tree in the left part of Figure 3, red vertex and blue vertex are partially over- 3.1 Single template case lapped, their intersection is p0.word, if string s Trie stores strings over a fixed alphabet, in our from template T =p0.word is absent in feature set, case, feature strings are stored over several alpha- then both nodes can be neglected. For efficiently bets, such as word list, POS tag list, etc. which are pruning candidate templates, each vertex in tem- extracted from training corpus. plate tree is restricted to have exactly one template To illustrate 2D Trie clearly, we first consider a unit (except root). Another important reason for simple case, where only one template used. The such restriction will be given in the next section.