Formal Grammars for Linguistic Treebank Queries

Formal Grammars for Linguistic Treebank Queries Mark Dras Steve Cassidy Centre for Language Technology Centre for Language Technology Macquarie University Macquarie University [email protected] [email protected] Abstract consequently the type of query language that is of interest is a tree query language. There has been recent interest in looking at what is required for a tree query language for linguis- One approach to deciding what is required in tic corpora. One approach is to start from exist- a tree query language, taken by Lai and Bird ing formal machinery, such as tree grammars and (2004), is to examine and compare a range automata, to see what kind of machine is an ap- of existing ones: for example, TGrep2 (Ro- propriate underlying one for the query language. hde, 2001), TIGERSearch (König and Lez- The goal of the paper is then to examine what is ius, 2001), or Emu (Cassidy and Harrington, an appropriate machine for a linguistic tree query 2001). One of their goals is to understand language, with a view to future work defining a better the formal properties required of query query language based on it. In this paper we languages. review work relating XPath to regular tree gram- It has been noted in a number of places mars, and as the paper's first contribution show that treebank querying is in essence a spec- how regular tree grammars can also be a basis for ification of a tree pattern, which matches extensions proposed for XPath for common lin- against the desired trees in the treebank cor- guistic corpus querying. As the paper's second pus (Abiteboul, 1997). These tree patterns contribution we demonstrate that, on the other can be described by existing formal machin- hand, regular tree grammars cannot describe a ery such as tree grammars and automata. number of structures of interest; we then show An approach complementary to the one men- that, instead, a slightly more powerful machine tioned above is thus to examine the extent is appropriate, and indicate how linguistic tree to which this formal machinery is adequate query languages might be augmented to include for describing tree patterns relevant for the this extra power. sorts of queries of interest in NLP or linguis- tics, and then to link this to a query lan- 1 Introduction guage. A reason for being interested in this link between tree query languages and for- There has been recent interest in looking mal machinery is that standard results are at what is required for a query language available for these latter. For example, an for annotated linguistic corpora (Lai and algorithm for recognition exists that is linear Bird, 2004). These corpora are used in a in the number of nodes in the tree and the range of areas of natural language processing size of the automaton; it is decidable whether (NLP)|parsing, machine translation, and so the set of matches will be empty; and so on on|where they form the basis of training (Comon et al., 1997). Further, there is the data for statistical methods; and also in lin- promise of the availability of standard tools guistics, where they are used to extract ex- and efficient techniques that could be used by amples of particular phenomena for analy- a tree query language. This is the case for sis and testing of hypotheses. As noted by formal machines over strings (for example, Lai and Bird, the prototypical hierarchical the library of finite-state string transducers linguistic annotation is the syntax tree, and of Mohri (1997)), although not yet for trees. 2 Regular Tree Grammars A number of researchers have looked at this Regular tree grammars (RTGs) are in essence complementary approach. One alternative is those trees whose paths are defined by reg- to design from scratch a tree query language ular grammars. Comon et al. (1997) pro- derived from a tree grammar or automaton; vide an introduction to necessary concepts this is taken by, for example, Chidlovskii in tree grammars, along with well known (2000). Another is to relate an existing query results such as that the string languages language to a formal machine: Murata et al. yielded by RTG trees are the context-free (2000) present a taxonomy of XML schema languages. In their treatment they divide languages using formal language theory. tree representations into two types: those for In this paper, we follow the second of these ranked trees (that is, where each symbol has alternatives. Existing work, mostly focussed a fixed number of children, with this num- on XML, has looked only at regular tree ber constituting the rank of the symbol), and grammars and automata for modelling query those for unranked trees. XML schema lan- languages; we examine the extent to which guages typically use unranked trees, so we this is the case for linguistic treebanks, and adopt, slightly modified, the definitions of what other machinery might be appropriate these from Bruggemann-Klein¨ et al. (2001) for a linguistic tree query language. We ar- and Murata et al. (2000). gue that while regular tree grammars might A regular tree grammar is a 4-tuple G = be satisfactory for a querying a broad range (Σ; N; P; S) such that of phenomena, not all queries over trees representing natural language can be based on • Σ is a finite set of terminal symbols; a regular tree grammar. This is a structural analogue of the work of Shieber (1985), which • N is a finite set of nonterminal symbols; showed that natural language as a string language cannot be generated by a context-free • P is a finite set of productions of the grammar, which corresponds at the tree level form X ! a(R), where X 2 N, a 2 Σ, to a regular tree grammar. and R is a regular expression over N [Σ; and In Section 2 we give the definition of regular tree grammars, along with an approach • S is the start symbol, S 2 N. used to relate them to XPath. In Section 3 we look at extensions to XPath that Cassidy Derivation ) (with transitive closure )∗ ) is (2003) argues are necessary for linguistic cor- defined in the usual way, with nonterminals pus querying, and show that these can be symbols rewritten by means of production captured by regular tree grammars. In Sec- rules, starting from the start symbol S. is tion 4, however, we present some examples the conventional null terminal symbol. The from Dutch from the work of Bresnan et al. set of trees generated by G is L(G). (1982) to show that not all desired queries can be represented by regular tree grammars, Example 2.1 and examine the question of what strong gen- Let G1 = (Σ; N; P; S) such erative capacity is necessary in a tree gram- that Σ = fa; bg, N = fS; Xg, and P = fS ! mar for representing natural language. Sec- a(aSa); S ! a(bXb); X ! b(bXb); X ! bg. tion 5 gives the definition of a more power- A sample derivation is given in Figure 1. ful grammar, the context-free tree grammar, L(G1 ) is thus the set of ternary-branching and shows how this can describe queries re- trees over the symbols a and b, where all lated to the Dutch instance, along with what nodes to a certain depth D are a nodes, and properties a query language based on these all below are b nodes. might have. Finally, Section 6 concludes. To relate this to a query language, we now review the approach presented in Wood (2003) to relate these regular tree grammars to ∗ ∗ S ) a ) a b a c a . a a a . a a a c d b X b b b b b b * g b b b d g i Figure 1: Derivation for RTG G1 i Figure 2: Tree matching a==b[∗=i]=g, and XPath (XPath, 1999). XPath is a lan- corresponding tree pattern guage for selecting nodes from XML document trees, and is thus an important part of XSLT and XQuery. Expressions in XPath in searchers have been interested in the proper- themselves can be seen as simple queries over ties of different fragments of XPath, denoted f[]g f[];∗} f[];∗;==g trees. XP , XP and XP , depending on which XPath constructs are included in the An XPath expression is a mapping from a fragment. The work of Wood himself is on node (the context node) to the set of all nodes the fragment XPf[];∗;==g and the question of reachable by the specified path. A path ex- whether the containment problem|whether pression is written as a series of steps where one query is subsumed by another|under a each step defines the axis used to reach new Document Type Definition can be decided nodes and a node test used to restrict the set in the complexity class ptime. To demon- of nodes reached along the axis. Axes in- strate that this is the case he conceives of clude child, descendent, following, attribute XPath queries as tree patterns (Abiteboul, and self. Node tests consist of two parts: a 1997; Deutsch et al., 1999), which can be de- restriction on the element name and an op- scribed by regular tree grammars. For exam- tional predicate expression. Other features, ple, the XPath query of Example 2.2 could such as built-in functions, are also allowed. be pictured as the tree pattern on the right Notationally, a null axis stands for the child in Figure 2, where the dotted line indicates axis, // the descendent axis, the wild card * non-immediate dominance.

Load more