Query Automata for Nested Words

Query Automata for Nested Words P. Madhusudan Mahesh Viswanathan University of Illinois at Urbana-Champaign Urbana, IL, USA fmadhu, [email protected] Abstract document is hence better viewed as a linear structure with a nesting relation: We study visibly pushdown automata (VPA) models for expressing and evaluating queries, expressed using MSO <movie> My neighbor Totoro <director> Hayao Miyazaki </director> </movie> formulas, on words with a nesting structure (like XML documents). We define a query VPA model, which is a 2-way de- The study of XML has naturally concentrated on its terministic VPA that can mark positions in a document, and tree representation; for example, document types are rep- show that it is equiexpressive as unary monadic queries. resented using the parse-trees generated by SDTDs, XML This surprising result parallels a classic result for queries query languages like XPath have modalities like “parent” on regular word languages. We also compare our model to and “child” that refer to the tree edges, and tree automata query models on unranked trees. (over unranked trees) have been used to model and solve We then consider the algorithmic problem of evaluating, decision problems for XML [26, 21, 13, 14, 19]. in one pass, the set of all positions satisfying a query in a Nested word automata are a recent invention, and is streaming nested word. We present an algorithm that an- an automaton that works directly on a linearly structured swers any fixed unary monadic query on a streaming docu- document with nesting relations as described above [2, 3]. ment which uses, at any point, at most space O(d+I log n), Nested word automata were discovered in the context of for- where d is the depth of the document at that point and I is mal verification for software, where program executions are the number of potential answers to the query in the word viewed as nested words, where the nesting relation relates processed thus far. This algorithm uses space close to the calls to procedures with corresponding returns. minimal space any streaming algorithm would need, and The nested word automaton model works left to right on generalizes to answering n-ary queries. an XML document, much like a finite automaton on words, but the automaton at a close-tag is allowed to change state by looking at the previous state as well as the state at the 1 Introduction open-tag associated with the close-tag. One way to implement a nested word automaton on a word is by allowing an The general purpose markup language XML (eXtensi- automaton access to a stack: the automaton is allowed to ble Markup Language) represents hierarchically structured store its state at an open-tag and recover the state at the cor- data, and has become the standard for data exchange, partic- responding close-tag. This leads to the definition of visibly ularly on the web [27, 1]. An XML document is a text-based pushdown automata for XML: a pushdown automaton that linear encoding of tree-structured data. Markups in the text pushes onto the stack at open-tags and pops from the stack in terms of open- and close-tags are used to create a brack- on close-tags. The class of languages accepted by nested eting structure on the document, and captures the hierarchy word automata is exactly the same as that accepted by visi- of information in the tree. For example, an XML text of the bly pushdown automata, and we call this the class of visibly form: pushdown languages. The visibly pushdown automaton model is an appealing <movie> My neighbor Totoro <director> Hayao Miyazaki </director> </movie> formalism for studying and building algorithms for XML. encodes a movie title and its director tag is a child In earlier work [18], we have shown that document type node of the movie-tag (since it lies within the definitions are captured by special classes of visibly push- <movie>...</movie> fragment), indicating that down automata, and have used this characterization to build Miyazaki is the director of this particular movie. The XML streaming algorithms for type-checking and pre-order (and 1 post-order) typing XML documents. Visibly pushdown au- from the filtering problem that outputs just whether there tomata are particularly effective in building streaming al- is some position that satisfies the query, which is easy to gorithms, as the model naturally captures reading a nested implement using space O(d), where d is the depth of the word left-to-right in a streaming fashion, unlike standard document. tree-automata models that work on the tree represented by When a streaming querying algorithm has read a prefix an XML document. u of an input document, several positions in u may partially In this paper, we study the power of visibly pushdown satisfy the query requirement. These potential answers to automaton in expressing and answering queries on XML the query have to be stored as the rest of the document will documents. Our first contribution is an automaton model for determine whether these are valid answers to a query. Con- defining queries, called query visibly pushdown automata sequently, we cannot hope for a streaming algorithm that (query VPA). A query VPA is a two-way visibly pushdown takes constant space (or space O(d)). However, a position automaton that can mark positions of a document. When that has already satisfied the query requirements can be im- moving right, a query VPA pushes onto the stack when mediately output (to an output tape) and need not be kept reading open-tags and pops from the stack reading close- in main memory. Formally, let us call a position i · juj an tags, like a visibly pushdown automaton. However, when interesting position if there is some extension v of u such moving left it does the opposite: it pushes onto the stack that i is an answer to the query on uv, but there is another reading close-tags and pops from the stack reading open- extension v0 of u such that i is not an answer to the query tags. Hence it preserves the invariant that the height of the on uv0. Summarizing the above argument, a streaming al- stack when at a particular position in the document is pre- gorithm must keep track of all the interesting positions at cisely the number of unmatched calls before that position u, which would take space (without resort to compression) (which is the number of unmatched returns after that posi- O(I: log n), where I is the number of interesting positions tion). at u and n = juj. We show that query VPAs have the right expressive The streaming algorithm we present uses only the space power for defining queries by showing that a query is ex- argued above as being necessary: after reading u, it uses pressible as a unary formula in monadic second-order logic space O(d + I: log n) for any fixed query (accounting for (MSO) if and only if it is implemented by a query VPA. I pointers into the document), where d is the depth of the Unary MSO queries form perhaps the largest class of log- document. The algorithm is based on a simulation of a vis- ical queries one can hope to answer by automatic means, ibly pushdown automaton representing the query on an in- and obviously include practical query languages such as put document, with care taken to prune away positions that XPath. We find it remarkable that the simple definition of become uninteresting. The notion of basing the space com- two-way VPAs exactly captures all monadic queries. This plexity of the algorithm on the set of interesting positions result parallels a classic result in the theory of automata on is the algorithmic analog of the right congruence for finite finite words, which equates the power of unary monadic automata on words (Myhill-Nerode theorem). We believe queries on (nonnested) words to that of two-way automata that this is an interesting criterion of complexity for stream- on words [10, 22]. ing query processors, rather than the simplistic linear de- While query VPAs are a simple clean model for express- pendence on the entire length of the document commonly ing queries, they do not help answering queries efficiently. reported (eg. [8]). They in fact represent one extreme end of the space-time- The query complexity of our algorithm is of course non- streaming spectrum: they do not process streaming doc- elementary for queries expressed in MSO, and is not a fair uments and they are not time-efficient as they reread the appraisal of the algorithm. However, if the query is repre- document several times, but are incredibly space-efficient, sented by a non-deterministic visibly pushdown automaton utilizing only space O(d), where d is the depth of the docu- with m states, we show the query complexity to be poly- ment (with no dependence at all on the length of the docu- nomial in m. More precisely, the algorithm after reading ment!). u uses space O(m4: log m:d + m4:I: log n) where d is the We then turn to streaming algorithms and present our depth of the document at the point u, n = juj, and I is the second main result: a streaming algorithm that answers any set of interesting positions at u. unary monadic query and has efficient time and space data We believe that the streaming algorithm we present here complexity. As far as we know, this is the first time it has is a very efficient algorithm that holds promise in being use- been shown that any unary MSO query can be efficiently ful in practice.

Load more