Constant-Delay Enumeration Algorithms for Document Spanners Over Nested Documents Martin Muñoz PUC & IMFD Chile Cristian Riveros PUC & IMFD Chile

Constant-delay enumeration algorithms for document spanners over nested documents Martin Muñoz PUC & IMFD Chile Cristian Riveros PUC & IMFD Chile

Abstract Some of the most relevant document schemas used online, such as XML and JSON, have a nested format. In recent years, the task of extracting data from large nested documents has become especially relevant. We model queries of this kind as Visibly Pushdown Transducers (VPT), a structure that extends visibly pushdown automata with outputs. Since processing a string through a VPT can generate a huge number of outputs, we are interested in the task of enumerating them one after another as eﬃciently as possible. This paper describes an algorithm that enumerates these elements with output-linear delay after preprocessing the string in a single pass. We show applications of this result on recursive document spanners over nested documents and show how our algorithm can be adapted to enumerate the outputs in this context.

2012 ACM Subject Classiﬁcation Theory of computation → Database theory.

Keywords and phrases Persistent data structures, Query evaluation, Enumeration algorithms.

1 Introduction

A constant-delay enumeration algorithm is an efficient solution to an enumeration problem: given an instance of the problem, the algorithm performs a preprocessing phase to build some indices, to then continue with an enumeration phase where it retrieves each output, one-by-one, taking constant-delay between consecutive outcomes. These algorithms provide a strong guarantee of efficiency since a user knows that, after the preprocessing phase, he will access the output as if we have already computed them. For these reasons, constant-delay algorithms have attracted researchers’ attention, finding sophisticated solutions on several query evaluation problems. Starting with Durand and Grandjean’s work [19], researchers have found constant-delay algorithms for various subclasses of conjunctive queries [11, 15, 12], FO queries over sparse structures [19, 30, 34], MSO queries over words and trees [10, 6], document spanners [23, 7], among others [35]. One could also develop constant-delay enumeration algorithms for query evaluation over nested documents. Indeed, some of the most relevant document schemas used online, such as XML and JSON, have a nested format. We can model the structure of these documents with the theory of nested words, where we divide the alphabet between open and close tags and encode the data in a sequence of well-nested symbols. Many formalisms for processing nested arXiv:2010.06037v1 [cs.DB] 12 Oct 2020 documents can be understood with visibly pushdown automata (VPA) [4], an automata model with excellent algorithmic properties [5]. Moreover, people have recently studied its natural extension to transducers [22], called visibly pushdown transducers (VPT), a model for processing nested documents and producing outputs. The main advantage of these models compared to other models of nesting structures (e.g., automata over trees) is that it allows processing naturally the input in a streaming fashion, reducing the number of resources needed to fulfill its task. This paper presents a constant-delay enumeration algorithm for evaluating VPT over nested words. Specifically, we study the combined complexity of evaluating a VPT T over an input word w and consider the notion of output-linear delay [23], a refinement of the idea of 2 Constant-delay enumeration algorithms for document spanners over nested documents

constant-delay where the delay between outputs depends only the size of two consecutive outcomes, namely, constant with respect to T and w. Therefore, we provide an output-linear delay enumeration algorithm that takes preprocessing time STS3 ⋅ SwS, namely, linear in the size of the input word. Why do we need a constant-delay algorithm for nested words? It is well-known that there is a close connection between VPAs and tree automata. Indeed, we can encode every well-nested word as a tree. Furthermore, there is a one-to-one correspondence between nested word languages accepted by VPAs and languages accepted by tree automata. Therefore, one could argue that for getting a constant-delay enumeration algorithm for VPT, one can convert the nested word to a tree and the VPT to a tree automata (with some output policy), and then use the machinery of [10, 6, 8] to solve this problem. It is undoubtedly true that we can derive the existence of an enumeration algorithm for VPT by following this approach. However, our algorithm makes two contributions to the solution of this problem: 1. The first contribution is that our algorithm makes a one-pass preprocessing phase, namely, the input is read letter by letter making a single pass over the input (see Section 2 for a formalization). Instead, algorithms in [10, 6, 8] require storing the entire input in memory, for making several passes over the data and computing indices for the final enumeration. Our algorithm makes a single pass, and for each letter updates its data structure driven by the VPT transition table. This characteristic of the algorithm is very appealing, for example, to process XML or JSON documents in a streaming fashion. 2. The second contribution of our algorithm is that it solves an open exposition problem. As Timothy Chow said [16]: “Solving an open exposition problem means explaining a mathematical subject in a way that renders it totally perspicuous”. Bringing this concept to algorithms, this means to provide a simple algorithm whose instructions are evident in a way that a software developer can understand and implement. One can argue that the algorithms presented in [10, 6, 8] are sophisticated, in the sense that they require several steps to preprocess the input. Further, these algorithms are given as a sequence of mathematical procedures without providing the final code. Instead, we base our algorithm on a fully-persistent [18] data structure called an Enumerable Compact Set (ECS), and an algorithm that mimics the abstract machine’s execution (see Algorithm 1). This algorithm is given by a sequence of low-level instructions, and we believe that any skilled software developer can implement it. Moreover, the exposition allows us to think on further optimization and better understand constant-delay algorithms over tree structures. Towards the end of this work, we apply our results to study efficient enumeration algorithms in information extraction. More specifically, we use our techniques to derive an output-linear delay algorithm for evaluating visibly pushdown extraction grammars, a subclass of extraction grammars [32]. This solution is the first practical algorithm (i.e., linear preprocessing) to evaluate a subclass of recursive document spanners efficiently [33]. Related work. Constant-delay algorithms have been studied for several classes of query languages and structures [35], as we already discussed. The approach of [10, 6] is the closest to this work (see the discussion above). Algorithms for processing nested documents (e.g., XML) with one-pass [26] or constant-delay [14] have been studied previously; however, they are interested in different problems not directly related to this work. In [21, 3], people have studied the evaluation of VPT in a streaming fashion, but none of them looked at enumeration problems. Very recently, people have considered extensions of document spanner with recursion [33, 32]. In these works, the question of investigating nested documents is not addressed. To the best of our knowledge, this is the first work on considering the evaluation of visibly pushdown transducers over nested documents with constant-delay enumeration. M. Muñoz and C. Riveros 3

2 Preliminaries

Nested words. As usual, given a set S we denote by S all finite words with symbols in S where ∈ S represents the empty word of length 0. ∗ We will∗ work over a structured alphabet Σ = (Σ<, Σ>, Σ|) comprised of three disjoint sets Σ<, Σ>, and Σ| that contain open, close, and neutral symbols respectively (in [4, 22] these sets are named call, return, and local, respectively). Furthermore, we will call a symbol < > | in Σ , Σ or Σ as an “open”, a “close”, or a “letter”, and we will denote them as , and a, respectively. Instead, we will use s to denote any symbol in Σ<, Σ>, or Σ|. The set of well-nested words over Σ (or just nested words), denoted as Σ<*>, is defined as the closure | <*> <*> <*> of the following rules: Σ ∪ {ε} ⊆ Σ , if w1, w2 ∈ Σ ∖ {ε} then w1 ⋅ w2 ∈ Σ , and (3) if <*> < > <*> w ∈ Σ and ∈ Σ then ∈ Σ . In [4], Alur et al. consider a slightly more general set of nested words, where there could be close symbols at the beginning of the word that are not open and open symbols at the end that are never close. We can extend our setting to support this generalization at the cost of considering these border cases. For the sake of simplicity, we restrict our work to nested words without loss of generality. Visibly Pushdown Languages. A visibly pushdown automaton [4] (VPA) is a tuple A = (Q, Σ, Γ, ∆,I,F ) where Q is a finite set of states, Σ = (Σ<, Σ>, Σ|) is the input alphabet, Γ is the stack alphabet, ∆ ⊆ (Q× Σ< ×Q× Γ)∪(Q× Σ> × Γ ×Q)∪(Q× Σ| ×Q) is the transition relation, I ⊆ Q is a set of initial states, and F ⊆ Q is a set of final states. A transition (q, , x, q ) is a pop-transition where x is read from the top of the stack and popped,′ and the current state′ changes from q to q . Lastly, we say that (q, a, q ) is a neutral transition if a ∈ Σ|, where there is no stack operation.′ A stack is a′ finite sequence σ over Γ where the top of the stack is the first symbol on σ. <*> s1 sn For a nested word w = s1⋯sn in Σ , a run of A on w is a sequence ρ = (q1, σ1) Ð→ ... Ð→ (qn 1, σn 1), where each qi ∈ Q, σi ∈ Γ , q1 ∈ I, σ1 = ε, and for every i ∈ [1, n] the following holds: (1) if s ∈ Σ<, then there is x ∈∗Γ such that (q , s , q , x) ∈ ∆ and σ = xσ , (2) if + + i i i i 1 i 1 i s ∈ Σ>, then there is x ∈ Γ such that (q , s , x, q ) ∈ ∆ and σ = xσ , and (3) if s ∈ Σ|, i i i i 1 + i i 1 + i then (q , s , q ) ∈ ∆ and σ = σ . A run ρ like above is accepting if q ∈ F . A nested i i i 1 i 1 i + + n 1 word w ∈ Σ<*> is accepted by a VPA A if there is an accepting run of A on w. The language + + + L(A) is the set of nested words accepted by A. Note that on a nested word w, if ρ is an <*> accepting run of A on w, then σn 1 = ε. A set of nested words L ⊆ Σ is called a visibly pushdown language if there exists a VPA A such that L = L(A). + A VPA A = (Q, I, Γ, δ, F ) is said to be deterministic if SIS = 1 and δ is a function subset of (Q × Σ< → Q × Γ) ∪ (Q × Σ> × Γ → Q) ∪ (Q × Σ| → Q). We also say that A is unambiguous if, for every w ∈ L(A), there exists exactly one accepting run of A on w. In [4], it is shown that for every VPA there exists an equivalent deterministic VPA of at most exponential size. Enumeration with one-pass preprocessing and output-linear delay. As it is common in the enumeration algorithms literature [10, 17, 35], for our computational model we use Random Access Machines (RAM) with uniform cost measure, and addition and subtraction as basic operations [1]. We assume that a RAM has read-only input registers where the machine places the input, read-write work registers where it does the computation, and write-only output registers where it gives the output (i.e., the enumeration of the results). For our gold standard for efficiency we consider the notion of output-linear delay defined in [23]. This notion is a refinement of the definition of constant-delay [35] or linear-delay [17] enumeration that better fits our purpose. We also add the additional restriction of making one-pass over the input. For this, we adopt the setting of relations to represent enumeration 4 Constant-delay enumeration algorithms for document spanners over nested documents

problem [29, 9] and separate the input between two components, query and data, in order to formalize the notion of one-pass over the data. Let Ω be an alphabet. An enumeration problem is a relation R ⊆ (Ω × Ω ) × Ω . Here, for each pair ((q, x), y) ∈ R we view (q, x) as the input of the problem and∗ y as∗ a possible∗ output for (q, x). Furthermore, we call q the

query and x the data. For an instance (q, x) we deﬁne the set ⟦q⟧R(x) = {y S ((q, x), y) ∈ R} of all outputs of evaluating q over x. As it is standard in this framework [29], we assume that R is a p-relation which means that there exist a polynomial p such that, for every

y ∈ ⟦q⟧R(x), it holds that SyS ≤ p(SqS + SxS). To formalize the notion of one-pass algorithm, we also need to restrict the access of the data x from an instance (q, x). Specifically, we assume the existence of a method yield[x] such that, if x = a1 . . . an, then the first call of yield[x] returns a1, the (i+1)-th call retrieves ai 1 after i-calls to yield[x], and the (n + 1)-th call outputs EOF which is a special symbol that marks the end of x. Note that the yield method does not give the length of the input in + advance and this is known only after the last call to yield. We say that E is an enumeration algorithm with one-pass preprocessing for R if E runs in two phases such that for every input (q, x) ∈ Ω × Ω : ∗ ∗ The first phase, called the preprocessing phase, receives as input q and access x through the method yield[x]. This phase does not produce output but may prepare data structures for use in the next phase. The second phase, called the enumeration phase, occurs immediately after the last call to yield[x] outputs EOF. During this phase the algorithm: (1) writes #y1#y2#⋯#ym# to the output registers where # is a distinct separator symbol not contained in Ω, and

y1, y2, . . . , ym is an enumeration (without repetitions) of the set ⟦q⟧R(x), (2) it writes the first # as soon as the enumeration phase starts, and (3) it stops immediately after writing the last #. The purpose of separating E’s operation into a preprocessing and enumeration phase is to be able to make an output-sensitive analysis of E’s complexity. We say that E has update- time f ∶ N → N if the number of instructions that E executes during two consecutive calls to yield[x] on an input (q, x) is at most O(f(SqS)). In particular, the total time of the preprocessing phase is at most O(f(SqS) ⋅ SxS). Here we assume that the RAM can store any element of Ω in a single register and it can operate each register in constant time. For the enumeration phase, we measure the delay between two outputs as follows. For an input x ∈ Ω , let #y1#y2#⋯#ym# be the output of the algorithm during the enumeration phase. Furthermore,∗ let timei(x) be the time in the enumeration phase when the algorithm writes the i-th # when running on x for i ≤ m + 1. Define delayi(x) = timei 1(x) − timei(x) for i ≤ m. Then we say that E has output-linear delay if there exists a constant k such that + for every input x ∈ Ω it holds that delayi(x) ≤ k ⋅ SyiS for every i ≤ m, that is, the number of instructions executed∗ by E between the time that the i-th and the (i + 1)-th # are written is linear on the size of yi. Given an enumeration problem R, we say that R can be solved with one-pass preprocessing, update-time f, and output-linear delay if there exists such an algorithm as above. Notice that the enumeration algorithm defined above is a formal refinement of the algorithmic notions used in the literature of dynamic query evaluation (see [13, 28, 27]). Indeed, given that an enumeration algorithm with one-pass preprocessing does not know when the last call to yield will arrive, the algorithm must be prepared to produce all the outputs at any moment in time. In other words, it makes these algorithms suitable for a streaming evaluation setting [28, 27], although we presented it here as an offline setting. M. Muñoz and C. Riveros 5

3 Visibly pushdown transducers and main result

In this section, we present the deﬁnition of visibly pushdown transducers [22], which is an extension of visibly pushdown automata to produce output. After the setting is formalized, we state the main result of the paper. A visibly pushdown transducer (VPT) is a tuple T = (Q, Σ, Γ, Ω, ∆,I,F ) where Q, Σ, Γ, I, and F are the same as for VPA, Ω is the output alphabet with ε ∉ Ω, and

∆ ⊆ (Q × Σ< × (Ω ∪ {ε}) × Q × Γ) ∪ (Q × Σ> × (Ω ∪ {ε}) × Γ × Q) ∪ (Q × Σ| × (Ω ∪ {ε}) × Q) is the transition relation. As usual for transducers, a symbol s ∈ Σ< ∪ Σ> ∪ Σ| is an input symbol that the machine reads and o` ∈ Ω ∪ {ε} is a symbol that the machine prints in an output tape. Furthermore, ε represents that no symbol is printed for that transition. A run ρ <*> of T over a nested word w = s1s2⋯sn ∈ Σ and output sequence µ = o`1o`2 ⋯ o`n ∈ (Ω ∪ {ε}) is s1 o`1 sn o`n ∗ a sequence of the form ρ = (q1, σ1) ÐÐÐ→ ... ÐÐÐ→ (qn 1, σn 1) where qi ∈ Q, σi ∈ Γ , q1 ∈ I, ~ ~ < σ = ε and for every i ∈ [1, n] the following holds: (1) if s ∈ Σ , then (q , s , o` , q ∗, x) ∈ ∆ 1 + i + i i i i 1 > for some x ∈ Γ and σ = xσ , (2) if s ∈ Σ , then (q , s , o` , x, q ) ∈ ∆ for some x ∈ Γ and i 1 i i i i i i 1 + | σ = xσ , and (3) if s ∈ Σ , then (p , s , o` , q ) ∈ ∆ and σ = σ . We say that the run i i 1 + i i i i i 1 i + i 1 is accepting if q ∈ F . We call a pair (q , σ ) a configuration of ρ. Finally, the output of + n 1 i i + + an accepting run ρ is defined as: out(ρ) = out(o` , 1) ⋅ ... ⋅ out(o` , n) where out(o`, i) = ε when + 1 n o` = ε and (o`, i) otherwise. Note that in µ = o`1 ⋯ o`n we use ε as a symbol, and in out(ρ) we use ε as the empty string. Given a VPT T and a nested-word w ∈ Σ<*>, we define the set ⟦T⟧(w) of all outputs of T over w as: ⟦T⟧(w) = {out(ρ)S ρ is an accepting run of T over w}. Strictly speaking, our definition of VPT is not the same as the one studied in [22]. In our definition of VPT each output element is a tuple (o`, i) where o` is the symbol and i is the output position, where for a standard VPT [22] an output element is just the symbol o`. Although our algorithm could be extended for standard VPTs, for application purposes it is better to have a more riched output as the one presented here (see Section 6). In this paper, we say that a VPT T = (Q, Σ, Γ, Ω, ∆,I,F ) is input/output deterministic (I/O-deterministic for short) if SIS = 1 and ∆ is a partial function of the form ∆ ∶ (Q×Σ< ×Ω → Q × Γ) ∪ (Q × Σ> × Ω × Γ → Q) ∪ (Q × Σ| × Ω → Q). On the other hand, we say that T is input/output unambiguous (I/O-unambiguous for short) if for every nested word w ∈ Σ<*> and every µ ∈ ⟦T⟧(w) there is exactly one accepting run ρ of T over w such that µ = out(ρ). Notice that an I/O-deterministic VPT is also I/O-unambiguous. The definition of I/O- deterministic is in line with the notion of I/O-deterministic variable automata of [23] and I/O-unambiguous is a generalization of this idea that is enough for the purpose of our enumeration algorithm. Actually, one can easily show that for every VPT T there exists an equivalent I/O-deterministic VPT and, therefore, an equivalent I/O-unambiguous VPT.

I Lemma 1. For every visibly pushdown transducer T there exists an I/O-deterministic visibly pushdown transducer T such that T (w) = T (w) for every w ∈ Σ<*>. ′ J K J ′K In this paper, we are interested on the following problem for VPT. Let C be a class of VPT (e.g. I/O-deterministic VPT).

Problem: EnumVPT[C] Input: a VPT T ∈ C and a nested word w ∈ Σ<*>. Output: Enumerate ⟦T⟧(w).

The main result of the paper is that for the class of I/O-unambiguous VPT, this enumeration problem can be solved in one-pass and with strong guarantees of eﬃciency. 6 Constant-delay enumeration algorithms for document spanners over nested documents

I Theorem 2. The enumeration problem EnumVPT for the class of I/O-unambiguous VPT can be solved with one-pass preprocessing, update-time SQS2S∆S, and output-linear delay. For the general class of VPT, this problem can be solved with one-pass preprocessing, update-time 2 2 Q ∆ , and output-linear delay. S S S S The result for the class of all VPT it is a consequence of Lemma 1 and the enumeration algorithm for I/O-unambiguous VPT. In the next sections we present this algorithm, starting by deﬁning a general data structure that we used the store the outputs.

4 Enumerable compact sets: a data structure for output-linear delay

This section presents a data structure called Enumerable Compact Set (ECS), which is in the hearth of our enumeration algorithm for VPT. This data structure is strongly inspired by the work in [6, 7]. Indeed, ECS can be considered a refinement of the d-DNNF circuits used in [6] or of the set circuits used in [7]. The main difference between ECS and the knowledge-compilation approach is that we see a “circuit” as a data structure and treat it as such. With this, we can avoid using the heavy notation of circuits, or applying special operations to convert them, or computing indices to enumerate the output. Instead, we use ECS to simplify the presentation and apply all these reductions and optimizations at once, improving the conceptual understanding of the structure. Some of the conceptual contributions of this approach are to understand that we need a succinct data structure to store the outputs, we need to retrieve these outputs with output-linear delay, and we need the data structure to be fully-persistent [18], that is, it always preserves the previous version of itself when it is modified. In fact, this last property is crucial for our preprocessing algorithm to manage different runs that share part of the same output. In the following, we present ECS step-by-step to use it later in the next section. Let Σ be a (possibly infinite) alphabet. We define an Enumerable Compact Set (ECS) as a tuple D = (Σ, V, I, `, r, λ) such that V is a finite set of nodes, I ⊆ V is the set of inner nodes, `∶ I → V and r∶ I → V are the left and right functions, and λ∶ V → Σ ∪ {∪, ⊙} is a label function such that λ(v) ∈ {∪, ⊙} if, and only if, v ∈ I. Further, we assume that the directed graph (V, {(v, `(v)), (v, r(v)) S v ∈ V }) is acyclic. We call the nodes in I inner nodes and the nodes in V ∖ I leaves. Furthermore, for v ∈ I we say that v is a product node if λ(v) = ⊙, and a union node if λ(v) = ∪. We define the size of D as SDS = SV S. For each node v in D, we associate a set of words L (v) recursively as follows: (1) L (v) = {a} whenever λ(v) = a ∈ Σ, (2) L (v) = L (`(v)) ∪ L (r(v)) whenever λ(v) = ∪, and (3) L (v) = L (`(v)) ⋅ L (r(v)) D D whenever λ(v) = ⊙, where L ⋅ L = {w ⋅ w S w ∈ L and w ∈ L }. D D D 1 2 1 2 1 1 2 D2 D D Note that SL (v)S can be of exponential size with respect to SDS. For this reason we say that D is a compact representation of the set L (v) for any v ∈ V . Despite that the D represented set is huge, the goal is to enumerate all its elements efficiently. In other words, D we consider the following problem:

Problem: Enum-ECS Input: An ECS D = (Σ, V, I, `, r, λ) and v ∈ V .

Output: Enumerate the set LD(v) without repetitions.

and we want to solve Enum-ECS with output-linear delay. To reach this goal we need to impose two additional restrictions to D. The first restriction is to guarantee that D is not ambiguous, namely, for each w ∈ L (v) there is at most one way to retrieve w from D. Formally, we say that D is unambiguous if D satisfies the following two properties: (1) for D M. Muñoz and C. Riveros 7 every union node v it holds that L (`(v)) and L (r(v)) are disjoint, and (2) for every product node v and for every w ∈ L (v), there exists a unique way to decompose w = w ⋅ w D D 1 2 such that w ∈ L (`(v)) and w ∈ L (r(v)). Then, if D is unambiguous, we can guarantee 1 2 D that by enumerating elements of L (v) there will be no duplicates, given that there is no D D way of producing the same element in two different ways. D The second restriction to solve Enum-ECS with output-linear delay is to guarantee that, for each node v, there exists an output or, more specifically, a symbol of an output close to v. This is not always the case for an ECS. For example, consider a balanced tree of union nodes where all the outputs are at the leaves. Then one has to traverse a logarithmic number of nodes from the root to reach the first output. For this reason, we define the notion of k-bounded ECS. Given an ECS D, define the (left) output-depth of a node v ∈ V , denoted by odepth (v), recursively as follows: odepth (v) = 0 whenever λ(v) ∈ Σ or λ(v) = ⊙, and odepth (v) = odepth (`(v)) + 1 whenever λ(v) = ∪. Then, for a fixed k ∈ we say that D is D D N k-bounded if odepth (v) ≤ k for all v ∈ V . D D Given the definition output-depth, we say that v is an output node of D if v is a leaf or D a product node. Note that if D only has output nodes then it is 0-bounded, and one can easily check that L (v) can be enumerated with output-linear delay. Indeed, for a fixed k the same happens with every unambiguous and k-bounded ECS. D

I Proposition 3. Fix k ∈ N. Let D = (Σ, V, I, `, r, λ) be an unambiguous and k-bounded ECS. Then the set L (v) can be enumerated with output-linear delay for any v ∈ V .

It is importantD to notice that the enumeration algorithm of the previous proposition does not require any preprocessing over D and the main idea is to perform some sort of DFS traversal over the nodes. Given that D is unambiguous, we know that this procedure will not enumerate any output twice. Moreover, given that D is k-bounded, we also know that on a fix number of steps it will find something to output, which is necessary to bound the delay. Therefore, from now we assume that all ECS are unambiguous and k-bounded for some fix k. The next step is to provide a set of operations that allow to extend an ECS D, maintaining k-boundedness. Furthermore, we require these operations to be fully-persistent [18] in order to always keep the previous versions of the data structures. To satisfy the last requirement, the strategy will consist in extending D to D for each operation, by always adding new nodes and maintaining the previous nodes untouched.′ Then L ′ (v) = L (v) for each node v ∈ V and the structure will be fully-persistent. More precisely, fix an ECS D = (Σ, V, I, `, r, λ). D D Then for any a ∈ Σ and v1, . . . , v4 ∈ V , we define the operations:

(D , v ) ∶= add(D, a)(D , v ) ∶= prod(D, v1, v2) ′ ′ (D′, v′) ∶= union(D, v3, v4) ′ ′ such that D = (Σ,V ,I , ` , r , λ ) is an extension of D (i.e. obj ⊆ obj for every obj ∈ {V, I, `, r, λ})′ and v ∈′V ′∖ ′V ′is a′ fresh node such that L ′ (v ) = {a}, L′ ′ (v ) = L (v1) ⋅ L (v ), and L ′ (v ′) = L′ (v ) ∪ L (v ), respectively. Here we′ assume that the′ union and 2 3 4 D D D prod respect properties′ (1) and (2) of an unambiguous ECS, that is, L (v ) and L (v ) D D D D 1 2 are disjoint and, for every w ∈ L (v ) ⋅ L (v ), there exists a unique way to decompose 3 4 D D w = w ⋅ w such that w ∈ L (v ) and w ∈ L (v ). 1 2 1 3 D 2 D 4 Next, we show how to implement each operation. In fact, the case of add and prod are D D straightforward. For (D , v) ∶= add(D, a) deﬁne V ∶= V ∪ {v }, I ∶= I, and λ (v ) = a. One can easily check that L ′ ′ (v ) = {a} as expected. For′ (D , v ′) ∶= prod′ (D, v1, v′2) we′ proceed in a similar way: deﬁne V ∶′= V ∪ {v }, I ∶= I ∪ {v}, ` (v ′) ∶=′ v , r (v ) = v , and λ (v ) = ⊙. D 1 2 Then L ′ (v ) = L (v1) ⋅ L′ (v2). Furthermore,′ ′ one can′ ′ check that′ ′ each operation′ ′ takes ′ D D D 8 Constant-delay enumeration algorithms for document spanners over nested documents

v ′ v ′′ v ∗ v3 v4

`(v3) r(v3) `(v4) r(v4)

Figure 1 Gadget for union(D, v1, v2). Nodes v, u1, u2, v1 and v2 are labeled as ∪. Dashed and solid lines denote the mappings in `′ and r′ respectively.

constant time, D is a valid ECS (i.e. unambiguous and k-bounded), and the operation are fully-persistent (i.e.′ the previous version D is available). To define the union, we need to be a bit more careful. For a node v ∈ V , we say that v is safe if (1) odepth (v) ≤ 1, and (2) if odepth (v) = 1, then odepth (r(v)) ≤ 1. In other words, v is safe if v is an output node, or its left child is an output node, and the right child is either D D D an output node or has output depth 1. Note that, by definition, a leaf or a product node are safe nodes and, thus, the add and prod operations always produce safe nodes. The trick then is to show that, if v3 and v4 are safe nodes, then we can implement (D , v ) ∶= union(D, v3, v4) and produce a safe node v . For this define (D , v) = union(D, v3, v4) ′as′ follows. If v3 or v4 are output nodes′ then V ∶= V ∪{v′ }, I ∶= I ∪{v }, and λ(v ) ∶= ∪. Moreover, if v3 is the output node, then ` (v ) ∶=′ v3 and r ′(v ) ′∶= v4. Otherwise,′ we′ connect ` (v ) ∶= v4 and r (v ) ∶= v3. ′ ′ ′ ′ ′ ′ If v3 and′ ′ v4 are not output nodes (i.e. both are union nodes), then V ∶= V ∪ {v , v , v }, I ∶= I ∪ {v , v , v }, ` (v ) ∶= `(v3), r (v ) ∶= v , and λ (v ) ∶= ∪; ` (v )′ ∶= `(v4), ′r (′′v )∗∶= v′ , and λ (′v ′′) ∶=∗∪; `′(v ′ ) ∶= r(v3), r′ (v′ ) ∶= r′′(v4), and′ λ′ (v ) ∶= ′∪. ′′ ′ ′′ This∗ gadget1′is′′ depicted′ in∗ Figure 1. Interestingly,′ ∗ this construction′ ∗ has several properties. First, one can easily check that L ′ (v ) = L (v1) ∪ L (v2) and then the semantics is well- defined. Second, union can be computed′ in constant time in SDS given that we only need D D D to add three fresh nodes (i.e. v , v , and v ), and the operation is fully-persistent given that we connect them to previous′ nodes′′ without∗ modifying D. Furthermore, the produced node v is safe in D , although nodes v and v are not necessarily safe. Finally, D is 2-bounded whenever′ D is 2-bounded. This′′ is straightforward∗ to see for first case when′ v3 or v4 are output nodes. For the second case (i.e. the gadget of Figure 1), we have to notice that v3 and v4 are safe, therefore `(v3) and `(v4) are output nodes, and then

odepth ′ (v ) = odepth ′ (v ) = 1. Further, given that v3 and v4 are safe, we know that

odepth (r(′v3)) ≤ 1 and odepth′′ (r(v4)) ≤ 1, so odepth ′ (v ) ≤ 2. Given that the output D D depths of all fresh nodes in D are bounded by 2 and D is 2-bounded,∗ then we conclude that D D D D is 2-bounded as well. ′ ′ By the previous discussion, if we start with an ECS D which is 2-bounded (or empty) and we apply the add, prod and union operators between safe nodes (which also produce safe nodes), then the result is 2-bounded as well. Finally, by Proposition 3, the result can be enumerated with output-linear delay.

I Theorem 4. The operations add, prod, and union take constant time and are fully-persistent. Furthermore, if we start from an empty ECS D and apply add, prod, and union over safe

nodes, the partial results (D , v ) satisﬁes that v is always a safe node and the set L ′ (v) can be enumerated with output-linear′ ′ delay for every′ node v. D

1 This trick is used in [6] for computing an index over a circuit and it is known as Louis’ trick. M. Muñoz and C. Riveros 9

It is important to remark that restricting these operations only over safe nodes is a mild condition. Given that we will usually start from an empty ECS and apply these operations over previously returned nodes, the whole algorithm will always use safe nodes during its computation, satisfying the conditions of Theorem 4. For technical reasons, our algorithm of the next section needs a slight extension of ECS by allowing leaves that produce the empty string . Let ε ∈~ Σ be a symbol representing the empty string (i.e.w ⋅ε = ε⋅w = w). We define an enumerable compact set with ε (called ε-ECS) as a tuple D = (Σ, V, I, `, r, λ) defined identically to an ECS except that λ ∶ V → Σ ∪ {∪, ⊙, ε} and λ(v) ∈ {∪, ⊙} if, and only, if v ∈ I. Also, if λ(v) = ε, then L (v) = {}. The unambiguous and k-boundedness restrictions imposed on ECS also apply to this version. However, to D support the prod and union operations in constant time and to maintain the k-boundedness invariant, we need to slightly extend the notion of safe nodes (called -safe) and the gadgets for prod and union. Given space restrictions, we show these extensions in the appendix and just state here the main result, that will be used in the next section. I Theorem 5. The operations add, prod, and union over ε-ECS take constant time and are fully-persistent. Furthermore, if we start from an empty ε-ECS D and apply add, prod, and union over -safe nodes, the partial results (D , v ) satisfies that v is always an -safe node and the set L ′ (v) can be enumerated with output-linear′ ′ delay for′ every node v.

D 5 Evaluating visibly pushdown transducers with output-linear delay

The goal of this section is to describe an algorithm that takes an I/O-unambiguous VPT T plus a well-nested word w, and enumerates the set T (w) with output-linear delay after a one-pass preprocessing phase. For this, we divide theJ K presentation of the algorithm into two parts. The first part explains the determinization of a VPA, which is instrumental in understanding our preprocessing algorithm. Then the second part gives the algorithm and proves its correctness. For the sake of simplification, in this section we present the algorithm and definitons without neutral letters, that is, the structured alphabet is Σ = (Σ<, Σ>). Indeed, a neutral symbol a can be represented as a pair and, therefore, it is straightforward to extend the techniques of this section to consider neutral symbols. Given this assumption, from now on we use a for denoting any symbol in Σ< ∪ Σ>. Determinization of visibly pushdown automata. A significant result in Alur and Madhusudan’s paper [4] that introduces VPA was that one can always determinize them. We provide here an alternative proof to this result that requires a somewhat more direct construction. This determinization process is behind our preprocessing algorithm and serves to give some crucial notions of how it works. We start by providing the determinization construction, introducing some useful notation, and then giving some intuition. Consider the following determinization of a non-deterministic VPA. For a VPA A = det det det det det det (Q, Σ, Γ, ∆,I,F ) defines a deterministic VPA A = (Q , q0 , Γ , δ ,F ) with state det Q Q det Q Γ Q det set Q = 2 and stack symbol set Γ = 2 . The initial state is q0 = {(q, q)S q ∈ I} and the set of× final states is F det = {S ∈ Qdet× S×S ∩ (I × F ) ≠ ∅}. Finally, the transition function δdet is defined as follows: < det if

> det′ ′ ′ ′ if a> ∈ Σ , then δ (S, T, a>) = S where: ′ S = (p, q)S ∃p , q , x. (p, x, p ) ∈ T ∧ (p , q ) ∈ S ∧ (q , x, a>, q) ∈ ∆ ′ ′ ′ ′ ′ ′ ′ 10 Constant-delay enumeration algorithms for document spanners over nested documents

(p, x, p ) ∈ Tk (p , q ) ∈ Sk ′ ′ ′ Open: Close: p q q p q ′ ′ push x push x push x ′ ′ pop x p q currlevel(k) p p p [ q push y push y ′ push y lowerlevel(k)

Figure 2 Left: An example run of some VPA A at step k. Right: Illustration of two nondetermin- istic runs for some VPA A, as considered in the determinization process.

To understand the purpose of this construction, first we need to introduce some notation. Fix a well-nested word w = a1a2⋯an. A span s of w is a pair [i, j⟩ of natural numbers i and j with 1 ≤ i ≤ j ≤ n + 1. We denote by w[i, j⟩ the subword ai⋯aj 1 of w and, when i = j, we assume that w[i, j⟩ = ε. Intuitively, spans are indexing w with intermediate positions, like − a1 a2 ... an , where i is between symbols ai 1 and ai. Then [i, j⟩ represents an interval 1 2 3 n n+1 {i, . . . , j} that captures the subword ai . . . aj 1. − Now, we say that a span [i, j⟩ of w is well-nested if w[i, j⟩ is well-nested. Note that ε is − well-nested, so [i, i⟩ is a well-nested span for every i. For a position k ∈ [1, n + 1], we define the current-level span of k, currlevel(k), as the well-nested span [j, k⟩ such that j = min{j S [j , k⟩ is well-nested}. Note that [k, k⟩ is always well-nested and then currlevel(k) is well′ defined.′ We also identify the lower-level span of k, lowerlevel(k), defined as lowerlevel(k) = currlevel(j − 1) = [i, j − 1⟩ whenever currlevel(k) = [j, k⟩ and j > 1. In contrast to currlevel(k), lowerlevel(k) is not defined always given that it is “one level below” than currlevel(k) and this may not exist. More concretely, for currlevel(k) = [j, k⟩ and lowerlevel(k) = [i, j − 1⟩, these spans will look as follows:

lowerlevel k currlevel k ³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ ³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ ( ) ( ) ↓ a1 a2 ... ) (see Figure 2 (right) for a graphical description). M. Muñoz and C. Riveros 11

Algorithm 1 The preprocessing phase of the enumeration algorithm for EnumVPT given an I/O-unambiguous VPT T = (Q, Σ, Γ, Ω, ∆,I,F ) and a well-nested word w. 1: procedure Preprocessing(T, w) 27: procedure OpenStep( then ′ 34: v ← Sp,p′ 9: (a, k) CloseStep 35: (D, v) ← IfProd(D, v, o`, k) 10: k ← k + 1 36: (D, v) ← union(D, v, Tp,x,q) 11: vout ← ∅ 37: Tp,x,q ← v ∈ ∈ ≠ ∅ 12: for each p I, q F s.t. Sp,q do 38: S ← S 13: ← (D ) vout union , vout,Sp,q 39: return′ 14: return (D, vout) 40: 15: 41: 16: procedure CloseStep(a>, k) 42: procedure IfProd(D, v, o`, k) 17: S ← ∅ 43: if o` ≠ ε then 18: for′ p, p ∈ Q and (q , a>, o`, x, q) ∈ ∆ do 44: (D , v ) ← add(D, (o`, k)) ′ ′ 19: if Sp′′,q′ ≠ ∅ and T′ p,x,p′ ≠ ∅ then 45: (D , v ) ← prod(D , v, v ) ′ ′ ′ ′ 20: (D, v) ← prod(D,Tp,x,p′ ,Sp′,q′ ) 46: else 21: (D, v) ←IfProd(D, v, o`, k) 47: (D , v ) ← (D, v) ′ ′ 22: (D, v) ← union(D, v, Sp,q) 48: return (D , v ) 23: Sp,q ← v ′ ′ ′ ′ 24: T ← pop(T ) 25: S ← S 26: return′

Indeed, the most important consequence of these two invariants is that a tuple (qj, qk) ∈ Sk represents the interval of some run over w[j, k⟩ with currlevel(k) = [j, k⟩ and the tuple (qi, x, qj) ∈ Tk represents the interval of some run over w[i, j −1⟩ with lowerlevel(k) = [i, j −1⟩, det i.e., the level below. In other words, the configuration (Sk, τk) of A forms a succinct representation of all the non-deterministic runs of A. This is the starting point of our preprocessing algorithm, that we discuss next. The preprocessing algorithm. In Algorithm 1 we present the preprocessing phase of our enumeration algorithm for solving EnumVPT. The main procedure is Preprocessing, that receives as input an I/O-unambiguous VPT T = (Q, Σ, Γ, Ω, ∆,I,F ) and a well-nested word w, and compute the set of outputs ⟦T⟧(w). More specifically, this procedure constructs an ε- ECS D and a vertex vout such that L (vout) = ⟦T⟧(w). After the Preprocessing procedure is done, we can enumerate L (v ) with output-linear delay by applying Theorem 5. out D Towards this goal, in AlgorithmD 1 we make use of the following data structures. First of all, we use an ε-ECS D = (Σ, V, I, `, r, λ), nodes v ∈ V , and the functions add, union, and prod over D and v (see Section 4). For the sake of simplification, we overload the notation of these operators slightly so that if v = ∅, then union(D, v, v ) = union(D, v , v) = prod(D, v, v ) = prod(D, v , v) = (D, v ). We use a hash table S which indexes′ nodes v in′ D by pairs of states′ (p, q) ∈ Q′× Q. We denote′ the elements of S as “(p, q) ∶ v” where (p, q) is the index and v is 12 Constant-delay enumeration algorithms for document spanners over nested documents

the content. Furthermore, we write Sp,q to access the node v. We also use a stack T that stores hash tables: each element is a hash table which indexes vertices v in D by triples (p, x, q) ∈ Q × Γ × Q. We assume that T has the standard stack methods push and pop where if T = tk⋯ t1, then push(T, t) = t tk⋯ t1 and pop(T ) = tk 1⋯ t1. Similar than for S, we use the notation T to access the nodes in the topmost hash-table in T (i.e. T is a stack of p,x,q − hash tables). We assume that accessing a non-assigned index in these hash tables returns the empty set. All variables representing these objects (i.e., T, D, S, and T ) are defined globally in Algorithm 1 and they can be accessed by any of the procedures. Finally, given that we use the RAM model, each operation over any hash tables or the stack takes constant time. Preprocessing builds the ε-ECS D incrementally, reading w one letter at a time by calling yield[w] and keeping a counter k for the position of the current letter. For every k ∈ [1, n + 1] the main procedure builds the k-th iteration of table S and stack T , which we note as Sk and T k respectively. We consider the initial S and T as the 1-th iteration, defined 1 1 as S = {(q, q) ∶ vε S q ∈ I} and T = ∅ (i.e. the empty stack) where vε is a node in D such that L (vε) = {ε} (lines 3-4). In the k-th iteration, depending whether the current letter is an open symbol or a close symbol, the or procedures are called D OpenStep CloseStep updating Sk 1 and T k 1 to Sk and T k, respectively. More specifically, Preprocessing adds nodes to D −such that− the nodes in Sk represent the runs over w[j, k⟩ where currlevel(k) = [j, k⟩, and the nodes in the topmost table in T k represent the runs over w[i, j − 1⟩ where k lowerlevel(k) = [i, j − 1⟩. Moreover, for a given pair (p, q), the node Sp,q represents all runs over w[j, k⟩ with currlevel(k) = [j, k⟩ that start on p and end on q. For a given triple (p, x, q) k the node Tp,x,q represents all runs over w[i, j − 1⟩ with lowerlevel(k) = [i, j − 1⟩ that start on p, and end on q right after pushing x onto the stack. Here, the intuition gained in the determinization of VPA is crucial. Indeed, table Sk and stack T k are the mirror of the det configuration (Sk, τk) of A (recall invariants (a) and (b) above). Before formalizing these notions, we will describe in more detail what the procedures OpenStep and CloseStep exactly do. Recall that the operation add(D, a) simply creates a node in D labeled as a; the operation prod(D, v1, v2) returns a pair (D , v ) such that L ′ (v ) = L (v1) ⋅ L (v2); and the operation union(D, v3, v4) returns a pair′ (′D , v ) such that L′ ′ (v ) = L (v ) ∪ L (v ). To improve the presentation of the algorithm, we′ include′ a D D 3D 4 simple procedure′ called (lines 42-48). Basically, this procedure receives a node v, an D D IfProdD output symbol o`, and a position k, and computes (D , v ) such that L ′ (v ) = L (v)⋅{(o`, k)}

if o` ≠ ε, and L ′ (v ) = L (v) otherwise. ′ ′ ′ D D ′ k k 1 In OpenStepD , S isD created (i.e. S ), and an empty table is pushed onto T to form T k (line 28). Then, all nodes in Sk′ 1 (i.e. S) are checked to see if it the runs− they represent can be extended with a transition− in ∆ (lines 29-30). If this is the case (lines 31 k onwards), a node vε with the ε-output is added in S to start a new level (lines 31-33). Then, k if the transition had a non-empty output, the node Sp,p′ is connected with a new label node k to form the node v (lines 34-35). This node is stored in Tp,x,q, or united with the node that was already present there (lines 36-37). In CloseStep, Sk is initialized as empty (line 17). Then the procedure looks for all of the valid ways to join a node in T k 1, a node in Sk 1, and a transition in ∆ to form a new k − − k 1 k 1 node in S . More precisely, it looks for quadruples (p, x, p , q ) for which Tp,x,p′ and Sp′,q′ are defined, and there is a close transition that starts on q that′ ′ reads x (lines− 18-19). These− nodes are joined and connected with a new label node if′ it corresponds (lines 20-21), and k stored in Sp,q or united with the node that was already present there (lines 22-23). Finally, the top of the stack T is popped after all (p, x, p , q ) are checked. As it was already mentioned, in each step the′ construction′ of D follows the ideas of the M. Muñoz and C. Riveros 13 determinization of a visibly pushdown automata. As such, Figure 2 also aids to illustrate how the table Sk and the top of the stack T k are constructed. The way how the table Sk and the stack T k are constructed is formalized in the following result. Recall that a run of T over a well-nested word w = a1⋯an is a sequence of the a1 o`1 an o`n form ρ = (q0, σ0) ÐÐÐ→ ... ÐÐÐ→ (qn, σn). Given a span [i, j⟩, define a subrun of ρ as a ~ a~i oì aj−1 o`j−1 subsequence ρ[i, j⟩ = (qi, σi) ÐÐÐ→ ... ÐÐÐÐÐ→ (qj, σj). We also extend the function out to ~ ~ receive a subrun ρ[i, j⟩ in the following way: out(ρ[i, j⟩) = out(oì, i) ⋅ ... ⋅ out(o`j 1, j − 1). Finally, define Runs(T, w) as the set of all runs of T over w. −

I Lemma 6. Let T be a VPT and w = a1⋯an be a well-nested word. While running the procedure Preprocessing of Algorithm 1, for every k ∈ [1, n + 1], every pair of states p, q and stack symbol x the following hold: k 1. L (Sp,q) has exactly all sequences out(ρ[j, k⟩) such that ρ ∈ Runs(T, w[1, k⟩), currlevel(k) = [j, k⟩, and ρ[j, k⟩ starts on p and ends on q. D k 2. If lowerlevel(k) is deﬁned, then L (Tp,x,q) has exactly all sequences out(ρ[i, j⟩) such that ρ ∈ Runs(T, w[1, j⟩), lowerlevel(k) = [i, j − 1⟩, and ρ[i, j⟩ starts on p, ends on q, and the D last symbol pushed onto the stack is x.