Constant-delay enumeration algorithms for document spanners over nested documents Martin Muñoz PUC & IMFD Chile Cristian Riveros PUC & IMFD Chile

Abstract Some of the most relevant document schemas used online, such as XML and JSON, have a nested format. In recent years, the task of extracting data from large nested documents has become especially relevant. We model queries of this kind as Visibly Pushdown Transducers (VPT), a structure that extends visibly pushdown automata with outputs. Since processing a string through a VPT can generate a huge number of outputs, we are interested in the task of enumerating them one after another as efficiently as possible. This paper describes an algorithm that enumerates these elements with output-linear delay after preprocessing the string in a single pass. We show applications of this result on recursive document spanners over nested documents and show how our algorithm can be adapted to enumerate the outputs in this context.

2012 ACM Subject Classification Theory of computation → Database theory.

Keywords and phrases Persistent data structures, Query evaluation, Enumeration algorithms.

1 Introduction

A constant-delay enumeration algorithm is an efficient solution to an enumeration problem: given an instance of the problem, the algorithm performs a preprocessing phase to build some indices, to then continue with an enumeration phase where it retrieves each output, one-by-one, taking constant-delay between consecutive outcomes. These algorithms provide a strong guarantee of efficiency since a user knows that, after the preprocessing phase, he will access the output as if we have already computed them. For these reasons, constant-delay algorithms have attracted researchers’ attention, finding sophisticated solutions on several query evaluation problems. Starting with Durand and Grandjean’s work [19], researchers have found constant-delay algorithms for various subclasses of conjunctive queries [11, 15, 12], FO queries over sparse structures [19, 30, 34], MSO queries over words and trees [10, 6], document spanners [23, 7], among others [35]. One could also develop constant-delay enumeration algorithms for query evaluation over nested documents. Indeed, some of the most relevant document schemas used online, such as XML and JSON, have a nested format. We can model the structure of these documents with the theory of nested words, where we divide the alphabet between open and close tags and encode the data in a sequence of well-nested symbols. Many formalisms for processing nested arXiv:2010.06037v1 [cs.DB] 12 Oct 2020 documents can be understood with visibly pushdown automata (VPA) [4], an automata model with excellent algorithmic properties [5]. Moreover, people have recently studied its natural extension to transducers [22], called visibly pushdown transducers (VPT), a model for processing nested documents and producing outputs. The main advantage of these models compared to other models of nesting structures (e.g., automata over trees) is that it allows processing naturally the input in a streaming fashion, reducing the number of resources needed to fulfill its task. This paper presents a constant-delay enumeration algorithm for evaluating VPT over nested words. Specifically, we study the combined complexity of evaluating a VPT T over an input word w and consider the notion of output-linear delay [23], a refinement of the idea of 2 Constant-delay enumeration algorithms for document spanners over nested documents

constant-delay where the delay between outputs depends only the size of two consecutive outcomes, namely, constant with respect to T and w. Therefore, we provide an output-linear delay enumeration algorithm that takes preprocessing time STS3 ⋅ SwS, namely, linear in the size of the input word. Why do we need a constant-delay algorithm for nested words? It is well-known that there is a close connection between VPAs and automata. Indeed, we can encode every well-nested word as a tree. Furthermore, there is a one-to-one correspondence between nested word languages accepted by VPAs and languages accepted by tree automata. Therefore, one could argue that for getting a constant-delay enumeration algorithm for VPT, one can convert the nested word to a tree and the VPT to a tree automata (with some output policy), and then use the machinery of [10, 6, 8] to solve this problem. It is undoubtedly true that we can derive the existence of an enumeration algorithm for VPT by following this approach. However, our algorithm makes two contributions to the solution of this problem: 1. The first contribution is that our algorithm makes a one-pass preprocessing phase, namely, the input is read letter by letter making a single pass over the input (see Section 2 for a formalization). Instead, algorithms in [10, 6, 8] require storing the entire input in memory, for making several passes over the data and computing indices for the final enumeration. Our algorithm makes a single pass, and for each letter updates its data structure driven by the VPT transition table. This characteristic of the algorithm is very appealing, for example, to process XML or JSON documents in a streaming fashion. 2. The second contribution of our algorithm is that it solves an open exposition problem. As Timothy Chow said [16]: “Solving an open exposition problem means explaining a mathematical subject in a way that renders it totally perspicuous”. Bringing this concept to algorithms, this means to provide a simple algorithm whose instructions are evident in a way that a software developer can understand and implement. One can argue that the algorithms presented in [10, 6, 8] are sophisticated, in the sense that they require several steps to preprocess the input. Further, these algorithms are given as a sequence of mathematical procedures without providing the final code. Instead, we base our algorithm on a fully-persistent [18] data structure called an Enumerable Compact Set (ECS), and an algorithm that mimics the abstract machine’s execution (see Algorithm 1). This algorithm is given by a sequence of low-level instructions, and we believe that any skilled software developer can implement it. Moreover, the exposition allows us to think on further optimization and better understand constant-delay algorithms over tree structures. Towards the end of this work, we apply our results to study efficient enumeration algorithms in information extraction. More specifically, we use our techniques to derive an output-linear delay algorithm for evaluating visibly pushdown extraction grammars, a subclass of extraction grammars [32]. This solution is the first practical algorithm (i.e., linear preprocessing) to evaluate a subclass of recursive document spanners efficiently [33]. Related work. Constant-delay algorithms have been studied for several classes of query languages and structures [35], as we already discussed. The approach of [10, 6] is the closest to this work (see the discussion above). Algorithms for processing nested documents (e.g., XML) with one-pass [26] or constant-delay [14] have been studied previously; however, they are interested in different problems not directly related to this work. In [21, 3], people have studied the evaluation of VPT in a streaming fashion, but none of them looked at enumeration problems. Very recently, people have considered extensions of document spanner with recursion [33, 32]. In these works, the question of investigating nested documents is not addressed. To the best of our knowledge, this is the first work on considering the evaluation of visibly pushdown transducers over nested documents with constant-delay enumeration. M. Muñoz and C. Riveros 3

2 Preliminaries

Nested words. As usual, given a set S we denote by S all finite words with symbols in S where  ∈ S represents the empty word of length 0. ∗ We will∗ work over a structured alphabet Σ = (Σ<, Σ>, Σ|) comprised of three disjoint sets Σ<, Σ>, and Σ| that contain open, close, and neutral symbols respectively (in [4, 22] these sets are named call, return, and local, respectively). Furthermore, we will call a symbol < > | in Σ , Σ or Σ as an “open”, a “close”, or a “letter”, and we will denote them as , and a, respectively. Instead, we will use s to denote any symbol in Σ<, Σ>, or Σ|. The set of well-nested words over Σ (or just nested words), denoted as Σ<*>, is defined as the closure | <*> <*> <*> of the following rules: Σ ∪ {ε} ⊆ Σ , if w1, w2 ∈ Σ ∖ {ε} then w1 ⋅ w2 ∈ Σ , and (3) if <*> < > <*> w ∈ Σ and ∈ Σ then ∈ Σ . In [4], Alur et al. consider a slightly more general set of nested words, where there could be close symbols at the beginning of the word that are not open and open symbols at the end that are never close. We can extend our setting to support this generalization at the cost of considering these border cases. For the sake of simplicity, we restrict our work to nested words without loss of generality. Visibly Pushdown Languages. A visibly [4] (VPA) is a tuple A = (Q, Σ, Γ, ∆,I,F ) where Q is a finite set of states, Σ = (Σ<, Σ>, Σ|) is the input alphabet, Γ is the stack alphabet, ∆ ⊆ (Q× Σ< ×Q× Γ)∪(Q× Σ> × Γ ×Q)∪(Q× Σ| ×Q) is the transition relation, I ⊆ Q is a set of initial states, and F ⊆ Q is a set of final states. A transition (q, , x, q ) is a pop-transition where x is read from the top of the stack and popped,′ and the current state′ changes from q to q . Lastly, we say that (q, a, q ) is a neutral transition if a ∈ Σ|, where there is no stack operation.′ A stack is a′ finite sequence σ over Γ where the top of the stack is the first symbol on σ. <*> s1 sn For a nested word w = s1⋯sn in Σ , a run of A on w is a sequence ρ = (q1, σ1) Ð→ ... Ð→ (qn 1, σn 1), where each qi ∈ Q, σi ∈ Γ , q1 ∈ I, σ1 = ε, and for every i ∈ [1, n] the following holds: (1) if s ∈ Σ<, then there is x ∈∗Γ such that (q , s , q , x) ∈ ∆ and σ = xσ , (2) if + + i i i i 1 i 1 i s ∈ Σ>, then there is x ∈ Γ such that (q , s , x, q ) ∈ ∆ and σ = xσ , and (3) if s ∈ Σ|, i i i i 1 + i i 1 + i then (q , s , q ) ∈ ∆ and σ = σ . A run ρ like above is accepting if q ∈ F . A nested i i i 1 i 1 i + + n 1 word w ∈ Σ<*> is accepted by a VPA A if there is an accepting run of A on w. The language + + + L(A) is the set of nested words accepted by A. Note that on a nested word w, if ρ is an <*> accepting run of A on w, then σn 1 = ε. A set of nested words L ⊆ Σ is called a visibly pushdown language if there exists a VPA A such that L = L(A). + A VPA A = (Q, I, Γ, δ, F ) is said to be deterministic if SIS = 1 and δ is a function subset of (Q × Σ< → Q × Γ) ∪ (Q × Σ> × Γ → Q) ∪ (Q × Σ| → Q). We also say that A is unambiguous if, for every w ∈ L(A), there exists exactly one accepting run of A on w. In [4], it is shown that for every VPA there exists an equivalent deterministic VPA of at most exponential size. Enumeration with one-pass preprocessing and output-linear delay. As it is common in the enumeration algorithms literature [10, 17, 35], for our computational model we use Random Access Machines (RAM) with uniform cost measure, and addition and subtraction as basic operations [1]. We assume that a RAM has read-only input registers where the machine places the input, read-write work registers where it does the computation, and write-only output registers where it gives the output (i.e., the enumeration of the results). For our gold standard for efficiency we consider the notion of output-linear delay defined in [23]. This notion is a refinement of the definition of constant-delay [35] or linear-delay [17] enumeration that better fits our purpose. We also add the additional restriction of making one-pass over the input. For this, we adopt the setting of relations to represent enumeration 4 Constant-delay enumeration algorithms for document spanners over nested documents

problem [29, 9] and separate the input between two components, query and data, in order to formalize the notion of one-pass over the data. Let Ω be an alphabet. An enumeration problem is a relation R ⊆ (Ω × Ω ) × Ω . Here, for each pair ((q, x), y) ∈ R we view (q, x) as the input of the problem and∗ y as∗ a possible∗ output for (q, x). Furthermore, we call q the

query and x the data. For an instance (q, x) we define the set ⟦q⟧R(x) = {y S ((q, x), y) ∈ R} of all outputs of evaluating q over x. As it is standard in this framework [29], we assume that R is a p-relation which means that there exist a polynomial p such that, for every

y ∈ ⟦q⟧R(x), it holds that SyS ≤ p(SqS + SxS). To formalize the notion of one-pass algorithm, we also need to restrict the access of the data x from an instance (q, x). Specifically, we assume the existence of a method yield[x] such that, if x = a1 . . . an, then the first call of yield[x] returns a1, the (i+1)-th call retrieves ai 1 after i-calls to yield[x], and the (n + 1)-th call outputs EOF which is a special symbol that marks the end of x. Note that the yield method does not give the length of the input in + advance and this is known only after the last call to yield. We say that E is an enumeration algorithm with one-pass preprocessing for R if E runs in two phases such that for every input (q, x) ∈ Ω × Ω : ∗ ∗ The first phase, called the preprocessing phase, receives as input q and access x through the method yield[x]. This phase does not produce output but may prepare data structures for use in the next phase. The second phase, called the enumeration phase, occurs immediately after the last call to yield[x] outputs EOF. During this phase the algorithm: (1) writes #y1#y2#⋯#ym# to the output registers where # is a distinct separator symbol not contained in Ω, and

y1, y2, . . . , ym is an enumeration (without repetitions) of the set ⟦q⟧R(x), (2) it writes the first # as soon as the enumeration phase starts, and (3) it stops immediately after writing the last #. The purpose of separating E’s operation into a preprocessing and enumeration phase is to be able to make an output-sensitive analysis of E’s complexity. We say that E has update- time f ∶ N → N if the number of instructions that E executes during two consecutive calls to yield[x] on an input (q, x) is at most O(f(SqS)). In particular, the total time of the preprocessing phase is at most O(f(SqS) ⋅ SxS). Here we assume that the RAM can store any element of Ω in a single register and it can operate each register in constant time. For the enumeration phase, we measure the delay between two outputs as follows. For an input x ∈ Ω , let #y1#y2#⋯#ym# be the output of the algorithm during the enumeration phase. Furthermore,∗ let timei(x) be the time in the enumeration phase when the algorithm writes the i-th # when running on x for i ≤ m + 1. Define delayi(x) = timei 1(x) − timei(x) for i ≤ m. Then we say that E has output-linear delay if there exists a constant k such that + for every input x ∈ Ω it holds that delayi(x) ≤ k ⋅ SyiS for every i ≤ m, that is, the number of instructions executed∗ by E between the time that the i-th and the (i + 1)-th # are written is linear on the size of yi. Given an enumeration problem R, we say that R can be solved with one-pass preprocessing, update-time f, and output-linear delay if there exists such an algorithm as above. Notice that the enumeration algorithm defined above is a formal refinement of the algorithmic notions used in the literature of dynamic query evaluation (see [13, 28, 27]). Indeed, given that an enumeration algorithm with one-pass preprocessing does not know when the last call to yield will arrive, the algorithm must be prepared to produce all the outputs at any moment in time. In other words, it makes these algorithms suitable for a streaming evaluation setting [28, 27], although we presented it here as an offline setting. M. Muñoz and C. Riveros 5

3 Visibly pushdown transducers and main result

In this section, we present the definition of visibly pushdown transducers [22], which is an extension of visibly pushdown automata to produce output. After the setting is formalized, we state the main result of the paper. A visibly pushdown transducer (VPT) is a tuple T = (Q, Σ, Γ, Ω, ∆,I,F ) where Q, Σ, Γ, I, and F are the same as for VPA, Ω is the output alphabet with ε ∉ Ω, and

∆ ⊆ (Q × Σ< × (Ω ∪ {ε}) × Q × Γ) ∪ (Q × Σ> × (Ω ∪ {ε}) × Γ × Q) ∪ (Q × Σ| × (Ω ∪ {ε}) × Q) is the transition relation. As usual for transducers, a symbol s ∈ Σ< ∪ Σ> ∪ Σ| is an input symbol that the machine reads and o` ∈ Ω ∪ {ε} is a symbol that the machine prints in an output tape. Furthermore, ε represents that no symbol is printed for that transition. A run ρ <*> of T over a nested word w = s1s2⋯sn ∈ Σ and output sequence µ = o`1o`2 ⋯ o`n ∈ (Ω ∪ {ε}) is s1 o`1 sn o`n ∗ a sequence of the form ρ = (q1, σ1) ÐÐÐ→ ... ÐÐÐ→ (qn 1, σn 1) where qi ∈ Q, σi ∈ Γ , q1 ∈ I, ~ ~ < σ = ε and for every i ∈ [1, n] the following holds: (1) if s ∈ Σ , then (q , s , o` , q ∗, x) ∈ ∆ 1 + i + i i i i 1 > for some x ∈ Γ and σ = xσ , (2) if s ∈ Σ , then (q , s , o` , x, q ) ∈ ∆ for some x ∈ Γ and i 1 i i i i i i 1 + | σ = xσ , and (3) if s ∈ Σ , then (p , s , o` , q ) ∈ ∆ and σ = σ . We say that the run i i 1 + i i i i i 1 i + i 1 is accepting if q ∈ F . We call a pair (q , σ ) a configuration of ρ. Finally, the output of + n 1 i i + + an accepting run ρ is defined as: out(ρ) = out(o` , 1) ⋅ ... ⋅ out(o` , n) where out(o`, i) = ε when + 1 n o` = ε and (o`, i) otherwise. Note that in µ = o`1 ⋯ o`n we use ε as a symbol, and in out(ρ) we use ε as the empty string. Given a VPT T and a nested-word w ∈ Σ<*>, we define the set ⟦T⟧(w) of all outputs of T over w as: ⟦T⟧(w) = {out(ρ)S ρ is an accepting run of T over w}. Strictly speaking, our definition of VPT is not the same as the one studied in [22]. In our definition of VPT each output element is a tuple (o`, i) where o` is the symbol and i is the output position, where for a standard VPT [22] an output element is just the symbol o`. Although our algorithm could be extended for standard VPTs, for application purposes it is better to have a more riched output as the one presented here (see Section 6). In this paper, we say that a VPT T = (Q, Σ, Γ, Ω, ∆,I,F ) is input/output deterministic (I/O-deterministic for short) if SIS = 1 and ∆ is a partial function of the form ∆ ∶ (Q×Σ< ×Ω → Q × Γ) ∪ (Q × Σ> × Ω × Γ → Q) ∪ (Q × Σ| × Ω → Q). On the other hand, we say that T is input/output unambiguous (I/O-unambiguous for short) if for every nested word w ∈ Σ<*> and every µ ∈ ⟦T⟧(w) there is exactly one accepting run ρ of T over w such that µ = out(ρ). Notice that an I/O-deterministic VPT is also I/O-unambiguous. The definition of I/O- deterministic is in line with the notion of I/O-deterministic variable automata of [23] and I/O-unambiguous is a generalization of this idea that is enough for the purpose of our enumeration algorithm. Actually, one can easily show that for every VPT T there exists an equivalent I/O-deterministic VPT and, therefore, an equivalent I/O-unambiguous VPT.

I Lemma 1. For every visibly pushdown transducer T there exists an I/O-deterministic visibly pushdown transducer T such that T (w) = T (w) for every w ∈ Σ<*>. ′ J K J ′K In this paper, we are interested on the following problem for VPT. Let C be a class of VPT (e.g. I/O-deterministic VPT).

Problem: EnumVPT[C] Input: a VPT T ∈ C and a nested word w ∈ Σ<*>. Output: Enumerate ⟦T⟧(w).

The main result of the paper is that for the class of I/O-unambiguous VPT, this enumeration problem can be solved in one-pass and with strong guarantees of efficiency. 6 Constant-delay enumeration algorithms for document spanners over nested documents

I Theorem 2. The enumeration problem EnumVPT for the class of I/O-unambiguous VPT can be solved with one-pass preprocessing, update-time SQS2S∆S, and output-linear delay. For the general class of VPT, this problem can be solved with one-pass preprocessing, update-time 2 2 Q ∆ , and output-linear delay. S S S S The result for the class of all VPT it is a consequence of Lemma 1 and the enumeration algorithm for I/O-unambiguous VPT. In the next sections we present this algorithm, starting by defining a general data structure that we used the store the outputs.

4 Enumerable compact sets: a data structure for output-linear delay

This section presents a data structure called Enumerable Compact Set (ECS), which is in the hearth of our enumeration algorithm for VPT. This data structure is strongly inspired by the work in [6, 7]. Indeed, ECS can be considered a refinement of the d-DNNF circuits used in [6] or of the set circuits used in [7]. The main difference between ECS and the knowledge-compilation approach is that we see a “circuit” as a data structure and treat it as such. With this, we can avoid using the heavy notation of circuits, or applying special operations to convert them, or computing indices to enumerate the output. Instead, we use ECS to simplify the presentation and apply all these reductions and optimizations at once, improving the conceptual understanding of the structure. Some of the conceptual contributions of this approach are to understand that we need a succinct data structure to store the outputs, we need to retrieve these outputs with output-linear delay, and we need the data structure to be fully-persistent [18], that is, it always preserves the previous version of itself when it is modified. In fact, this last property is crucial for our preprocessing algorithm to manage different runs that share part of the same output. In the following, we present ECS step-by-step to use it later in the next section. Let Σ be a (possibly infinite) alphabet. We define an Enumerable Compact Set (ECS) as a tuple D = (Σ, V, I, `, r, λ) such that V is a finite set of nodes, I ⊆ V is the set of inner nodes, `∶ I → V and r∶ I → V are the left and right functions, and λ∶ V → Σ ∪ {∪, ⊙} is a label function such that λ(v) ∈ {∪, ⊙} if, and only if, v ∈ I. Further, we assume that the directed graph (V, {(v, `(v)), (v, r(v)) S v ∈ V }) is acyclic. We call the nodes in I inner nodes and the nodes in V ∖ I leaves. Furthermore, for v ∈ I we say that v is a product node if λ(v) = ⊙, and a union node if λ(v) = ∪. We define the size of D as SDS = SV S. For each node v in D, we associate a set of words L (v) recursively as follows: (1) L (v) = {a} whenever λ(v) = a ∈ Σ, (2) L (v) = L (`(v)) ∪ L (r(v)) whenever λ(v) = ∪, and (3) L (v) = L (`(v)) ⋅ L (r(v)) D D whenever λ(v) = ⊙, where L ⋅ L = {w ⋅ w S w ∈ L and w ∈ L }. D D D 1 2 1 2 1 1 2 D2 D D Note that SL (v)S can be of exponential size with respect to SDS. For this reason we say that D is a compact representation of the set L (v) for any v ∈ V . Despite that the D represented set is huge, the goal is to enumerate all its elements efficiently. In other words, D we consider the following problem:

Problem: Enum-ECS Input: An ECS D = (Σ, V, I, `, r, λ) and v ∈ V .

Output: Enumerate the set LD(v) without repetitions.

and we want to solve Enum-ECS with output-linear delay. To reach this goal we need to impose two additional restrictions to D. The first restriction is to guarantee that D is not ambiguous, namely, for each w ∈ L (v) there is at most one way to retrieve w from D. Formally, we say that D is unambiguous if D satisfies the following two properties: (1) for D M. Muñoz and C. Riveros 7 every union node v it holds that L (`(v)) and L (r(v)) are disjoint, and (2) for every product node v and for every w ∈ L (v), there exists a unique way to decompose w = w ⋅ w D D 1 2 such that w ∈ L (`(v)) and w ∈ L (r(v)). Then, if D is unambiguous, we can guarantee 1 2 D that by enumerating elements of L (v) there will be no duplicates, given that there is no D D way of producing the same element in two different ways. D The second restriction to solve Enum-ECS with output-linear delay is to guarantee that, for each node v, there exists an output or, more specifically, a symbol of an output close to v. This is not always the case for an ECS. For example, consider a balanced tree of union nodes where all the outputs are at the leaves. Then one has to traverse a logarithmic number of nodes from the root to reach the first output. For this reason, we define the notion of k-bounded ECS. Given an ECS D, define the (left) output-depth of a node v ∈ V , denoted by odepth (v), recursively as follows: odepth (v) = 0 whenever λ(v) ∈ Σ or λ(v) = ⊙, and odepth (v) = odepth (`(v)) + 1 whenever λ(v) = ∪. Then, for a fixed k ∈ we say that D is D D N k-bounded if odepth (v) ≤ k for all v ∈ V . D D Given the definition output-depth, we say that v is an output node of D if v is a leaf or D a product node. Note that if D only has output nodes then it is 0-bounded, and one can easily check that L (v) can be enumerated with output-linear delay. Indeed, for a fixed k the same happens with every unambiguous and k-bounded ECS. D

I Proposition 3. Fix k ∈ N. Let D = (Σ, V, I, `, r, λ) be an unambiguous and k-bounded ECS. Then the set L (v) can be enumerated with output-linear delay for any v ∈ V .

It is importantD to notice that the enumeration algorithm of the previous proposition does not require any preprocessing over D and the main idea is to perform some sort of DFS traversal over the nodes. Given that D is unambiguous, we know that this procedure will not enumerate any output twice. Moreover, given that D is k-bounded, we also know that on a fix number of steps it will find something to output, which is necessary to bound the delay. Therefore, from now we assume that all ECS are unambiguous and k-bounded for some fix k. The next step is to provide a set of operations that allow to extend an ECS D, maintaining k-boundedness. Furthermore, we require these operations to be fully-persistent [18] in order to always keep the previous versions of the data structures. To satisfy the last requirement, the strategy will consist in extending D to D for each operation, by always adding new nodes and maintaining the previous nodes untouched.′ Then L ′ (v) = L (v) for each node v ∈ V and the structure will be fully-persistent. More precisely, fix an ECS D = (Σ, V, I, `, r, λ). D D Then for any a ∈ Σ and v1, . . . , v4 ∈ V , we define the operations:

(D , v ) ∶= add(D, a)(D , v ) ∶= prod(D, v1, v2) ′ ′ (D′, v′) ∶= union(D, v3, v4) ′ ′ such that D = (Σ,V ,I , ` , r , λ ) is an extension of D (i.e. obj ⊆ obj for every obj ∈ {V, I, `, r, λ})′ and v ∈′V ′∖ ′V ′is a′ fresh node such that L ′ (v ) = {a}, L′ ′ (v ) = L (v1) ⋅ L (v ), and L ′ (v ′) = L′ (v ) ∪ L (v ), respectively. Here we′ assume that the′ union and 2 3 4 D D D prod respect properties′ (1) and (2) of an unambiguous ECS, that is, L (v ) and L (v ) D D D D 1 2 are disjoint and, for every w ∈ L (v ) ⋅ L (v ), there exists a unique way to decompose 3 4 D D w = w ⋅ w such that w ∈ L (v ) and w ∈ L (v ). 1 2 1 3 D 2 D 4 Next, we show how to implement each operation. In fact, the case of add and prod are D D straightforward. For (D , v) ∶= add(D, a) define V ∶= V ∪ {v }, I ∶= I, and λ (v ) = a. One can easily check that L ′ ′ (v ) = {a} as expected. For′ (D , v ′) ∶= prod′ (D, v1, v′2) we′ proceed in a similar way: define V ∶′= V ∪ {v }, I ∶= I ∪ {v}, ` (v ′) ∶=′ v , r (v ) = v , and λ (v ) = ⊙. D 1 2 Then L ′ (v ) = L (v1) ⋅ L′ (v2). Furthermore,′ ′ one can′ ′ check that′ ′ each operation′ ′ takes ′ D D D 8 Constant-delay enumeration algorithms for document spanners over nested documents

v ′ v ′′ v ∗ v3 v4

`(v3) r(v3) `(v4) r(v4)

Figure 1 Gadget for union(D, v1, v2). Nodes v, u1, u2, v1 and v2 are labeled as ∪. Dashed and solid lines denote the mappings in `′ and r′ respectively.

constant time, D is a valid ECS (i.e. unambiguous and k-bounded), and the operation are fully-persistent (i.e.′ the previous version D is available). To define the union, we need to be a bit more careful. For a node v ∈ V , we say that v is safe if (1) odepth (v) ≤ 1, and (2) if odepth (v) = 1, then odepth (r(v)) ≤ 1. In other words, v is safe if v is an output node, or its left child is an output node, and the right child is either D D D an output node or has output depth 1. Note that, by definition, a leaf or a product node are safe nodes and, thus, the add and prod operations always produce safe nodes. The trick then is to show that, if v3 and v4 are safe nodes, then we can implement (D , v ) ∶= union(D, v3, v4) and produce a safe node v . For this define (D , v) = union(D, v3, v4) ′as′ follows. If v3 or v4 are output nodes′ then V ∶= V ∪{v′ }, I ∶= I ∪{v }, and λ(v ) ∶= ∪. Moreover, if v3 is the output node, then ` (v ) ∶=′ v3 and r ′(v ) ′∶= v4. Otherwise,′ we′ connect ` (v ) ∶= v4 and r (v ) ∶= v3. ′ ′ ′ ′ ′ ′ If v3 and′ ′ v4 are not output nodes (i.e. both are union nodes), then V ∶= V ∪ {v , v , v }, I ∶= I ∪ {v , v , v }, ` (v ) ∶= `(v3), r (v ) ∶= v , and λ (v ) ∶= ∪; ` (v )′ ∶= `(v4), ′r (′′v )∗∶= v′ , and λ (′v ′′) ∶=∗∪; `′(v ′ ) ∶= r(v3), r′ (v′ ) ∶= r′′(v4), and′ λ′ (v ) ∶= ′∪. ′′ ′ ′′ This∗ gadget1′is′′ depicted′ in∗ Figure 1. Interestingly,′ ∗ this construction′ ∗ has several properties. First, one can easily check that L ′ (v ) = L (v1) ∪ L (v2) and then the semantics is well- defined. Second, union can be computed′ in constant time in SDS given that we only need D D D to add three fresh nodes (i.e. v , v , and v ), and the operation is fully-persistent given that we connect them to previous′ nodes′′ without∗ modifying D. Furthermore, the produced node v is safe in D , although nodes v and v are not necessarily safe. Finally, D is 2-bounded whenever′ D is 2-bounded. This′′ is straightforward∗ to see for first case when′ v3 or v4 are output nodes. For the second case (i.e. the gadget of Figure 1), we have to notice that v3 and v4 are safe, therefore `(v3) and `(v4) are output nodes, and then

odepth ′ (v ) = odepth ′ (v ) = 1. Further, given that v3 and v4 are safe, we know that

odepth (r(′v3)) ≤ 1 and odepth′′ (r(v4)) ≤ 1, so odepth ′ (v ) ≤ 2. Given that the output D D depths of all fresh nodes in D are bounded by 2 and D is 2-bounded,∗ then we conclude that D D D D is 2-bounded as well. ′ ′ By the previous discussion, if we start with an ECS D which is 2-bounded (or empty) and we apply the add, prod and union operators between safe nodes (which also produce safe nodes), then the result is 2-bounded as well. Finally, by Proposition 3, the result can be enumerated with output-linear delay.

I Theorem 4. The operations add, prod, and union take constant time and are fully-persistent. Furthermore, if we start from an empty ECS D and apply add, prod, and union over safe

nodes, the partial results (D , v ) satisfies that v is always a safe node and the set L ′ (v) can be enumerated with output-linear′ ′ delay for every′ node v. D

1 This trick is used in [6] for computing an index over a circuit and it is known as Louis’ trick. M. Muñoz and C. Riveros 9

It is important to remark that restricting these operations only over safe nodes is a mild condition. Given that we will usually start from an empty ECS and apply these operations over previously returned nodes, the whole algorithm will always use safe nodes during its computation, satisfying the conditions of Theorem 4. For technical reasons, our algorithm of the next section needs a slight extension of ECS by allowing leaves that produce the empty string . Let ε ∈~ Σ be a symbol representing the empty string (i.e.w ⋅ε = ε⋅w = w). We define an enumerable compact set with ε (called ε-ECS) as a tuple D = (Σ, V, I, `, r, λ) defined identically to an ECS except that λ ∶ V → Σ ∪ {∪, ⊙, ε} and λ(v) ∈ {∪, ⊙} if, and only, if v ∈ I. Also, if λ(v) = ε, then L (v) = {}. The unambiguous and k-boundedness restrictions imposed on ECS also apply to this version. However, to D support the prod and union operations in constant time and to maintain the k-boundedness invariant, we need to slightly extend the notion of safe nodes (called -safe) and the gadgets for prod and union. Given space restrictions, we show these extensions in the appendix and just state here the main result, that will be used in the next section. I Theorem 5. The operations add, prod, and union over ε-ECS take constant time and are fully-persistent. Furthermore, if we start from an empty ε-ECS D and apply add, prod, and union over -safe nodes, the partial results (D , v ) satisfies that v is always an -safe node and the set L ′ (v) can be enumerated with output-linear′ ′ delay for′ every node v.

D 5 Evaluating visibly pushdown transducers with output-linear delay

The goal of this section is to describe an algorithm that takes an I/O-unambiguous VPT T plus a well-nested word w, and enumerates the set T (w) with output-linear delay after a one-pass preprocessing phase. For this, we divide theJ K presentation of the algorithm into two parts. The first part explains the determinization of a VPA, which is instrumental in understanding our preprocessing algorithm. Then the second part gives the algorithm and proves its correctness. For the sake of simplification, in this section we present the algorithm and definitons without neutral letters, that is, the structured alphabet is Σ = (Σ<, Σ>). Indeed, a neutral symbol a can be represented as a pair and, therefore, it is straightforward to extend the techniques of this section to consider neutral symbols. Given this assumption, from now on we use a for denoting any symbol in Σ< ∪ Σ>. Determinization of visibly pushdown automata. A significant result in Alur and Madhusudan’s paper [4] that introduces VPA was that one can always determinize them. We provide here an alternative proof to this result that requires a somewhat more direct construction. This determinization process is behind our preprocessing algorithm and serves to give some crucial notions of how it works. We start by providing the determinization construction, introducing some useful notation, and then giving some intuition. Consider the following determinization of a non-deterministic VPA. For a VPA A = det det det det det det (Q, Σ, Γ, ∆,I,F ) defines a deterministic VPA A = (Q , q0 , Γ , δ ,F ) with state det Q Q det Q Γ Q det set Q = 2 and stack symbol set Γ = 2 . The initial state is q0 = {(q, q)S q ∈ I} and the set of× final states is F det = {S ∈ Qdet× S×S ∩ (I × F ) ≠ ∅}. Finally, the transition function δdet is defined as follows: < det if

> det′ ′ ′ ′ if a> ∈ Σ , then δ (S, T, a>) = S where: ′ S = ™ (p, q)S ∃p , q , x. (p, x, p ) ∈ T ∧ (p , q ) ∈ S ∧ (q , x, a>, q) ∈ ∆ ž ′ ′ ′ ′ ′ ′ ′ 10 Constant-delay enumeration algorithms for document spanners over nested documents

(p, x, p ) ∈ Tk (p , q ) ∈ Sk ′ ′ ′ Open: Close: p q q p q ′ ′ push x push x push x ′ ′ pop x p q currlevel(k) p p p [ q push y push y ′ push y lowerlevel(k)

Figure 2 Left: An example run of some VPA A at step k. Right: Illustration of two nondetermin- istic runs for some VPA A, as considered in the determinization process.

To understand the purpose of this construction, first we need to introduce some notation. Fix a well-nested word w = a1a2⋯an. A span s of w is a pair [i, j⟩ of natural numbers i and j with 1 ≤ i ≤ j ≤ n + 1. We denote by w[i, j⟩ the subword ai⋯aj 1 of w and, when i = j, we assume that w[i, j⟩ = ε. Intuitively, spans are indexing w with intermediate positions, like − a1 a2 ... an , where i is between symbols ai 1 and ai. Then [i, j⟩ represents an interval 1 2 3 n n+1 {i, . . . , j} that captures the subword ai . . . aj 1. − Now, we say that a span [i, j⟩ of w is well-nested if w[i, j⟩ is well-nested. Note that ε is − well-nested, so [i, i⟩ is a well-nested span for every i. For a position k ∈ [1, n + 1], we define the current-level span of k, currlevel(k), as the well-nested span [j, k⟩ such that j = min{j S [j , k⟩ is well-nested}. Note that [k, k⟩ is always well-nested and then currlevel(k) is well′ defined.′ We also identify the lower-level span of k, lowerlevel(k), defined as lowerlevel(k) = currlevel(j − 1) = [i, j − 1⟩ whenever currlevel(k) = [j, k⟩ and j > 1. In contrast to currlevel(k), lowerlevel(k) is not defined always given that it is “one level below” than currlevel(k) and this may not exist. More concretely, for currlevel(k) = [j, k⟩ and lowerlevel(k) = [i, j − 1⟩, these spans will look as follows:

lowerlevel k currlevel k ³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ ³¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹·¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹µ ( ) ( ) ↓ a1 a2 ... ) (see Figure 2 (right) for a graphical description). M. Muñoz and C. Riveros 11

Algorithm 1 The preprocessing phase of the enumeration algorithm for EnumVPT given an I/O-unambiguous VPT T = (Q, Σ, Γ, Ω, ∆,I,F ) and a well-nested word w. 1: procedure Preprocessing(T, w) 27: procedure OpenStep( then ′ 34: v ← Sp,p′ 9: (a, k) CloseStep 35: (D, v) ← IfProd(D, v, o`, k) 10: k ← k + 1 36: (D, v) ← union(D, v, Tp,x,q) 11: vout ← ∅ 37: Tp,x,q ← v ∈ ∈ ≠ ∅ 12: for each p I, q F s.t. Sp,q do 38: S ← S 13: ← (D ) vout union , vout,Sp,q 39: return′ 14: return (D, vout) 40: 15: 41: 16: procedure CloseStep(a>, k) 42: procedure IfProd(D, v, o`, k) 17: S ← ∅ 43: if o` ≠ ε then 18: for′ p, p ∈ Q and (q , a>, o`, x, q) ∈ ∆ do 44: (D , v ) ← add(D, (o`, k)) ′ ′ 19: if Sp′′,q′ ≠ ∅ and T′ p,x,p′ ≠ ∅ then 45: (D , v ) ← prod(D , v, v ) ′ ′ ′ ′ 20: (D, v) ← prod(D,Tp,x,p′ ,Sp′,q′ ) 46: else 21: (D, v) ←IfProd(D, v, o`, k) 47: (D , v ) ← (D, v) ′ ′ 22: (D, v) ← union(D, v, Sp,q) 48: return (D , v ) 23: Sp,q ← v ′ ′ ′ ′ 24: T ← pop(T ) 25: S ← S 26: return′

Indeed, the most important consequence of these two invariants is that a tuple (qj, qk) ∈ Sk represents the interval of some run over w[j, k⟩ with currlevel(k) = [j, k⟩ and the tuple (qi, x, qj) ∈ Tk represents the interval of some run over w[i, j −1⟩ with lowerlevel(k) = [i, j −1⟩, det i.e., the level below. In other words, the configuration (Sk, τk) of A forms a succinct representation of all the non-deterministic runs of A. This is the starting point of our preprocessing algorithm, that we discuss next. The preprocessing algorithm. In Algorithm 1 we present the preprocessing phase of our enumeration algorithm for solving EnumVPT. The main procedure is Preprocessing, that receives as input an I/O-unambiguous VPT T = (Q, Σ, Γ, Ω, ∆,I,F ) and a well-nested word w, and compute the set of outputs ⟦T⟧(w). More specifically, this procedure constructs an ε- ECS D and a vertex vout such that L (vout) = ⟦T⟧(w). After the Preprocessing procedure is done, we can enumerate L (v ) with output-linear delay by applying Theorem 5. out D Towards this goal, in AlgorithmD 1 we make use of the following data structures. First of all, we use an ε-ECS D = (Σ, V, I, `, r, λ), nodes v ∈ V , and the functions add, union, and prod over D and v (see Section 4). For the sake of simplification, we overload the notation of these operators slightly so that if v = ∅, then union(D, v, v ) = union(D, v , v) = prod(D, v, v ) = prod(D, v , v) = (D, v ). We use a hash table S which indexes′ nodes v in′ D by pairs of states′ (p, q) ∈ Q′× Q. We denote′ the elements of S as “(p, q) ∶ v” where (p, q) is the index and v is 12 Constant-delay enumeration algorithms for document spanners over nested documents

the content. Furthermore, we write Sp,q to access the node v. We also use a stack T that stores hash tables: each element is a hash table which indexes vertices v in D by triples (p, x, q) ∈ Q × Γ × Q. We assume that T has the standard stack methods push and pop where if T = tk⋯ t1, then push(T, t) = t tk⋯ t1 and pop(T ) = tk 1⋯ t1. Similar than for S, we use the notation T to access the nodes in the topmost hash-table in T (i.e. T is a stack of p,x,q − hash tables). We assume that accessing a non-assigned index in these hash tables returns the empty set. All variables representing these objects (i.e., T, D, S, and T ) are defined globally in Algorithm 1 and they can be accessed by any of the procedures. Finally, given that we use the RAM model, each operation over any hash tables or the stack takes constant time. Preprocessing builds the ε-ECS D incrementally, reading w one letter at a time by calling yield[w] and keeping a counter k for the position of the current letter. For every k ∈ [1, n + 1] the main procedure builds the k-th iteration of table S and stack T , which we note as Sk and T k respectively. We consider the initial S and T as the 1-th iteration, defined 1 1 as S = {(q, q) ∶ vε S q ∈ I} and T = ∅ (i.e. the empty stack) where vε is a node in D such that L (vε) = {ε} (lines 3-4). In the k-th iteration, depending whether the current letter is an open symbol or a close symbol, the or procedures are called D OpenStep CloseStep updating Sk 1 and T k 1 to Sk and T k, respectively. More specifically, Preprocessing adds nodes to D −such that− the nodes in Sk represent the runs over w[j, k⟩ where currlevel(k) = [j, k⟩, and the nodes in the topmost table in T k represent the runs over w[i, j − 1⟩ where k lowerlevel(k) = [i, j − 1⟩. Moreover, for a given pair (p, q), the node Sp,q represents all runs over w[j, k⟩ with currlevel(k) = [j, k⟩ that start on p and end on q. For a given triple (p, x, q) k the node Tp,x,q represents all runs over w[i, j − 1⟩ with lowerlevel(k) = [i, j − 1⟩ that start on p, and end on q right after pushing x onto the stack. Here, the intuition gained in the determinization of VPA is crucial. Indeed, table Sk and stack T k are the mirror of the det configuration (Sk, τk) of A (recall invariants (a) and (b) above). Before formalizing these notions, we will describe in more detail what the procedures OpenStep and CloseStep exactly do. Recall that the operation add(D, a) simply creates a node in D labeled as a; the operation prod(D, v1, v2) returns a pair (D , v ) such that L ′ (v ) = L (v1) ⋅ L (v2); and the operation union(D, v3, v4) returns a pair′ (′D , v ) such that L′ ′ (v ) = L (v ) ∪ L (v ). To improve the presentation of the algorithm, we′ include′ a D D 3D 4 simple procedure′ called (lines 42-48). Basically, this procedure receives a node v, an D D IfProdD output symbol o`, and a position k, and computes (D , v ) such that L ′ (v ) = L (v)⋅{(o`, k)}

if o` ≠ ε, and L ′ (v ) = L (v) otherwise. ′ ′ ′ D D ′ k k 1 In OpenStepD , S isD created (i.e. S ), and an empty table is pushed onto T to form T k (line 28). Then, all nodes in Sk′ 1 (i.e. S) are checked to see if it the runs− they represent can be extended with a transition− in ∆ (lines 29-30). If this is the case (lines 31 k onwards), a node vε with the ε-output is added in S to start a new level (lines 31-33). Then, k if the transition had a non-empty output, the node Sp,p′ is connected with a new label node k to form the node v (lines 34-35). This node is stored in Tp,x,q, or united with the node that was already present there (lines 36-37). In CloseStep, Sk is initialized as empty (line 17). Then the procedure looks for all of the valid ways to join a node in T k 1, a node in Sk 1, and a transition in ∆ to form a new k − − k 1 k 1 node in S . More precisely, it looks for quadruples (p, x, p , q ) for which Tp,x,p′ and Sp′,q′ are defined, and there is a close transition that starts on q that′ ′ reads x (lines− 18-19). These− nodes are joined and connected with a new label node if′ it corresponds (lines 20-21), and k stored in Sp,q or united with the node that was already present there (lines 22-23). Finally, the top of the stack T is popped after all (p, x, p , q ) are checked. As it was already mentioned, in each step the′ construction′ of D follows the ideas of the M. Muñoz and C. Riveros 13 determinization of a visibly pushdown automata. As such, Figure 2 also aids to illustrate how the table Sk and the top of the stack T k are constructed. The way how the table Sk and the stack T k are constructed is formalized in the following result. Recall that a run of T over a well-nested word w = a1⋯an is a sequence of the a1 o`1 an o`n form ρ = (q0, σ0) ÐÐÐ→ ... ÐÐÐ→ (qn, σn). Given a span [i, j⟩, define a subrun of ρ as a ~ a~i o`i aj−1 o`j−1 subsequence ρ[i, j⟩ = (qi, σi) ÐÐÐ→ ... ÐÐÐÐÐ→ (qj, σj). We also extend the function out to ~ ~ receive a subrun ρ[i, j⟩ in the following way: out(ρ[i, j⟩) = out(o`i, i) ⋅ ... ⋅ out(o`j 1, j − 1). Finally, define Runs(T, w) as the set of all runs of T over w. −

I Lemma 6. Let T be a VPT and w = a1⋯an be a well-nested word. While running the procedure Preprocessing of Algorithm 1, for every k ∈ [1, n + 1], every pair of states p, q and stack symbol x the following hold: k 1. L (Sp,q) has exactly all sequences out(ρ[j, k⟩) such that ρ ∈ Runs(T, w[1, k⟩), currlevel(k) = [j, k⟩, and ρ[j, k⟩ starts on p and ends on q. D k 2. If lowerlevel(k) is defined, then L (Tp,x,q) has exactly all sequences out(ρ[i, j⟩) such that ρ ∈ Runs(T, w[1, j⟩), lowerlevel(k) = [i, j − 1⟩, and ρ[i, j⟩ starts on p, ends on q, and the D last symbol pushed onto the stack is x.

Since w is well nested, then currlevel(SwS + 1) = [1, SwS + 1⟩ and so the lemma implies that the nodes in S w 1 represent all runs of T over w. By taking the union of all pairs in S w 1 that represent acceptingS S+ runs, as is done in lines 11-13, we can conclude the following resultS S+2.

I Theorem 7. Given a VPT T and a word w, Preprocessing(T, w) makes one-pass over w and returns a pair (D, vout) such that L (vout) = ⟦T⟧(w). At this point we address the fact thatD D needs to be unambiguous in order to enumerate all the outputs from (D, vout) without repetitions. This is guaranteed essentially by the fact that T is I/O-unambiguous as well. Indeed, the previous result holds even if T is not I/O-unambiguous. The next result guarantees that the output can be enumerated efficiently.

I Lemma 8. Let T be an unambiguous VPT and w be a well-nested word. While running the procedure Preprocessing of Algorithm 1, the ε-ECS D is unambiguous at every step. The complexity of this algorithm can be easily deduced from the fact that the ε-ECS operations we use take constant time (Theorem 5). For a VPT T = (Q, Σ, Γ, Ω, ∆,I,F ), in each of the calls to OpenStep, lines 31-37 perform a constant number of instructions, and they are visited at most SQSS∆S times. In each of the calls to CloseStep, lines 20-23 perform a constant number of instructions, and they are visited at most SQS2S∆S times. Combined with Theorem 7, Lemma 8, and Theorem 5, this shows our main result.

I Corollary 9. Let T be an I/O-unambiguous VPT and w be a nested word. ⟦T⟧(w) can be enumerated with one-pass preprocessing, update time O(SQS2S∆S), and output-linear delay.

6 Application: document spanners and extraction grammars

This section presents an application of our enumeration algorithm to the evaluation of recursive spanners [33]. Practical formalisms to define document spanner for information

2 By the definition of enumeration algorithm with one pass preprocessing, the preprocessing phase should end right after the EOF is read. However, one can assume that lines 11-13 are executed at the end of each iteration, without affecting the asymptotic performance of the algorithm. 14 Constant-delay enumeration algorithms for document spanners over nested documents

extraction with recursion was only proposed very recently. In [32], the author suggests using extraction grammars to specify document spanners, which is the natural extension of regular spanners to a controlled form of recursion. Furthermore, the author gives an enumeration algorithm for unambiguous functional extraction grammars that outputs the results with constant-delay after cubic time preprocessing (i.e., in the document). We can show an enumeration algorithm with one-pass preprocessing and output-linear delay by restricting to the class of visibly pushdown extraction grammars. That is an enumeration algorithm that needs a single pass over the document and takes linear time. We proceed by recalling the framework of document spanners and extraction grammars to define the class of visibly pushdown extraction grammars and state the main algorithmic result. We start by recalling the basics of document spanners [20]. Fix an alphabet Σ and a set of variables Vars such that Σ ∩ Vars = ∅. A document d over Σ is basically a word in Σ . A span s of a document d is a pair [i, j⟩ of natural numbers i and j with 1 ≤ i ≤ j ≤ SdS + 1. Intuitively,∗ a span represents a substring of d by identifying the starting and ending position. We denote by Spans(d) the set of all possible spans of d. Let X ⊆ Vars be a finite set of variables. An (X, d)-mapping µ∶ X → Spans(d) assigns variables in X to spans of d. An (X, d)-relation is a finite set of (X, d)-mappings. Then a document spanner P (or just spanner) is a function associated with a finite set X of variables that maps documents d into (X, d)-relations. We use the framework of extraction grammars, recently proposed in [32], to specify document spanners. For X ⊆ Vars, let CX = {{x, }x S x ∈ X} be the set of captures of X where, intuitively, {x denotes the opening of x, and }x its closing. An extraction context-free grammar, or extraction grammar for short, is a tuple G = (X,V, Σ, S, P ) such that X ⊆ Vars, V is a finite set of non-terminals symbols, Σ is the alphabet of terminal symbols with Σ ∩ V = ∅, S ∈ V is the start symbol, and P ⊆ V × (V ∪ Σ ∪ CX ) is a finite relation. In the literature, the elements of V are also referred as “variables”, but we∗ call them non-terminals to distinguish V from Vars. Each pair (A, α) ∈ P is called a production and we write it as A → α. The set of productions P defines the (left) derivation relation ⇒G ⊆ (V ∪ Σ ∪ CX ) × (V ∪ Σ ∪ CX ) such that wAβ ⇒G wαβ iff w ∈ (Σ ∪ CX ) , A ∈ V , α, β ∈ (V ∪ Σ ∪ CX ) , and∗ A → α ∈ P . We∗ ∗ ∗ denote by ⇒G the reflexive and transitive closure of ⇒G. Then the language defined by G is ∗ L(G) = {w ∈ (Σ ∪ CX ) S S ⇒G w}. A word w ∈ L(G) is called a ref-word produced by G. In order to define∗ a spanner∗ from G, we need to interpret ref-words as mappings [24]. Formally, a ref-word r = a1 . . . an ∈ (Σ ∪ CX ) is called valid for X if, for every x ∈ X, there exists exactly one position i with ai = {x and∗ exactly one position j with aj = }x, such that i < j. In other words, a valid ref-word defines a correct match of open and close captures. p s Moreover, each x ∈ X induces a unique factorization of r of the form r = rx ⋅ {x ⋅ rx ⋅ }x ⋅ rx. This factorization defines an (X, d)-mapping as follows. Let plain ∶ (Σ ∪ CX ) → Σ be the morphism that removes the captures from ref-words, namely, plain(a) = a when∗ a∗∈ Σ and plain(c) = ε when c ∈ CX . Furthermore, let r be a valid ref-word for x, d be a document, and assume that plain(r) = d. Then we define the (X, d)-mapping µr such that µr(x) = [i, j⟩ p s p iff r = rx ⋅ {x ⋅ rx ⋅ }x ⋅ rx, i = Splain(rx)S + 1, and j = i + Splain(rx)S. Finally, the spanner ⟦G⟧ associated to an extraction grammar G is defined over any document d ∈ Σ as follows: ∗ ⟦G⟧(d) = { µr S r ∈ L(G), r is valid for X, and plain(r) = d }.

There are two classes of extraction grammars that are relevant for our discussion. The first class of grammars are called functional extraction grammars. An extraction grammar G is functional if every r ∈ L(G) is valid for X. In [32] it was shown that for any extraction grammar G there exists an equivalent functional grammar G (i.e. ⟦G⟧ = ⟦G ⟧). Non- functional grammars are problematic given that, even for regular′ spanners, their′ decision M. Muñoz and C. Riveros 15 problems easily become intractable [31, 25]. For this reason, we restrict to functional extraction grammars without loss of expressive power. The second class of grammars are called unambiguous extraction grammars. An extraction grammar G is unambiguous if for every r ∈ L(G) there exists exactly one path from S to r in the graph ((V ∪ Σ ∪ CX ) , ⇒G). In other words, there exists exactly one left-most derivation. ∗ We consider now a sub-class of extraction grammars for nested words. Let Σ = (Σ<, Σ>, Σ|) be a structured alphabet. Then a visibly pushdown extraction grammar (VPEG) is a functional extraction grammar G = (X,V, Σ, S, P ) in which Σ = (Σ<, Σ>, Σ|) is a structured alphabet, and all the productions in P are of one of the following forms: (1) A → ε; (2) A → aB such | < > that a ∈ Σ ∪ CX and B ∈ V ; (3) A → C such that ∈ Σ , and B,C ∈ V . Intuitively, rules A → aB allow to produce arbitrary sequences of neutral symbols, where rules A → C forces the word to be well-nested. Visibly pushdown extraction grammars are a subclass of extraction grammars that works for well-nested documents. In fact, the reader can notice that the visibly pushdown restriction for extraction grammars is the analog counterpart of visibly pushdown grammars3 introduced in [4]. Therefore, one could expect that VPEGs are less expressive than extraction grammars. Despite of this, we can use Theorem 2 to give an efficient one-pass enumeration algorithms for evaluating VPEG.

I Theorem 10. Fix a set of variables X. The enumeration problem of, given a visibly pushdown extraction grammar G = (X,V, Σ, S, P ) and a document d, enumerate all (X, d)- 3 mappings of ⟦G⟧(w) can be solved with one-pass preprocessing, update-time 2 G , and output- linear delay. Furthermore, if G is restricted to be unambiguous, then theS problemS can be solved with update-time SGS3.

This result goes by constructing an extraction pushdown automata [32] from G, and reduce it to a visibly pushdown transducers. Note that, although the update time of the algorithm is exponential in the size of the grammar, in terms of data-complexity the update-time is constant. Furthermore, for the special case of unambiguous grammars the update time is even polynomial. Unambiguous grammars are very common in parsing tasks [2] and, thus, this restriction could be useful in practice.

7 Concluding remarks

This paper presented a one-pass enumeration algorithm with output-linear delay to evaluate VPT over nested words. We believe that the data structure and algorithm presentation are more manageable than previous algorithms over tree structures and permit more space to implement and optimize the evaluation task. One possible direction is to improve the complexity of the update-time. Our algorithm is cubic in the number of states and transitions; however, it may be possible to derive an algorithm with a better complexity by using other determinization construction. Another direction is to find a one-pass enumeration algorithm with polynomial update-time for the case of non-deterministic VPT. In [7], the authors provided a polynomial-time algorithm for non-deterministic word transducers (called vset automata), and they extended this result to automata over trees in [8]. We believe that one could develop these techniques to provide an algorithm for the non-deterministic case. However, it is not yet clear how to extend ECS to manage with ambiguity naturally.

3 The definition of visibly pushdown grammars in [4] is slightly more complicated given that they consider nested words that are not necessary well-nested (see the discussion in Section 2). 16 Constant-delay enumeration algorithms for document spanners over nested documents

References 1 Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, 1974. 2 Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman. Compilers, principles, techniques. Addison wesley, 7(8):9, 1986. 3 Rajeev Alur, Dana Fisman, Konstantinos Mamouras, Mukund Raghothaman, and Caleb Stanford. Streamable regular transductions. Theor. Comput. Sci., 807:15–41, 2020. 4 Rajeev Alur and P. Madhusudan. Visibly pushdown languages. In STOC, pages 202–211, 2004. 5 Rajeev Alur and Parthasarathy Madhusudan. Adding nesting structure to words. Journal of the ACM (JACM), 56(3):1–43, 2009. 6 Antoine Amarilli, Pierre Bourhis, Louis Jachiet, and Stefan Mengel. A circuit-based approach to efficient enumeration. In ICALP, pages 111:1–111:15, 2017. 7 Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, pages 22:1–22:19, 2019. 8 Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, pages 89–103, 2019. 9 Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. Efficient logspace classes for enumeration, counting, and uniform generation. In PODS, pages 59–73, 2019. 10 Guillaume Bagan. MSO queries on tree decomposable structures are computable with linear delay. In CSL, pages 167–181, 2006. 11 Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. On acyclic conjunctive queries and constant delay enumeration. In CSL, pages 208–222, 2007. 12 Christoph Berkholz, Fabian Gerhardt, and Nicole Schweikardt. Constant delay enumeration for conjunctive queries: a tutorial. ACM SIGLOG News, 7(1):4–33, 2020. 13 Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering conjunctive queries under updates. In PODS, pages 303–318, 2017. 14 Mikolaj Bojanczyk and Pawel Parys. Xpath evaluation in linear time. J. ACM, 58(4):17:1–17:33, 2011. 15 Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Benny Kimelfeld, and Nicole Schweikardt. Answering (unions of) conjunctive queries using random access and random-order enumeration. In PODS, pages 393–409, 2020. 16 Timothy Y Chow. A beginner’s guide to forcing. Communicating mathematics, 479:25–40, 2009. 17 Bruno Courcelle. Linear delay enumeration and monadic second-order logic. Discrete Applied Mathematics, 157(12):2675–2700, 2009. 18 James R Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. Making data structures persistent. In STOC, pages 109–121, 1986. 19 Arnaud Durand and Etienne Grandjean. First-order queries on structures of bounded degree are computable with constant delay. ACM Trans. Comput. Log., 8(4):21, 2007. 20 Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2):12:1–12:51, 2015. 21 Emmanuel Filiot, Olivier Gauwin, Pierre-Alain Reynier, and Frédéric Servais. Streamability of nested word transductions. LMCS, 15(2), 2019. 22 Emmanuel Filiot, Jean-François Raskin, Pierre-Alain Reynier, Frédéric Servais, and Jean-Marc Talbot. Visibly pushdown transducers. JCSS, 97:147–181, 2018. 23 Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Efficient enumeration algorithms for regular document spanners. TODS, 45(1):3:1–3:42, 2020. 24 Dominik D. Freydenberger. A logic for document spanners. Theory Comput. Syst., 63(7):1679– 1754, 2019. M. Muñoz and C. Riveros 17

25 Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining extractions of regular expressions. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 137–149, 2018. 26 Olivier Gauwin, Joachim Niehren, and Sophie Tison. Bounded delay and concurrency for earliest query answering. In LATA, pages 350–361, 2009. 27 Alejandro Grez and Cristian Riveros. Towards streaming evaluation of queries with correlation in complex event processing. In ICDT, pages 14:1–14:17, 2020. 28 Alejandro Grez, Cristian Riveros, and Martín Ugarte. A formal framework for complex event processing. In ICDT, pages 5:1–5:18, 2019. 29 Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical computer science, 43:169–188, 1986. 30 Wojciech Kazana and Luc Segoufin. First-order query evaluation on structures of bounded degree. Log. Methods Comput. Sci., 7(2), 2011. 31 Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125–136, 2018. 32 Liat Peterfreund. Grammars for document spanners. CoRR, abs/2003.06880, 2020. URL: https://arxiv.org/abs/2003.06880, arXiv:2003.06880. 33 Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive programs for document spanners. In ICDT, pages 13:1–13:18, 2019. 34 Nicole Schweikardt, Luc Segoufin, and Alexandre Vigny. Enumeration for FO queries over nowhere dense graphs. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 151–163, 2018. 35 Luc Segoufin. Enumerating with constant delay the answers to a query. In ICDT, pages 10–20, 2013. 18 Constant-delay enumeration algorithms for document spanners over nested documents

A Proofs from Section 3 A.1 Proof of Lemma 1 Let T = (Q, Σ, Γ, Ω, ∆,I,F ). We will construct an input-output deterministic VPAwO T = det Q Q Q Γ Q (Q , Σ, Γ , Ω, δ , qI ,F ) as follows. Let Q = 2 and Γ = 2 . Let SI = {(q, q)S q ∈ ′I} and′ let F′ = {S S(p, q)′∈ S for some p ∈ I and′ q ∈× F }. Let′δ be× defined× as follows: < For For a> ∈ Σ and o` ∈ Ω, δ(S, a>, o`,T ) = S where, if T ⊆ Q × Γ × Q, then: ′ S = {(p, q)S(p , q ) ∈ S and (p, x, p ) ∈ T and (q , a>, o`, x, q ) ∈ ∆ ′ ′ ′ ′ for some′ p , q ′′∈ Q, x ∈ Γ}, ′ ′ | For a ∈ Σ and o` ∈ Ω, δ(S, a) = S where: ′ S = {(q, q )S(q, q ) ∈ S and (q , a, o`, q ) ∈ ∆ for some q ∈ Q}. ′ ′′ ′ ′ ′′ ′ One can immediately check that this automaton is input-output determinstic since the transition relation is modelled as a partial function. We will prove that T and T are equivalent by induction on well-nested words. To aid our proof, we will introduce a couple′ of ideas. First, we extend the definition of a run to include sequences that start on an arbitary configuration. Also, given a run

s1 o`1 s2 o`2 sn o`n ρ = (q1, σ1) ÐÐÐ→ (q2, σ2) ÐÐÐ→ ⋯ ÐÐÐ→ (qn 1, σn 1), ~ ~ ~ and a span [i, j⟩, define a subrun of ρ as the subsequence + +

si o`i si+1 o`i+1 sj−1 o`j−1 ρ[i, j⟩ = (qi, σi) ÐÐÐ→ (qi 1, σi 1) ÐÐÐÐÐ→ ⋯ ÐÐÐÐÐ→ (qj, σj). ~ ~ ~ In this proof, we only consider subruns+ + such that w[i, j⟩ = sisi 1⋯sj 1 is a well-nested word. A second definition we will use is that of a VPT with arbitrary initial states. Formally, + − let S ⊆ Q. We define T q as the VPT that simulates T by starting on the configuration s1 o`1 sn o`n (q, ε). Note that for a run ρ = (q1, σ1) ÐÐÐ→ ⋯ ÐÐÐ→ (qn 1, σn 1) of T over w = s1⋯sn and a well-nested span [i, j⟩, the subrun ρ[i,~ j⟩ is one~ of the runs of T over w[i, j⟩ modulo σ , + + q i which is present in all of the stacks in ρ as a common suffix. We shall prove first that T (w) ⊆ T (w) for every well-nested word w. This is done with the aid of the following result:J K J ′K

B Claim 11. For a well-nested word w, output sequence µ, states p, q ⊆ Q, and a set S that contains (p, q), if there is a run of T q over w and µ such that its last state is q , the (only) ′ run of T S over w and µ ends in a state S which contains (p, q ). ′ ′ ′ Proof. We will prove the claim by induction on w. If w = ε, the proof is trivial since q = q . If w = a ∈ Σ| the proof follows straightforwardly from the construction of δ. ′ If w, v ∈ Σ<*>, and µ, κ ∈ O , let p, q ∈ Q, let S be a set that contains (p, q), and let ρ be a run of T q over wv and µκ, which∗ ends in a state q . Our goal is to prove that the run ρ of ′ w′ T S over wv and µκ ends in a state that contains (p, q ). Let n = SwS, m = SvS, and let q be ′ ′ M. Muñoz and C. Riveros 19 the last state of the subrun ρ[1, n + 1⟩. Consider as well ρ[n + 1, n + m + 1⟩, which is a run of T qw over v and κ that ends in q . From the hypothesis two conditions follow: (1) In the run ′ w of T S over w and µ the last state S contains (p, q ), and (2) in the run of T S′ over v and κ the last′ state contains (p, q ). It can′ be seen that ρ is the concatenation of′ these two runs, so this proves the claim. ′ ′ <*> < > If w ∈ Σ , ∈ Σ , µ ∈ O , and o`1, o`2 ∈ O, let p, q ∈ Q, let S be a set that contains (p, q), and let ρ be a run of T q over <∗awb> and o`1µo`2. Let n = SwS, and let q, q2, . . . , qn 2, qn 3 be the states of ρ in order. Our goal is to prove that the run ρ of T over and S + + o`1µo`2 ends in a state that contains (p, qn 3). Let (q2, x) be the second′ configuration′ of ρ. This implies that (q, , o` , x, q ) ∈ ∆. Let S and T be such 1 2 + n 2 2 n 3 that δ(S,

Since (q , q ) ∈ S , from the hypothesis it follows that the run of T ′ over w and µ ends in a 2 2 + S state S that contains′ (q2, qn 2). This run starts on the configuration′ (S , ε) and ends in (S , ε),′′ so a run on the same automaton that starts on (S ,T ) and reads the′ same symbols + will′′ end in (S ,T ), which is the case for the subrun ρ [2, n′+ 2⟩. Therefore, the construction of δ implies that′′ (p, qn 3) is contained in the last state′ of ρ , which proves the claim. J ′ Let now w be a well-nested+ word and µ be an output sequence such that T accepts (w, µ). Let ρ be an accepting run of T over (w, µ) which starts on a state p ∈ I and ends in a state q ∈ F . Note that T also accepts (w, µ). Note that T = T , and since (p, p) ∈ S the claim p SI I implies that the run of T over (w, µ) ends in a state which′ contains (p, q), and so this run is accepting. This proves that T (w) ⊆ T (w). To prove that T (w) ⊆ JT K(w) weJ use′K a similar result: J ′K J K B Claim 12. For a well-nested word w, output sequence µ, states q, p, q ⊆ Q, and a set S ′ that contains (p, q), if the run of T S over w and µ ends on a state S that contains (p, q ), then there is a run of T q over w and′ µ such that its last state is q . ′ ′ ′ Proof. We will prove the claim by induction on w. If w = ε, the proof is trivial since q = q . If w = a ∈ Σ| the proof follows straightforwardly from the construction of δ. ′ If w, v ∈ Σ<*>, and µ, κ ∈ O , let p, q, q ∈ Q, let S be a set that contains (p, q), and let ρ ∗ ′ be the run of T S over wv and µκ, which ends in a state S that contains (p, q ). Our goal is to prove that′ there is a run ρ of T q over wv and µκ such′ that its last state′ is q . Let n = SwS, m = SvS, and let Sw be the′ last state of the subrun ρ[1, n + 1⟩. Consider as′ well

ρ[n + 1, n + m + 1⟩, which is a run of T Sw over v and κ that ends in S . From the construction of δ, it is clear that if a non-empty′ state S follows from S in a run′ of T , then S is not w w empty. Let (p, q ) ∈ S . From the hypothesis′ two conditions follow: (1) There′ is a run ρ1 of w T q over w and µ such that its last state is q (2) There is a run ρ2 of T qw over v and κ such that its last state is q . We then construct ρ by concatenating ρ1 and ρ2 which ends in q , and this proves the claim.′ ′ ′ <*> < > If w ∈ Σ , ∈ Σ , µ ∈ O , and o`1, o`2 ∈ O, let p, q, q ∈ Q, let S be a set

∗ ` ′ ` that contains (p, q), and let ρ be the run of T S over and o1µo2. Let n = SwS, let S, S2,...,Sn 2,Sn 3 be the states of ρ in order,′ and suppose there is a pair (p, q ) ∈ Sn 3. Our goal is to prove that there is a run ρ of T over and o` µo` that ends in′ q . Let + + q 1 2 + (S2,T ) be the second configuration of ρ. From′ the construction of δ, there exist q2, qn ′2 ∈ Q and x ∈ Γ such that (q , b>, o` , x, q ) ∈ ∆, (p, x, q ) ∈ T and (q , q ) ∈ S . Since w n 2 2 n 3 2 2 n 2 n 2 + is well-nested, this T could only have been pushed after reading

ρ[2, n + 2⟩, which is a run of T over w and µ that ends in S modulo the common stack S2 n 2 suffix T . We now have that (q′ , q ) ∈ S and (q , q ) ∈ S , and so, from the hypothesis 2 2 2 2 n 2 n 2 + it follows that there is a run ρ of T over w and µ such that its last state is q . In q2 + + n 2 a similar fashion as in the previous′′ claim, we modify the run slightly to obtain one that + starts and ends on the stack x. This new run can be easily extended with the transitions (q, , o`2, x, qn 3) ∈ ∆, and as a result, we obtain a run ρ of T q that fulfils the conditions of the claim. ′ + + J Let now w be a well nested word and let µ be an output sequence such that T accepts ′ (w, µ). Since T = T SI and the run of T over (w, µ) ends in a state S ∈ F , we have that S contains an element′ (p, q) such that p ∈ I′ and q ∈ F . Moreover, (p, p) ∈ SI .′ From the prevous claim, it follows that there is an accepting run of T p over (w, µ) such that its last state is q. Therefore, T accepts (w, µ). This proves that T (w) ⊆ T (w). J K J K We conclude that T (w) = T (w) for every′ well-nested word w. J J K J ′K B Proofs from Section 4 B.1 Proof of Proposition 3 Let D = (Σ, V, I, `, r, λ) be a k-bounded ECS and v ∈ V . We will show that the set L (v) can be enumerated with output-linear delay. To show that this is possible we use a data D structure we call an output tree. This is a dynamic binary tree T which appends itself to an ECS D. We define it as follows: If v is a leaf node in D, then v is an output tree of D. If T is an output tree and v is a union node, then T = v(T ) is an output tree of D. If T1 and T2 are output trees and v is a product node, then′ v(T1,T2) is an output tree of D. In either case, we say that T is rooted in v, and we notate it by root(T ) = v. For an output tree T we define the functions childT , lchildT and rchildT as follows: If v(T ) is a subtree of T , then childT (v) = T . If v(T1,T2) is a subtree of T , then lchildT (v) = T1 and′ rchildT (v) = T2. These functions are′ not defined in any other case.

I Definition 13. Let D = (Σ, V, I, `, r, λ) be an ECS. An output tree T of D is full if for each node v in T the following hold: If v is an union node in D, then childT (v) is either rooted in `(v) or in r(v). If v is a product node in D, then lchildT (v) is rooted in `(v) and rchildT (v) is rooted in r(v).

We define the function print(T ) as follows: If v is a leaf node v, then print(T ) = λ(v). If T = v(T ) then print(T ) = print(T ). If T = v(T1,T2) then print(T ) = print(T1) ⋅ print(T2). ′ ′ I Lemma 14. Let D be an ECS and let v be a node in D. For a full output tree T of D rooted on v it holds that print(T ) ∈ L (v).

Proof. We prove this by induction onD the size of T . The case T = v where v is a leaf node is trivial. If T = v(T ), v is an union node, so the proof follows since print(T ) is equal to print(T ) which is either′ in L(`(v)) or L(r(v)), and therefore in L(v). If T = v(T1,T2) then v is a product′ node. We have that print(T1) ∈ L(`(v)) and print(T2) ∈ L(r(v)), from which it follows that print(T ) ∈ L(v). J

I Lemma 15. Let D be an unambiguous ECS and let v be a node in D. For each µ ∈ L (v) there exists exactly one full output tree T of D rooted in v such that print(T ) = µ. µ D Proof. Let reach (v) be the number of nodes reachable from v in D, including itself. We will prove this lemma by induction in reach (v). If reach (v) = 1, then v is a leaf node and D D D M. Muñoz and C. Riveros 21 the proof follows directly since the only output tree rooted in v is v itself. Assume that it holds for every node v such that reach (v) < s. Let v be a node such that reach (v) = s and let µ ∈ L(v). If v is a union node suppose without loss of generality that µ ∈ L(`(v)). D D Note that since D is unambiguous we have that µ ∈~ L(r(v)). If Tµ = v(T ) and T was rooted in r(v), Lemma 14 would imply that print(Tµ) = print(T ) ∈ L(r(v)) which′ leads′ to a contradiction. Therefore, Tµ could only be of the form v(T ) ′where T is rooted in `(v). From our hypothesis, there exists only one full output tree Tµ such′ that print′ (Tµ) = µ, so the proof follows from taking Tµ = v(Tµ). If v is a product node′ note that any full output′ tree T rooted in v is of the form v(T1,T2),′ where T1 and T2 are rooted in `(v) and r(v) respectively. Since D is unambiguous, there exists only two strings µ1 and µ2 such that µ = µ1 ⋅ µ2 and

µ1 ∈ L(`(v)) and µ2 ∈ L(r(v)). Let Tµ1 and Tµ2 be the only full output trees that are rooted in `(v) and r(v) respectively for which the hypothesis hold. The proof follows by taking = ( ) Tµ v Tµ1 ,Tµ2 . J For an ECS D and node v we define a total order over the full output trees rooted in v recursively: If v is a leaf node there exists only one tree rooted in v so the order is trivial.

If v is a union node then let T1 = v(T1) and T2 = v(T2) be full output trees. We have that T1 < T2 if and only if root(T1) = `(v) and′ root(T2) = r(′v), or T1 < T2. If v is a product node then let T = v(T1,T2) and T′ = v(T1,T2). We have′ that T < ′T if′ and only if T1 < T1, or T1 = T1 and T2 < T2. ′ ′ ′ ′ ′ For′ an ECS D ′and an output tree T on D we define the operation tilt(T ) as follows: If T = v, then tilt(T ) = v. If T = v(T ) where root(T ) = `(v), then tilt(T ) = v(tilt(T )). If T = v(T ) where root(T ) = r(v), then′ tilt(T ) = tilt(′T ). If T = v(T1,T2), then tilt(T ) = v(tilt(T1)′ , tilt(T2)). Intuitively,′ what this operation does′ is to bypass any union node in T whose child is a right child in D.

I Definition 16. For an ECS D, an output tree T of D is left-tilted if it can be obtained as T = tilt(T ) where T is a full output tree. ′ ′ Two left-tilted output trees can be seen in Figure 3. The first tree in the figure is also full. Note that since the root could be a union node whose child is a right child, the root of tilt(T ) could be a different node than the root of T . We also notice the following result.

I Lemma 17. Let D ECS with a node v. The first tree T in the ordered sequence of full output trees rooted in v is also left-tilted. In other words, tilt(T ) = T .

Proof. We define the operation build(v) as follows. If v is a leaf node, then build(v) = v. If v is a union node then build(v) = v(build(`(v)). If v is a product node then build(v) = v(build(`(v)), build(r(v))). Let T be a different full output tree rooted in v. A straightfor- ward induction shows that T < T ′. J ′ I Lemma 18. Let D be an ECS with an output tree T . We have that print(tilt(T )) = print(T ).

Proof. The proof follows by a straightforward induction on the tree. J We are ready to discuss the enumeration algorithm. Our algorithm receives an unam- biguous k-bounded ECS D along with one of its nodes v and prints the elements in L (v) one by one. The way this is done is by generating the sequence of left-tilted output trees D tilt(T1),..., tilt(Tm) for which T1 < ⋯ < Tm is the complete sequence of full output trees rooted in v. After generating each tree T , the procedure outputs the string print(T ) which can be easily done with a depth-first traversal on the tree. The procedure is detailed in Algorithm 2. 22 Constant-delay enumeration algorithms for document spanners over nested documents

Algorithm 2 Enumeration of the set LD(v) for a CE D and a node v. 1: procedure Enumerate(D, v) 18: procedure NextTree(D,T ) 2: T ← BuildTree(D, v) 19: if T = v then 3: Output # 20: return ∅ 4: while T ≠ ∅ do 21: else if T = v(T1,T2) then 5: Output print(T ) 22: T2 ← NextTree(D,T2) 6: Output # 23: if T2 is empty then 7: T ← NextTree(D,T ) 24: T1 ← NextTree(D,T1) 8: procedure BuildTree(D, v) 25: if T1 is empty then 9: if λ(v) ∈ Σ then 26: return ∅ 10: return v 27: T2 ← BuildTree(D, r(v)) 11: else if λ(v) = ⊙ then 28: return T 12: T1 ← BuildTree(D, `(v)) 29: else if T = v(T ) then 13: T2 ← BuildTree(D, r(v)) 30: T ← NextTree′ (D,T ) 14: Return v(T1,T2) 31: if′ T = ∅ then ′ 15: else if λ(v) = ∪ then 32: T′ ← BuildTree(D, r(v)) 16: T ← BuildTree(D, `(v)) 33: return T 17: Return v(T )

∪ ∪

⊙ ⊙

∪ a3 ∪ a3

a1 a2 a1 a2

Figure 3 An example iteration of an output tree. The subjacent ECS D is represented by solid edges, and the output tree with curve dashed lines. The next tree would be the single node v for which λ(v) = a3.

The procedure BuildTree builds a completely embedded output tree rooted in u. The procedure NextTree receives a tree rooted in u and recursively builds the next tree in the sequence tilt(T1),..., tilt(Tm) for which T1 < ⋯ < Tm is the sequence of full output trees rooted in u. We can deduce the following from Lemma 17:

I Corollary 19. Let D be an ECS and let v be one of its nodes. BuildTree(D, v) builds a full output tree T that is the first in the ordered sequence of full output trees rooted in v.

We prove the correctness of the algorithm in the following results.

I Lemma 20. Let D be an ECS and let v be one of its nodes. Let T1 < ... < Tm be the sequence of full output trees rooted in v. If the procedure NextTree receives (D, tilt(Ti)) it returns tilt(Ti 1), or ∅ if i = m.

Proof. We prove+ this by induction in reach (v). If v is a leaf node, the sequence consists only of the tree v, so the proof follows directly. Assume it holds for nodes v such that D reach (v ) < s and let v be such that reach (v) = s. If v is a union node notice that′ there exists ′ D D M. Muñoz and C. Riveros 23

an e such that the sequence of full output trees rooted in v is T1 < ... < Te < Te 1 < ... < Tm where T = v(T ) and T = v(T ), and root(T ) = `(v) and root(T ) = r(v). If i < e e e e 1 e 1 e e 1 + or i > e, then the′ proof follows by′ induction. Otherwise,′ if i = e, note′ that the procedure + + + BuildTree(D, r(v)) builds the first full output tree rooted in r(v), which is Te 1, and is equal to tilt(Te 1). If v is a product node the proof follows straightforwardly by′ induction + over the algorithm. + J From the previous results, correctness of the algorithm follows:

B Claim 21. Enumerate receives an ECS D and one of its nodes v and outputs all of the elements in L (v) one by one without repetition.

D Proof. Let T1 < ⋯ < Tm be the sequence of full output trees rooted in v. The algorithm starts by generating T1 = tilt(T1) as proven by Corollary 17. Then on each step i, the algorithm iterates T as tilt(Ti) to transform it into tilt(Ti), as proven by Lemma 20. In each step, an element in L (v) is given as output as proven by Lemma 14. Moreover, the sequence T < ⋯ < T allows the set L (v) to be produced exhaustively without repetitions, as proven 1 m D by Lemma 15. D J The following results ensure that each tree in the sequence can be generated efficiently.

I Lemma 22. Let D be an ECS, let v one of its nodes, and let T1 < ⋯ < Tm be the sequence of full output trees rooted in v. If the procedure NextTree receives (D,T ) it returns the tree T in at most c(ST S + ST S) time, for some constant c. ′ ′ Proof. We choose c as a factor of the number of steps that are taken in NextTree without taking into account recursion. That is, the time that it takes to run steps 19-33 without calls. A first observation that we make is that BuildTree builds a tree T in time at most cST S, since each call to BuildTree takes less than c steps, and exactly one call to BuildTree is done per node in T . We prove the lemma by induction on the tree. If v is a leaf node, then the proof is trivial. If v is a product node, let T = v(T1,T2), and let T be the output of NextTree such that T = v(T1,T2) or T = ∅. If the call in line 22 returns′ an nonempty tree, then the procedure takes′ time′ ′ c + c(ST′ 2S + ST2S). Otherwise, line 22 takes time cST2S. Then, if the call in line 24 returns a nonempty tree,′ it takes time c(ST1S + ST1S), and then the call in line 27 takes time cST2S; otherwise, it takes time cST1S. In each of the′ routes where T is not empty, the execution time′ is bounded by c(ST1S + ST2S + ST1S + ST2S + 1) ≤ c(ST S + ST S), and′ if T = ∅, it is bounded by c(ST1S + ST2S + 1) = cST S which proves′ the statement.′ If v is a′ union node,′ let T = v(T ) and let Tout be the output of NextTree. If the call in line 30 returns a nonempty tree, it′ takes time c(ST S + SToutS), where Tout = v(Tout), and the procedure takes total time c + c(ST S + SToutS) ≤ c(ST S′+ STout′ S), which proves the statement.′ Otherwise, the call in line 30 takes time′ cST′ S, and then line 32 takes time cSToutS, which adds to a total time c + c(ST S + SToutS) = c(ST S +′ SToutS), which also proves the statement. J ′ I Lemma 23. Let D be a k-bounded ECS and T be a left-tilted output tree in D. The size of T is at most 2kSprint(T )S.

Proof. Note that Sprint(T )S is equal to the number of leaves in T . Since T is left-tilted, then for each union node v in T we have that childT (v) is rooted in `(v). We also have that D is k-bounded, so there are at most k nodes between each pair of product nodes in T . We know that a binary tree with e leaves has 2e − 1 nodes and 2e − 2 edges. Therefore, if we replace each edge by k − 1 nodes we obtain a tree whose size is an upper bound for the size of T , and the proof follows. J 24 Constant-delay enumeration algorithms for document spanners over nested documents

From these lemmas we obtain a result that ensures nearly output-linear delay.

B Claim 24. Let D be a k-bounded ECS and let v be a node in D. For some sequence µ1, . . . , µm that contains exactly the elements in L (v) without repetition, Enumerate can produce each element µ for i ∈ [2, m] with delay c(Sµ S + Sµ S), and µ with delay cSµ S, i D i 1 i 1 1 where c is a constant. −

Proof. The sequence in question is the one given by the total order T1 < ⋯ < Tm of total output trees rooted in v, for which µi = print(Ti). Let c be the constant in Lemma 22 and let d be a constant such that print(T ) can be produced′ in time dST S. We have that BuildTree can build a tree T in size. In Lemma 22 it is shown that the first tree T1 in the sequence can be generated in time c ST1S, and in Lemma 23 we show that ST1S ≤ 2kSµ1S. From this, it follows that µ1 can be produced′ in time 2k(c + d)Sµ1S. For each i ∈ [2, m] Lemma 22 shows that Ti can be generated in time c (STi 1S+STi′S). We can bound this number by 2kc (Sµ S + Sµ S) using Lemma 23. Printing the′ output takes time dST S, so the total time i 1 i − i is 2kc (S′ µ S + Sµ S) + 2kdSµ S, which is bounded by 2k(c + d)(Sµ S + Sµ S). We conclude the i −1 i i i 1 i proof′ by taking c = 2k(c + d). ′ − − J ′ We optimize this result to obtain the desired statement.

I Proposition 25 (Proposition 3). Fix k ∈ N. Let D be an unambiguous and k-bounded ECS. Then the set L (v) can be enumerated with output-linear delay for any node v in D.

D Proof. Let c be the constant from Claim 24. We have that the elements in L (v) can be enumerated with′ delay proportional to the size of each output, plus the size of the previous D output. Moreover, the first output can be produced in constant time. Let µ1, . . . , µm be the elements in L (v) in the order that the algorithm from Claim 24 produces them. We utilize this algorithm as an oracle to retrieve each output µ in order. After each output is D i obtained, it is printed in an auxiliary tape instead of the output tape. Then the oracle is called to obtain the following output µi 1. Once c SµiS steps have been taken in the oracle, µi is printed to the output tape. After this, the call′ to obtain µ is resumed. This output is + i 1 then obtained c(Sµ S + Sµ ) steps after the oracle was called, but only cSµ S steps after µ i i 1 + i 1 i was printed. From this, we have that for each i ∈ [1, m], printing µ takes time (2c + d)Sµ S, + i + i where d arises from the modifications to the algorithm. By taking the constant c = ′(2c + d) we obtain output-linear delay. ′ J

B.2 Proof of Theorem 4

The construction of the operators and the reasoning why each partial result (D , v ) is 2-bounded is stated in the paper. By adding the condition that D is unambiguous′ we′ can

deduce that L ′ (v ) can be enumerated with output-linear delay using′ Proposition 3. ′ D B.3 Proof of Theorem 5

Let D = (Σ, V, I, `, r, λ) be an ε-ECS. For a node v ∈ V we say that v is ε-safe if (1) v is safe, and (2) for v one of the following apply: ε ∈~ L (v).

λ(v)D= ε. λ(v) = ∪, `(v) = ε and ε ∈~ L (r(v)).

D M. Muñoz and C. Riveros 25

∪ = va ∪ = vb ∪ = vc ′ ′ ε ∪′ ⊙ ⊙ ∗ ∪

v1 v2 v1 v2 v1 ⊙ v2

ε r(v2) ε r(v1) ε r(v1) ε r(v2)

′ ′ ′ ′ Figure 4 Gadgets for prod as defined for an ε-ECS. Nodes va, vb and vc correspond to v as is defined for cases (a), (b) and (c) respectively.

This last case produces the most involved constructions and will often referred to as Case 3.

Let Dv be the ε-ECS induced by the nodes that are reachable from v. Formally, let Vv be this set of nodes. Then Dv = (Σ,Vv,Iv, `v, rv, λv) where Iv = I ∩ Vv, and also `v, rv and λ are

the functions `, r and λ induced by Vv. It is straightforward to check that L v (v) = L (v). A further condition that we require a node v to be ε-safe is that (1) if ε ∈~ L (v), then D D D v does not contain any leaf node v such that λ(v ) = ε and (2) if v is in Case 3, then D D r v does not contain any leaf node v ′such that λ(v )′ = ε. In other words, they are both regular ( ) ECS. ′ ′ We define the operations add, prod and union over D to return a pair (D , v ) such that D = (Σ,V ,I , ` , r , λ ) as follows: ′ ′ ′ For (D′, v′) ∶′= add′ (′ D, a) we define V ∶= V ∪ {v }, I ∶= I, and λ (v ) = a.

Assume′ v1′ and v2 are ε-safe. Further,′ assume that′ for′ every word′ in′ w ∈ L (v1) ⋅ L (v2) there exist only two non-empty words w and w such that w ∈ L (v ), w ∈ L (v ) 1 2 1 1 D 2 D 2 and w = w w . Since both v and v may fall in one of three cases, we define (D , v ) ∶= 1 2 1 2 D D prod(D, v1, v2) by separating into nine cases, of which the first six are straightforward:′ ′ If ε ∈~ L (v1) and ε ∈~ L (v2), we use the construction given for a regular ECS. ∈~ L ( ) ( ) = = D = D If ε D v1 and λ v2 D ε, we define v v1, and . ( ) = ∈~ L ( ) ′ = D′ = D If λ v1 D ε and ε v2 , we define v v2, and . ( ) = ( ) = =′ D =′ D If λ v1 ε and λ v2 D ε, we define v v1, and . If λ(v1) = ε and v2 is in Case 3, we define′ v = v2, and′ D = D. If v1 is in Case 3 and λ(v2) = ε, we define v′ = v1, and D′ = D. ′ ′ The other three cases are more involved and they are presented graphically in Figure 4. Formally, they are deifned as follows:

(a) If ε ∈~ L (v1) and v2 is in Case 3, then V = V ∪ {v , v }, I = I ∪ {v , v }, ` (v ) = v1, r (v ) = v , `(v ) = v , r(v ) = r(v ), λ (v )′ = ∪ and λ′ (v′′ ) =′⊙. ′ ′′ ′ ′ D 1 2 (b) If′ v1′ is in′′ Case′′ 3 and ε ∈~ ′′L (v1), then′ V′ = V ∪ {v ,′ v ′′}, I = I ∪ {v , v }, ` (v ) = v , r (v ) = v , `(v ) = r(v ), r(v ) = v , λ (v )′ = ∪ and λ′ (v′′ ) = ′⊙. ′ ′′ ′ ′ ′′ 2 1 D 2 2 3 4 2 3 4 (c) If′ both′ v1 and′′v2 are in Case′′ 3, then V′ =′ V ∪ {v , v ′, v ′′, v , v }, I = I ∪ {v , v , v , v }, 2 2 2 3 4 3 4 ` (v ) = v , r (v ) = v , ` (v ) = r(v1), r′ (v ) = v3′, `∗(v ) = v , r (′v ) = r(v′2), ` (v ) = 4 1 2 3 4 r′(v1′), r (∗v ) ′= r′(v2), λ (v′ ) = λ (v ) = λ ′(v ) = ∪, λ (′v ) = ⊙ and ′λ (v ) = ε. ′ ′ ′ ′ ′ ′ ′ ∗ Assume v1 and v2 are ε-safe nodes. Further, assume that L (v1) ∖ {ε} and L (v2) ∖ {ε} are disjoint. We define (D , v ) ∶= union(D, v , v ) as follows: 1 2 D D If ε ∈~ L (v1) and ε ∈~ L′ (v′ 2), we use the construction given for a regular ECS. ∈~ L ( ) ( ) = = ∪ { } = ∪ { } ( ) = ∪ If ε D v1 and λ v2D ε, we define V V v , I I v and λ v . We connect `(v ) = v and r(v ) = v . ′ ′ ′ ′ ′ D 2 1 ′ ′ 26 Constant-delay enumeration algorithms for document spanners over nested documents

If ε ∈~ L (v1) and v2 is in Case 3, let (D , v ) = union(D, v1, r(v2)) as defined for a regular ECS. We define V = V ∪ {v }, I = I ′′∪ {′′v } and λ (v ) = ∪ where λ is an extension of D λ . We connect ` (′ v ) =′′`(v2)′ and′ r (v′′ ) = v′ . ′ ′ ′ ′′ ′ ′ ′ ′ ′′ If λ(v1) = ε and ε ∈~ L (v2), we define V = V ∪ {v }, I = I ∪ {v } and λ(v ) = ∪. We connect `(v ) = v and r(v ) = v . ′ ′ ′ ′ ′ 1 D 2 ′ ′ If λ(v1) = ε and λ(v2) = ε, we define D = D and v = v1. ′ ′ If λ(v1) = ε and v2 is in Case 3, we define D = D and v = v2. ′ ′ If v1 falls in case 3 and ε ∈~ L (v2), let (D , v ) = union(D, r(v1), v2) as defined for a regular ECS. We define V = V ∪ {v }, I ′′= I′′ ∪ {v } and λ (v ) = ∪ where λ is an D extension of λ . We connect′ ` (v′′) = `(′v2) and′ r′′(v ) =′ v . ′ ′ ′ ′′ ′ ′ ′ ′ ′′ If v1 is in Case 3 and λ(v2) = ε, we define D = D and v = v1. ′ ′ If both v1 and v2 are in Case 3, we define (D , v ) = union(D, r(v1), v2) as defined in a previous case. ′ ′

Whenever D is mentioned it is assumed to be equal to (Σ,V ,I , ` , r , λ ). It is straightforward′′ to check that each operation behaves as expected.′′ ′′ ′′ That′′ is,′′ if (D , v ) = add(D, a), then L (v ) = {a}, if (D , v ) = prod(D, v1, v2), then L (v ) = L (v1) ⋅ L′ (′v2), and if (D , v ) = union′(D, v , v ), then′ ′L (v ) ∪ L (v ). Moreover, if′ both v and v are D 1 2 1 2 D D 1 D2 ε-safe, then′ the′ resulting node v is ε-safe as well for each operation. D D Note that each operation falls′ into a fixed number of cases which can be checked exhaust- ively, and each construction has a fixed size, so they take constant time. Furthermore, each operation is fully persistent. Finally, let (D , v ) be a partial result obtained from applying the operations add, prod and union such that′ D′ is unambiguous. Since v is ε-safe, it falls in one of the three cases mentioned at the beginning′ of the proof. If ε ∈~ L ′ (v ), then D is a regular ECS, and we can enumerate L (v) with the algorithm from Proposition′ 3. If λ(v) = ε, then we can trivially D enumerate the set L = {ε} in constant time. If v is in Case 3, then we enumerate L (v) by D v using the algorithm from Proposition 3 to eumerate the set L (r(v)) after printing ε as an D output. It is straightforward to check that only traverses through nodes in D Enumerate D v when fed the input (D, v), and will therefore enumerate L (r(v)) with output-linear delay.

D C Proofs from Section 5

C.1 Proof of Lemma 6

We will prove the lemma by induction on k. The case i = 0 is trivial since currlevel(0) = [0, 0⟩, 0 Sp,q is empty and lowerlevel(0) is not defined. We assume that statements 1 and 2 of the lemma are true for k − 1 and below. < k k If ak ∈ Σ , the algorithm proceeds into OpenStep to build S and T . Statement 1 can be proved trivially since currlevel(k) = [k, k⟩, similarly as for the base case. For statement 2 let lowerlevel(k) = [i, k − 1⟩, and consider a run ρ ∈ Runs(T, w) such that ρ[i, k⟩ starts on p and ends on q for some p, q and x, and let p be its second-to-last state. Since ak is an open symbol, then the string ai 1⋯ak 1 is well-nested,′ so it holds that currlevel(k − 1) = [i, k − 1⟩. k 1 Therefore, from our hypothesis it holds that L (S ′ ) contains out(ρ[i, k − 1⟩), and so, + − p,p out(ρ[i, k⟩) is included in L (T k ) at some iteration− of T k at line 37. To show that p,x,q D p,x,q every element in L (T k ) corresponds to some run ρ ∈ Runs(T, w), we note that the only p,x,q D step that modifies T k is line 37, which is reached only when a valid subrun from i to k D p,x,q can be constructed. M. Muñoz and C. Riveros 27

> k k If ak ∈ Σ , the algorithm proceeds into CloseStep to build S and T . Let currlevel(k) = [j, k⟩. In this case, statement 2 can be deduced directly from the hypothesis since j < k and k j the table on the top of T is the same as T . To prove statement 1 notice that since ak is a close symbol it holds that currlevel(k − 1) = [j , k − 1⟩ and lowerlevel(k − 1) = [j, j − 1⟩ for some j . Consider a run ρ ∈ Runs(T, w) such that′ ρ[j, k⟩ starts on p, ends on q, and′ the last symbol′ pushed onto the stack is x. This run can be subdivided in three subruns from p to p , from p to q , and a transition from q to q as it is illustrated in Figure 2 (Right). The first′ ′ ′ ′ k 1 two subruns correspond to ρ[j, j + 1⟩ and ρ[j , k − 1⟩, for which out(ρ[j, j + 1⟩) ∈ L (Tp,x,q) k 1 k and out(ρ[j , k − 1⟩) ∈ L (S ′ ′ )′. Therefore, out′ (ρ[j, k⟩) ∈ L (S ) at some′ iteration of− line p ,q p,q D 23. To show′ that every element− in Sk corresponds to some run ρ ∈ Runs(T, w), note that D p,q D k the only line at which Sp,q is modified are is line 23, which is reached only when a valid run from j to k has been constructed.

C.2 Proof of Theorem 7 This theorem is a straightforward consequence of Lemma 6.

C.3 Proof of Lemma 8 Proof. For the sake of simplification, assume that T is trimmed to be I/O-unambiguous on subruns as well. Formally, we extend the condition so that for every nested word w, span [i, j⟩ and µ ∈ Ω there exists only one run ρ ∈ Runs(T, w) such that µ = out(ρ[i, j⟩). Towards a contradiction,∗ we assume that D is not I/O-unambiguous. Therefore, at least one of these conditions must hold: (1) There is some union node v in D for which L (`(v)) and L (r(v)) are not disjoint, or (2) there is some product node v for which there are at least two ways D D to decompose some µ ∈ L (v) in non-empty strings µ1 and µ2 such that mu = µ1 ⋅ µ2 and µ ∈ L(`(v)) and µ ∈ L (r(v)). 1 2 D Assume the first conditionD is true and let v be an union node that satisfies it, and let k be the step in which it was added to D. If this node was added on OpenStep, then the node v represents a subset of the subruns defined in condition 1 of Lemma 6. Consider two different iterations of lines 36-37 on step i where two nodes v and v were united for which there is an element µ ∈ L (v) ∩ L (v ). Since these nodes were assigned′ to Tp,x,q on different iterations, the states p that were being′ considered must have been different. Therefore, if D D lowerlevel(i) = [i, j⟩, µ =′ out(ρ[i, k⟩) = out(ρ [i, k⟩) for two runs ρ and ρ where the (k − 1)-th state is different. This violates the condition′ that T is unambiguous. If′ this node was added on CloseStep, we can follow an analogous argument. Note that union nodes created on a prod operation are unambiguous by construction (See Theorem 5). Assume now that the second condition is true and let v be a node for which the condition holds and let k be the step where it was created. We note that this node could not have been created in OpenStep since the only step that creates product nodes is step 35, where vλ has the label (o`, k), and Sp,p′ is connected to nodes that were created in a previous step, so all of the elements µ ∈ L(Sp,p′ ) only contain pairs (o`, j) where j < k. We can follow a similar argument to prove that this node could not have been created in step 21 of CloseStep. k 1 We now have that v was created in step 20 of OpenStep, and therefore `(v) = Tp,x,q and k 1 − r(v) = Sp′,q′ unless either of these indices were empty. However, that is not possible since we assumed that− the step where v was created was k, and if either were empty, no node would k 1 have been created. Now let µ ∈ L(v) be such that there exist strings µ1, µ1 ∈ L(Tp,x,q) and k 1 ′ − µ2, µ2 ∈ L(Sp′,q′ ) such that µ = µ1µ2 = µ1µ2 and µ1 ≠ µ1. Without loss of generality, let µ be the′ non-empty− suffix in µ1 such that′ µ1′ µ = µ1. Here′ we reach a contradiction since µ′′ ′ ′′ ′′ 28 Constant-delay enumeration algorithms for document spanners over nested documents

is a prefix of µ2 and thus it must contain a pair (o`, j) such that and j ∈ lowerlevel(k) and j ∈ currlevel(k), which is not possible. The fact that all nodes in D are ε-safe carries easily from Theorem 5. J

D Proofs from Section 6

D.1 Proof of Theorem 10 To link the model of visibly pushdown extraction grammars and visibly pushdown automata we define another class of automata based on the ideas in [32]. Let A be an extraction visibly pushdown automaton (EVPA) if A = (X, Q, Σ, Γ, ∆,I,F ) where X is a set of variables, Q is a set of states, Σ = (Σ<, Σ>, Σ|) is a visibly pushdown alphabet, Γ is a stack alphabet, < > | ∆ ⊆ (Q × Σ × Q × Γ) ∪ (Q × Σ × Γ × Q) ∪ (Q × (Σ ∪ CX ) × Q), I is a set of initial states, and F is a set of final states. Note that this is a simple extension of VPA where neutral transitions are allowed to read neutral symbols or captures in X. We define the runs as in VPA except the input in a EVPA is a ref-word w ∈ (Σ ∪ CX ), and we say that w ∈ L(A) if and only if there is an accepting run of A on w. Furthermore, A is unambiguous if for every ref-word w there exists at most one accepting run of A over w. It is straightforward to see that this is a direct counterpart to visibly pushdown extraction grammars. Therefore, we can use the ideas in [4] to obtain a one-to-one conversion from one to another.

B Claim 26. For a given VPEG G there exists an EVPA AG such that L(G) = L(AG). Moreover, AG is unambiguous iff G is unambiguous, and AG can be constructed in time SGS.

Proof. Let G = (X,V, Σ, S, P ) be a VPEG. We construct a EVPA AG = (X, Q, Σ, Γ, ∆,I,F ) such that L(G) = L(AG) using an almost identical construction to the one in Theorem 6 of [4]. The only differences arise in that our structure is defined for well-nested words, so it can be slightly simpified, and in the case where a production is of the form X → aY , for which we add the possibility that a ∈ CX . This construction provides one transition in ∆ per production in P , and in some cases it needs to check if a variable is nullable. Checking if a single variable is nullable is costly, but by a constant number of traversals in P it is possible to check which variables in X are nullable or not, which can be done before building ∆. Therefore, this construction can be done in time SP S. Finally, AG is unambiguous if and only if G is unambiguous, which is another consequence of Theorem 6 of [4]. J Here we define the spanner A for a given EVPA A identically to the definition for an extraction grammar. Note thatJ fromK the proof it also follows that if G is functional, then AG is functional as well. For the next part of the proof assume that AG is unambiguous. We will show that for an EVPA A and document d, the set A (d) can be enumerated in output-linear 3 J K delay with update-time SAGS . Towards this goal, we will start with an unambiguous X AG = (X, Q, Σ, Γ, ∆,I,F ) and convert it into a VPT T G with output symbol set 2 and use our algorithm to enumerate the set T G (w) where d = d#, using a dummy symbolC #. J K Each element w ∈ T G (d ) can then be converted into a mapping′ µ ∈ G (d) after it is given as output in timeJSµS. K ′ J K < > | Let T G = (Q , Σ , Γ, Ω, ∆ ,I,F ) where Q = Q ∪ {qf }, Σ = (Σ , Σ , Σ#) such that | | ′ ′ X ′ ′ ′ ′ Σ# = Σ ∪ {#}, Ω = 2 ∪ {ε} and F = {qf }. To define ∆ we introduce a merge operation on a path overC AG. This is′ defined for any non-empty′ sequence of trans- itions t = (p1, v1, q1)(p2, v2, p2)⋯(pm, vm, qm) ∈ ∆ such that vi ∈ CX for i ∈ [1, m], and qi = pi 1 ∈ [i, m − 1]. If these conditions hold, we say∗ that t is a v-path ending in pm. Let

+ M. Muñoz and C. Riveros 29

< t be such a v-path and let S = {v1, . . . , vm}. For that p = qm, we define merge(t, (p, ∈ Σ and a transition | (p, a>, q, x) such that p = qm, we define merge(t, (p, a>, q, x)) ∶= (p1, a>, S, q, x). For a ∈ Σ and a transition (p, a, q) such that p = qm, we define merge(t, (p, a, q)) ∶= (p1, a, S, q). We now define ∆ as follows: ′ ∆ = {(p, , ε, q, x)S(p, a>, q, x) ∈ ∆} ∪

{merge(t, (p, a>, q, x)) S there is a v-path t ∈ ∆ ending in p and (p, a>, q, x) ∈ ∆} ∪ {(p, a, ε, q)S(p, a, q) ∈ ∆} ∪ ∗ {merge(t, (p, a, q)) S there is a v-path t ∈ ∆ ending in p and (p, a, q) ∈ ∆} ∪ ∗ {merge(t, (p, #, qf )) S there is a v-path t ∈ ∆ ending in p and p ∈ F }. ∗ Since AG is unambiguous, and therefore, the transitions in ∆ define a DAG over Q, from which we deduce that ∆ is well-defined. By the definition of merge it is straightforward to check that every accepting path in AG is preserved in T G, in the sense that if r ∈ L(AG) then there exists an accepting path of T G over (plain(r)#, w), where ω is a sequence of elements in 2 X ∪ {ε} built from the captures present in r. ToC show accepting pairs for T G correspond to a valid counterpart in AG let (d, w) be an input/output pair that is accepted by T G. Note that d = d # from our definition of ∆ . It can be seen that for every accepting path of T G over (d, w) there′ exists at least one ref-word′ r built from d and w. However, note that for every such ref-word r the only difference may be in the order of the elements inside each group of contiguous captures, which will be asociated to the same position in µr. From this, it follows that for each accepting pair (d, w) there exists only one mapping µ ∈ AG (d ) that can be built from (d, w). J K The size of ∆ is bounded by the′ number of valid v-paths there could exist in AG. Recall that AG is functional, an thus every v-path in AG contains at most one instance of each element in CX . From this it follows that the size of T G is bounded by S∆SS2X S. Furthermore, since the transitions in ∆ form a DAG over Q, each of these v-paths can be foundC by a single traversal over AG, so building T G takes time S∆S. By using the algorithm detailed in Section 5 we can enumerate the set T G (d) with 3 J K one-pass preprocessing with update time O(ST GS ) and output-linear delay. However, with a more fine-grained analysis of the algorithm, we note that the update time is bounded by SQ S2S∆ S ∈ O(SQS2S∆SS2 X S). We modify the enumeration algorithm slightly so that for each output′ ′ w ∈ T G (d) weC build the expected output in G (d). We do this by checking w symbol by symbolJ K and building a mapping µ ∈ G (d), andJ K this can be done in time SµS. As the set X is fixed, it follows that this enumerationJ K can be done with update-time O(SGS3) and output-linear delay. Finally, we adress the case where G is an arbitrary VPEG. The way we deal with this case is by determinizing the EVPA constructed in Claim 26. This can be done in time 2 G . From here, we can follow the reasoning given for the unambiguous case to prove the statement.SA S