Enumerating with Constant Delay the Answers to a Query Luc Segoufin

Enumerating with constant delay the answers to a query Luc Segoufin To cite this version: Luc Segoufin. Enumerating with constant delay the answers to a query. ICDT 2013 - 16th International Conference on Database Theory, Mar 2013, Genes, Italy. pp.10-20, 10.1145/2448496.2448498. hal- 00907085 HAL Id: hal-00907085 https://hal.inria.fr/hal-00907085 Submitted on 20 Nov 2013 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Enumerating with constant delay the answers to a query Luc Segoufin INRIA and ENS Cachan http://pages.saclay.inria.fr/luc.segoufin/ ABSTRACT number of solutions |q(D)| or providing an efficient test for We survey recent results about enumerating with constant whether a given tuple belongs to q(D) or not. delay the answers to a query over a database. More precisely, we focus on the case when enumeration can be achieved We stress that our study is from the theoretical point of with a preprocessing running in time linear in the size of the view. If most of the algorithms we will mention here are database, followed by an enumeration process outputting linear in the size of the database, the constant factors are the answers one by one with constant time between any often very big, making any practical implementation diffi- consecutive outputs. We survey classes of databases and cult. However we believe the index structures designed for classes of queries for which this is possible. We also mention making the algorithms work are interesting and, with extra related problems such as computing the number of answers assumptions, could possibly be turned into something more or sampling the set of answers. practical. Categories and Subject Descriptors In this paper we will focus on another scenario consisting in enumerating q(D) with constant delay. In a nutshell, this F.4.1 [Mathematical Logic and Formal Languages]: means that there is a two-phase algorithm working as fol- Mathematical Logic; F.1.3 [Computation by Abstract lows: a preprocessing phase that works in time linear in the Devices]: Complexity Measures and Classes size of the database, followed by an enumeration phase outputting one by one all the elements of q(D) with a constant General Terms delay between any two consecutive outputs. Logic, algorithmic, databases The existence of a constant delay enumeration algorithm im- 1. INTRODUCTION poses drastic constraints. In particular, the first answer is The evaluation of queries is a central problem in database output after a time linear in the size of the database and management systems. Given a query q and a database D the once the enumeration starts a new answer is being out- evaluation of q over D consists in computing the set q(D) of put regularly at a speed independent from the size of the all answers to q on D. A simple observation shows that the database. Altogether, the set q(D) is entirely computed in set q(D) may be huge, even larger than the database itself, time f(q)(n+|q(D)|) for some function f depending only on as it can have a number of elements of the form nl, where q and not on D. n is the size of the database and l the arity of the query. It can therefore require too many of the available resources to Without any assumptions on the database or on the query, compute it entirely. most queries cannot be enumerated with constant delay. For instance we will see that the simple acyclic conjunctive query There are many solutions to overcome this problem. For computing the pairs of nodes at distance 2 in a graph can- instance one could imagine that a small subset of q(D) can not be enumerated with constant delay unless some drastic be quickly computed and that this subset will be enough consequence on the asymptotic complexity of the Boolean for the user needs. Typically one could imagine computing Matrix Multiplication problem. the top-ℓ most relevant answers relative to some ranking function or to provide a sampling of q(D) relative to some The first and main part of the paper is devoted to the study distribution. One could also imagine computing only the of constant delay algorithms. It is organized with one section per query language. In a second part of the paper we will discuss variations and related problems. (c) 2012 Association for Computing Machinery. ACM acknowledges that We will start with conjunctive queries, actually acyclic con- this contribution was authored or co-authored by an employee, contractor junctive queries, in Section 3. We will see that we can char- or affiliate of the national government of France. As such, the government acterize precisely those acyclic conjunctive queries that can of France retains a nonexclusive, royalty-free right to publish or reproduce be enumerated with constant delay. this article, or to allow others to do so, for Government purposes only. EDBT/ICDT ’13, March 18 - 22 2013, Genoa, Italy We will then move on to first-order queries in Section 4. Copyright 2012 ACM 978-1-4503-1598-2/13/03. $15.00 In this case we need to restrict the class of databases. We takes as input a database of a given signature σ and returns will see that constant delay algorithms can be obtained over a relation of a fixed arity, the arity of the query. A query is a databases with bounded degree, bounded treewidth or, more sentence if its arity is 0. The query is then either true or false generally, “bounded expansion”. on D and defines a property of D. A query is unary if its arity is 1. If q is a query anda ¯ is in the image of q on D, then In Section 6 we will see that, in the bounded treewidth case, we write D |= q(¯a). Finally we set q(D) = {a¯ |D|= q(¯a)}. one can even enumerate monadic second-order queries with Note that the size of q(D) may be exponential in the arity constant delay. of q. The last case permitting constant delay enumeration that we A query language is a class of queries. Typically it is defined will mention in Section 5 is XPath over XML documents. as a logical formalism such as CQ (for conjunctive queries), FO (for first-order queries), MSO (for monadic second-order There are many variations of constant delay enumeration queries), XPath and so on. that have been considered. For instance one could require that the answers are enumerated relative to a specific order. Given a query language L, the model checking problem of L It turns out that the existence of constant delay enumeration is the computational problem of given a sentence q ∈L and algorithm is very sensitive to the order of outputs. We will a database D, to test whether D |= q or not. The database also discuss linear or even polynomial delays in Section 7. D is often restricted to a class C of structures. In this case we speak of the model checking problem of L over C. We will also discuss the problem of counting the number of solutions and the problem of testing whether a given tuple Given a query language L, the query answering problem of belongs to the answers set or not. It is not clear a priori L is the computational problem of, given a query q ∈L and how these two problems are related to constant delay enu- a database D, to compute q(D). As for model checking, it meration. However, it turns out that in the scenarios where is often restricted to a class C of structures. In this case we constant delay enumeration can be achieved, one can often speak of the query answering problem of L over C. also count the number of solutions in time linear in the size of the database and, after linear time preprocessing on the The definition of the query answering and model checking database, one can test in constant time whether a given tu- problems given above correspond to their combined complex- ple is part of the answers set. We will survey those results ity. When the query is no longer part of the input, we speak in Section 7. of their data complexity. Finally Section 7 concludes with the very few known results about sampling the answers set. 2.2 Model of computation As we will deal with linear time, it is important to make This survey is by no means exhaustive. It is only intended our model of computation precise. We use Random Access to survey the major theoretical results concerning database Machines (RAM) with addition and uniform cost measure as querying and enumeration. Hopefully it will convince the a model of computation, cf. [2]. Here are the features of this reader that this is an important subject for research that still model that are the most relevant for this paper. On an input contains many interesting and challenging open problems. of size n, a RAM has a certain number of registers, each of them containing log n bits, and the machine can modify Enumeration in general, and constant delay enumeration in the content of these registers as a Turing Machine would particular, is a well identified subfield of algorithmics, and do, but can also use these registers for accessing directly the many non trivial enumeration algorithms exist for problems memory cell pointed by the register or performing numerical over graphs (like enumerating all spanning trees, all con- operations (typically additions) between them.

Load more