Pushing Data-Induced Predicates Through Joins in Big-Data Clusters

Total Page:16

File Type:pdf, Size:1020Kb

Pushing Data-Induced Predicates Through Joins in Big-Data Clusters Pushing Data-Induced Predicates Through Joins in Big-Data Clusters Srikanth Kandula, Laurel Orr, Surajit Chaudhuri Microsoft ABSTRACT Using daTa sTaTisTics, we converT predicaTes on a Table inTo daTa in- duced predicaTes (diPs) ThaT apply on The joining Tables. Doing so subsTanTially speeds up mulTi-relaTion queries because The beneûts of predicaTe pushdown can now apply beyond jusT The Tables ThaT Figure Ï:Example illusTraTing creaTion and use of a daTa-induced have predicaTes. We use diPs To skip daTa exclusively during query predicaTe which only uses The join columns and is a necessary con- opTimizaTion; i.e., diPs lead To beTTer plans and have no overhead diTion To The True predicaTe, i.e., σ ⇒ dpartkey. during query execuTion. We sTudy how To apply diPs for complex Ï query expressions and how The usefulness of diPs varies wiTh The pruning [ÊÏ]), buT predicaTes over join columns are rare, and These daTa sTaTisTics used To consTrucT diPs and The daTa disTribuTions.Our Techniques do noT apply when The predicaTes use columns ThaT do results show ThaT building diPs using zone-maps which are already noT exisT in The joining Tables. mainTained in Today’s clusTers leads To sizable daTa skipping gains. Some sysTems implemenT a form of sideways informaTion passing Using a new (slighTly larger) sTaTisTic, ªì of The queries in The TPC- over joins [}Ï, CÊ] during query execuTion. For example, They may H, TPC-DS and JoinOrder benchmarks can skip aT leasT kkì of The build a bloom ûlTer over The values of The join column partkey query inpuT. ConsequenTly, The median query in a producTion big- in The rows ThaT saTisfy The predicaTe on The part Table and use This daTa clusTer ûnishes roughly }× fasTer. bloom ûlTer To skip rows from The lineitem Table. UnforTunaTely, This Technique only applies during query execuTion, does noT easily PvLDB Reference FormaT: exTend To general joins and has high overheads, especially during SrikanTh Kandula, Laurel Orr, SurajiT Chaudhuri. Pushing DaTa-Induced PredicaTes _rough Joins in Big-DaTa ClusTers. PvLDB, Ï}(k): xxxx-yyyy, parallel execuTion on large daTasets because consTrucTing The bloom }ªÏ. ûlTer becomes a scheduling barrier delaying The scan of lineitem DOI: hTTps://doi.org/Ϫ.Ï99Ê/kkCÊ}Ê.kkCÊ}} unTil The bloom ûlTer has been consTrucTed. We seek a meThod ThaT can converT predicaTes on a Table To daTa 1. INTRODUCTION skipping opporTuniTies on joining Tables even if The predicaTe columns In This paper, we seek To exTend The beneûts of predicaTe push- are absenT in oTher Tables. Moreover, we seek a meThod ThaT applies down beyond jusT The Tables ThaT have predicaTes. Consider The fol- exclusively during query plan generaTion in order To limiT overheads lowing fragmenT of TPC-H query ¢Ï9 [}ª]. during query execuTion. Finally, we are inTeresTed in a meThod ThaT is easy To mainTain, applies To a broad class of queries and makes SELECT SUM(l extendedprice) FROM lineitem minimalisT assumpTions. JOIN part ON l partkey = p partkey Our TargeT scenario is big-daTa sysTems, e.g., SCOPE[], Spark[k9, WHERE p brand=`:1' AND p container=`:2' Ê], Hive [ʪ], FÏ [9C] or Pig [C9] clusTers ThaT run SQL-like queries _e lineitem Table is much larger Than The part Table, buT because over large daTasets; recenT reports esTimaTe over a million servers in The query predicaTe uses columns ThaT are only available in part, such clusTers [Ï]. predicaTe pushdown cannoT speed up The scan of lineitem. How- Big-daTa sysTems already mainTain daTa sTaTisTics such as The max- ever, iT is easy To see ThaT scanning The enTire lineitem Table will be imum and minimum value of each column aT diòerenT granulariTies wasTeful if only a small number of Those rows will join wiTh The rows of The input; deTails are in Table Ï. In The resT of This paper, for sim- from part ThaT saTisfy The predicaTe on part. pliciTy, we will call This The zone-map sTaTisTic and we use The word If only The predicaTe was on The column used in The join condiTion, parTiTion To denoTe The granulariTy aT which sTaTisTics are mainTained. partkey, Then a varieTy of Techniques become applicable (e.g., al- Using daTa sTaTisTics, we oòer a meThod ThaT converts predicaTes gebraic equivalence [k], magic seT rewriTing [ª, 9}] or value-based on a Table To daTa skipping opporTuniTies on The joining Tables aT query opTimizaTion Time. _e meThod, an example of which is shown This work is licensed under the Creative Commons Attribution- in Figure Ï, begins by using daTa sTaTisTics To eliminaTe parTiTions on NonCommercial-NoDerivatives 4.0 International License. To view a copy Tables ThaT have predicaTes. _is sTep is already implemenTed in some of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For sysTems [9, ÏÊ, ,ÊÏ].NexT, using The daTa sTaTisTics of The parTiTions any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights ThaT saTisfy The local predicaTes, we consTrucT a new predicaTe which licensed to the VLDB Endowment. Ï Proceedings of the VLDB Endowment, Vol. 12, No. 3 Over all The queries in TPC-H [}C] and TPC-DS [}], There are ISSN 2150-8097. zero predicaTes on join columns perhaps because join columns Tend DOI: https://doi.org/10.14778/3368289.3368292 To be opaque sysTem-generaTed idenTiûers. Table Ï: DaTa sTaTisTics mainTained by several sysTems. Scheme STaTisTic GranulariTy ZoneMaps [Ï] max and min value per col- zone Spark [, ÏÊ] umn file ExadaTa [C] max, min and null presenT or per Table region verTica [], ORC[}], null counT per column sTripe, row- ParqueT [ÏC] group BrighThouse [99] hisTograms, char maps per col daTa pack Figure k:A workow which shows changes in red; using parTiTion sTaTisTics, our query opTimizer compuTes daTa-induced predicaTes and ouTputs plans ThaT read less inpuT. range-seT, ThaT improves query performance over zone-maps aTThe Figure }: IllusTraTing The need To move diPs pasT oTher opera- cosT of a small increase in space. We also discuss how To mainTain Tions (le). On a k-way join when all Tables have predicaTes (middle), sTaTisTics when daTasets evolve. In more deTail, The resT of This paper The opTimal schedule only requires Three (parallel) sTeps (righT). has These conTribuTions. ● Using diPs for complex queries leads To novel challenges. capTures all of The join column values conTained in such parTiTions. – Consider TPC-H query¢Ï9 [}ª] in Figure }(le) which has a _is new daTa-induced predicaTe (diP) is a necessary condiTion of The nesTed sub-query on The lineitem Table. CreaTing diPs for acTual predicaTe (i.e., σ ⇒ d) because There may be false-posiTives; only The fragmenT considered in Figure Ï sTill reads The enTire i.e., in The parTiTions ThaT are included in The diP, noT all of The rows lineitem Table for The nesTed group-by. To alleviaTe This, we may saTisfy σ. However, The diP can apply over The joining Table be- use new QO TransformaTion rules To move diPs; in This case, cause iT only uses The join column; in The case of Figure Ï, The diP shown wiTh doTTed arrows, The diP is pulled above a join, moved consTrucTed on The part Table can be applied on The parTiTion sTaTis- sideways To a diòerenT join inpuT and Then pushed below a group- Tics of lineitem To eliminaTe parTiTions. All of These sTeps happen by Thereby ensuring ThaT a shared scan of lineitem will suõce. during query opTimizaTion; our QO eòecTively replaces each Table – When mulTiple joining Tables have predicaTes, a second chal- wiTh a parTiTion subseT of ThaTTable;The reducTion in inpuT size oen lenge arises. Consider The k-way join in Figure }(middle) where Triggers oTher plan changes (e.g., using broadcasT joins which elim- all Tables have local predicaTes. _e ûgure shows four diPs: inaTe a parTiTion-shuøe [9]) leading To more eõcienT query plans. one per Table and per join condiTion. If applying These diPs If The above meThod is implemenTed using zone-maps, which are eliminaTes parTiTions on some joining Table, Then The diPs ThaT mainTained by many sysTems already, The only overhead is an in- were previously consTrucTed on ThaTTable are no longer up-To- crease in query opTimizaTion Time which we show is small in §. daTe. Re-creaTing diPs whenever parTiTion subsets change will For queries wiTh joins, we show ThaT daTa-induced predicaTes oòer increase daTa skipping buT doing so na¨ıvely can consTrucT ex- comparable query performance aT much lower cosT relaTive To maTe- cessively many diPs which increases query opTimizaTion Time. rializing denormalized join views [C] or using join indexes [, }k]. We presenT an opTimal schedule for Tree-like join graphs which _e fundamenTal reason is ThaTThese Techniques use augmenTary converges To ûxed poinT and hence achieves all possible daTa daTa-sTrucTures which are cosTly To mainTain; yeT, Their beneûts are skipping while compuTing The fewesT number of diPs. STar and limiTed To narrow classes of queries (e.g., queries ThaT maTch views, snowake schemas resulT in Tree-like join graphs. We discuss have speciûc join condiTions, or very selecTive predicaTes) [k]. DaTa- how To derive diPs for general join graphs wiThin a cosT-based induced predicaTes, we will show, are useful more broadly. query opTimizer in §k. We also noTe ThaTThe consTrucTion and use of daTa-induced pred- diPs icaTes is decoupled from how The daTasets are laid ouT.Prior work ● We show how diòerenT daTa sTaTisTics can be used To compuTe idenTiûes useful daTa layouts, for example, co-parTiTion Tables on in § and discuss why range-sets represenT a good Trade-oò be- Their join columns [Ï, C}] or clusTer rows ThaT saTisfy The same pred- Tween space usage and poTenTial for daTa skipping.
Recommended publications
  • SAQE: Practical Privacy-Preserving Approximate Query Processing for Data Federations
    SAQE: Practical Privacy-Preserving Approximate Query Processing for Data Federations Johes Bater Yongjoo Park Xi He Northwestern University University of Illinois (UIUC) University of Waterloo [email protected] [email protected] [email protected] Xiao Wang Jennie Rogers Northwestern University Northwestern University [email protected] [email protected] ABSTRACT 1. INTRODUCTION A private data federation enables clients to query the union of data Querying the union of multiple private data stores is challeng- from multiple data providers without revealing any extra private ing due to the need to compute over the combined datasets without information to the client or any other data providers. Unfortu- data providers disclosing their secret query inputs to anyone. Here, nately, this strong end-to-end privacy guarantee requires crypto- a client issues a query against the union of these private records graphic protocols that incur a significant performance overhead as and he or she receives the output of their query over this shared high as 1,000× compared to executing the same query in the clear. data. Presently, systems of this kind use a trusted third party to se- As a result, private data federations are impractical for common curely query the union of multiple private datastores. For example, database workloads. This gap reveals the following key challenge some large hospitals in the Chicago area offer services for a cer- in a private data federation: offering significantly fast and accurate tain percentage of the residents; if we can query the union of these query answers without compromising strong end-to-end privacy. databases, it may serve as invaluable resources for accurate diag- To address this challenge, we propose SAQE, the Secure Ap- nosis, informed immunization, timely epidemic control, and so on.
    [Show full text]
  • WIN: an Efficient Data Placement Strategy for Parallel XML Databases
    WIN : An Efficient Data Placement Strategy for Parallel XML Databases Nan Tang† Guoren Wang‡ Jeffrey Xu Yu† Kam-Fai Wong† Ge Yu‡ † Department of SE & EM, The Chinese University of Hong Kong, Hong Kong, China ‡ College of Information Science and Engineering, Northeastern University, Shenyang, China {ntang, yu, kfwong}se.cuhk.edu.hk {wanggr, yuge}@mail.neu.edu.cn Abstract graph-based. Distinct from them, XML employs a tree struc- The basic idea behind parallel database systems is to ture. Conventional data placement strategies in perform operations in parallel to reduce the response RDBMSs and OODBMSs cannot serve XML data well. time and improve the system throughput. Data place- Semistructured data involve nested structures; there ment is a key factor on the overall performance of are much intricacy and redundancy when represent- parallel systems. XML is semistructured data, tradi- ing XML as relations. Data placement strategies in tional data placement strategies cannot serve it well. RDBMSs cannot, therefore, well solve the data place- In this paper, we present the concept of intermediary ment problem of XML data. Although graph partition node INode, and propose a novel workload-aware data algorithms in OODBMSs can be applied to tree parti- placement WIN to effectively decluster XML data, to tion problems, the key relationships between tree nodes obtain high intra query parallelism. The extensive ex- such as ancestor-descendant or sibling cannot be clearly periments show that the speedup and scaleup perfor- denoted on graph. Hence data placement strategies mance of WIN outperforms the previous strategies. for OODBMSs cannot well adapt to XML data either.
    [Show full text]
  • Tepzz¥ 67¥¥Za T
    (19) TZZ¥ ¥¥Z_T (11) EP 3 267 330 A1 (12) EUROPEAN PATENT APPLICATION (43) Date of publication: (51) Int Cl.: 10.01.2018 Bulletin 2018/02 G06F 17/30 (2006.01) (21) Application number: 16187202.3 (22) Date of filing: 05.09.2016 (84) Designated Contracting States: (72) Inventors: AL AT BE BG CH CY CZ DE DK EE ES FI FR GB • Abolhassani, Neda GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO Athens, GA Georgia 30605 (US) PL PT RO RS SE SI SK SM TR • Sheausan Tung, Teresa Designated Extension States: San Jose, CA California 95120 (US) BA ME • Gomadam, Karthik Designated Validation States: San Jose, CA California 95112 (US) MA MD (74) Representative: Müller-Boré & Partner (30) Priority: 07.07.2016 US 201662359547 P Patentanwälte PartG mbB 03.08.2016 US 201615227605 Friedenheimer Brücke 21 80639 München (DE) (71) Applicant: Accenture Global Solutions Limited Dublin 4 (IE) (54) QUERY REWRITING IN A RELATIONAL DATA HARMONIZATION FRAMEWORK (57) A query rewriting processor (processor) analyz- an SQL query). The processor may then pass the rela- es database semantic models (e.g., RDF knowledge tional database query to another system or process (e.g., graphs) that capture the interconnections (e.g., foreign a data virtualization layer) for execution against the indi- and primary key links to other tables) present in a rela- vidual relational databases. In this manner, the processor tional database. The processor generates an enriched automatically translates queries for information about the model query given an initial model query (e.g., a SPARQL relational database structure to a corresponding or query) against the semantic model.
    [Show full text]
  • Conjunctive Query Answering in the Description Logic EL Using a Relational Database System
    Conjunctive Query Answering in the Description Logic EL Using a Relational Database System Carsten Lutz - Universität Bremen, Germany David Toman - University of Waterloo, Canada FrankWolter - University of Liverpool, UK Presented By : Jasmeet Jagdev Background • One of the main application of ontologies is data access • Ontologies formalize conceptual information about the data • Calvanese et al. have argued, true scalability of CQ answering over DL ontologies can only be achieved by making use of RDBMSs • Not straight forward ! Background • RDBMSs are unaware of Tboxes (the DL mechanism for storing conceptual information) and adopt the closed-world semantics • In contrast, ABoxes (the DL mechanism for storing data) and the associated ontologies employ the open-world semantics Introduction • Approach for using RDBMSs for CQ answering over DL ontologies • We apply it to an extension of EL family of DLs • Widely used as ontology languages for large scale bio-medical ontologies such as SNOMED CT and NCI Introduction • EL allows – Concept Intersection – Existential restrictions dr • ELH⊥ = EL + bottom concept, role inclusions, and domain and range restrictions Main Idea • To incorporate the consequences of the TBox T into the relational instance corresponding to the given ABox A • To introduce the idea of combined first-order (FO) rewritability • It is possibly only if: – A and T can be written into an FO structure – q and T into FO query q* Main Idea • Properties of this approach: – It applies to DLs for which data complexity of CQ dr answering
    [Show full text]
  • Query Rewriting Using Monolingual Statistical Machine Translation
    Query Rewriting using Monolingual Statistical Machine Translation Stefan Riezler∗ Yi Liu∗∗ Google Google Long queries often suffer from low recall in web search due to conjunctive term matching. The chances of matching words in relevant documents can be increased by rewriting query terms into new terms with similar statistical properties. We present a comparison of approaches that deploy user query logs to learn rewrites of query terms into terms from the document space. We show that the best results are achieved by adopting the perspective of bridging the “lexical chasm” between queries and documents by translating from a source language of user queries into a target language of web documents. We train a state-of-the-art statistical machine translation (SMT) model on query-snippet pairs from user query logs, and extract expansion terms from the query rewrites produced by the monolingual translation system. We show in an extrinsic evaluation in a real-world web search task that the combination of a query-to-snippet translation model with a query language model achieves improved contextual query expansion compared to a state-of-the-art query expansion model that is trained on the same query log data. 1. Introduction Information Retrieval (IR) applications have been notoriously resistant to improvement attempts by Natural Language Processing (NLP). With a few exceptions for specialized tasks1, the contribution of part-of-speech taggers, syntactic parsers, or ontologies of nouns or verbs, has been inconclusive. In this paper, instead of deploying NLP tools or ontologies, we apply NLP ideas to IR problems. In particular, we take a viewpoint that looks at the problem of the word mismatch between queries and documents in web search as a problem of translating from a source language of user queries into a target language of web documents.
    [Show full text]
  • On Using an Automatic, Autonomous and Non-Intrusive Approach for Rewriting SQL Queries
    Simpósio Brasileiro de Banco de Dados - SBBD 2013 Short Papers On Using an Automatic, Autonomous and Non-Intrusive Approach for Rewriting SQL Queries Arlino H. M. de Araújo1;2, José Maria Monteiro2, José Antônio F. de Macêdo2, Júlio A. Tavares3, Angelo Brayner3, Sérgio Lifschitz4 1 Federal University of Piauí, Brazil 2 Federal University of Ceará, Brazil {arlino,monteiro,jose.macedo}@lia.ufc.br 3 University of Fortaleza, Brazil [email protected] 4 Pontifical Catholic University of Rio de Janeiro [email protected] Abstract. Database applications have become very complex, dealing with a huge volume of data and database objects. Concurrently, low query response time and high transaction throughput have emerged as mandatory requirements to be ensured by database management systems (DBMSs). Among other possible interventions regarding database performance, SQL query rewriting has been shown quite efficient. The idea is to rewrite a new SQL statement equivalent to the statement initially formulated, where the new SQL statement provides performance gains w.r.t. query response time. In this paper, we propose an automatic and non-intrusive approach for rewriting SQL queries. The proposed approach has been implemented and its behavior was evaluated in three different DBMSs using TPC-H benchmark. The results show that the proposed approach is quite efficient and can be effectively considered by database professionals. Categories and Subject Descriptors: H.2 [Database Management]: Miscellaneous Keywords: Query processing, Database tuning, Query rewriting 1. INTRODUCTION Database queries are expressed by means of high-level declarative languages, such as SQL (Structured Query Language), for instance. Such queries are submitted to the query engine, which is responsible for processing queries in database mangement systems (DBMSs).
    [Show full text]
  • Towards Practical Privacy-Preserving Data Analytics by Noah Michael
    Towards Practical Privacy-Preserving Data Analytics by Noah Michael Johnson A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering and Computer Sciences in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Dawn Song, Chair Professor David Wagner Professor Deirdre Mulligan Fall 2018 Towards Practical Privacy-Preserving Data Analytics Copyright 2018 by Noah Michael Johnson 1 Abstract Towards Practical Privacy-Preserving Data Analytics by Noah Michael Johnson Doctor of Philosophy in Engineering - Electrical Engineering and Computer Sciences University of California, Berkeley Professor Dawn Song, Chair Organizations are increasingly collecting sensitive information about individuals. Extracting value from this data requires providing analysts with flexible access, typically in the form of data- bases that support SQL queries. Unfortunately, allowing access to data has been a major cause of privacy breaches. Traditional approaches for data security cannot protect privacy of individuals while providing flexible access for analytics. This presents a difficult trade-off. Overly restrictive policies result in underutilization and data siloing, while insufficient restrictions can lead to privacy breaches and data leaks. Differential privacy is widely recognized by experts as the most rigorous theoretical solution to this problem. Differential privacy provides a formal guarantee of privacy for individuals while allowing general statistical analysis of the data. Despite extensive academic research, differential privacy has not been widely adopted in practice. Additional work is needed to address performance and usability issues that arise when applying differential privacy to real-world environments. In this dissertation we develop empirical and theoretical advances towards practical differential privacy.
    [Show full text]
  • Data Management 09 Transaction Processing
    1 SCIENCE PASSION TECHNOLOGY Data Management 09 Transaction Processing Matthias Boehm, Arnab Phani Graz University of Technology, Austria Computer Science and Biomedical Engineering Institute of Interactive Systems and Data Science BMK endowed chair for Data Management Last update: Dec 06, 2020 2 Announcements/Org . #1 Video Recording . Link in TeachCenter & TUbe (lectures will be public) . Due to second lockdown: webex recording . #2 Exercise Submissions . Exercise 1: grading done . Exercise 2: in process of being grading . Exercise 3: published on Dec 01, submit by Dec 22 11:59pm . Issues dataset (Nov 19) / expected results (Nov 16, Nov 26) . Two approaches of handling actors . Incorrect expected results 5 extra points INF.01017UF Data Management / 706.010 Databases – 09 Transaction Processing Matthias Boehm, Graz University of Technology, WS 2020/21 3 Transaction (TX) Processing User 2 User 1 User 3 read/write TXs #1 Multiple users Correctness? DBS DBMS #2 Various failures Deadlocks (TX, system, media) Constraint Reliability? violations DBs Network Crash/power failure Disk failure failure . Goal: Basic Understanding of Transaction Processing . Transaction processing from user perspective . Locking and concurrency control to ensure #1 correctness . Logging and recovery to ensure #2 reliability INF.01017UF Data Management / 706.010 Databases – 09 Transaction Processing Matthias Boehm, Graz University of Technology, WS 2020/21 4 Agenda . Overview Transaction Processing . Locking and Concurrency Control . Logging and Recovery . Exercise
    [Show full text]
  • Distributed Complex Event Processing with Query Rewriting
    Distributed Complex Event Processing with Query Rewriting Nicholas Poul Matteo Migliavacca Peter Pietzuch Schultz-Møller Department of Computing Department of Computing Department of Computing Imperial College London Imperial College London Imperial College London United Kingdom United Kingdom United Kingdom [email protected] [email protected] [email protected] ABSTRACT is first stored and indexed and only queried afterwards. The nature of data in enterprises and on the Internet is In contrast, complex event processing (CEP) systems re- changing. Data used to be stored in a database first and gard new data (or updates to existing data) as a stream of queried later. Today timely processing of new data, repre- events [19]. Users can specify queries as high-level patterns sented as events, is increasingly valuable. In many domains, of events. The CEP system detects complex events matching complex event processing (CEP) systems detect patterns of these queries and notifies clients of matches in soft real-time. events for decision making. Examples include processing of For example, a credit card processing company may monitor environmental sensor data, trades in financial markets and fraudulent credit card use by specifying patterns that detect RSS web feeds. Unlike conventional database systems, most concurrent transactions caused by the same credit card at current CEP systems pay little attention to query optimi- geographically distant locations. sation. They do not rewrite queries to more efficient rep- Detecting complex patterns in high-rate event streams re- resentations or make decisions about operator distribution, quires substantial CPU resources. To make a CEP system limiting their overall scalability.
    [Show full text]
  • Probabilistic Query Rewriting for Efficient and Effective Keyword
    Probabilistic Query Rewriting for Efficient and Effective Keyword Search on Graph Data Lei Zhang Thanh Tran Achim Rettinger Karlsruhe Institute of Karlsruhe Institute of Karlsruhe Institute of Technology Technology Technology 76128 Karlsruhe, Germany 76128 Karlsruhe, Germany 76128 Karlsruhe, Germany [email protected] [email protected] [email protected] ABSTRACT connecting them, which represent final answers covering all The problem of rewriting keyword search queries on graph query keywords. The space of possible subgraphs is gener- data has been studied recently, where the main goal is to ally exponential in the number of query keywords. Through clean user queries by rewriting keywords as valid tokens ap- grouping keywords into larger meaningful units (called seg- pearing in the data and grouping them into meaningful seg- ments), the number of keywords to be processed and the ments. The main solution to this problem employs heuris- corresponding search space is reduced. tics for ranking query rewrites and a dynamic programming The two main tasks involved in query cleaning (henceforth algorithm for computing them. Based on a broader set of also called query rewriting) are token rewriting, where query queries defined by an existing benchmark, we show that the keywords are rewritten as tokens appearing in the data, and use of these heuristics does not yield good results. We pro- query segmentation, where tokens are grouped together as pose a novel probabilistic framework, which enables the op- segments representing compound keywords. Query rewrit- timality of a query rewrite to be estimated in a more princi- ing helps to improve not only the result quality but also the pled way.
    [Show full text]
  • First-Order Query Rewriting for Inconsistent Databases
    First-Order Query Rewriting for Inconsistent Databases Ariel D. Fuxman and Ren´ee J. Miller Department of Computer Science University of Toronto {afuxman, miller}@cs.toronto.edu Abstract. We consider the problem of retrieving consistent answers over databases that might be inconsistent with respect to some given integrity constraints. In particular, we concentrate on sets of constraints that consist of key dependencies. Most of the existing work has focused on identifying intractable cases of this problem. In contrast, in this paper we give an algorithm that computes the consistent answers for a large and practical class of conjunctive queries. Given a query q, the algorithm returns a first-order query Q (called a query rewriting) such that for ev- ery (potentially inconsistent) database I, the consistent answers for q can be obtained by evaluating Q directly on I. 1 Introduction Consistent query answering is the problem of retrieving “consistent” answers over databases that might be inconsistent with respect to some given integrity constraints. Applications that have motivated the study of this problem include data integration and data exchange. In data integration, the goal is to provide “a uniform interface to multiple autonomous data sources” [Hal01]. In data ex- change, “data structured under one (source) schema must be restructured and translated into an instance of a different (target) schema” [FKMP03]. In both contexts, it is often the case that the source data does not satisfy the integrity constraints of the global or target schema. The traditional approach to deal with this situation involves “cleaning” the source instance in order to remove data that violates the target constraints.
    [Show full text]
  • A Comparison of Query Rewriting Techniques for DL-Lite
    A Comparison of Query Rewriting Techniques for DL-Lite H´ector P´erez-Urbina, Boris Motik, and Ian Horrocks Oxford University Computing Laboratory Oxford, England {hector.perez-urbina,boris.motik,ian.horrocks}@comlab.ox.ac.uk 1 Introduction An incomplete database is defined by a set of constraints and a partial database instance [1]. Answering conjunctive queries over incomplete databases is an im- portant computational task that lies at the core of many problems, such as in- formation integration [12], data exchange [9], and data warehousing [17]. Given a query and an incomplete database, the task is to compute the so-called certain answers—the tuples that satisfy the query in every database instance that con- forms to the partial instance and satisfies the constraints [16]. It is well known that answering queries under general constraints is undecidable; therefore, the expressivity of the constraint languages considered is typically restricted in order to achieve decidability. Description Logics (DLs) [3] can be viewed as very expressive but decidable first-order fragments, which makes them natural candidates for constraint lan- guages. DLs are a family of knowledge representation formalisms that represent a given domain in terms of concepts (unary predicates), roles (binary predicates), and individuals (constants). A DL Knowledge Base (KB) K = hT , Ai consists of a terminological component T called the TBox, and an assertional compo- nent A called the ABox. In analogy to incomplete databases, the TBox can be seen as a conceptual schema containing the set of constraints, and the ABox as some partial instance of the schema. The use of DLs as constraint languages has already proven to be useful in a variety of scenarios such as ontology-based information integration [7] and the Semantic Web [10].
    [Show full text]