Adaptive Schema Databases ∗
Total Page:16
File Type:pdf, Size:1020Kb
Adaptive Schema Databases ∗ William Spothb, Bahareh Sadat Arabi, Eric S. Chano, Dieter Gawlicko, Adel Ghoneimyo, Boris Glavici, Beda Hammerschmidto, Oliver Kennedyb, Seokki Leei, Zhen Hua Liuo, Xing Niui, Ying Yangb b: University at Buffalo i: Illinois Inst. Tech. o: Oracle {wmspoth|okennedy|yyang25}@buffalo.edu {barab|slee195|xniu7}@hawk.iit.edu [email protected] {eric.s.chan|dieter.gawlick|adel.ghoneimy|beda.hammerschmidt|zhen.liu}@oracle.com ABSTRACT in unstructured or semi-structured form, then an ETL (i.e., The rigid schemas of classical relational databases help users Extract, Transform, and Load) process needs to be designed in specifying queries and inform the storage organization of to translate the input data into relational form. Thus, clas- data. However, the advantages of schemas come at a high sical relational systems require a lot of upfront investment. upfront cost through schema and ETL process design. In This makes them unattractive when upfront costs cannot this work, we propose a new paradigm where the database be amortized, such as in workloads with rapidly evolving system takes a more active role in schema development and data or where individual elements of a schema are queried data integration. We refer to this approach as adaptive infrequently. Furthermore, in settings like data exploration, schema databases (ASDs). An ASD ingests semi-structured schema design simply takes too long to be practical. or unstructured data directly using a pluggable combina- Schema-on-query is an alternative approach popularized tion of extraction and data integration techniques. Over by NoSQL and Big Data systems that avoids the upfront time it discovers and adapts schemas for the ingested data investment in schema design by performing data extraction using information provided by data integration and infor- and integration at query-time. Using this approach to query mation extraction techniques, as well as from queries and semi-structured and unstructured data, we have to perform user-feedback. In contrast to relational databases, ASDs data integration tasks such as natural language processing maintain multiple schema workspaces that represent indi- (NLP), entity resolution, and schema matching on a per- vidualized views over the data, which are fine-tuned to the query basis. Although it allows data to be queried imme- needs of a particular user or group of users. A novel as- diately, this approach sacrifices the navigational and perfor- pect of ASDs is that probabilistic database techniques are mance benefits of a schema. Furthermore, schema-on-query used to encode ambiguity in automatically generated data incentivizes task-specific curation efforts, leading to a pro- extraction workflows and in generated schemas. ASDs can liferation of individualized lower-quality copies of data and provide users with context-dependent feedback on the qual- to reduced productivity. ity of a schema, both in terms of its ability to satisfy a user's One significant benefit of schema-on-query is that queries queries, and the quality of the resulting answers. We outline often only access a subset of all available data. Thus, to our vision for ASDs, and present a proof of concept imple- answer a specific query, it may be sufficient to limit inte- mentation as part of the Mimir probabilistic data curation gration and extraction to only relevant parts of the data. system. Furthermore, there may be multiple \correct" relational rep- resentations of semi-structured data and what constitutes a correct schema may be highly application dependent. This 1. INTRODUCTION implies that imposing a single flat relational schema will lead Classical relational systems rely on schema-on-load, re- to schemas that are the lowest common denominator of the quiring analysts to design a schema upfront before posing entire workload and not well-suited for any of the workload's any queries. The schema of a relational database serves queries. Consider a dataset with tweets and re-tweets. Some both a navigational purpose (it exposes the structure of queries over a tweet relation may want to consider re-tweets data for querying) as well as an organizational purpose (it as tweets while others may prefer to ignore them. informs storage layout of data). If raw data is available In this work, we propose adaptive schema databases (ASDs), a new paradigm that addresses the shortcomings ∗Authors Listed in Alphabetical Order of both the classical relational and the Big Data approaches mentioned above. ASDs enjoy the navigational and organi- zational benefits of a schema without incurring the upfront investment in schema and ETL process development. This is achieved by automating schema inference, information ex- traction, and integration to reduce the load on the user. This article is published under a Creative Commons Attribution Li- Furthermore, instead of enforcing one global schema, ASDs cense(http://creativecommons.org/licenses/by/3.0/), which permits distri- build and adapt idiosyncratic schemas that are specialized bution and reproduction in any medium as well as allowing derivative to users' needs. works, provided that you attribute the original work to the author(s) and CIDR 2017. 8th Biennial Conference on Innovative Data Systems Research We propose the probabilistic framework shown in Fig- (CIDR ’17). January 8-11, 2017, Chaminade, California, USA. ure 1 as a reference architecture for ASDs. When unstruc- are probabilistic relational views over semistructured Queries + Feedback or unstructured input datasets. Schema Schema Schema Schema • We illustrate how ASDs communicate potential sources Workspace Workspace Workspace Workspace of error and low-quality data, and how this communi- Schema Matching cation enables analysts to provide feedback. Extraction Schema Candidates • We present a proof of concept implementation of ASDs based on the Mimir [29] data curation system. Extraction workflow Extraction workflow Extraction workflow • We demonstrate through experiments that the instru- Unstructured Data Semi-structed Data (e.g., JSON) mentation required to embed information extraction Figure 1: Overview of an ASD system into an ASD has minimal overhead. tured or semi-structured data are loaded into an ASD, this 2. EXTRACTION AND DISCOVERY framework applies a sequence of data extraction and in- tegration components that we refer to as an extraction An ASD allows users to pose relational queries over the workflow to compute possible relational schemas for this content of semi-structured and unstructured datasets. We data. Any existing techniques for schema extraction or in- call the steps taken to transform an input dataset into re- formation integration can be used as long as they can ex- lational form an extraction workflow. For example, one pose ambiguity in a probabilistic form. For example, an possible extraction workflow is to first employ natural lan- entity resolution algorithm might identify two possible in- guage processing (NLP) to extract semi-structured data (e.g., stances representing the same entity. Classically, the algo- RDF triples) from an unstructured input, and then shred rithm would include heuristics that resolve this uncertainty the semi-structured data into a relational form. The user and allow it to produce a single deterministic output. In con- can then ask queries against the resultant relational dataset. trast, our approach requires that extraction workflow stages Such a workflow frequently relies on heuristics to create produce non-deterministic, probabilistic outputs instead of seemingly deterministic outputs, obscuring the possibility using heuristics. The final result of such an extraction that the heuristics may choose incorrectly. In an ASD, one workflow is a set of candidate schemas and a proba- or more modular information extraction components instead bility distribution describing the likelihood of each of these produce a set of possible ways to shred the raw data with schemas. In ASDs, users create schema workspaces that associated probabilities. This is achieved by exposing ambi- represent individual views over the schema candidates cre- guity arising in the components of an extraction workflow. ated by the extraction workflow. The schema of a workspace Any NLP, information retrieval, and data integration algo- is created incrementally based on queries asked by a user rithm may be used as an information extraction component, of the workspace. Outputs from the extraction workflow as long as the ambiguity in its heuristic choices can be ex- are dynamically imported into the workspace as they are posed. The set of schema candidates are then used to seed used, or users may suggest new relations and attributes not the development of schemas individualized for a particular readily available to the database. In the latter case, the purpose and/or user. The ASD's goal is to figure out which ASD will apply schema matching and other data integration of these candidates is the correct one for the analyst's cur- methods to determine how the new schema elements relate rent requirements, to communicate any potential sources of to the elements in the candidate schemas, and attempt to error, and to adapt itself as those requirements change. synthesize new relations or attributes to match. Similar to Extraction Schema Candidates. When a collection of extraction workflows, the result of this step is probabilis- unstructured or semi-structured datasets D is loaded into tic. Based on these probabilities and feedback provided by an ASD, then information extraction and integration tech- users through queries, ASDs can incrementally modify the niques are automatically applied to extract relational con- extraction workflow and schema workspaces to correct er- tent and compute candidate schemas for the extracted in- rors, to improve their quality, to adapt to changing require- formation. The choice of techniques is based on the input ments, and to evolve schemas based on updates to input data type (JSON, CSV, natural language text, etc. ). We datasets. The use of feedback is made possible based on associate with this data a schema candidate set Cext = our previous work on probabilistic curation operators [35] (Sext;Pext) where Sext is a set of candidate schemas and and provenance [2].