Adaptive Schema Databases ∗

Adaptive Schema Databases ∗ William Spothb, Bahareh Sadat Arabi, Eric S. Chano, Dieter Gawlicko, Adel Ghoneimyo, Boris Glavici, Beda Hammerschmidto, Oliver Kennedyb, Seokki Leei, Zhen Hua Liuo, Xing Niui, Ying Yangb b: University at Buffalo i: Illinois Inst. Tech. o: Oracle {wmspoth|okennedy|yyang25}@buffalo.edu {barab|slee195|xniu7}@hawk.iit.edu [email protected] {eric.s.chan|dieter.gawlick|adel.ghoneimy|beda.hammerschmidt|zhen.liu}@oracle.com ABSTRACT to translate the input data into relational form. Thus, clas- The rigid schemas of classical relational databases help users sical relational systems require a lot of upfront investment. in specifying queries and inform the storage organization of This makes them unattractive when upfront costs cannot data. However, the advantages of schemas come at a high be amortized, such as in workloads with rapidly evolving upfront cost through schema and ETL process design. In data or where individual elements of a schema are queried this work, we propose a new paradigm where the database infrequently. Furthermore, in settings like data exploration, system takes a more active role in schema development and schema design simply takes too long to be practical. data integration. We refer to this approach as adaptive Schema-on-query is an alternative approach popularized schema databases (ASDs). An ASD ingests semi-structured by NoSQL and Big Data systems that avoids the upfront or unstructured data directly using a pluggable combina- investment in schema design by performing data extraction of extraction and data integration techniques. Over tion and integration at query-time. For semi-structured and time it discovers and adapts schemas for the ingested data unstructured data, data integration tasks such as natural using information provided by queries and user-feedback. language processing (NLP), entity resolution, and schema In contrast to relational databases, ASDs maintain multi- matching have to be performed on a per-query basis. This ple schema workspaces that represent individualized views enables data to be queried immediately, but sacrifices the over the data which are fine-tuned to the needs of a par- navigational and performance benefits of a schema and fre- ticular user or group of users. A novel aspect of ASDs is quently leads to reduced data quality (e.g., many task-specific that probabilistic database techniques are used to encode versions of a dataset) and lost productivity (increased work ambiguity in automatically generated data extraction work- being done at query time). flows and in generated schemas. ASDs can provide users One significant benefit of schema-on-query is that queries with context-dependent feedback on the quality of a schema, often only access a subset of all available data. Thus, to both in terms of its ability to satisfy a user's queries, and answer a specific query, it may be sufficient to limit inte- the quality of the resulting answers. We outline our vision gration and extraction to only relevant parts of the data. for ASDs, and present a proof-of concept implementation as Furthermore, there may be multiple \correct" relational rep- part of the Mimir probabilistic data curation system. resentations of semi-structured data and what constitutes a correct schema may be highly application dependent. This implies that imposing a single flat relational schema will lead 1. INTRODUCTION to schemas that are the lowest common denominator of the Classical relational systems rely on schema-on-load, re- entire workload and not well-suited for any of the workload's quiring analysts to design a schema upfront before posing queries. Consider a dataset with tweets and re-tweets. Some any queries. The schema of a relational database serves queries over a tweet relation may want to consider re-tweets both a navigational purpose (it exposes the structure of as tweets while others may prefer to ignore them. data for querying) as well as an organizational purpose (it In this work, we propose adaptive schema databases informs storage layout of data). If raw data is available (ASDs), a new paradigm that addresses the shortcomings in unstructured or semi-structured form, then an ETL (i.e., of both the classical relational and Big Data approaches Extract, Transform, and Load) process needs to be designed mentioned above. ASDs enjoy the navigational and organizational benefits of a schema without incurring the upfront ∗Authors Listed in Alphabetical Order investment in schema and ETL process development. This is achieved by automating schema inference, information extraction, and integration to reduce the load on the user. In- stead of enforcing one global schema, ASDs build and adapt idiosyncratic schemas that are specialized to users' needs. We propose the probabilistic framework shown in Fig- ure 1 as a reference architecture for ASDs. When unstructured or semi-structured data are loaded into an ASD, this framework applies a sequence of data extraction and integration components that we refer to as an extraction workflow to compute possible relational schemas for this data. Any existing techniques for schema extraction or information integration can be used as long as they can expose the Queries + Feedback Schema Schema Schema Schema ambiguity inherent in these tasks in a probabilistic form. Workspace Workspace Workspace Workspace For example, an entity resolution algorithm might identify Schema Matching two possible instances representing the same entity. Clas- Extraction Schema Candidates sically, the algorithm would include heuristics that resolve Extraction workflow Extraction workflow Extraction workflow this uncertainty and allow it to produce a single determin- Unstructured Data Semistructed Data (e.g., JSON) istic output. In contrast, our approach requires that ex- Figure 1: Overview of an ASD system traction workflow stages produce non-deterministic, probabilistic outputs instead of using heuristics. The final re- semi-structured data (e.g., RDF triples) from an unstruc- sult of such an extraction workflow is a set of candi- tured input, and then shred the semi-structured data into a date schemas and a probability distribution describing the relational form. The user can then ask queries against the likelihood of each of these schemas. In ASDs, users create resultant relational dataset. Such a workflow frequently re- schema workspaces that represent individual views over lies on heuristics to create seemingly deterministic outputs, the schema candidates created by the extraction workflow. obscuring the possibility that the heuristics may choose in- The schema of a workspace is created incrementally based correctly. In an ASD, one or more modular information ex- on queries asked by a user of the workspace. Outputs from traction components instead produce a set of possible ways the extraction workflow are dynamically imported into the to shred the raw data with associated probabilities. This is workspace as they are used, or users may suggest new rela- achieved by exposing ambiguity arising in the components tions and attributes not readily available to the database. In of an extraction workflow. Any NLP, information retrieval, the latter case, the ASD will apply schema matching to de- and data integration algorithm may be used as an informa- termine how the new schema elements relate to the elements tion extraction component, as long as the ambiguity in its in the candidate schemas. Similar to extraction workflows, heuristic choices can be exposed. The set of schema can- this schema matching consists of candidate matches with as- didates are then used to seed the development of schemas sociated probabilities. Based on these probabilities and feed- individualized for a particular purpose and/or user. The back provided by users through queries, ASDs can incremen- ASD's goal is to figure out which of these candidates is the tally modify the extraction workflow and schema workspaces correct one for the analyst's current requirements, to com- to correct errors, to improve their quality, to adapt to chang- municate any potential sources of error, and to adapt itself ing requirements, and to evolve schemas based on updates as those requirements change. to input datasets. The use of feedback is made possible Extraction Schema Candidates. When a collection of based on our previous work on probabilistic curation opera- unstructured or semi-structured datasets D is loaded into tors [19] and provenance [1]. By modelling schemas as views an ASD, then information extraction and integration tech- over a non-relational input dataset, we decouple data repre- niques are automatically applied to extract relational con- sentation from content. Thus, we gain flexibility in storage tent and compute candidate schemas for the extracted in- organization | for a given schema we may choose not to formation. The choice of techniques is based on the input materialize anything, we may fully materialize the schema, data type (JSON, CSV, natural language text, etc. ). We or materialize selectively based on access patterns. associate with this data a schema candidate set Cext = Concretely, this paper makes the following contributions: (Sext;Pext) where Sext is a set of candidate schemas and P is a probability distribution over this schema. We • We introduce our vision of ASDs, which enable access ext use S referred to as the best guess schema to denote to unstructured and semi-structured data through per- max arg max (P (S)), i.e., the most likely schema from the sonalized relational schemas. S2Sext set of candidate schemas. Similar data models have been • We show how ASDs leverage information extraction studied extensively in probabilistic databases, allowing us to and data integration to automatically infer and adapt adapt existing work on probabilistic query processing, while schemas based on evidence provided by these compo- still supporting a variety of information extraction, natural nents, by queries, and through user feedback. language processing, and data integration techniques. • We show how ASDs enable adaptive task-specific \per- sonalized schemas" through schema workspaces. Example 1.

Load more