Dataspaces: Progress and Prospects

Michael J. Franklin UC Berkeley & Truviso

BNCOD July 7, 2009

M. Franklin BNCOD 2009 7 July 2009 Dataspaces: Progress and Prospects Dataspace: The Final Frontier?

Michael J. Franklin UC Berkeley & Truviso

BNCOD July 7, 2009

M. Franklin BNCOD 2009 7 July 2009 Outline

• Dataspaces – some history • Dataspaces – what are they, really? • Some emerging examples • Example technologies • What’s missing? • What’s next?

M. Franklin BNCOD 2009 7 July 2009 The SIGMOD Credo

Codd made relations, all else is the work of man.

Leopold Kronecker (paraphrased by Raghu Ramakrishnan?)

M. Franklin BNCOD 2009 7 July 2009 The Politics of Dataspaces • Roots: CIDR 2005 Conference – “Gloom and Doom” panel – David Dewitt’s call for a unifying goal – Juxtaposed with lots of great work across the web, new devices, scalable computing, …

M. Franklin BNCOD 2009 7 July 2009 An Aside: The cycle of DB Angst

Are we polishing a “round ball”?

Did we “miss the boat” on something cool?

M. Franklin BNCOD 2009 7 July 2009 Dataspaces: Timeline • CIDR 2005 (January) • A small group started looking for commonality and a “grand challenge” • We put a name on it. • Ran an early draft by an impromptu group of advisors at SIGMOD 2005 (June 05). • Wrote it up for SIGMOD Record (Dec 05) [Franklin, Halevy, Maier] • Kept working on pretty much what we were already doing! M. Franklin BNCOD 2009 7 July 2009 What’s in a name?

M. Franklin BNCOD 2009 7 July 2009 Dataspaces – what are they?

M. Franklin BNCOD 2009 7 July 2009 Dataspaces Inclusive Deal with all the data of interest – in whatever form Co-existence not Integration No integrated schema, no single warehouse, no ownership required Pay-as-you-go – Keyword search is bare minimum. – More function and increased consistency as you add work.

M. Franklin BNCOD 2009 7 July 2009 Compare to

A quintessential schema-first Mediated Schema

approach. Semantic mappings

wrapper wrapper wrapper wrapper wrapper

Courtesy of Alon Halevy

M. Franklin BNCOD 2009 7 July 2009 Structured

M. Franklin BNCOD 2009 7 July 2009 A “Modern” View of Data Management

M. Franklin BNCOD 2009 7 July 2009 The Structure Spectrum

Structured Semi-Structured Unstructured (schema-first) (schema-later) (schema-never)

Relational XML Plain Text Tagged Media Formatted Text/Media Messages

M. Franklin BNCOD 2009 7 July 2009 Whither Structured Data? • Conventional Wisdom: only 20% of data is structured.

• Decreasing due to: – Consumer applications – Enterprise search – Media applications M. Franklin BNCOD 2009 7 July 2009 But Structure Matters!

Functionality Structure enables computers to help users manipulate and maintain the data.

Dataspaces (pay-as-you-go)

Structured (schema-first)

Unstructured (schema-less)

M. Franklin TimeBNCOD (and 2009 cost) 7 July 2009 An Alternative View

Weak Web Virtual Search Organization Administrative Control Federated DBMS Desktop Strong DBMS Search Strong Weak

M. Franklin BNCOD 2009 7 July 2009 Some Interesting Points on the Structure Spectrum

M. Franklin BNCOD 2009 7 July 2009 M. Franklin BNCOD 2009 7 July 2009 M. Franklin BNCOD 2009 7 July 2009 M. Franklin BNCOD 2009 7 July 2009 M. Franklin BNCOD 2009 7 July 2009 M. Franklin BNCOD 2009 7 July 2009 Web-scale Structured Data Database Views in the HTML Tables extracted from the Web Deep Web accessed through HTML Forms on the Web

For years, Microsoft Corporation CEO Bill Relaons generated by Gates was against open source. But today he appears to have changed his mind. "We informaon extracon can be open source. We love the concept of from web pages shared source," said Bill Veghte, a Microsoft VP. "That's a super-important Name Title Organization shift for us in terms of code access.“ Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman, founder of the Free Richard Stallman Founder Free Soft.. Software Foundation, countered saying…

23 M. Franklin BNCOD 2009 7 July 2009 The Future of Analytics • Analytics traditionally a key DB use case – Need to understand data to manipulate it • “Barbarians at the Gate” – Procedural cloud-based approaches gaining interest – Scalability for massive data sets – But, we’ve seen this movie before! M. Franklin BNCOD 2009 7 July 2009 The View From the Clouds • “Pig Latin” [Olston et al. SIGMOD 08] – Why have a schema? 1) Transactional (referential?) Consistency 2) Fast point look ups through indexes 3) Curation for future (other) users – Flexible, optional, nested data model – Data remains in files (no admin) • “Column Family” models of BigTable, Hbase, Cassandra, CouchDB, … • “Schema on Read”? == Errors on Read? M. Franklin BNCOD 2009 7 July 2009 Other Examples Personal Information Management(iMemex), Question answering, Scientific Collaboration

M. Franklin BNCOD 2009 7 July 2009 Outline

• Dataspaces – some history • Dataspaces – what are they, really? • Some emerging examples • Example technologies • What’s missing? • What’s next?

M. Franklin BNCOD 2009 7 July 2009 DataSpace Technology

• Probabilistic • Schema Matching • Judicious use of User Input • Approx. Query Answering • Uncertainty Management • Data Model Learning • Provenance and Annotation • Structured + Unstructured Search

M. Franklin BNCOD 2009 7 July 2009 Roomba: Soliciting User Feedback* • A “web 2.0” spin on Reference Reconciliation. – Inspired by “ESP Game” for image labeling by Von Ahn & Dabbish; “MOBS” architecture by Doan et al. • Use automated techniques to generate candidate matches. • Ask users to confirm. • Problem: which matches are most important? * “Soliciting User Feedback in a Dataspace System”, Shawn Jeffery, Michael Franklin, Alon Halevy; SIGMOD 2008.

M. Franklin BNCOD 2009 7 July 2009 Roomba Overview

• Based on Value of Perfect Information (VPI) (see Russell and Norvig) • Choose matches that provide largest increase in dataspace utility.

• Must consider: Query Workload, # Records per Term, and Confidence of Matches.

M. Franklin BNCOD 2009 7 July 2009 Roomba: Sample Result Perfect Knowledge

VPI-Based Ordering

M. Franklin BNCOD 2009 7 July 2009 Data Integration at Web-scale

• A typical data integration solution is impractical for web-scale data – Too many domains of interest (Web Data is about everything) – Huge number of sources for each domain – Designing Mediated Schema is infeasible – Data sources are dirty, incomplete and lack of meta-data

• Solution: A Data Integration Solution that is – Automated – Best Effort – Pay-as-you-go

“Functional Dependency Generation and Applications in Pay-as-you-go Data Integration Systems” WebDB 2009 Wang, Dong, Das Sarma, Franklin, Halevy M. Franklin BNCOD 2009 7 July 2009 Probabilistic Functional Dependencies (pFDs) • Idea - use probabilistic Functional Dependencies to guide automated approaches – Normalize mediated schemas – Identify low quality data sources

• Definition of a probabilistic FD (pFD) X p A, p is the likelihood of FD holds in general

• “Learn” pFDs by counting data and schema instances – Note: this will get you a bad grade in your database course.

• Related work – TANE, CORDS – Conditional Functional Dependences M. Franklin BNCOD 2009 7 July 2009 Results for pFDs Generation Algorithms on “Web Tables”

Fidelity of generated FDs with confidence 0.8 with “golden standard” FDs

M. Franklin BNCOD 2009 7 July 2009 Normalizing a Mediated Schema

• Generating the minimal pFD-set – Prune low-probability pFDs – Prune pFDs that can be generated by transitivity 0.95 issn author 0.95 tle 0.95 authors 0.9 0.92 author(s) journal tle subject 0.97 journal subjects

• Avoid over-spling conference zip 0.95 meeng colloquium 1.0 0.9 city address

M. Franklin BNCOD 2009 7 July 2009 Results for Schema Normalization

M. Franklin BNCOD 2009 7 July 2009 PayGo Quality Metrics

• Measuring quality of data sources • Measuring and Improving quality of a integration (e.g. mediated schema, schema mapping, etc.)

• FD-based Quality measuring framework is an example: – Identify Dirty Data sources – Improving Mediated Schema

M. Franklin BNCOD 2009 7 July 2009 What’s Missing? • Metrics!!!! – Key idea: you pay more to get better data. Must define “better”! – Application-, user-, context-dependent – Relation to Data Quality work • Benchmarks – Key to progress • Support for collaboration/data-sharing/visualization – Particularly with uncertainty in base data and inferences • More data/media types • Focus on “serious” analytics workloads • …Your ideas here… M. Franklin BNCOD 2009 7 July 2009 Metcalf’s (not Moore’s) Law will drive future DBMS inovation

EDGE

ERP PoS

Inventory Data Center

• More connectivity means more data to

Data integrate. Warehouse • Dataspace-style techniques will play an ever-larger role. M. Franklin BNCOD 2009 7 July 2009 Conclusions • More connectivity means more data. • Many would simply throw away the benefits of structure due to “schema-first” problems. • Dataspaces provide a framework for intelligent use of structural information. • Could also meet the goal of a “grand challenge” for the DB Community.

As an inherently unsolvable problem…

Dataspace may, in fact, be the final frontier.

M. Franklin BNCOD 2009 7 July 2009