Sam Birch & Alex Leblang

Data Integration

Sam Birch & Alex Leblang Two faces of data integration

● Businesses ○ Have relatively more structured databases which they need to organize ● Research on integrating less structured data ○ Databases coming from different organizations without common architecture Businesses have a lot of data

…in many different databases One safe place: the data warehouse

Businesses want to control and access their operational data in a single place: 1. Backups and versioning 2. Single query interface 3. Quality & consistency 4. Separation of analytic and operational workloads Data warehouse design

Bill Inmon, a data warehouse means: 1. All data about a single real-world thing are linked; 2. Data are never over-written or deleted; 3. Comprises all (or nearly all) an organization's data in a consistent way; 4. Comprises all versions of the data in the operational system; 5. Not virtualized Update methods

1. ETL 2. Virtual / federated databases 3. Change data capture ETL

● Extract, Transform, Load ● Take data from one data source, transform it, and load it into another location, usually a data warehouse ● Generally periodic (hourly, daily…) Virtual (federated) DBs

● A method for integrating data virtually ● No actual (physical) data integration ● Virtual database systems give you the feel of data integration without the need for maintaining one single data warehouse How they work

● Federated databases map autonomous database systems into a single database ● This is done over a computer network and has the advantage of possible geographic distribution ● Federated databases can be loosely or tightly coupled Loosely / Tightly Coupled

● Loosely coupled databases require each component to construct their own schema ○ ...forces the user to have knowledge of the schema when using the database ● Tightly coupled databases use independent processes to create a schema used across the federated database ○ …removes much of the work from the user or DBA to the software itself Change data capture

● Keep track of diffs, like version control for your database ● Helpful for large data with smaller changes ● Different Implementations: ○ Timestamps on rows ○ Version number on rows ○ Triggers on tables Reforming unstructured data Large and unstructured

The 4 Vs (according to Dong) ○ large Volume of sources ○ changing at a high Velocity ○ as well as a huge Variety of sources ○ with lots of question regarding data Veracity

Dong et al. Goals

● Schema alignment ● Record linkage ● Data fusion Schema Alignment

Dong et al Schema Alignment

● Mediated Schema ○ Identify domain specific modeling ● Attribute Matching ○ Identify similarities between schema attributes ● Schema Mapping ○ Specify how to specifically map records in different schemas

Dong et al Record Linkage

Dong et al Data Fusion

● Reconciliation of conflicting non-identifying content ○ Voting ○ Source Quality ○ Copy Detection

Dong et al Dealing With Different Data Sources

● Semantic Heterogeneity ● Access Pattern Limitations ● Integrity Constraints ● Data-level Heterogeneity

http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source Semantic Heterogeneity

● Data integration can suffer from many issues ● Differences in: ○ organization of tables ○ naming of schemas ○ data-level representation

http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source Data-level Heterogeneity

● “115 Waterman St. 02912” / “Brown University CIT” ● “Tim Kraska” / “Timothy Kraska” / “[email protected]” ● IRA: Individual Retirement Account or Irish Republican Army? ● Arbitrarily hard: different descriptions / photos of the same place Entity resolution (ER)

“[The] problem of identifying and linking/grouping different manifestations of the same real world object.”

Ironically, AKA: deduplication, entity clustering, merge/purge, fuzzy match, record linkage, approximate match... Getoor, 2012. Motivating examples

● Mining unstructured data (e.g. webpages) ● Governance (census, intelligence) ● Generally, when data comes from different organizations

Getoor, 2012. ER Challenges

● Fundamental ambiguity ● Diversity in representations (format, truncation, ambiguity) ● Errors ● Missing data ● Records from different times ● Relationships in addition to equality

Getoor, 2012. Normalization

● Transform data into a format which is more likely to match other similar data ○ Splitting / combining rows ● Canonicalization (e.g. phone numbers, URLs, case of text, expanding truncations) ○ Maximally informative, but standard format ● Logic is specific to data

Getoor, 2012. Pairwise matching

Raw data Normalized Matching data features

Getoor, 2012. Matching features

● Edit distance (e.g. Levenstein) for typos ● Set/vector similarity (Jaccard index, TF/IDF, dot-product) for ● Alignment (e.g. Monge-Elkan) ● Phonetic (e.g. Soundex)

Getoor, 2012. Record linkage

● Matching record to record rather than datum to datum ○ May also require schema alignment ● Average of component similarities ● Specific rules about each column ● Probabilistic models with ML ○ Training data not trivial: most pairs are obviously not matches

Getoor, 2012. Collective matching and constraints

● Some data matching operations aren’t independent of the other data in the record ○ e.g. two research papers in the same venue are more likely to be by the same authors ● Expressed in constraints over the matching relationships of columns in a record ○ Transitivity (if A = B, and B = C then A = C) ○ Exclusivity (if A = B then B != C)

Getoor, 2012. Getoor, 2012. AscDB

● Developed at Google ● The authors looked at 14.1 billion HTML tables and from that found 154 million that they considered to contain high quality relational data ● Work was done in 2008

Cafarella et al AscDB

● The authors created the attribute correlation statistics database ● AscDB is “a set of statistics about the schemas in the corpus”

Cafarella et al AscDB

● AscBD makes possible: ○ schema auto-complete ○ attribute synonym finding ○ join-graph traversal

Cafarella et al ASCDb

Cafarella et al 2013 Follow Up

Extracting Tabular Data on the Web VLDB 2013 paper discusses the idea of row classes that have a more flexible method towards determining the table schema

Adelfio et al Conclusion

● For businesses there are tradeoffs between specialized systems and integration ● Lots of research is being done involving combining very large amounts of disparate data References

● Luna Dong and Divesh Srivastava, Big Data Integration, Tutorial in Proceedings of the IEEE International Conference on Database Engineering (ICDE), 2013 ● Lise Getoor, Ashwin Machanavajjhala, Entity Resolution: Theory, Practice & Open Challenges, PVLDB 5(12): 2018-2019 (2012) ● Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang: WebTables: exploring the power of tables on the web. PVLDB 1(1): 538-549 (2008) ● Marco D. Adelfio, Hanan Samet, Schema Extraction for Tabular Data on the Web, In International Conference on Very Large Data Bases (VLDB), 2013 ● http://research.cs.wisc.edu/dibook/