Entity Resolution in the Web of Data

Vasilis Efthymiou The Web of Data

2 What is an entity?

Any real-world object

3 The same entity in multiple sources

Videos about the St. of Liberty Wiki pages about the St. of Liberty

Images of the St. of Liberty General and lexicographical information about the St. of Liberty

Articles about the St. of General information Liberty about the St. of Liberty 4 Name Statue of Liberty about Statue of Liberty

Art Form Sculpture dbpprop:architect Frédéric Auguste Bartholdi

Architect Frédéric Auguste Bartholdi, Gustave dbpedia-owl:location dbpedia:United_States, dbpedia: New_York_City, dbpedia:New_York, Eiffel, Richard Morris Hunt dbpedia:Liberty_Island

Opened Oct 28, 1886 dbpprop:built 1886-10-28

Artist Frédéric Auguste Bartholdi dbpprop:height 151

Contained by Statue of Liberty National Monument, is dbpprop:basis of dbpedia:Miss_Liberty Ellis Island and Liberty Island, Liberty Island dbpprop:hasPhotoCollection http://www4.wiwiss.fu- berlin.de/flickrwrappr/photos/Statue_of_Liberty Struct. Height 93 m (310 ft )

Media Copper, Wrought iron URI Statue of Liberty Archit. Style Neoclassical architecture preferred label Statue of Liberty Date Completed Oct 28, 1886 yago:hasHeight 46.0248 Also known as Liberty Enlightening the World yago:wasCreatedOnDate 1886-##-## Art Subject Freedom yago:isLocatedIn New York City, New York, Dimensions 46 m (150 ft ) Manhattan, Liberty Island 5 Red: inconsistency, Blue: schema diversity, Green: additional values, underlined: missing values The standard data format for the Web of Data: RDF

Data in the form of (subject predicate object) triples:

Architect Frédéric Auguste Bartholdi ( #subject #predicate ) #object

– Subject: who/what are we talking about? – Predicate: how is the subject related to the object? – Object: who/what is related to the subject?

6 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro (filmmaker) p:b irth Pla ce rdf:type dbprop:east dbpedia:Wellingborough dbpedia:Cambridge t jec ub s:s rm cte category:University_towns_ d in_the_UK

dbpedia:http://dbpedia.org/resource/ geonames:http://sws.geonames.org fb:http://rdf.freebase.com/ns/ lgdo:http://linkedgeodata.org/triplify/ nyt:http://data.nytimes.com

7 “GB” geonames:7290660 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro geonames: (filmmaker) p:b geonames: irth parentFeature Pla countryCode ce rdf:type dbprop:east dbpedia:Wellingborough dbpedia:Cambridge t jec geonames:2653941 ub s:s rm cte category:University_towns_ d in_the_UK

dbpedia:http://dbpedia.org/resource/ geonames:http://sws.geonames.org fb:http://rdf.freebase.com/ns/ lgdo:http://linkedgeodata.org/triplify/ nyt:http://data.nytimes.com

8 “GB” geonames:7290660 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro geonames: (filmmaker) p:b geonames: irth parentFeature Pla countryCode ce rdf:type dbprop:east dbpedia:Wellingborough dbpedia:Cambridge t jec geonames:2653941 ub s:s rm cte category:University_towns_ d in_the_UK

fb:tourist_ attractions fb:ely_cathedral dbpedia:http://dbpedia.org/resource/ geonames:http://sws.geonames.org fb:http://rdf.freebase.com/ns/ fb:Cambridge lgdo:http://linkedgeodata.org/triplify/ e fb:nearby fb:cambridge_airports nyt:http://data.nytimes.com p y t _airports : f d r

fb:travel_destination

9 “GB” geonames:7290660 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro geonames: (filmmaker) p:b geonames: irth parentFeature Pla countryCode ce rdf:type dbprop:east dbpedia:Wellingborough dbpedia:Cambridge t jec geonames:2653941 ub s:s rm cte category:University_towns_ d in_the_UK

10 lgdo:City “GB” geonames:7290660 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro geonames: (filmmaker) p:b geonames: irth parentFeature Pla countryCode ce rdf:type dbprop:east dbpedia:Wellingborough dbpedia:Cambridge t jec geonames:2653941 ub s:s rm cte category:University_towns_ d in_the_UK

e skos:prefLabel

p (England)”

t nyt:5242016742

r 5542471781 4 nyt:associated_ article_count 11 lgdo:City “GB” geonames:7290660 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro geonames: (filmmaker) p:b geonames: irth parentFeature Pla countryCode ce rdf:type dbprop:east dbpedia:Wellingborough owl:sameAs dbpedia:Cambridge ct bje geonames:2653941 :su ms r s cte d A

category:University_towns_ e

in_the_UK m a s : l w

o fb:tourist_ s s fb:ely_cathedral A A attractions e dbpedia:http://dbpedia.org/resource/ e m geonames:http://sws.geonames.org m a a fb:Cambridge s fb:http://rdf.freebase.com/ns/ s : : l lgdo:http://linkedgeodata.org/triplify/ l w w e fb:nearby fb:cambridge_airports o o nyt:http://data.nytimes.com p y t _airports : f d r geo:long 0.124862e0 lgdo:node20971094 fb:travel_destination 52.2033051e0 geo:lat “Cambridge

e skos:prefLabel

p (England)”

t nyt:5242016742

r 5542471781 4 nyt:associated_ article_count 12 lgdo:City Ideally we would like to have such links for every entity, but... ● can we always have such links? ● are these links correct? owl:sameAs dbpedia:Cambridge geonames:2653941 s A e m a s : l w o s s A A e e m m a a fb:Cambridge s s : : l l w w o o

lgdo:node20971094

nyt:5242016742 5542471781

13 Entity Resolution (ER)

Entity Resolution (ER) is the problem of matching and merging references to the same real-world objects

Useful because: – improves data quality and integrity – fosters re-use of existing data sources

14 Entity Resolution (ER)

Input: datasets possibly containing duplicates

The goal: produce a “clean” dataset (no duplicates); the result of merging the identified duplicates – Identifying the duplicates requires a number of comparisons, wrt. similarity metrics

Application examples: Linking Census Records, Public Health, Web search, Counter-terrorism, ... 15 Need to reduce the number of comparisons

Naive ER requires a quadratic number of comparisons, i.e. O(n2) – Every entity must be compared with all others

How to reduce the number of comparisons without losing (many) true matches?

Idea: Blocking – group similar entities together and – compare only entities within the same group 16 Blocking

17 Blocking

Goal: Put similar records in the same block and dissimilar records in different blocks - multiple records may refer to the same entity

ER process compares only pairs within the same block

Disjoint Blocking (Partitioning): Each record appears in exactly one block Non-disjoint Blocking (Overlapping): Each record appears in at least one block – Pros & Cons? 18 Why Blocking?

1,000 x 1,000 = 1,000,000

 If each comparison needs 1 sec – Without blocking: 1K x 1K sec = 11.5 days

19 Why Blocking?

 If each comparison needs 1 sec – Without blocking: 1K x 1K sec = 11.5 days – With blocking: 200K sec = 2.3 days

20 Why Blocking?

 If each comparison needs 1 sec – Without blocking: 1K x 1K sec = 11.5 days – With blocking: 200K sec = 2.3 days  If we had 200 blocks of 5 records each: 21 – 5 x 5 x 200 sec = 83 minutes Blocking in relational datasets

Typical assumptions: – A-priori known schema – Every record consists of a uniquely identified set of name-value pairs – For each attribute, we know some metadata: • e.g. distinctiveness of values

22 Traditional blocking Identifier Name ZIP code 12345 54321 55555 77551 r1 Smith 12345 r2 Smyth 54321 r3 Smiths 12345 r1 r2 r4 r8 r4 Do 55555 r3 r6 r5 r5 Jackson 77551 r7 r6 Doe 55555 r7 Oliver 12345 r8 Jackson 77551

Entries with the same ZIP code (Blocking Key Value - BKV) end up in the same block

Data structure used: inverted index 23 Sorted Neighborhood

Blocks of the same size A small window size might not be large enough to cover all records with the same BKV Sorting is sensitive to errors/variations in the first few positions of values – (e.g. ‘Christina’, ‘Kristina’) 24 Canopy Clustering Canopies: overlapping clusters Input: Records R, distance metric d, thresholds T1 > T2

1. Pick a random record r from R

2. Create new canopy Cr using records r' s.t. d(r,r') < T1 3. Delete all records r' from R s.t. d(r,r') < T2 4. Return to Step 1 if R is not empty

25 q-grams

q-grams: substrings of length q Create variations for each BKV using q-grams, and insert record identifiers into more than one block Can lead to more true matches than traditional blocking and sorted neighborhood

26 but our datasets are not so tidy...

27 Our world: The Web of Data

Highly Heterogeneous Information Spaces: – Rich diversity of schemata – Data imperfection (incomplete, missing, inconsistent, erroneous data) – Large scale

28 some specific characteristics

● Size of data – 32 billion RDF triples (until 2011) ● Data heterogeneity – structural • {Address} vs {Street, Number, ZIPcode}

– lexical [Elmagarmid et al. 2007] • StreetAddress: “44 W. 4th St.” vs StreetAddress: “44 West Fourth Street”

– logical [Ferrara et al. 2008] • Politician:Obama vs President:Obama 29 Blocking in the Web of Data

Can we use one of the relational blocking approaches? – lack of schema

Blocking approaches for heterogeneous data

– Use only the values [Instance, Token, ACB]

Can we apply ideas from other areas to exploit the schema?

– Use only the schema [ch. Sets, web tables]

30 Blocking based on Values

Only consider the values, to correlate two records ● Few true matches are missed – high recall ● Works well with diverse schemata (schema-agnostic) ● High space & time requirements – #attributes << #values ● Results in many redundant comparisons – the same value could refer to many things • e.g. Jaguar, Boston, John 31 Token Blocking [Papadakis et al. 2011]

Every distinct token ti creates a separate block that contains all records having ti in their values Blocks are built independently of the attribute names associated with a token – attribute-agnostic functionality

32 Token Blocking [Papadakis et al. 2011]

Big space and time requirements – for this specific example of 4 records, 16 blocks are generated Many dissimilar entities share common values – businesses, people, movies, historical events, touristic landmarks of LA will be put in the same block

33 Attribute Clustering Blocking [Papadakis et al. 2012]

A blocking scheme that exploits patterns in the values

1. Partition attribute names into clusters – According to the similarity of their values 2. Given a cluster k – Token Blocking

34 Intuition Collection 1 Collection 2 AttNames AttNames N1.1 N2.1 N1.2 N2.2 N1.3 N2.3 N1.4 N2.4 N1.5 N2.5 N1 N2 The most similar name to N1 is N2

35 Intuition Collection 1 Collection 2 AttNames AttNames N1.1 N2.1 Clusters: N1.2 N2.2 N1.1 N2.2 N1.2 N2.3 N1.3 N2.4 N2.1 N1.3 N2.3 N1.4 N1.5 N2.5 N1.4 N2.4 N1.5 N2.5 N1 N2 The most similar name to N1 is N2

36 Intuition Collection 1 Collection 2 AttNames AttNames N1.1 N2.1 Clusters: N1.1 and N2.2 are in the N1.2 N2.2 N1.1 N2.2 N1.2 N2.3 same cluster, because their N1.3 N2.4 N2.1 N1.3 N2.3 values are similar. E.g: N1.4 N1.5 N2.5 N1.1: George, Nick, Kostas N1.4 N2.4 N2.2: Kostas, Niki, Georgia N1.5 N2.5 N1 N2 The most similar name to N1 is N2

37 Intuition Collection 1 Collection 2 AttNames AttNames N1.1 N2.1 Clusters: N1.1 and N2.2 are in the N1.2 N2.2 N1.1 N2.2 N1.2 N2.3 same cluster, because their N1.3 N2.4 N2.1 N1.3 N2.3 values are similar. E.g: N1.4 N1.5 N2.5 N1.1: George, Nick, Kostas N1.4 N2.4 N2.2: Kostas, Niki, Georgia N1.5 N2.5 N1 N2 The most similar name to N1 is N2

The good case: State: Georgia, California, Texas, Florida Name: Georgia, Niki, Kostas would be placed in different clusters

38 Weakness Collection 1 Collection 2 AttNames AttNames N1.1 N2.1 Clusters: N1.1 and N2.2 are in the N1.2 N2.2 N1.1 N2.2 N1.2 N2.3 same cluster, because their N1.3 N2.4 N2.1 N1.3 N2.3 values are similar. E.g: N1.4 N1.5 N2.5 N1.1: George, Nick, Kostas N1.4 N2.4 N2.2: Kostas, Niki, Georgia N1.5 N2.5 N1 N2 The most similar name to N1 is N2 The bad case Built In: 2003, 2010, 1821, 1789 WarOf: 1789, 1821, 1939, 2003

year: 2000, continent: Europe, 39 gender: male, ... The other alternative: Blocking based on Schema

40 Blocking based on Schema

Only consider the schema, to correlate two records ● Low space & time requirements – #attributes << #values ● Many true matches are missed in diverse datasets – low recall

No blocking approaches based on schema – Rich diversity of schemata

41 Can we adopt ideas from other areas? Characteristic Sets [Neumann et al. 2011]

Data tends to have a latent soft schema – Books tend to have authors and titles, etc. So, an entity can be characterized by its set of predicates, called characteristic set

Idea: Each characteristic set corresponds to a block, but... – data schemata are highly diverse – similar entities with small schema variations would be placed in different blocks

42 Web Tables: The Attribute Correlation Statistics Database (ACSDb) [Cafarella et al. 2008]

ACSDb keeps statistics on co-occurrences of schema elements – ACSDb is a set of pairs of the form (S, c) • S is the schema of a record, c is the number of records that have the schema S

ACSDb allows finding the probability of seeing various attributes in a schema – p(address) = sum of counts c for records whose schema contains “address” / total sum of all counts

– p(address|name) = #all the schemata in which “address” appears43 along with “name” / counts for seeing “name” alone Web Tables: The Attribute Correlation Statistics Database (ACSDb) [Cafarella et al. 2008]

Synonym finding, based on the observations: – synonymous attributes do not appear in the same schema – two synonyms will appear in similar contexts

Degree of synonymity between attribute names a, b:

C: context attributes A: all attributes that appear in ACSDb with C

44 Web Tables: The Attribute Correlation Statistics Database (ACSDb) [Cafarella et al. 2008]

Schema clustering, based on common attributes:

X, Y: two schemas, D: a shared attribute between X and Y

We could use these clusters as blocks, but... – same schemata could describe different things – e.g. (name, year, country) could be used to describe

movies, buildings, people, wars, awards, ... 45 Performance of Blocking

46 Performance of blocking

Good balance between: – Efficiency • wrt. the number of pairwise comparisons within a block – Effectiveness • wrt. the pairs of matching records compared in at least one block

The more comparisons executed in blocks, the higher the effectiveness, but the lower the efficiency (and vice versa) 47 Evaluation metrics for blocking

 Pairs Completeness (PC) = Recall: True matches put in the same block / real true matches  Pairs Quality (PQ) = Precision: True matches put in the same block / total comparison pairs generated  F-measure = 2 PC PQ / (PC + PQ)

 Reduction Ratio (RR) 1- (generated comparisons / initial comparisons)

48 A comparison of blocking approaches*

Approach Pros Cons Attributes -low computational cost -low recall only -low space requirements -work best with homogeneous data

Values only -handle schema diversity -low precision

-high recall -high computational cost

-high space requirements

-low RR

49 *assuming the same matching function is used Our focus: Exploit available information both at schema and instance level!

To achieve better: – efficiency – effectiveness (check out the following example)

50 Name Statue of Liberty about Statue of Liberty

Art Form Sculpture dbpprop:architect Frédéric Auguste Bartholdi

Architect Frédéric Auguste Bartholdi, Gustave dbpedia-owl:location dbpedia:United_States, dbpedia: New_York_City, dbpedia:New_York, Eiffel, Richard Morris Hunt dbpedia:Liberty_Island

Opened Oct 28, 1886 dbpprop:built 1886-10-28

Artist Frédéric Auguste Bartholdi dbpprop:height 151

Media Copper, Wrought iron name Frédéric Auguste Bartholdi

Archit. Style Neoclassical architecture Date of birth Aug 2, 1834

Date Completed Oct 28, 1886 Date of death Oct 4, 1904 (age 70 years)

Also known as Liberty Enlightening the World Artworks Statue of Liberty, Fontaine Bartholdi, Bartholdi Art Subject Freedom Fountain, The 667 Madison Statue of Liberty Dimensions 46 m (150 ft ) Gender Male Red: inconsistency, Blue: schema diversity, Green: additional values Blocking Based on Attributes and Values

Our proposal: exploit the information from both attributes and values

Intuitively, put together records with similar values for similar attribute names

● Handles schema diversity ● Matches similar attribute-value pairs

52 We need more than similarity between strings

For computing the similarity between two records r1, r2, consider both their schema and values:

– sim(r1,r2) = simschema(r1,r2) * simvalues(r1,r2)

where

– simvalues could be a typical string similarity metric (string_sim)

53 Strings similarity metrics examples

 Boolean

 Edit distance – Levenstein, Smith-Waterman, Affine

 Set similarity – Jaccard, Dice

 Vector based – Cosine similarity, TFID

 Phonetic similarity: – Soundex

 Alignment based or two-tiered – Jaro-Winkler, Soft-TFIDF, Monge-Elkan

 Numeric distance 54  Domain-specific Or, consider attribute-value pairs for computing simvalues(r1,r2)

55 simvalues(r1,r2)

Given: r1 = {A11 = a11, … , A1n = a1n}

r2 = {A21 = a21, … , A2n = a2n} and the synonymous attributes A = {(Ai1,Aj1), … , (Aik, Ajk)}

∑ ( )∗ ( ) syn Aix , A jx string_sim aix , a jx ∀( A , A )∈A sim (r r )= ix jx values 1, 2 k where syn counts the degree of synonymity

56 We need more than similarity between strings

We need a similarity metric that considers both the

schema and the values of two records r1, r2:

– sim(r1,r2) = simschema(r1,r2) * simvalues(r1,r2)

where

– simschema measures how similar two schemata are • e.g. based on how many attributes they share

57 Adding sim(r1,r2) in the Big Picture

Given a set of block collections B1, …, Bn constructed for a set of records R, the “selected” block collection is the collection B* for which

* B = argmaxi (goodness(Bi))

Bi is a set of blocks bi1,...,bip containing the records of R

58 Adding sim(r1,r2) in the Big Picture

Given a set of block collections B1, …, Bn constructed for a set of records R, the “selected” block collection is the collection B* for which

* B = argmaxi (goodness(Bi))

Bi is a set of blocks bi1,...,bip containing the records of R goodness(Bi) = aggr_f(overall_sim(bi1), …, overall_sim(bip))

59 Adding sim(r1,r2) in the Big Picture

Given a set of block collections B1, …, Bn constructed for a set of records R, the “selected” block collection is the collection B* for which

* B = argmaxi (goodness(Bi))

Bi is a set of blocks bi1,...,bip containing the records of R goodness(Bi) = aggr_f(overall_sim(bi1), …, overall_sim(bip)) overall_sim(bij) = F(sim(rx,ry)), for all pairs (rx,ry) in bij

60 The best solution is too expensive!

Intuitively, to find the best B*, we have to construct all possible block collections and keep the one with the maximum goodness

Use a heuristic method to solve the problem

61 Our blocking sketch solution: Two phases blocking

Motivation: schema similarity is a lot cheaper to compute – #attributes << #values – #schemata << #attributes

Two phases blocking: 1st phase: use the schema • Targeting on higher efficiency 2nd phase: use the values (or attribute-value pairs) • Targeting on higher effectiveness 62 A comparison of blocking approaches*

Approach Pros Cons Attributes -low computational cost -low recall only -low space requirements -work best with homogeneous data

Values only -handle schema diversity -low precision

-high recall -high computational cost

-high space requirements

-low RR

Attributes -improve RR -preserve recall & Values: Goals -improve precision

63 *assuming the same matching function is used Conclusions

 ER is the problem of matching and merging records referring to the same real-world entity

 ER is an inherently quadratic problem

 Blocking can significantly reduce the matching-pairs search space

 Data heterogeneity is mainly why ER and blocking techniques for relational rata do not apply to the Web of Data

64 Challenges

 Larger and more datasets – Need efficient parallel techniques  Heterogeneity – Diverse data types, unclean and incomplete data  Lack of links – Need to infer more relationships in addition to equality  Multi-relational – Deal with the structure of entities (address vs street, no)  Multi-domain 65  Multiple applications References

 Christen, Peter. "Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection" (2012).

 L. Getoor, A. Machanavajjhala. Entity Resolution Tutorial. In VLDB 2012 http://www.cs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf

 Publishing Relational Data on the Semantic Web (tutorial at ESWC2011 http://db.disi.unitn.eu/pages/Rel2RDFTutorial2011/S0.pdf)

 Papadakis, George, Ekaterini Ioannou, Themis Palpanas, and W. Nejdl. "A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces." (2012): 1-1.

 Papadakis, George, Ekaterini Ioannou, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. "Eliminating the redundancy in blocking-based entity resolution methods." In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, pp. 85-94. ACM, 2011.

 Neumann, Thomas, and Guido Moerkotte. "Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins." In Data Engineering (ICDE), 2011 66 IEEE 27th International Conference on, pp. 984-994. IEEE, 2011. References

 Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, Jennifer Widom. Swoosh: A Generic Approach to Entity Resolution. The VLDB Journal, vol. 18, no. 1, pp. 255-276, Jan. 2009.

 David Menestrina, Steven Euijong Whang, Hector Garcia-Molina. In Proc. 36th Int'l Conf. Evaluating Entity Resolution Results. On Very Large Data Bases (PVLDB), pp. 208-219, Singapore, Sept. 2010.

 Palpanas, Papadakis. Entity Resolution for BIG Data: Blocking-based Entity Resolution in Highly Heterogeneous Information Spaces https://team.inria.fr/oak/files/2012/.../20121219-Themis-Palpanas.pdf

 Rajaraman, Anand, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2011.

 Jure Leskovec, Stanford C246: Mining Massive Datasets http://www.stanford.edu/class/cs246/

 Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing 67 References

● Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey, Knowledge and Data Engineering, IEEE Transactions on, Pages 1-16, Volume 19, Number 1, January 2007. ● A. Ferrara, D. Lorusso, S. Montanelli, and G. Varese. Towards a Benchmark for Instance Matching. In The 7th International Semantic Web Conference, 2008. ● Cafarella, Michael J., Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. "Webtables: exploring the power of tables on the web." Proceedings of the VLDB Endowment 1, no. 1 (2008): 538-549.