Entity Resolution in the Web of Data
Vasilis Efthymiou The Web of Data
2 What is an entity?
Any real-world object
3 The same entity in multiple sources
Videos about the St. of Liberty Wiki pages about the St. of Liberty
Images of the St. of Liberty General and lexicographical information about the St. of Liberty
Articles about the St. of General information Liberty about the St. of Liberty 4 Name Statue of Liberty about Statue of Liberty
Art Form Sculpture dbpprop:architect Frédéric Auguste Bartholdi
Architect Frédéric Auguste Bartholdi, Gustave dbpedia-owl:location dbpedia:United_States, dbpedia: New_York_City, dbpedia:New_York, Eiffel, Richard Morris Hunt dbpedia:Liberty_Island
Opened Oct 28, 1886 dbpprop:built 1886-10-28
Artist Frédéric Auguste Bartholdi dbpprop:height 151
Contained by Statue of Liberty National Monument, is dbpprop:basis of dbpedia:Miss_Liberty Ellis Island and Liberty Island, Liberty Island dbpprop:hasPhotoCollection http://www4.wiwiss.fu- berlin.de/flickrwrappr/photos/Statue_of_Liberty Struct. Height 93 m (310 ft )
Media Copper, Wrought iron URI Statue of Liberty Archit. Style Neoclassical architecture preferred label Statue of Liberty Date Completed Oct 28, 1886 yago:hasHeight 46.0248 Also known as Liberty Enlightening the World yago:wasCreatedOnDate 1886-##-## Art Subject Freedom yago:isLocatedIn New York City, New York, Dimensions 46 m (150 ft ) Manhattan, Liberty Island 5 Red: inconsistency, Blue: schema diversity, Green: additional values, underlined: missing values The standard data format for the Web of Data: RDF
Data in the form of (subject predicate object) triples:
Architect Frédéric Auguste Bartholdi (
– Subject: who/what are we talking about? – Predicate: how is the subject related to the object? – Object: who/what is related to the subject?
6 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro (filmmaker) p:b irth Pla ce rdf:type dbprop:east dbpedia:Wellingborough dbpedia:Cambridge t jec ub s:s rm cte category:University_towns_ d in_the_UK
dbpedia:http://dbpedia.org/resource/ geonames:http://sws.geonames.org fb:http://rdf.freebase.com/ns/ lgdo:http://linkedgeodata.org/triplify/ nyt:http://data.nytimes.com
7 “GB” geonames:7290660 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro geonames: (filmmaker) p:b geonames: irth parentFeature Pla countryCode ce rdf:type dbprop:east dbpedia:Wellingborough dbpedia:Cambridge t jec geonames:2653941 ub s:s rm cte category:University_towns_ d in_the_UK
dbpedia:http://dbpedia.org/resource/ geonames:http://sws.geonames.org fb:http://rdf.freebase.com/ns/ lgdo:http://linkedgeodata.org/triplify/ nyt:http://data.nytimes.com
8 “GB” geonames:7290660 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro geonames: (filmmaker) p:b geonames: irth parentFeature Pla countryCode ce rdf:type dbprop:east dbpedia:Wellingborough dbpedia:Cambridge t jec geonames:2653941 ub s:s rm cte category:University_towns_ d in_the_UK
fb:tourist_ attractions fb:ely_cathedral dbpedia:http://dbpedia.org/resource/ geonames:http://sws.geonames.org fb:http://rdf.freebase.com/ns/ fb:Cambridge lgdo:http://linkedgeodata.org/triplify/ e fb:nearby fb:cambridge_airports nyt:http://data.nytimes.com p y t _airports : f d r
fb:travel_destination
9 “GB” geonames:7290660 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro geonames: (filmmaker) p:b geonames: irth parentFeature Pla countryCode ce rdf:type dbprop:east dbpedia:Wellingborough dbpedia:Cambridge t jec geonames:2653941 ub s:s rm cte category:University_towns_ d in_the_UK
fb:tourist_ attractions fb:ely_cathedral dbpedia:http://dbpedia.org/resource/ geonames:http://sws.geonames.org fb:http://rdf.freebase.com/ns/ fb:Cambridge lgdo:http://linkedgeodata.org/triplify/ e fb:nearby fb:cambridge_airports nyt:http://data.nytimes.com p y t _airports : f d r geo:long 0.124862e0 lgdo:node20971094 fb:travel_destination 52.2033051e0 geo:lat
e
p
y
t
:
f
d
r
10 lgdo:City “GB” geonames:7290660 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro geonames: (filmmaker) p:b geonames: irth parentFeature Pla countryCode ce rdf:type dbprop:east dbpedia:Wellingborough dbpedia:Cambridge t jec geonames:2653941 ub s:s rm cte category:University_towns_ d in_the_UK
fb:tourist_ attractions fb:ely_cathedral dbpedia:http://dbpedia.org/resource/ geonames:http://sws.geonames.org fb:http://rdf.freebase.com/ns/ fb:Cambridge lgdo:http://linkedgeodata.org/triplify/ e fb:nearby fb:cambridge_airports nyt:http://data.nytimes.com p y t _airports : f d r geo:long 0.124862e0 lgdo:node20971094 fb:travel_destination 52.2033051e0 geo:lat “Cambridge
e skos:prefLabel
p (England)”
y
t nyt:5242016742
:
f
d
r 5542471781 4 nyt:associated_ article_count 11 lgdo:City “GB” geonames:7290660 yago:CitiesInTheEast OfEngland dbpedia:John_Marshall_ db pro geonames: (filmmaker) p:b geonames: irth parentFeature Pla countryCode ce rdf:type dbprop:east dbpedia:Wellingborough owl:sameAs dbpedia:Cambridge ct bje geonames:2653941 :su ms r s cte d A
category:University_towns_ e
in_the_UK m a s : l w
o fb:tourist_ s s fb:ely_cathedral A A attractions e dbpedia:http://dbpedia.org/resource/ e m geonames:http://sws.geonames.org m a a fb:Cambridge s fb:http://rdf.freebase.com/ns/ s : : l lgdo:http://linkedgeodata.org/triplify/ l w w e fb:nearby fb:cambridge_airports o o nyt:http://data.nytimes.com p y t _airports : f d r geo:long 0.124862e0 lgdo:node20971094 fb:travel_destination 52.2033051e0 geo:lat “Cambridge
e skos:prefLabel
p (England)”
y
t nyt:5242016742
:
f
d
r 5542471781 4 nyt:associated_ article_count 12 lgdo:City Ideally we would like to have such links for every entity, but... ● can we always have such links? ● are these links correct? owl:sameAs dbpedia:Cambridge geonames:2653941 s A e m a s : l w o s s A A e e m m a a fb:Cambridge s s : : l l w w o o
lgdo:node20971094
nyt:5242016742 5542471781
13 Entity Resolution (ER)
Entity Resolution (ER) is the problem of matching and merging references to the same real-world objects
Useful because: – improves data quality and integrity – fosters re-use of existing data sources
14 Entity Resolution (ER)
Input: datasets possibly containing duplicates
The goal: produce a “clean” dataset (no duplicates); the result of merging the identified duplicates – Identifying the duplicates requires a number of comparisons, wrt. similarity metrics
Application examples: Linking Census Records, Public Health, Web search, Counter-terrorism, ... 15 Need to reduce the number of comparisons
Naive ER requires a quadratic number of comparisons, i.e. O(n2) – Every entity must be compared with all others
How to reduce the number of comparisons without losing (many) true matches?
Idea: Blocking – group similar entities together and – compare only entities within the same group 16 Blocking
17 Blocking
Goal: Put similar records in the same block and dissimilar records in different blocks - multiple records may refer to the same entity
ER process compares only pairs within the same block
Disjoint Blocking (Partitioning): Each record appears in exactly one block Non-disjoint Blocking (Overlapping): Each record appears in at least one block – Pros & Cons? 18 Why Blocking?
1,000 x 1,000 = 1,000,000
If each comparison needs 1 sec – Without blocking: 1K x 1K sec = 11.5 days
19 Why Blocking?
If each comparison needs 1 sec – Without blocking: 1K x 1K sec = 11.5 days – With blocking: 200K sec = 2.3 days
20 Why Blocking?
If each comparison needs 1 sec – Without blocking: 1K x 1K sec = 11.5 days – With blocking: 200K sec = 2.3 days If we had 200 blocks of 5 records each: 21 – 5 x 5 x 200 sec = 83 minutes Blocking in relational datasets
Typical assumptions: – A-priori known schema – Every record consists of a uniquely identified set of name-value pairs – For each attribute, we know some metadata: • e.g. distinctiveness of values
22 Traditional blocking Identifier Name ZIP code 12345 54321 55555 77551 r1 Smith 12345 r2 Smyth 54321 r3 Smiths 12345 r1 r2 r4 r8 r4 Do 55555 r3 r6 r5 r5 Jackson 77551 r7 r6 Doe 55555 r7 Oliver 12345 r8 Jackson 77551
Entries with the same ZIP code (Blocking Key Value - BKV) end up in the same block
Data structure used: inverted index 23 Sorted Neighborhood
Blocks of the same size A small window size might not be large enough to cover all records with the same BKV Sorting is sensitive to errors/variations in the first few positions of values – (e.g. ‘Christina’, ‘Kristina’) 24 Canopy Clustering Canopies: overlapping clusters Input: Records R, distance metric d, thresholds T1 > T2
1. Pick a random record r from R
2. Create new canopy Cr using records r' s.t. d(r,r') < T1 3. Delete all records r' from R s.t. d(r,r') < T2 4. Return to Step 1 if R is not empty
25 q-grams
q-grams: substrings of length q Create variations for each BKV using q-grams, and insert record identifiers into more than one block Can lead to more true matches than traditional blocking and sorted neighborhood
26 but our datasets are not so tidy...
vs
27 Our world: The Web of Data
Highly Heterogeneous Information Spaces: – Rich diversity of schemata – Data imperfection (incomplete, missing, inconsistent, erroneous data) – Large scale
28 some specific characteristics
● Size of data – 32 billion RDF triples (until 2011) ● Data heterogeneity – structural • {Address} vs {Street, Number, ZIPcode}
– lexical [Elmagarmid et al. 2007] • StreetAddress: “44 W. 4th St.” vs StreetAddress: “44 West Fourth Street”
– logical [Ferrara et al. 2008] • Politician:Obama vs President:Obama 29 Blocking in the Web of Data
Can we use one of the relational blocking approaches? – lack of schema
Blocking approaches for heterogeneous data
– Use only the values [Instance, Token, ACB]
Can we apply ideas from other areas to exploit the schema?
– Use only the schema [ch. Sets, web tables]
30 Blocking based on Values
Only consider the values, to correlate two records ● Few true matches are missed – high recall ● Works well with diverse schemata (schema-agnostic) ● High space & time requirements – #attributes << #values ● Results in many redundant comparisons – the same value could refer to many things • e.g. Jaguar, Boston, John 31 Token Blocking [Papadakis et al. 2011]
Every distinct token ti creates a separate block that contains all records having ti in their values Blocks are built independently of the attribute names associated with a token – attribute-agnostic functionality
32 Token Blocking [Papadakis et al. 2011]
Big space and time requirements – for this specific example of 4 records, 16 blocks are generated Many dissimilar entities share common values – businesses, people, movies, historical events, touristic landmarks of LA will be put in the same block
33 Attribute Clustering Blocking [Papadakis et al. 2012]
A blocking scheme that exploits patterns in the values
1. Partition attribute names into clusters – According to the similarity of their values 2. Given a cluster k – Token Blocking
34 Intuition Collection 1 Collection 2 AttNames AttNames N1.1 N2.1 N1.2 N2.2 N1.3 N2.3 N1.4 N2.4 N1.5 N2.5 N1 N2 The most similar name to N1 is N2
35 Intuition Collection 1 Collection 2 AttNames AttNames N1.1 N2.1 Clusters: N1.2 N2.2 N1.1 N2.2 N1.2 N2.3 N1.3 N2.4 N2.1 N1.3 N2.3 N1.4 N1.5 N2.5 N1.4 N2.4 N1.5 N2.5 N1 N2 The most similar name to N1 is N2
36 Intuition Collection 1 Collection 2 AttNames AttNames N1.1 N2.1 Clusters: N1.1 and N2.2 are in the N1.2 N2.2 N1.1 N2.2 N1.2 N2.3 same cluster, because their N1.3 N2.4 N2.1 N1.3 N2.3 values are similar. E.g: N1.4 N1.5 N2.5 N1.1: George, Nick, Kostas N1.4 N2.4 N2.2: Kostas, Niki, Georgia N1.5 N2.5 N1 N2 The most similar name to N1 is N2
37 Intuition Collection 1 Collection 2 AttNames AttNames N1.1 N2.1 Clusters: N1.1 and N2.2 are in the N1.2 N2.2 N1.1 N2.2 N1.2 N2.3 same cluster, because their N1.3 N2.4 N2.1 N1.3 N2.3 values are similar. E.g: N1.4 N1.5 N2.5 N1.1: George, Nick, Kostas N1.4 N2.4 N2.2: Kostas, Niki, Georgia N1.5 N2.5 N1 N2 The most similar name to N1 is N2
The good case: State: Georgia, California, Texas, Florida Name: Georgia, Niki, Kostas would be placed in different clusters
38 Weakness Collection 1 Collection 2 AttNames AttNames N1.1 N2.1 Clusters: N1.1 and N2.2 are in the N1.2 N2.2 N1.1 N2.2 N1.2 N2.3 same cluster, because their N1.3 N2.4 N2.1 N1.3 N2.3 values are similar. E.g: N1.4 N1.5 N2.5 N1.1: George, Nick, Kostas N1.4 N2.4 N2.2: Kostas, Niki, Georgia N1.5 N2.5 N1 N2 The most similar name to N1 is N2 The bad case Built In: 2003, 2010, 1821, 1789 WarOf: 1789, 1821, 1939, 2003
OR
year: 2000, continent: Europe, 39 gender: male, ... The other alternative: Blocking based on Schema
40 Blocking based on Schema
Only consider the schema, to correlate two records ● Low space & time requirements – #attributes << #values ● Many true matches are missed in diverse datasets – low recall
No blocking approaches based on schema – Rich diversity of schemata
41 Can we adopt ideas from other areas? Characteristic Sets [Neumann et al. 2011]
Data tends to have a latent soft schema – Books tend to have authors and titles, etc. So, an entity can be characterized by its set of predicates, called characteristic set
Idea: Each characteristic set corresponds to a block, but... – data schemata are highly diverse – similar entities with small schema variations would be placed in different blocks
42 Web Tables: The Attribute Correlation Statistics Database (ACSDb) [Cafarella et al. 2008]
ACSDb keeps statistics on co-occurrences of schema elements – ACSDb is a set of pairs of the form (S, c) • S is the schema of a record, c is the number of records that have the schema S
ACSDb allows finding the probability of seeing various attributes in a schema – p(address) = sum of counts c for records whose schema contains “address” / total sum of all counts
– p(address|name) = #all the schemata in which “address” appears43 along with “name” / counts for seeing “name” alone Web Tables: The Attribute Correlation Statistics Database (ACSDb) [Cafarella et al. 2008]
Synonym finding, based on the observations: – synonymous attributes do not appear in the same schema – two synonyms will appear in similar contexts
Degree of synonymity between attribute names a, b:
C: context attributes A: all attributes that appear in ACSDb with C
44 Web Tables: The Attribute Correlation Statistics Database (ACSDb) [Cafarella et al. 2008]
Schema clustering, based on common attributes:
X, Y: two schemas, D: a shared attribute between X and Y
We could use these clusters as blocks, but... – same schemata could describe different things – e.g. (name, year, country) could be used to describe
movies, buildings, people, wars, awards, ... 45 Performance of Blocking
46 Performance of blocking
Good balance between: – Efficiency • wrt. the number of pairwise comparisons within a block – Effectiveness • wrt. the pairs of matching records compared in at least one block
The more comparisons executed in blocks, the higher the effectiveness, but the lower the efficiency (and vice versa) 47 Evaluation metrics for blocking
Pairs Completeness (PC) = Recall: True matches put in the same block / real true matches Pairs Quality (PQ) = Precision: True matches put in the same block / total comparison pairs generated F-measure = 2 PC PQ / (PC + PQ)
Reduction Ratio (RR) 1- (generated comparisons / initial comparisons)
48 A comparison of blocking approaches*
Approach Pros Cons Attributes -low computational cost -low recall only -low space requirements -work best with homogeneous data
Values only -handle schema diversity -low precision
-high recall -high computational cost
-high space requirements
-low RR
49 *assuming the same matching function is used Our focus: Exploit available information both at schema and instance level!
To achieve better: – efficiency – effectiveness (check out the following example)
50 Name Statue of Liberty about Statue of Liberty
Art Form Sculpture dbpprop:architect Frédéric Auguste Bartholdi
Architect Frédéric Auguste Bartholdi, Gustave dbpedia-owl:location dbpedia:United_States, dbpedia: New_York_City, dbpedia:New_York, Eiffel, Richard Morris Hunt dbpedia:Liberty_Island
Opened Oct 28, 1886 dbpprop:built 1886-10-28
Artist Frédéric Auguste Bartholdi dbpprop:height 151
Contained by Statue of Liberty National Monument, is dbpprop:basis of dbpedia:Miss_Liberty Ellis Island and Liberty Island, Liberty Island dbpprop:hasPhotoCollection http://www4.wiwiss.fu- berlin.de/flickrwrappr/photos/Statue_of_Liberty Struct. Height 93 m (310 ft )
Media Copper, Wrought iron name Frédéric Auguste Bartholdi
Archit. Style Neoclassical architecture Date of birth Aug 2, 1834
Date Completed Oct 28, 1886 Date of death Oct 4, 1904 (age 70 years)
Also known as Liberty Enlightening the World Artworks Statue of Liberty, Fontaine Bartholdi, Bartholdi Art Subject Freedom Fountain, The 667 Madison Statue of Liberty Dimensions 46 m (150 ft ) Gender Male Red: inconsistency, Blue: schema diversity, Green: additional values Blocking Based on Attributes and Values
Our proposal: exploit the information from both attributes and values
Intuitively, put together records with similar values for similar attribute names
● Handles schema diversity ● Matches similar attribute-value pairs
52 We need more than similarity between strings
For computing the similarity between two records r1, r2, consider both their schema and values:
– sim(r1,r2) = simschema(r1,r2) * simvalues(r1,r2)
where
– simvalues could be a typical string similarity metric (string_sim)
53 Strings similarity metrics examples
Boolean
Edit distance – Levenstein, Smith-Waterman, Affine
Set similarity – Jaccard, Dice
Vector based – Cosine similarity, TFID
Phonetic similarity: – Soundex
Alignment based or two-tiered – Jaro-Winkler, Soft-TFIDF, Monge-Elkan
Numeric distance 54 Domain-specific Or, consider attribute-value pairs for computing simvalues(r1,r2)
55 simvalues(r1,r2)
Given: r1 = {A11 = a11, … , A1n = a1n}
r2 = {A21 = a21, … , A2n = a2n} and the synonymous attributes A = {(Ai1,Aj1), … , (Aik, Ajk)}
∑ ( )∗ ( ) syn Aix , A jx string_sim aix , a jx ∀( A , A )∈A sim (r r )= ix jx values 1, 2 k where syn counts the degree of synonymity
56 We need more than similarity between strings
We need a similarity metric that considers both the
schema and the values of two records r1, r2:
– sim(r1,r2) = simschema(r1,r2) * simvalues(r1,r2)
where
– simschema measures how similar two schemata are • e.g. based on how many attributes they share
57 Adding sim(r1,r2) in the Big Picture
Given a set of block collections B1, …, Bn constructed for a set of records R, the “selected” block collection is the collection B* for which
* B = argmaxi (goodness(Bi))
Bi is a set of blocks bi1,...,bip containing the records of R
58 Adding sim(r1,r2) in the Big Picture
Given a set of block collections B1, …, Bn constructed for a set of records R, the “selected” block collection is the collection B* for which
* B = argmaxi (goodness(Bi))
Bi is a set of blocks bi1,...,bip containing the records of R goodness(Bi) = aggr_f(overall_sim(bi1), …, overall_sim(bip))
59 Adding sim(r1,r2) in the Big Picture
Given a set of block collections B1, …, Bn constructed for a set of records R, the “selected” block collection is the collection B* for which
* B = argmaxi (goodness(Bi))
Bi is a set of blocks bi1,...,bip containing the records of R goodness(Bi) = aggr_f(overall_sim(bi1), …, overall_sim(bip)) overall_sim(bij) = F(sim(rx,ry)), for all pairs (rx,ry) in bij
60 The best solution is too expensive!
Intuitively, to find the best B*, we have to construct all possible block collections and keep the one with the maximum goodness
Use a heuristic method to solve the problem
61 Our blocking sketch solution: Two phases blocking
Motivation: schema similarity is a lot cheaper to compute – #attributes << #values – #schemata << #attributes
Two phases blocking: 1st phase: use the schema • Targeting on higher efficiency 2nd phase: use the values (or attribute-value pairs) • Targeting on higher effectiveness 62 A comparison of blocking approaches*
Approach Pros Cons Attributes -low computational cost -low recall only -low space requirements -work best with homogeneous data
Values only -handle schema diversity -low precision
-high recall -high computational cost
-high space requirements
-low RR
Attributes -improve RR -preserve recall & Values: Goals -improve precision
63 *assuming the same matching function is used Conclusions
ER is the problem of matching and merging records referring to the same real-world entity
ER is an inherently quadratic problem
Blocking can significantly reduce the matching-pairs search space
Data heterogeneity is mainly why ER and blocking techniques for relational rata do not apply to the Web of Data
64 Challenges
Larger and more datasets – Need efficient parallel techniques Heterogeneity – Diverse data types, unclean and incomplete data Lack of links – Need to infer more relationships in addition to equality Multi-relational – Deal with the structure of entities (address vs street, no) Multi-domain 65 Multiple applications References
Christen, Peter. "Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection" (2012).
L. Getoor, A. Machanavajjhala. Entity Resolution Tutorial. In VLDB 2012 http://www.cs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
Publishing Relational Data on the Semantic Web (tutorial at ESWC2011 http://db.disi.unitn.eu/pages/Rel2RDFTutorial2011/S0.pdf)
Papadakis, George, Ekaterini Ioannou, Themis Palpanas, and W. Nejdl. "A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces." (2012): 1-1.
Papadakis, George, Ekaterini Ioannou, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. "Eliminating the redundancy in blocking-based entity resolution methods." In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, pp. 85-94. ACM, 2011.
Neumann, Thomas, and Guido Moerkotte. "Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins." In Data Engineering (ICDE), 2011 66 IEEE 27th International Conference on, pp. 984-994. IEEE, 2011. References
Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, Jennifer Widom. Swoosh: A Generic Approach to Entity Resolution. The VLDB Journal, vol. 18, no. 1, pp. 255-276, Jan. 2009.
David Menestrina, Steven Euijong Whang, Hector Garcia-Molina. In Proc. 36th Int'l Conf. Evaluating Entity Resolution Results. On Very Large Data Bases (PVLDB), pp. 208-219, Singapore, Sept. 2010.
Palpanas, Papadakis. Entity Resolution for BIG Data: Blocking-based Entity Resolution in Highly Heterogeneous Information Spaces https://team.inria.fr/oak/files/2012/.../20121219-Themis-Palpanas.pdf
Rajaraman, Anand, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2011.
Jure Leskovec, Stanford C246: Mining Massive Datasets http://www.stanford.edu/class/cs246/
Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing 67 References
● Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey, Knowledge and Data Engineering, IEEE Transactions on, Pages 1-16, Volume 19, Number 1, January 2007. ● A. Ferrara, D. Lorusso, S. Montanelli, and G. Varese. Towards a Benchmark for Instance Matching. In The 7th International Semantic Web Conference, 2008. ● Cafarella, Michael J., Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. "Webtables: exploring the power of tables on the web." Proceedings of the VLDB Endowment 1, no. 1 (2008): 538-549.
68