Web Data Integration Types of Structured Data on the Web

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 1 Topology of the Web Today

The Classic The Web Document of Data Web

Deep Web (via and forms)

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 2 Outline

1. Data Catalogs and Marketplaces 2. Web APIs 3. 4. HTML-embedded Data 1. RDFa, Microdata, 2. HTML Tables and Templates 3. Wikipedia as Data Source 5. References

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 3 1. Data Catalogs and Marketplaces

 The Web traditionally contains structured data in various formats: • CSV files, Excel worksheets • XML documents, SQL dumps  Data Catalogs and Data Marketplaces • collect and host data sets plus • provide free or payment-based access to the data sets  Examples • data.gov.uk, data.gov.us, publicdata.eu: Thousands of public sector data sets • Factual, Dun & Bradstreet, DataStreamX: Commercial data market places  List of Data Catalogs • http://data.wu.ac.at/portalwatch/portalslist

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 4 University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 5 University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 6 2. Web 2.0 Applications and Web APIs

 A multitude of Web-based applications enable users to share information.  These applications form seperate data spaces that are only partly accessible via the Web. • HTML interfaces • Web APIs

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 7 Example: Facebook

 Users (September 2012) • 1 billion monthly active users • including 600 million mobile users • 140.3 billion friend connections • 1.13 trillion likes since launch in February 2009 • 219 billion photos uploaded • 17 billion location-tagged posts, including check-ins  Data Volume • over 100 Petabyte • inluding profile data, communication, usage logs, ...

Sources • https://s3.amazonaws.com/OneBillionFB/Facebook+1+Billion+Stats.docx • http://www.technologyreview.com/featuredstory/428150/what-facebook-knows/

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 8 Web APIs

 Provide limited access to the collected data • restricted to specific queries (canned queries) • restrictred number of queries / number of results  ProgrammableWeb API Catalog • lists over 17,000 Web APIs (2017) • list over 6,800 mashups

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 9 Most Popular Web API

Popularity = Number of Mashups using an API

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 10 Mashups are based on a fixed set of data sources

Web APIs expose proprietary Mashup interfaces. Up  Not index-able by generic Web crawlers  No automatic discovery of additional data sources Web Web Web Web API API API API  No single global data space

A B C D

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 11 Web APIs slice the Web into Data Silos

Image:University Bob Jagensdorf, of Mannheim http://flickr.com/photos/darwinbell/, – Bizer: Web Data Integration CC-BY – HWS2018 (Version: 30.8.2018) Slide 12 3. Alternative Approach: Linked Data

 Extend the Web with a single global data graph • by using RDF to publish structured data on the Web • by setting links between data items within different data sources.

RDF RDF RDF RDF RDF

RDF RDF RDF RDF RDF

RDF RDF RDF RDF link links links links

A B C D E

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 13 Entities are identified with HTTP URIs

rdf:type pd:cygri :Person

foaf:name Richard Cyganiak foaf:based_near :Berlin

HTTP URIs take the role of global primary keys.

pd:cygri = http://richard.cyganiak.de/foaf.rdf#cygri dbpedia:Berlin = http://dbpedia.org/resource/Berlin

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 14 URIs can be looked up on the Web

rdf:type pd:cygri foaf:Person

foaf:name 3.405.259 Richard Cyganiak dp:population foaf:based_near dbpedia:Berlin

skos:subject

dp:Cities_in_Germany

 By following RDF links applications can • navigate the global data graph • discover new data sources

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 15 The Marbles Browser

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 16 The SigMa Linked Data

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 17 LOD Datasets on the Web: April 2014

More statistics: http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/ Dataset Catalogs: https://lod-cloud.net/datasets (as of 2018) http://linkeddatacatalog.dws.informatik.uni-mannheim.de/ (as of 2014)

Source: Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. In: 13th International Conference (ISWC2014).

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 18 Uptake in the Life Science Domain

 Goals: 1. Connect life science datasets in order to support • biological knowledge discovery • drug discovery 2. Reuse results of previous integration efforts

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 19 Uptake in the Libraries Community

 Institutions publishing Linked Data • Library of Congress (subject headings) • German National Library (PND dataset and subject headings) • Swedish National Library (Libris - catalog) • Hungarian National Library (OPAC and ) • Europeana Digital Library (4 million artifacts) • Springer (metadata about conference proceedings)

 Goals: 1. Interconnect resources between repositories (by topic, by location, by historical period, by ...) 2. Integrate library catalogs on global scale

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 20 RDF Links between LOD Datasets

httUniversityps://lod-cloud.net/ of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 21 Hands-on: How to get the Data?

 Download the Billion Triples Challenge Dataset • 4 billion RDF triples (52 GB gzipped, 1.1 TB uncompressed) • crawled from the public Web of Linked Data in Februar/June 2014 • http://km.aifb.kit.edu/projects/btc-2014/

 Use SPARQL endpoints of individual data sets • Endpoint list: http://sparqles.ai.wu.ac.at/availability

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 22 4. HTML-embedded Data

1. Webpages traditionally contain structured data in the form of HTML tables as well as template data. 2. More and more Websites semantically markup the content of their HTML pages using standardized markup formats.

Microformats RDFa

Microdata

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 23 4.1 Microformats

effort dates back to 2003  Small set of fixed formats • : people, companies, organizations, and places • XFN : relationships between people • hCalendar : calendaring and events • hListing : small-ads; classifieds • hReview : reviews of products, businesses, events  Shortcoming of Microformats • can not represent any kind of data.  indexed by and Yahoo since 2009

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 24 RDFa

 serialization format for embedding RDF data into HTML pages  proposed in 2004, W3C Recommendation in 2008  can be used together with any vocabulary  can assign URIs as global primary keys to entities

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 25 Open Graph Protocol

 allows site owners to determine how entities are described in Facebook  relies on RDFa for embedding data into HTML pages  available since April 2010

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 26 Microdata

 alternative technique for embedding structured data  proposed in 2009 by WHATWG as part of HTML5 work  tries to be simpler than RDFa (5 new attributes instead of 8)

Vienna Marriott Hotel Parkring 12a Vienna
4 stars-based on 250 reviews.

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 27 Schema.org

 ask site owners since 2011 to annotate data for enriching search results  675 Types: Event, Place, Local Business, Product, Review, Person  Encoding: Microdata, RDFa, JSON-LD

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 28 Usage of Schema.org Data @ Google

Data snippets within search results

Data snippets within info boxes

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 29 Rich-Snippets Get More User Attention

 Suchen

Potential business incentive.

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Source: www.looktracker.com Slide 30 The Web Data Commons Project

 extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl  analyzes and provides the extracted data for download  statistics about some extraction runs • 2017 CC Corpus: 3.1 billion HTML pages  38.2 billion RDF triples • 2016 CC Corpus: 3.1 billion HTML pages  44.2 billion RDF triples • 2014 CC Corpus: 2.0 billion HTML pages  20.4 billion RDF triples • 2012 CC Corpus: 3.0 billion HTML pages  7.3 billion RDF triples  uses 100 machines on Amazon EC2 • approx. 2000 machine/hours (100 spot instances of type c3.xlarge)  350 Euro  http://www.webdatacommons.org/structureddata/

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 31 Overall Adoption 2017

1.2 billion HTML pages out of the 3.2 billion pages provide semantic annotations (38%).

7.4 million pay-level-domains (PLDs) out of the 26 million pay-level-domains covered by the crawl provide annotations (28.4%).

Google, 2014*: 5 million websites provide Schema.org data. * Guha in LDOW2014 Keynote

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 32 Development of the Adoption over Time Number of PLDs using a specific format

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 33 Most Popular Classes

RDFa # PLDs

Microdata

Strong focus on # PLDs Schema.org and Facebook (og:) vocabularies.

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 34 Topical Focus – Microdata 2014

2014 2013  Top Classes Class Instances # PLDs PLDs #% #% 1 schema:WebPage 51.757.000 148,893 18,16% 69.712 15,04  Topics: 2 schema:Article 54.972.000 88,7 10,82% 65.930 14,22 • CMS and blog 3 schema:Blog 3.787.000 110,663 13,50% 64.709 13,96 4 schema:Product 288.083.000 89,608 10,93% 56.388 12,16 metadata 5 schema:PostalAddress 48.804.000 101,086 12,33% 52.446 11,31 6 dv:Breadcrumb 269.088.000 76,894 9,38% 44.187 9,53 • products and 7 schema:AggregateRating 59.070.000 50,510 6,16% 36.823 7,94 offers 8 schema:Offer 236.953.000 62,849 7,66% 35.635 7,69 9 schema:LocalBusiness 20.194.000 62,191 7,58% 35.264 7,61 • ratings and 10 schema:BlogPosting 11.458.000 65,397 7,98% 32.056 6,92 reviews 11 schema:Organization 101.769.000 52,733 6,43% 24.255 5,23 12 schema:Person 115.376.000 47,936 5,85% 21.107 4,55 • business listings 13 schema:ImageObject 35.356.000 25,573 3,12% 16.084 3,47 14 dv:Product 12.411.000 16,003 1,95% 13.844 2,99 • address data 15 schema:Review 42.561.000 20,124 2,45% 13.137 2,83 16 dv:Review‐aggregate 3.964.000 14,094 1,72% 13.075 2,82 • ...and a massive 17 dv:Organization 3.155.000 10,649 1,30% 9.582 2,07 long tail 18 dv:Offer 7.170.000 11,64 1,42% 9.298 2,01 19 dv:Address 2.138.000 9,674 1,18% 8.866 1,91 20 dv:Rating 1.732.000 9,367 1,14% 8.360 1,8

schema: = Schema.org dv: = Google Rich Snippet Vocabulary (deprecated)

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 35 Adoption by E-Commerce Websites

Distribution by Alexa Top-15 Shopping Sites Top-Level Domain Website schema:Product TLD #PLDs Amazon.com  com 38344 Ebay.com  co.uk 3605 NetFlix.com  net 1813 Amazon.co.uk  de 1333 Walmart.com  pl 1273 etsy.com  com.br 1194 Ikea.com  ru 1165 Bestbuy.com  com.au 1062 Homedepot.com  nl 1002 Target.com  Groupon.com  Newegg.com  Adoption by Top-15: Lowes.com  60 % Macys.com  Nordstrom.com 

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 36 Properties used to Describe Products

Top 15 Properties PLDs #% schema:Product/name 78,292 87 % schema:Product/image 59,445 66 % schema:Product/description 58,228 65 % schema:Product/offers 57,633 64 % schema:Offer/price 54,290 61 % schema:Offer/availability 36,789 41 % schema:Offer/priceCurrency 30,610 34 % schema:Product/ 23,723 26 % schema:Product/aggregateRating 21,166 24 % schema:AggregateRating/ratingValue 20,513 23 % schema:AggregateRating/reviewCount 14,930 17 % schema:Product/manufacturer 10,150 11 % schema:Product/brand 9,739 11 % schema:Product/productID 9,221 10 % schema:Product/sku 7955 9 %

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 37 Adoption by Travel Websites

Top 15 Travel Websites schema:Hotel Any Class Adoption: Booking.com (uses DataVoc) 73 % TripAdvisor  Expedia  Agoda  Hotels.com  Kayak  Priceline  Travelocity  Orbitz  ChoiceHotels  HolidayCheck  ChoiceHotels  InterContinental Hotels Group  Marriott International  Global Hyatt Corp. 

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 38 Hands-on: How to get the Data?

 http://www.webdatacommons.org/structureddata/

 Only tip of the iceberg, as each website is only partly crawled.

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 39 There are hundreds of millions of high- quality tables on the Web and in Wikipedia.

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 40 4.2 HTML Tables

In corpus of 14B raw tables, 154M are “good” relations (1.1%). Cafarella (2008)

• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008. • Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011.

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 41 Hands-on: Web Data Commons – Web Tables Corpus

 Large public corpus of relational Web tables  extracted from Common Crawl 2015 (1.78 billion pages)  90 million relational tables  selected out of 10.2 B raw tables (0.9%)  download includes the HTML pages of the tables (1TB zipped)  http://webdatacommons.org/webtables/

 Alternative: Google Table Search  https://research.google.com/tables

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 42 Web Data Commons – Web Tables Corpus

 Attribute Statistics  Subject Attribute Values

Attribute #Tables Value #Rows name 4,600,000 usa 135,000 price 3,700,000 germany 91,000 date 2,700,000 greece 42,000 artist 2,100,000 new york 59,000 location 1,200,000 london 37,000 year 1,000,000 athens 11,000 manufacturer 375,000 david beckham 3,000 counrty 340,000 ronaldinho 1,200 isbn 99,000 oliver kahn 710 area 95,000 twist shout 2,000 population 86,000 yellow submarine 1,400

28,000,000 different attribute labels 1.74 billion rows 253,000,000 different subject labels

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 43 Exploiting the Template-Structure of HTML Pages

 Many websites are generated from using HTML-templates.  Approaches to extract the data: • Hand-written wrappers using Xpath or regexes • Wrapper induction using machine learning techniques (see Bing Liu: Web Data Mining book)  Problem: • Wrappers are site-specific • Thus the approach does not scale to large numbers of websites

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 44 4.3 Wikipedia as Data Source

Title Geo‐ Coordinates Description Images

Infoboxes Cross Language Links

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 45 Extracting Knowledge from Wikipedia

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 46 The DBpedia - Release 2016

 Describes 6.6 million things, out of which 5.5 million are classified in a consistent ontology using 760 classes and 2729 different properties • 1,500,000 persons • 840,000 places • 260,000 organizations • 139,000 music albums  Altogether 13 billion pieces of information (RDF triples) • 1.7 million were extracted from the English edition of Wikipedia • 29,000,000 links to external web pages • 50,000,000 external links into other RDF datasets  DBpedia Internationalization • provides data from 125 Wikipedia language editions for download • For 28 popular languages DBpedia provides cleaned infobox data

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 47 University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 48 Hands-on: How to get DBpedia Data?

 Download Data Dumps  Use SPARQL endpoint

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 49 Knowledge Graphs

Large cross-domain knowledge bases which aim to cover all “relevant” entities in the world.

 Google • development started 2012, builds on Freebase • 570 million objects described by over 18 billion facts (2012) • 1500 classes, 35,000 properties  Satori Knowledge Base • revealed to the public in mid-2013  Yahoo Knowledge Graph • revealed to the public early-2014  Knowledge Graphs employ RDF-style graph data models

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 50 Data Sources used to Build Knowledge Graphs

1. Wikipedia • infoboxes, category system, information extraction from text 2. Open license sources • e.g. CIA World Factbook, MusicBrainz, … 3. Commercial third-party data • e.g. IMDB, company listings, … 4. schema.org annotations in web pages • e.g. contact information for companies • e.g. logos of companies

Lots of effort is spend on data integration and manual data curation

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 51 Application of the Google Knowledge Graph

 Enrich search results with knowledge cards and lists  Goal: Fulfil information need without having users navigate to other websites

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 52 Applications of the Google Knowledge Graph

1. Answer fact queries: “birthdate michael douglas” 2. Compare things: ”compare eiffel tower vs empire state building”

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 53 Behind-the-Scenes Applications of KGs

Various tasks become easier, if you know all entities in the world.  Google • uses its knowledge graph to identity entities in web pages (Entity Linking) • Hummingbird ranking algorithm (deployed in 2013) uses knowledge graph as background knowledge for ranking search results.  Yahoo • uses its knowledge graph to “support applications across the company: • Web Search, Content Understanding • Recommendation, Personalization, Advertisement”*  Data Integration • becomes matching data sources against knowledge graphs as intermediate schemata. *Source: Nicolas Torzec, Yahoo 2014

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 54 SEO Topic: How to influence Knowledge Graphs?

Source: http://searchengineland.com/ leveraging-wikidata-gain-google- knowledge-graph-result-219706

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 55 Summary

 In addition to text, the Web contains a vast amount of structured data.  The topics of the published data partly correlate with the publication methods used: • Web APIs: User generated content, geographic data • Schema.org data: E-commerce, local business, event, job data • Linked Data: E-government data, library data, research data • Wikipedia, HTML tabels: General knowledge  The Web is the perfect playground for researching and applying Big Data integration techniques • tough challenges concerning heterogeneity, volume, and data quality • rewarding if challenges can be handled, e.g. web-scale queries and mining

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 56 5. References

 Linked Data • Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers (also: free online version), 2011.  RDFa, Microdata and Microformats • Christian Bizer, et al.: Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. 12th International Semantic Web Conference, 2013.  Extracting HTML Table Data • Michael Cafarella, Alon Halevy, et al.: WebTables: Exploring the power of Tables on the Web. Proceedings of the VLDB Endowment, 2008. • Petros Venetis, Alon Halevy, et al.: Recovering of Tables on the Web. Proceedings of the VLDB Endowment, 2011.  Wrapper Induction • Bing Liu: Web Data Mining. Chapter 9. Springer, 2011.  DBpedia • Jens Lehmann, et al: DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal, 2014.

University of Mannheim – Bizer: Web Data Integration – HWS2018 (Version: 30.8.2018) Slide 57