Graphs: Use Cases, Analytics and Linking
Total Page:16
File Type:pdf, Size:1020Kb
Big Knowledge Graphs: Use Cases, Analytics and Linking Company Importance and Similarity Demo Atanas Kiryakov LAMBDA Big Data School, Belgrade, June, 2019 Presentation Outline o Introduction o GraphDB o Use Cases o Market Intelligence Vision o Concept and Entity Awareness via Big Knowledge Graphs o FactForge: Showcase KG with 2B Statements o KG Analytics: Similarity and Importancе Mission We help enterprises to get better insights by interlinking: o Diverse databases & Unstructured information o Proprietary & Global data We master Knowledge graphs, combining several AI technologies: o Graph analytics, Text mining, Computer vision o Symbolic reasoning & Machine learning Essential Facts o Leader ✓ Semantic technology vendor established year 2000 ✓ Part of Sirma Group: 400 persons, listed at Sofia Stock Exchange o Profitable and growing ✓ HQ and R&D in Sofia, Bulgaria ✓ More than 70% of the commercial revenues from London and New York o Innovator: Attracted more than €10M in R&D funding o Trendsetter ✓ Member of: W3C, EDMC, ODI, LDBC, STI, DBPedia Foundation, Pistoia Alliance Ontotext GraphDB™ - Flagman Product Source: db-engines.com popularity ranking of graph databases Note: This is not ranking by revenues – such information is not available for most of the vendors Fancy Stuff and Heavy Lifting o We do advanced analytics: We predicted BREXIT ✓ 14 Jun 2016 whitepaper: #BRExit Twitter Analysis: More Twitter Users Want to Split with EU and Support #Brexit https://ontotext.com/white-paper-brexit-twitter-analysis/ o But most of the time we do the heavy lifting of data integration and information extraction ✓ Enabling data scientists can do fancy things Technology Excellence o Unique: GraphDBTM + Text mining o Enterprise robust: powers BBC.CO.UK/SPORT and FT.COM o Serving the most knowledge intensive enterprises: What is Knowledge Graph? o KG represents a collection of interlinked descriptions of entities – real-world objects where: ✓ Descriptions have a formal structure that allows both people and computers to process them in an efficient and unambiguous manner; ✓ Entity descriptions contribute to one another, forming a network, where each entity represents part of the description of the entities, related to it. • The Knowledge Graph can be seen as a specific type of: ✓ Database, because it can be queried via structured queries; ✓ Graph, because it can be analyzed as any other network data structure; ✓ Knowledge base, because the data in it bears formal semantics, which can be used to interpret the data and infer new facts. Discovery in Knowledge Graphs o Find suspicious patterns like: ✓ Company in USA ✓ Controls another company in USA ✓ Through a company in an off-shore zone o Show news relevant to these companies Text Analytics: Annotate Semantic Disambiguation Content Get Suggestions Entity Detection from Vocabulary Apple : Organisation Tim Cook : Person, CEO NLP Pipeline Suggestions Tim Cook : Person, Footballer Samsung : Organisation Language Detection POS Apple CEO Tim Cook was Disambiguation ... at a conference with the Apple : Organisation Vocabulary Gazetteer CEO of Samsung. Tim Tim Cook : Person, CEO Dynamic explained how smart Tim Cook : Person, Footballer ... phones are changing the Samsung : Organisation Vocabulary ... consumer electronics market. GraphDB Relevance Disambiguation Vocabulary 87% - Tim Cook : Person, CEO ... 68% - Apple : Organisation 56% - Samsung : Organisation Relevance Ranking Approach and Applications Portfolio Presentation Outline o Introduction o GraphDB o Use Cases o Market Intelligence Vision o Concept and Entity Awareness via Big Knowledge Graphs o FactForge: Showcase KG with 2B Statements o KG Analytics: Similarity and Importancе GraphDB Essentials o Scalable RDF / SPARQL engine ✓ W3C standards support o Platform independent (100% Java) o Open source API ✓ Main contributor to RDF4J project o Reasoning and consistency checking ✓ UNIQUE! Efficient reasoning support for big data sets across the full lifecycle of the data: load, query, updates Architecture GraphDB Workbench GraphDB Engine User friendly interface for database REST API for database access administration Plugin / Connectors GraphDB Workbench o SPARQL editor & autocomplete o Schema visualization o Graph exploration o Database monitoring and administration 11/20/202 0 GraphDB Workbench o Generation of RDF from structured sources o Data cleaning and transformations o Integration with OpenRefine and GREL language Visual Graph #18 GraphDB Enterprise: Resilience & Availability Features Free Standard Enterprise RDF 1.1 support SPARQL 1.1 support RDFS, OWL2 RL and QL reasoning Efficient query execution Workbench interface Community support Unlimited number of CPU cores Commercial support Connectors for Elasticsearch & SOLR High-availability cluster Managed service High Availability Cluster Architecture o Improved resilience Multi-DC Data Governance ✓ failover, dynamic configuration GraphDB Cluster o Improved query bandwidth Master R+W Master RO ✓ larger cluster means more queries per unit time o Multiple data centres deployment Worker 1 Worker 2 Worker 3 o Integration with search engines Connector Connector Connector o Integration with MongoDB SOLR/ES SOLR/ES SOLR/ES GraphDB Benchmarking o LDBC: TPC-like benchmarks for graph databases o Members include: Ontotext, OpenLink, neo4j, CWI, UPM, ORACLE, IBM, *Sparsity o LDBC Semantic Publishing Benchmark ✓ Based on BBC’s Dynamic Semantic Publishing editorial workflow ✓ Updates, adding new content metadata or updating the reference knowledge (e.g. new people) ✓ Aggregation queries retrieve content according to various criteria (e.g. to generate a topic web page) ✓ The only benchmark that involves reasoning and updates Clients reading / writing Reads/s Writes/s LDBC SPB Results of GraphDB 0 / 1 0.0000 11.4067 0 / 2 0.0000 14.3033 0 / 4 0.0000 14.6700 o 0 / 8 0.0000 15.1067 CPU: 1 x E5-1650 1 / 0 17.8258 0.0000 4 / 0 43.0833 0.0000 o RAM: 20G heap 8 / 0 70.3767 0.0000 16 / 0 83.2633 0.0000 o Dataset: LDBC SPB 256 8 / 2 52.5667 9.2867 8 / 4 54.0233 9.6167 8 / 8 54.9067 9.5733 o DB: GraphDB SE 8.0, RDF 10 / 2 59.9467 8.5333 Statements: 10 / 4 62.2867 8.4767 10 / 8 61.7167 8.6067 254,948,985 (explicit), 480,405,141 (total) 16 / 2 68.8100 5.0600 16 / 4 70.3900 5.1067 o 16 / 8 70.2300 4.9967 Creative works: 8,821,535 16 / 16 70.9467 5.0567 GraphDB SE vs AWS Neptune o Berlin SPARQL Benchmark (BSBM) – 100M scale ✓ No inference (because Neptune does not support inference) ✓ Established for many years ✓ Requested help from AWS Neptune to get the best possible results GraphDB SE AWS Neptune Version 8.6 1.0.1.0200237.0 AWS instance r4.large (2 vCPU, 15.25G RAM) db.r4.large (2 vCPU, 15.25G RAM) Storage EBS (gp1) ? Data loading protocol HTTP POST (RDF4J) Load TTL from an S3 bucket Load type ACID-compliant Non-ACID compliant GraphDB SE vs AWS Neptune (2) GraphDB SE AWS Neptune Neptune/GraphDB RDF import operation (100M RDF triples dataset) Lower is better Loading time 1,895 12,149 641% Read queries (1 client vs 100M RDF triples dataset) Query 1 QPS 309.96 41.25 13% Query 2 QPS 255.23 67.77 27% Query 3 QPS 289.98 39.73 14% Query 4 QPS 232.70 37.71 16% Query 5 QPS 23.18 2.20 9% Higher is better Query 7 QPS 171.75 39.78 23% Query 8 QPS 229.17 36.33 16% Query 9 QPS 406.16 115.93 29% Query 10 QPS 234.97 37.43 16% Query 11 QPS 266.93 63.97 24% Query 12 QPS 249.47 102.32 41% Total 20,122.74 2,679.15 13% Presentation Outline o Introduction o GraphDB o Use Cases o Market Intelligence Vision o Concept and Entity Awareness via Big Knowledge Graphs o FactForge: Showcase KG with 2B Statements o KG Analytics: Similarity and Importancе 2010: Semantic Publishing in BBC Use Case o Goals ✓ Create a dynamic semantic publishing platform that assembles web pages on- the-fly using a variety of data sources ✓ Deliver highly relevant data to web site visitors with sub-second response "The goal is to be able to more easily and accurately aggregate content, find it and share it across many sources. From these simple relationships and building blocks you can dynamically build up incredibly rich sites and navigation on any platform." John O’Donovan, Chief Technical Architect, BBC Use Cases in Media o Dynamic Semantic Publishing ✓ Client: BBC ✓ Task: Power dynamic media website with 1000s of topical pages ✓ Projects: BBC.CO.UK/Sport and BBC’s London 2012 Olympics websites ✓ Technology challenges: text analysis; reasoning; database load that combines frequent updates with high query throughput o Metadata management for Scientific Publishers ✓ Client: Elsevier, John Wiley ✓ Task: Manage large volume of rich and complex metadata about scientific articles Use Cases in Healthcare and Life Sciences o Semantic Medical Coding ✓ Client: Insurance companies, EMR processing etc. ✓ Transforms raw patient data into structured knowledge ✓ Enrich data by applying medical ontologies (SNOMED, LOINC and UMLS) ✓ Load extracted and normalised information in the medical Knowledge Graph. o Data Integration for Pharma Insights ✓ Client: Pharmaceutical and biotech companies ✓ Unifies both public & private data sources, structured knowledge extracted by text mining & semantic data integration ✓ Combine internal and standard public terminology like MedDRA and SNOMED Use Cases in Compliance o Adverse Media Monitoring ✓ Client: Top-5 US bank ✓ Task: monitor negative news and media about people of interest and related entities ✓ News sourced by Factiva; Factiva’s adverse events coding does not meet client’s needs o Compliance with changing regulations ✓ Client: