NLP Data Cleansing Based on Linguistic Ontology Constraints
Total Page:16
File Type:pdf, Size:1020Kb
NLP Data Cleansing Based on Linguistic Ontology Constraints Dimitris Kontokostas13 Martin Brümmer1 Sebastian Hellmann13 Jens Lehmann1 Lazaros Ioannidis2 1AKSW, University of Leipzig 2Aristotle University of Thessaloniki 3DBpedia Association 2014-05-27 Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 1 / 33 LOD Cloud (2011) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 2 / 33 LOD Cloud (2011) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 3 / 33 Linguistic Communities Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 4 / 33 Linguistic workshops & conferences Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 5 / 33 Linguistic workshops & conferences Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 6 / 33 Linguistic LOD Cloud (LLOD Cloud) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 7 / 33 Problem denition Linguistic (related) Data Purpose-Driven denition Increasing Data, ontologies & vocabularies New-comers ! hard to understand the ontologies / follow updates Validation is essential Many dierent pipelines (parsing, annotation, disambiguation, etc) Errors are propagated Partially provided by maintainers (incomplete) Focus on Lemon & NIF (proof of concept) Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 8 / 33 Lemon - Lexicon Model for Ontologies Models lexicon and machine-readable dictionaries http://lemon-model.net/ RDF-native form Linguistically sound structure (LMF) Separation of the lexicon and ontology layers Linking to data categories ! arbitrarily complex linguistic description Principle of least power - the less expressive the language, the more reusable the data. Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 9 / 33 Lemon - Example :lexicon a lemon:Lexicon ; lemon:entry :Pizza, :Tortilla . :Pizza a lemon:LexicalEntry ; lemon:sense [ lemon:reference <http : / /dbpedia.org/resource/Pizza> ] . :Tortilla a lemon:LexicalEntry ; lemon:sense [lemon:reference <http : / /dbpedia.org/resource/Tortilla> ] . Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 10 / 33 Lemon - Example (Correct) :lexicon a lemon:Lexicon ; lemon: language"en"; lemon:entry :Pizza, :Tortilla . :Pizza a lemon:LexicalEntry ; lemon: canonicalForm[ lemon: writtenRep"Pizza"@en]; lemon:sense [ lemon:reference <http : / /dbpedia.org/resource/Pizza >]. :Tortilla a lemon:LexicalEntry ; lemon: canonicalForm[ lemon: writtenRep"Tortilla"@en]; lemon:sense [ lemon:reference <http : / /dbpedia.org/resource/Tortilla >]. Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 11 / 33 NIF - NLP Interchange Format RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations In a nutshell: Logical formalisation of strings and annotations Builds on existing standards, e.g. RDF, LAF/GrAF, RFC 5147 Reuse of RDF tool stack Decreases development cost for integration Integrated in: DBpedia Spotlight, Stanford Core NLP, OpenNLP, RDFace, Validator, ConLL converter , . Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 12 / 33 NIF - Overview Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 13 / 33 NIF - Example <http://abc.com/doc#char=0,17> a nif:Context ; a nif:RFC147String ; nif :beginIndex"0"; nif :endIndex"17"; nif:isString"My dog likes pizza". <http://abc.com/doc#char=2,7> a nif:RFC5147String ; nif :anchorOf" dog"; nif :referenceContext <http://abc.com/doc#char=0,17>. itsrdf :taClassRef dbo:Animal ; Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 14 / 33 NIF - Example (Correct) <http://abc.com/doc#char=0,18> a nif:Context ; a n i f : RFC5147String ; nif :beginIndex"0"^^xsd:nonNegativeInteger; nif :endIndex"18"^^xsd:nonNegativeInteger; nif:isString"My dog likes pizza"^^xsd:string. <http://abc.com/doc#char=2,7> a nif:RFC5147String ; nif :beginIndex"2"^^xsd:nonNegativeInteger ; nif :endIndex"7"^^xsd:nonNegativeInteger ; nif :anchorOf" dog"^^xsd:string; nif :referenceContext <http://abc.com/doc#char=0,27>. itsrdf :taClassRef dbo:Animal ; Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 15 / 33 Maintainer validation Lemon Python script 24 tests for structural criteria too slow on big datasets not good reporting NIF SPARQL queries 11 tests for common errors not complete Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 16 / 33 Built on previous work Test-driven evaluation of linked data quality. Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali J. Zaveri in WWW 2014. Horizontal, multi-domain data quality assessment Massive detection of errors for ve large-scale LOD data sets 291 vocabularies, independent of their domain or purpose New contributions: Relation to OWL reasoners Test Driven Data Engineering Ontology Domain-specic validation Quickly improving existing validation options provided by maintainers Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 17 / 33 Test-Driven Data Development Methodology Test case: a data constraint that involves one or more triples Test suite: a set of test cases for testing a dataset Status: Success, Fail, Timeout (complexity) or Error (e.g. network) Fail: Error, warning or notice RDF: basis for both data and schema Unied model facilitates automatic test case generation SPARQL serves as the test case denition language Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 18 / 33 Example test case A nif:RFC5147String should never have a nif:beginIndex greater than nif:endIndex Test cases are written in SPARQL SELECT?s WHERE{ ?s nif:beginIndex?v1. ?s nif:endIndex?v2. FILTER(?v1>?v2)} We query for errors Success: Query returns empty result set Fail: Query returns results Every result we get is a violation instance Timeout / Error: needs further investigation on SPARQL Engine capabilities, query syntax or query complexity Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 19 / 33 Patterns & Bindings Data Quality Test Patterns (DQTP) abstract patterns, which can be further rened into concrete data quality test cases using test pattern bindings Existing library of 20 patterns SELECT?s WHERE{ ?s %%P1%% ?v1. ?s %%P2%% ?v2. FILTER(?v1 %%OP%% ?v2)} Bindings mapping of variables to valid pattern replacement P1 => nif:beginIndex| SELECT?s WHERE{ P2 => nif:endIndex|?s nif:beginIndex?v1. OP=>> |?s nif:endIndex?v2. | FILTER(?v1>?v2)} Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 20 / 33 Test Auto Generators (TAGs) RDF(s) & OWL (partial) support Query schema for supported axioms SELECT DISTINCT?T1?T2 WHERE{ ?T1 owl:disjointWith?T2.} For every result a binding to a pattern is generated & a test case instantiated Supported axioms at the moment: RDFS: domain & range OWL: minCardinality, maxCardinality, cardinality, functionalProperty, InverseFunctionalProperty, disjointClass, propertyDisjointWith, AsymmetricProperty and deprecated Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 21 / 33 Test Case Elicitation Workow Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 22 / 33 TD(D)D vs Reasoners SPARQL test cases detect a subset of validation errors detectable by an OWL reasoner. Limited by SPARQL endpoint reasoning support limitations of the OWL-to-SPARQL translation. SPARQL test cases detect validation errors not expressible in OWL OWL reasoning is often not feasible on large datasets. Datasets are already deployed and accessible via SPARQL endpoints Pattern library more user friendly approach for building validation rules compared to modelling OWL axioms. requires familiarity non-common validations require manual SPARQL test cases Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 23 / 33 Data Engineering Ontology Input / Output entirely in RDF Model the methodology in OWL test suites, test cases, patterns, auto generators Strict to serve as a validation layer Four dierent levels of error reporting simple test case report (success, fail) / enriched with counts violation instance reporting / enriched with annotations Reuse dcterms, prov, spin, rlog Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 24 / 33 Data Engineering Ontology - Denition & Generation Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 25 / 33 Data Engineering Ontology - Result Representation Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 26 / 33 Lemon & NIF Test case elicitation RDFUnit Suite implements our methodology Run on Lemon & NIF ontologies TAGs could not yet handle some complex owl:Restrictions owl:unionOf, owl:allValuesFrom, owl:someValuesFrom, owl:hasSelf and some rdfs:subPropertyOf cases Manual test cases for constraints not captured in OWL. Total Domain Range Datatype Card. Disj. Func. I. Func. Manual Lemon 182 40 34 1 29 64 3 1 10 NIF 96 42 24 4 6 10 10 Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 27 / 33 Example of manual Lemon test case lemon:narrower denotes that one sense of a word is narrower than the other and must never be symmetric or contain cycles. SELECT DISTINCT?s WHERE{ ?s lemon:narrower+ ?narrower. ?narrower lemon:narrower+ ?s.} lemon:language must not have a language tag (RDF1.1 to the rescue) SELECT DISTINCT?s WHERE{ ?s lemon:language?v1. FILTER(lang(?v1)!="")} Kontokostas et al. (ESWC2014) NLP Data Cleansing 2014-05-27 28 / 33 Example of manual NIF test case Ensure that nif:beginIndex & nif:endIndex index are correct SELECT DISTINCT?s WHERE{ ?s nif:anchorOf?anchorOf; n i f:beginIndex?beginIndex; n i f:endIndex?endIndex; n i f:referenceContext