Ontology lexicalization - Context System architecture Results Conclusions

Towards Lexicalization of DBpedia Ontology with Unsupervised Learning and

Anca Marginean, Kando Eniko

September 2016 Structured - unstructured data ?disease diseasome/possibleDrug Lepirudin Which are the diseases treated by Lepirudin? Which are the diseases Lepirudin is used for? Which are the diseases that have Lepirudin as possible drug? Which are the diseases whose possible drug is Lepirudin?

?person :field Analytic number theory/Mathematician Heini Halberstam was a British mathematician, working in the field of analytic number theory. John Lewis Selfridge was an American mathematician who contributed to the fields of analytic number theory, computational number theory, and combinatorics. Traian Lalescu was a Romanian mathematician.

?person dbpedia:deathCause Cancer Stephen Shadegg died of cancer at his Phoenix home at the age of eighty on April 16, 1990. Derek Waring died from cancer at Petworth Cottage Hospital in West Sussex in 2007, aged 79. Her(Bea Arthur) family acknowledged the cause of death was cancer, but declined to specify what type. Walter Kiernan died after a protracted struggle with cancer in 1978 at the age of 75. Structured - unstructured data ?disease diseasome/possibleDrug Lepirudin . Diseasome Which are the diseases treated by Lepirudin? Which are the diseases Lepirudin is used for? Which are the diseases that have Lepirudin as possible drug? Which are the diseases whose possible drug is Lepirudin?

?person dbpedia:field Analytic number theory/Mathematician . DBpedia Heini Halberstam was a British mathematician, working in the field of analytic number theory. John Lewis Selfridge was an American mathematician who contributed to the fields of analytic number theory, computational number theory, and combinatorics. Traian Lalescu was a Romanian mathematician.

?person dbpedia:deathCause Cancer . DBpedia Stephen Shadegg died of cancer at his Phoenix home at the age of eighty on April 16, 1990. Derek Waring died from cancer at Petworth Cottage Hospital in West Sussex in 2007, aged 79. Her(Bea Arthur) family acknowledged the cause of death was cancer, but declined to specify what type. Walter Kiernan died after a protracted struggle with cancer in 1978 at the age of 75. Structured - unstructured data ?disease diseasome/possibleDrug Lepirudin . Diseasome Which are the diseases treated by Lepirudin? Which are the diseases Lepirudin is used for? Which are the diseases that have Lepirudin as possible drug? Which are the diseases whose possible drug is Lepirudin? ?person dbpedia:field Analytic number theory/Mathematician . DBpedia Heini Halberstam was a British mathematician, working in the field of analytic number theory. Heini Halberstam field Analytic number theory John Lewis Selfridge was an American mathematician who contributed to the fields of analytic number theory, computational number theory, and combinatorics. John Lewis Selfridge field Analytic number theory Traian Lalescu was a Romanian mathematician. Traian Lalescu field Mathematician ?person dbpedia:deathCause Cancer . DBpedia Stephen Shadegg died of cancer at his Phoenix home at the age of eighty on April 16, 1990. Stephen Shadegg deathCause Cancer Derek Waring died from cancer at Petworth Cottage Hospital in West Sussex in 2007, aged 79. Her(Bea Arthur) family acknowledged the cause of death was cancer, but declined to specify what type. Walter Kiernan died after a protracted struggle with cancer in 1978 at the age of 75. Structured - unstructured data

How can ontology properties and concepts, such as field, deathCause, possibleDrug, be expressed in natural language? Outline

1 Ontology lexicalization - Context

2 System architecture

3 Results

4 Conclusions Ontology lexicalization - Context System architecture Results Conclusions Large Context

Translating between natural language and structured data

Text mining - Unstructured Data Industry: IBM Watson, Lexalytics, Alchemy API, SmartLogic, PoolParty, Cogito - Structured Data with NL interfaces Linked Data Meaning Representation Language (CHILL 1996), Controlled Natural Languages (ACE Attempto, GF based) Ontology lexicalization - Context System architecture Results Conclusions Specific Context

How are verbalized in natural language the elements from the ontology? Ontology lexicalization = enrich the ontology with linguistic information lemon standard for sharing lexical information on the semantic web 2013 - CLEF challenge with 10 classes and 30 properties from DBpedia 2014 - manual lexicon for DBpedia - 1.8 entries per ontology (1.3 per class and 2,4 per property) 2014 - M-ATOLL - framework for the Lexicalization of Ontologies in Multiple Languages 2010 - WRPA - relational paraphrase acquisition from Wikipedia - extensive use of Infoboxes Ontology lexicalization - Context System architecture Results Conclusions Existing approaches for Ontology lexicalization

Exploit the relation between Wikipedia and DBpedia string-level - common substring between domain and range of the property dependency tree - set of dependency patterns semantic role graphs Without , but with external lexical resources: label-based - synonyms from BabelNet Ontology lexicalization - Context System architecture Results Conclusions Starting approaches

M-ATOLL1 Learning a Cross-Lingual For a property P do: Semantic Representation2 1 triple retrieval from DBpedia For a multilingual document corpus: 2 for retrieved triples 1 build SRL graphs for each sentence 3 dependency patterns application (only the main 2 construct similarity grammatical structures: matrix for the identified transitive/intransitive verbs, graphs relational noun/adjectives) 3 apply spectral clustering 4 for each match of the 4 identify a ranked set of dependency pattern, extract DBpedia properties for lemon patterns for each cluster lexicalization 1 S. Walter, C. Unger, and P. Cimiano, “M-ATOLL: A framework for the lexicalization of ontologies in multiple languages,” in The Semantic Web - ISWC 2014 2 A. Rettinger, A. Schumilin, S. Thoma, and B. Ell, Learning a Cross-Lingual semantic representation of relations expressed in text. ESWC, 2015 Ontology lexicalization - Context System architecture Results Conclusions Starting approaches

M-ATOLL1 Learning a Cross-Lingual For a property P do: Semantic Representation2 1 triple retrieval from DBpedia For a multilingual document corpus: 2 sentence extraction for retrieved triples 1 build SRL graphs for each sentence 3 dependency patterns application (only the main 2 construct similarity grammatical structures: matrix for the identified transitive/intransitive verbs, graphs relational noun/adjectives) 3 apply spectral clustering 4 for each match of the 4 identify a ranked set of dependency pattern, extract DBpedia properties for lemon patterns for each cluster lexicalization 1 S. Walter, C. Unger, and P. Cimiano, “M-ATOLL: A framework for the lexicalization of ontologies in multiple languages,” in The Semantic Web - ISWC 2014 2 A. Rettinger, A. Schumilin, S. Thoma, and B. Ell, Learning a Cross-Lingual semantic representation of relations expressed in text. ESWC, 2015 Ontology lexicalization - Context System architecture Results Conclusions Starting approaches

M-ATOLL1 Learning a Cross-Lingual For a property P do: Semantic Representation2 1 triple retrieval from DBpedia For a multilingual document corpus: 2 sentence extraction for retrieved triples 1 build SRL graphs for each sentence 3 dependency patterns application (only the main 2 construct similarity grammatical structures: matrix for the identified transitive/intransitive verbs, graphs relational noun/adjectives) 3 apply spectral clustering 4 for each match of the 4 identify a ranked set of dependency pattern, extract DBpedia properties for lemon patterns for each cluster lexicalization 1 S. Walter, C. Unger, and P. Cimiano, “M-ATOLL: A framework for the lexicalization of ontologies in multiple languages,” in The Semantic Web - ISWC 2014 2 A. Rettinger, A. Schumilin, S. Thoma, and B. Ell, Learning a Cross-Lingual semantic representation of relations expressed in text. ESWC, 2015 Ontology lexicalization - Context System architecture Results Conclusions System architecture Ontology lexicalization - Context System architecture Results Conclusions 1. Data gathering - Triple retrieval

dbr:Lord_of_the_Rings dbp:author dbr:J._R._R._Tolkien dbr:Albert_Einstein dbp:almaMater dbr: University_of_Zurich dbr:World_War_II dbp:causalties "Civilian dead" dbr:Amelia_Warner dbp:spouse dbr:Jamie_Dornan dbr:French_Alps dbp:highest dbr:Mont_Blanc

PREFIX dbp: PREFIX dbo:

SELECT ?person1 ?person2 WHERE { ?person1 rdf:type dbo:Person. ?person2 rdf:type dbo:Person. ?person1 dbp:spouse ?person2. ?person1 dbp:nationality ?nationality. FILTER(regex(?nationality,"British","i")) } Ontology lexicalization - Context System architecture Results Conclusions 1. Data gathering - Documents extraction

hEntity1 property Entity2i - extract Entity1 Wikipedia page

property married one of the retrieved triple: hdbr:Barbara Amiel dbo:married dbr:George Jonas retrieve Barbara Amiel Wikipedia page Ontology lexicalization - Context System architecture Results Conclusions 2. Data processing - Sentence filtering

Full Matching

hToby Ziegler dbp:married Andrea Wyatti Wiki sentence: Toby Ziegler was married to Andrea Wyatt , who serves as a Congresswoman from Ohio.

Partial Matching

hRose McConnellLong dbp:married Huey Longi Wiki sentence: Rose and Huey were married in 1913.

hApple Inc. dbp:founders Steve Jobsi Wiki sentence: Apple was founded by Steve Jobs , Steve Wozniak , and Ronald Wayne.

hMonika Ritsch − Marte dbp : field Physicsi Wiki sentence: In 1995 Monika received her Habilitation in the field of Theoretical Physics at the University of Innsbruck. Ontology lexicalization - Context System architecture Results Conclusions 2. Data processing - Sentence filtering

Full Matching

hToby Ziegler dbp:married Andrea Wyatti Wiki sentence: Toby Ziegler was married to Andrea Wyatt , who serves as a Congresswoman from Ohio.

Partial Matching

hRose McConnellLong dbp:married Huey Longi Wiki sentence: Rose and Huey were married in 1913.

hApple Inc. dbp:founders Steve Jobsi Wiki sentence: Apple was founded by Steve Jobs , Steve Wozniak , and Ronald Wayne.

hMonika Ritsch − Marte dbp : field Physicsi Wiki sentence: In 1995 Monika received her Habilitation in the field of Theoretical Physics at the University of Innsbruck. Ontology lexicalization - Context System architecture Results Conclusions 2. Data processing - Sentence filtering

Full Matching Partial Matching Co-reference matching

hSteve Jobs dbp:founder −1 NeXTi Wiki sentence: Steven Paul ”Steve” Jobs .. was an American information technology... He was the .. and founder, chairman, and CEO of NeXT Inc

StanfordCore NLP library Ontology lexicalization - Context System architecture Results Conclusions 2. Data processing - Semantic role label graphs building

sentence → SRL (Mate tools) → SRL annotated graphs

Gerda Christian dies of cancer in Dusselford in 1997, aged 83. died (die.01) AM-CAU: of cancer AM-LOC: in Dusseldorf AM-TMP: in 1997 , aged 83 A1: Gerda Christian

Saba held a master’s degree in Philosophy and Islamic Studies and taught theology. held (hold.01) A0: Saba A1: a master ’s degree in Philosophy and Islamic Studies

taught (teach.01) A0: Saba A1: theology Ontology lexicalization - Context System architecture Results Conclusions 3. Unsupervised Learning - Similarity matrix computation

X sim(t1, t2) = wi ∗ mi (t1, t2) i=1..3

1 root predicate name of the SRL tree: w = 0.8 2 role labels of SRL tree (A0, AM-LOC, etc.): w = 0.15 common/total number 3 values of the detected roles: w = 0.05 3. Unsupervised Learning - Similarity matrix computation

m1(1, 2) = 1; m1(1, 3) = 0 m2(1, 2) = 3/5 m2(1, 3) = 2/6 m3(1, 2) = 0 m3(1, 3) = 0

died (die.01) AM-CAU: of cancer AM-LOC: in Dusseldorf AM-TMP: in 1997 , aged 83 A1: Gerda Christian

died (die.01) AM-MNR: peacefully AM-LOC: in his sleep at his great-granddaughter’s home .... AM-CAU: of heart failure A1: Arthur

contracted (contract.02) A2: influenza and pneumonia while in New York City AM-TMP: during the 1918 flu pandemic A1: John and Horace AM-TMP: In January 1920 Ontology lexicalization - Context System architecture Results Conclusions 3. Unsupervised Learning - Spectral Clustering

similarity matrix →Spectral clustering from scikit-learn library → clusters

Examples for clusters extracted for deathCause property: 1 die.01 2 failure.01 3 home.01 4 cancer.01 5 murder.01 6 commit.02, lead.03, charge.05, sentence.01, reveal.01, re- name.01, enact.01, arrest.01, kill.01, martyre.01, suffer.01, suf- fer.01, conduct.01, attack.01 7 aneurysm.01, alcoholism.01, pneumonia.01, suffer.01, cy- berbully.01, arrest.01, stroke.02, sleep.01, supporter.01, edema.01,attack.01, infarction.01 Results

Property Dbpedia Wikipedia Extracted Distinct Distinct Clusters triples articles sentences SRL trees predicates married 26277 1829 972 3221 1420 marry, hus- band, wife, date board 782 780 124 478 470 chairman, member, president, executive field 24103 3024 525 1464 738 professor, director, work, sci- ence, study, physics cause of 4517 4485 359 1267 699 murder, die, death cancer, fail- ure writer 1750 1588 194 394 239 write, di- rect, star, produce, co-write rector 581 557 77 65 52 rector, institu- tion,director starring 3000 2981 364 941 509 star, direct, film,role,play, cast Ontology lexicalization - Context System architecture Results Conclusions Results - Word clouds for deathCause and field property Ontology lexicalization - Context System architecture Results Conclusions Results - Inside a cluster

1 A married B – corresponding to trees marry.01 with at least A0 and A2 roles 2 A and B married – trees marry.01 with only A1 roles 3 A,..., married B, – trees marry.01 with only A1 roles 4 A was/is (firtsly) married to B – trees marry.01 with at least A1 and A2 roles 5 A married to B – trees marry.01 with at least A1 and A2 roles 6 A, married to B, – trees marry.01 with at least A1 and A2 roles Conclusions Goal: Extract lexicalization of terms in DBpedia ontology Solution: Wikipedia - DBpedia relation Semantic role labeling graphs - a meaning-based alternative to dependency trees Spectral Clustering - unsupervised, connectivity-based clustering Resulting Clusters: (1) uniform in the root of the included trees: i) relevant for lexicalization, ii) but also not relevant; (2) non-uniform Future work: automatically identify the uniform clusters relevant for lexicalization: use or instance-based approach extend the testing to existing DBpedia lexicalizations use SRL with Framenet for predicate specific roles test on medical Linked Data (Diseasome, DrugBank) Thank you Ontology lexicalization - Context System architecture Results Conclusions PropBank roles

Name of Role Meaning Role Meaning A0 Subject AM-DIR Direction A1 Object AM-LOC Location A2 indirect object AM-MNR Manner AM-PRP Purpose AM-TMP Temporal