Generating Lexicalization Patterns for Linked Open Data

Generating Lexicalization Patterns for Linked Open Data Rivindu Perera and Parma Nand Auckland University of Technology Auckland, New Zealand rperera, pnand @aut.ac.nz { } Abstract drawback of Linked Data is that it lacks the linguistic information which can be used to turn them The concept of Linked Data has attracted back to a natural textual format. increased interest in recent times due to its Generating linguistic structures and choosing free and open availability and the sheer of words to communicate a particular abstract repre- volume. We present a framework to gener- sentation (e.g., triple) is referred to as lexicaliza- ate patterns which can be used to lexical- tion which is a subtask in Natural Language Gen- ize Linked Data. We use DBpedia as the eration. The work described in this paper is a part Linked Data resource which is one of the of our NLG project2 currently under way (Perera most comprehensive and fastest growing and Nand, 2014a; Perera and Nand, 2014b; Per- Linked Data resource available for free. era and Nand, 2014c). The framework presented The framework incorporates a text prepa- in this paper uses DBpedia as the Linked Data re- ration module which collects and prepares source and lexicalization is presented as the min- the text after which Open Information Ex- ing best available pattern to generate a natural lan- traction is employed to extract relations guage representation for the triple being consid- which are then aligned with triples to iden- ered. tify patterns. The framework also uses The remainder of the paper is structured as fol- lexical semantic resources to mine pat- lows. Section 2 presents related work in the area terns utilizing VerbNet and WordNet. The of lexicalization. In Section 3 we describe the pro- framework achieved 70.36% accuracy and posed framework in detail. Section 4 presents the a Mean reciprocal Rank value of 0.72 for experiments used to validate the framework. Sec- five DBpedia ontology classes generating tion 5 concludes the paper with an outlook on fu- 101 lexicalizations. ture work. 1 Introduction 2 Related work Semantic web continues to grow rapidly in var- ious forms. Two key areas that recent semantic Duma and Klein (2013) present an approach to ex- web researches have focused on are enrichment of tract templates to verbalize triples using a heuris- Linked Data resources and using these resources tic. The main drawbacks noticed in this model are in different applications. the ignorance of additional textual resources and DBpedia, Freebase, and YAGO1 are frontiers less consideration on the cohesive pattern genera- in Linked Data area. The Linked Data is repre- tion sented as triples (a data structure in the form of Lemon model (Walter et al., 2013) extracts lex- subject, predicate, object ) using Resource De- icalizations for DBpedia using dependency pat- h i scription Framework (RDF). As Linked Data con- terns extracted from Wikipedia sentences. How- cept moves forward, there is also a need to uti- ever, the initial experiments we performed have lize this data in applications. A major area that re- shown that this approach fails completely when quires Linked Data is Natural Language Process- provided with sentences with grammatical con- ing (NLP) and applications such as Question An- junctions. swering (QA) (Perera, 2012a; Perera, 2012b). A Ell and Harth (2014) introduce the language in- 1dbpedia.org, freebase.com, mpi-inf.mpg.de/yago/ 2http://rivinduperera.com/information/realtextlex 2 Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data, pages 2–5, Hissar, Bulgaria, 11 September 2015. dependent approach to generate RDF verbaliza- verb frame based patterns are also determined and tion templates. This model utilizes the maxi- added to the pattern list. mal sub-graph pattern extraction model. However, 3.3.1 Relation based patterns in our approach the Open Information Extraction (OpenIE) is utilized to get more coherent lexical- Based on the aligned relations and triples, a string ization patterns (Perera and Nand, 2015a; Perera based pattern is generated. These string based pat- and Nand, 2015b). terns can get two forms as shown in Fig. 2 for two sample scenarios. The subject and object are de- 3 RealTextlex framework noted by symbols s? and o? respectively. Fig. 1 depicts the high-level overview of the pro- 3.3.2 Verb frame based patterns cess of generating lexicalization patterns in the The framework utilizes two lexical semantic re- proposed framework. The process starts with a sources, VerbNet and WordNet to mine patterns. given DBpedia ontology class (e.g., person, orga- Currently, the framework generates only one type nization, etc.). The following sections explains the of pattern (s? Verb o?), if the predicate is a process in detail. verb and if that verb has the frame Noun phrase, { Verb, Noun phrase in either VerbNet or WordNet. 3.1 Candidate sentence extraction } The objective of candidate sentence extractor is 3.3.3 Property based patterns to identify potential sentences that can lexicalize The predicates which cannot be associated with a a given triple. The input is taken as a collection pattern in the above processes described in Sec- of co-reference resolved sentences and a set of tion 3.3.1 and Section 3.3.2 are properties belong- triples. This unit firstly verbalizes the triples using to the DBpedia resources selected. The left ing a set of rules. Then each sentence is analysed over predicates are assigned a generic pattern (s? has predicate of o?) based on the spe- to check either complete subject (s), the object (o) h i or the predicate (p) are mentioned in the sentence cific predicate. (S). This sentence analysis assigns a score to each 3.4 Pattern enrichment sentence based on presence of a triple. The score is the ratio of subject, predicate and object present Pattern enrichment adds two types of additional in the sentence. information; grammatical gender related to the pattern and multiplicity level associated with the 3.2 Open Information Extraction determined pattern. When searching a pattern Once the candidate sentences are selected for each in the lexicalization pattern database, these addi- triple, we then extract relations from these can- tional information is also mined in the lexicaliza- didate sentences employing Open IE. The Open tion patterns for a given predicate of an ontology IE (Etzioni et al., 2008) essentially focuses on do- class. main independent relation extraction and predom- 3.4.1 Grammatical gender determination inantly targets the web as a corpus for deriving the The lexicalization patterns can be accurately relations. The framework proposed in this paper reused later only if the grammatical gender is uses textual content extracted from the web which recorded with the pattern. For example, consider works with a diverse set of domains. Specifically, triple, Walt Disney, spouse, Lillian the framework uses Ollie Open IE system3 for re- h Disney and lexicalization pattern, “s? is lation extraction. This module associates each re- i the husband of o?”. This pattern cannot lation with the triple and outputs a triple-relations be reused to lexicalize the triple Lillian collection. A relation is composed of first argu- h Disney, spouse, Walt Disney , be- ment (arg1), relation (rel), and second argument i cause the grammatical gender of the subject is (arg2). now different, even though the property (spouse) 3.3 Pattern processing and combination is same in both scenarios. The framework uses three types of grammatical gender types (male, This module generates patterns from aligned rela- female, neutral) based on the triple subject and tions in Section 3.2. In addition to these patterns, it is determined by DBpedia grammatical gender 3knowitall.github.io/ollie/ dataset (Mendes et al., 2012). 3 Figure 1: Schematic representation of the complete framework Walt Disney, birth date, 1901-12-05 •h i – arg1: Walt Disney, rel: was born on, arg2: December 5, 1901 pattern: s? was born on o? Walt Disney, designer, Mickey Mouse •h i – arg1: Mickey Mouse, rel: is designed by, arg2: Walt Disney Figure 3: Analysis of syntactic correctness of the pattern: o? is designed by s? extracted patterns Figure 2: Basic patterns generated for two sample ness was considered as being both grammatically triples. s? and o? represent subject and object accurate and coherent. The results of this evalu- respectively. ation is shown in Fig. 3 for each of the ontology classes. Since, the framework ranks lexicalization pat- 3.4.2 Multiplicity determination terns using a scoring system, we considered it as In DBpedia page for Nile River has three countries a method that provides a set of possible outputs. listed under the predicate “country” because it We decided to get a statistical measurement incor- does not belong to one country, but flows through porating Mean Reciprocal Rank (MRR) as shown these countries. However, East River belongs only below to compute the rank of the first correct pat- to United States. The lexicalization patterns gen- tern of each predicate in each ontology class. erated for these two scenarios will also be differ- P ent and cannot be shared. For example, lexical- 1 | | 1 MRR = (1) ization pattern for Nile river will in the form of P rank | | i=1 i “s? flows through o?” and for East River X it will be like “s? is in o?”. To address this where P and ranki represent predicates and the th variation, our framework checks whether there are rank of the correct lexicalization for the i predi- multiple object values for the same subject and cate respectively. Table 2 depicts the MRR results predicate, then it adds the appropriate property for the 5 ontology classes being considered. value (multiple/single) to the pattern.

Load more