Transforming Natural Language Query to SPARQL for Semantic Information Retrieval Sharmela Shaik #1, Prathyusha Kanakam*2, S Mahaboob Hussain#3, D
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Engineering Trends and Technology (IJETT) – Volume-41 Number-7 - November 2016 Transforming Natural Language Query to SPARQL for Semantic Information Retrieval Sharmela Shaik #1, Prathyusha Kanakam*2, S Mahaboob Hussain#3, D. Suryanarayana*4 #1PG Scholar, CSE, Vishnu Institute of Technology *2Assistant Professor, CSE, Vishnu Institute of Technology *4Professor, CSE, Vishnu Institute of Technology Bhimavaram, A.P, India Abstract— To retrieve the information in a semantic parts-of-speech tagging. NLP used for text mining, manner requires a special query language to apply machine translation, and automated question on the huge web and database. In general, the entire answering. Searching for information mainly relies user queries will be in the form of natural language on handling a large amount of unstructured data. to search using the traditional search engine Archived and streaming data have to be included applications and no guarantee that user will satisfy into the diagnosis to achieve proper results. Varying with the outcome results. According to the users, in data representations and the difficulties of querying the databases in natural language is a very processing unstructured data require additional easy method for the desired data but it might be support for engineers through predefined search difficult to understand the NL query by a machine. queries or diagnostic tools. Search queries have to Therefore, this paper clearly explains the procedure update to different logical representations. To and importance of reforming the natural language formulate and perform target search tasks, engineers (NL) query into SPARQL query to apply to the need additional support from IT experts more database to retrieve the accurate semantic results. frequent [3]. SPARQL is an RDF query language which is a semantic query language used to retrieve data and II. WORKING METHODOLOGY WITH LODQA give precise results. Natural language query is an SYSTEM English sentence interpreted by the computer and LODQA (Linked Open Data Question Answering) appropriate action taken. Thus, in this paper is a system that takes natural language query as input, architecture introduced to translate an NL query and finally produces SPARQL queries as output, into SPARQL to retrieve semantic results and together with the answers to them certain SPARQL compared them with the traditional search engines. endpoint [4]. Figure 1 represents the three modules of the system are Graphicator, TermFinder, and Keywords — Information retrieval, NLP, natural GraphFinder. language, RDF, SPARQL, semantic Web, semantic search, URIs, tagging, POS. I/P: Natural Language Query I. INTRODUCTION The Web offers huge amounts of unstructured textual documents that do not have a predefined data Graphicator TermFinder GraphFinder model do not fit well into relational tables, product descriptions, and reviews are representative of this type of information. Recently researchers discovered Psuedo Graph Anchored Psuedo that, unstructured information about 70-80 % of all Pattern Graph Pattern data in organizations growing fifty times of structured data [1]. The huge amounts of textual data available on the Internet and company intranets are an opportunity and a challenge to a number of O/P: SPARQL queries applications. It performs on the results of information retrieval (IR) engines and it requires a Fig 1: Architecture of a LODQA system certain level of understanding of the natural A. Graphicator language in which such textual data is given. For this It is the first module of the LODQA system. The and many other reasons, IR increased use of natural first module, which handles the NL query, is also language processing (NLP) methods [2]. responsible for parsing the NL query, and producing NLP used to analyse text, allowing machines to a graph representation of the query, and called as understand how humans speak. This human- pseudo graph pattern (PGP). A PGP contains nodes computer interaction used in real-world applications and relations. Typically, the nodes correspond to the like automatic text, summarization, semantic basic noun phrases (BNPs) and the relations to the analysis, topic extraction, named entity recognition, dependency paths between the BNPs as expressed in ISSN: 2231-5381 http://www.ijettjournal.org Page 347 International Journal of Engineering Trends and Technology (IJETT) – Volume-41 Number-7 - November 2016 the NL query. Additionally, a PGP specifies which GraphFinder is the third module of the LODQA node is the focus of the query, i.e. what the user system. For an APGP produced through the wants to get as the answer to the query, e.g., "Post Graphicator and the TermFinder modules, the Graduation” in the following example query. GraphFinder module is responsible for searching the NLQ: What is the Post Graduation course after target dataset for corresponding parts, considering engineering? For this query, the pseudo graph possible variations, which may occur in the dataset. pattern represented as in Figure 2. To absorb structural discrepancy between the APGP and actual structure in the target dataset, engineering GraphFinder attempts to generate SPARQL queries for all the possible structural variations. Then the after SPARQL queries submitted to the target endpoint and collected answers provided to the user. Post Graduation The graph finder converted into SPARQL query course with arguments and relations. These arguments are can be either a primitive type, such as S, N, or NP, Fig 2: Pseudo Graph Pattern or complex, such as S\NP, or NP/N. A forward A PGP is a graph pattern as to search a target slash denotes that the argument should appear to the RDF graph for sub-graphs that correspond to the right, while a backslash denotes that the argument PGP. However, it called as a “pseudo” graph pattern, should appear on the left. The graph finder has Parts since it is not yet grounded at the target dataset. of Speech (POS) grammar notations and some of them represented in Table I. B. TermFinder: It is the second module of the LODQA system. TABLE I PARTS OF SPEECH (POS) REPRESENTATION Once Graphicator module has produced a PGP from Tag Description a given NL query, the TermFinder module is DT Determiner responsible for finding the URIs and values of the IN Preposition subordinating conjunction nodes in the PGP the URIs and the values have to be NN Noun, singular or mass actually present in the target dataset. Other than that, NNS Noun, plural NNP Proper noun, singular there is no chance for the PGP to match with any NNPS Proper noun, plural part of the dataset. For the normalization, each node WP Wh –pronoun of the PGP is connected to a URI in the dataset. It is VBZ Verb, 3rd person singular present described as the PGP and is represented in the VB Verb, base form dataset. It is also called as the PGP as an anchored To represent the third module GraphFinder, PGP (APGP). A natural language term may be consider the query. normalized to more than one RDF terms due to NLQ:"What is the Post Graduation course after ambiguity. Therefore, more than one APGPs may be engineering?" produced from one PGP through normalization. For The resultant GraphFinder represented in Figure 3 instance, consider the NL query. with all the terms separated and denoted with POS NLQ:"What is the Post Graduation course after tags. Here, every term in the sentence represented engineering?" with any one of the POS tags. Thus, words in Now, the Termfinder is applied to the above NL sentence is lemmatized and assigned with proper query, URIs will be formed and represented as three categories of grammar for easy identification of terms {Post Graduation; engineering; after}. triplets. C. GraphFinder What is the Post Graduation course after engineering /WP/WP /VBZ/VB /DT/DT /NN/NN /NN/NN /IN/IN /NN/NN ARG2 ARG1 Fig 3: GraphFinder for the query with POS tags representation. III. EXPERIMENTAL ANALYSIS natural language query. Now, the duty of this search The importance of the machine is to provide the engine is to retrieval according to the query given. relevant information to the users. Here, the machine Here, the novelty of this paper is to introduce this is the search engine that take inputs in the form of a SPARQL and RDF database for semantic ISSN: 2231-5381 http://www.ijettjournal.org Page 348 International Journal of Engineering Trends and Technology (IJETT) – Volume-41 Number-7 - November 2016 information retrieval [5]. The natural language query prefix dbo: <http://dbpedia.org/ontology/> translates into SPARQL query for retrieving of prefix dbp: <http://dbpedia.org/property/> accurate results from RDF knowledge base [6]. prefix dbpedia: <http://dbpedia.org/resource/> SPARQL also having a select statement to extract prefix rdf: <http://www.w3.org/1999/02/22-rdf- the data and results will be displayed in tabular form. syntax-ns#> The SPARQL is a query language, which traverse prefix rdfs: <http://www.w3.org/2000/01/rdf- through RDF graph. schema#> Therefore, the natural language should convert prefix tto: <http://example.org/tuto/ontology#> into a specific syntactical format, i.e., SPARQL. In prefix ttr: <http://example.org/tuto/resource#> between the conversion of NL query to SPARQL the prefix xsd: <http://www.w3.org/2001/XMLSchema#> sentence undergoes a process for lemmatization and prefix ont:<http://www.co- parts of speech (POS) tagging. This procedure will ode.org/ontologies/ont.owl#> make ready to convert the sentence into SPARQL SELECT ?thing ?postgraduationcourse ?duration ?e with important terms by removing all necessary ntranceexam ?job words by this lemmatization process [7]. The main WHERE { ?thing rdf:type dbo:engineering . } purpose of Natural Language Query is for an English To retrieve the semantic results, this SPARQL sentence to be interpreted the computer and query is applied on the RDF database. This process appropriate action taken and the query is translated is applied to several natural language queries and into SPARQL query.