Dbpedia Based Factual Questions Answering System
Total Page:16
File Type:pdf, Size:1020Kb
IADIS International Journal on WWW/Internet Vol. 15, No. 1, pp. 80-95 ISSN: 1645-7641 DBPEDIA BASED FACTUAL QUESTIONS ANSWERING SYSTEM Maksym Ketsmur1, Mário Rodrigues2 and António Teixeira1 1Department of Electronics Telecommunications and Informatics / IEETA, University of Aveiro. Portugal 2Águeda School of Technology and Management / IEETA, University of Aveiro. Portugal ABSTRACT The creation of generic natural language query and answering (QA) systems is an active goal of the Semantic Web since it would allow people to conduct any query using their native language. Current solutions already handle factual questions mostly in English. The aim of this work was to develop a QA system to query knowledge bases (KB) such as DBpedia, in a first version, using factual questions in Portuguese, and in a second version, factual questions in English, French and German. This involves representing queries in terms of the KB ontology using SPARQL. The process of constructing a SPARQL query representing the natural language input involves determining: (1) the type of answer that is being sought - a person, a place, etc. - which is done by looking at the wh-words of wh-questions; (2) the main topic of the question - which person, place, etc. - obtained by morphosyntactic analysis to discover the potential subjects of the question; and (3) the properties that can be mapped to the KB ontology for creating a SPARQL query as precise as possible. The first version of the system, working for Portuguese, was tested by with 22 questions without guarantee that the answer was in the KB, and the multilingual version was tested with 30 random questions from QALD 7 (Question Answering over Linked Data) training set. The correctness of the answers was verified as well if the answer exists in the KB when the system did not produced results. A correct answer was produced for 67% of the questions for the Portuguese version and up to 55% (for English) of times for multi-language version considering that the answer existed in the KB. Results show that this approach is promising and further investigation should be done to improve it. The robustness observed, and capability to handle several languages, fosters future work to expand the system to answer questions of other types. KEYWORDS Question Answering, Multi-lingual, Knowledge Base 80 DBPEDIA BASED FACTUAL QUESTIONS ANSWERING SYSTEM 1. INTRODUCTION The predominant form of search on the internet is the use of keywords. However, the use of this form of information search is often inadvertent to express the true intention of the user (Song, D. et al., 2015). The traditional keyword searching returns an extensive list of possible results, where the user must spend some time to find the desired information. On one hand, a greater amount of information allows users to compare results and convey some conclusion, yet, it makes the search more difficult due to the quantity and growth of online information nowadays. So, systems that allow anyone to perform structured search become increasingly needed (Hirschman, L. et al., 2001; Navigli, R. et al., 2012). Using a well-structured search allows to better infer users' intent, returning the desired answer and not a list of documents as in traditional approaches. With this, a need to develop semantic search systems arise. Such systems are intended to allow users carrying out structured searches using natural language, and should identify users' intentions, generating formal semantic representations that are able to combine distinct sources of information to obtain an answer. Increasingly, the information is provided in the form of Linked Data, which consists of using the Web to interconnect different sources of information. These can be as diverse as the databases of two distinct organizations in different geographic locations. Technically, Linked Data refers to data published on the Internet in a way that can be understood by both humans and computers (Damljanovic, D. et al., 2011; Unger, C. et al., 2012). Semantic web data is published in Resource Description Framework (RDF) form, which is a standard for expressing data graphs and sharing them with other people and, perhaps more importantly, with machines. A large collection of tools and services emerged around the RDF (Bizer, C. et al., 2009; Segaran, T. et al., 2009). RDF is a language to express data models using triples, which are constituted by subject, predicate, and object. In addition, it adds several important concepts that make these models more accurate and robust, removing the ambiguity when transmitting semantic data between machines (Segaran, T. et al., 2009). The main goal of this study is to develop two versions of a QA system that allows users to get the desired information using natural language as a query. The characteristic points of this system are the use of Portuguese language in the first version, being expanded to handle English, French, and German languages in the second version, and being designed to operate directly over already existing knowledge bases (KB), being for the moment instantiated in the popular DBpedia. The paper continues in section two with the background and the relevant related work. The third section explains how the two version of QA system works and the fourth section shows representative results of each version and discusses them. The paper ends with the conclusions in section 5. 2. BACKGROUND AND RELATED WORK The first step in the development of QA systems is to understand how users search information and what kind of answer they expect (Hirschman, L. et al., 2001). Different types of questions can be asked and can be identified per the type of answer: factual, opinion, and abstract (Hirschman, L. et al., 2001). The different types of questions can be done in different 81 IADIS International Journal on WWW/Internet ways, such as indirect requests or even commands. “I would like to list all presidents of Portugal” is considered as an indirect request and “Name all presidents of Portugal” is considered as a command (Hirschman, L. et al., 2001). However, this can bring some difficulties to systems that heavily depend on identifying terms such as Who, Where, When and What in users' queries. For example, by identifying the word Who in the query, the system will determinate that the user wants to know about some human or organization, or the word Where that will indicate that the answer will be about some location. These types of systems cannot respond to the questions asked in indirect request form or even commands. Also, the user is not expected to perform search in a structured way, which in this case makes the autosuggest a good feature during the construction of the query, as seen in the TR Discover system. Not all types of questions are similar, there is evidence that some types are more difficult than others, such as Why and How tend to be more difficult to analyze because they require understanding of text and relations between entities (Hirschman, L. et al., 2001). The identification of the correct expected answer type reduces the number of possible answers, making the system more efficient and precise (Breck, E. et al., 2000). Different systems use distinct approaches to form the answer that is shown to the user. Answers can be formed in two ways, extraction or generation. In the first, the fragment that contains the answer is extracted from the original document and presented to user, while in the second the answer is extracted from multiple sentences or from several documents (Hirschman, L. et al., 2001). The ability of identifying users’ intention implies a natural language analysis where the main challenge is to understand the users’ intent that may be ambiguous. In addition, natural language processing aims to achieve a natural language analysis like humans, using different techniques that allow to identify entities, actions, relation between entities, etc. in a natural language sentence (Hirschman, L. et al., 2001; E. Liddy. et al., 2001). Many evaluations of semantic search systems referred that users prefer to use free natural language during the search rather than controlled inputs or view-based searches. Although the flexibility offered by this approach is a significant advantage, it can also become very complex (Kumar, A. et al., 2016). Allowing users full freedom in choosing terms increases the difficulty of these tools to disambiguate the search and understand the users’ intent (Elbedweihy, K. et al., 2013). Word sense disambiguation aims to understand what the terms mean. While for humans it is easy to understand the meaning of a word, for a machine it is a difficult task (Guha, R., et al., 2003; Lopes, L.S. et al. 2005; Elbedweihy, K. et al., 2013). Therefore, the use of semantic search tools is essential to match the document content with users’ intention (Hoffart, J. et al., 2011). Semantic search systems adopt different search approaches, ranging from natural language (free or guided) to visualization-based interfaces (forms or graphs). Each of these strategies provides different levels of user flexibility, query language expression and support during the query formulation (Elbedweihy, K. et al., 2013). However, as mentioned earlier, flexibility makes it difficult to map terms with concepts, properties and ontological entities. One of these difficulties is polysemy (a word with more than one meaning) and synonymy (multiple words with the same meaning). While the first affects accuracy by providing false correspondences, the second affects recall by causing the lack of true semantic correspondences. In this way, both use the word disambiguation techniques (Elbedweihy, K. et al., 2013). For example, to answer the question “How tall is ...?”, the query term tall needs to be mapped to the height property in the ontology. However, the term tall is polysemous and has different meanings including (from WordNet): "Great in vertical dimension", "High in stature", "Tall people", "Tall buildings", etc.