Using Linked Data for Semi-Automatic Guesstimation
Total Page:16
File Type:pdf, Size:1020Kb
Using Linked Data for Semi-Automatic Guesstimation Jonathan A. Abourbih and Alan Bundy and Fiona McNeill∗ [email protected], [email protected], [email protected] University of Edinburgh, School of Informatics 10 Crichton Street, Edinburgh, EH8 9AB, United Kingdom Abstract and Semantic Web systems. Next, we outline the process of GORT is a system that combines Linked Data from across guesstimation. Then, we describe the organisation and im- several Semantic Web data sources to solve guesstimation plementation of GORT. Finally, we close with an evaluation problems, with user assistance. The system uses customised of the system’s performance and adaptability, and compare inference rules over the relationships in the OpenCyc ontol- it to several other related systems. We also conclude with a ogy, combined with data from DBPedia, to reason and per- brief section on future work. form its calculations. The system is extensible with new Linked Data, as it becomes available, and is capable of an- Literature Survey swering a small range of guesstimation questions. Combining facts to answer a user query is a mature field. The DEDUCOM system (Slagle 1965) was one of the ear- Introduction liest systems to perform deductive query answering. DE- The true power of the Semantic Web will come from com- DUCOM applies procedural knowledge to a set of facts in bining information from heterogeneous data sources to form a knowledge base to answer user queries, and a user can new knowledge. A system that is capable of deducing an an- also supplement the knowledge base with further facts. DE- swer to a query, even when necessary knowledge might be DUCOM also uses theorem proving techniques to show that missing or incomplete, could have broad applications and the calculated answer follows logically from the axioms appeal. In this paper, we describe a system that uses Linked in the knowledge base. The QUARK/SNARK/Geo-Logica Data to perform such compound query answering, as applied systems (Waldinger et al. 2003) exploit large, external data to the domain of guesstimation—the process of finding ap- sources to do deductive query answering, however their fo- proximate (within an order of magnitude) numeric answers cus was on yes/no and geographical queries and not numeric to queries, potentially combining unrelated knowledge in in- calculations. Cyc (Lenat 1995) is a large, curated knowledge teresting ways (Weinstein and Adam 2008). The hypothesis base that can also apply deductive reasoning to derive new that we are attempting to test is that data contained in het- numeric facts in response to a user query. Wolfram|Alpha erogeneous Linked Data sources can be reasoned over and (http://www.wolframalpha.com) is a new system combined, using a small set of rules, to calculate new infor- that combines facts in a curated knowledge base to arrive mation, with occasional human intervention. at new knowledge, and can produce a surprisingly large In this paper, we describe our implementation of a sys- amount of information from a very simple query. tem, called GORT (Guesstimation with Ontologies and Rea- Query answering over the Semantic Web is a new soning Techniques), that does compound reasoning to an- field. The most developed system in this field is Power- swer a small range of guesstimation questions. The sys- Aqua (Lopez, Motta, and Uren 2006), which uses a Se- tem’s inference rules are implemented in Prolog using its Se- mantic Web search engine to do fact retrieval, but it does mantic Web module (Wielemaker, Schreiber, and Wielinga not do any deduction based on that knowledge. There has 2003) for RDF inference. The system uses the OpenCyc been work on using Prolog’s Semantic Web module in con- OWL ontology as a basis for its reasoning, and uses facts junction with custom-built ontologies to construct interac- from other Linked Data sources in its calculations. This tive query systems (Wielemaker 2009). Our GORT system, work is the result of a three-month-long Master’s disserta- described in this paper, builds on prior work in knowledge- tion project (Abourbih 2009), supervised by Bundy and Mc- based systems and work on Semantic Web reasoning with Neill. Only the advent of interlinked data sources has made Prolog. The system implements inference rules and plans this system possible. for combining facts in Semantic Web data sources to derive The rest of this paper is organised as follows: First, we de- new information in response to a user query. scribe prior work in the fields of deductive query answering ∗Funded by ONR project N000140910467 Guesstimation Copyright c 2010, Association for the Advancement of Artificial Guesstimation is a method of calculating an approximate Intelligence (www.aaai.org). All rights reserved. answer to a query when some needed fact is partially or 2 completely missing. The types of problems that lend them- Humans, the population of the Earth, is about 6 × 109. selves well to guesstimation are those for which a precise an- Substituting into (3) gives: swer cannot be known or easily found. Weinstein and Adam (Humans) × (2008) present several dozen problems that lend themselves Area Humans to the method, and illustrate a range of guesstimation tac- ≈> 1m2 × (6 × 109) tics. Two worked examples, of the type that GORT has been ≈> 6 × 109 m2 designed to solve, are given below. Example 1. How many golf balls would it take to circle the Guesstimation Plans Earth at the equator? The examples from the previous section follow patterns that are common to a range of guesstimation queries. In this sec- Solution. It is clear that this problem can be solved by cal- tion, we generalise the examples above into plans that can culating: be instantiated for other similar problems. Circumference(Earth) (1) Diameter(GolfBall) Count Plan The solution requires the circumference of the Earth and the Example 1 asks for a count of how many golf balls fit inside diameter of a golf ball. Weinstein and Adam (2008) show the circumference of the Earth. This can be generalised to how the Earth’s circumference can be deduced from com- the form, “How many of some Smaller entity will fit inside mon knowledge: a 5 hour flight from Los Angeles to New of some Bigger entity?” and expressed as: York crosses 3 time zones at a speed of about 1000 km/h, F (Bigger) Count(Smaller, Bigger) ≈> , (4) and there are 24 time zones around the Earth, yielding: G(Smaller) 5h × 24 time zones × 103km/h where Count(Smaller, Bigger) represents the count of 3 time zones how many Smaller objects fit inside some Bigger object. ≈ 4 × 104 km ≈ 4 × 107 m. Functions F (x) and G(x) represent measurements on an ob- ject, like volume, mass, length, or duration. The units of Note that in (1), Earth is an individual object, but GolfBall measurement must be the same. The variables Smaller and does not denote any particular golf ball, but instead an arbi- Bigger may be instantiated with a class, or an instance. In trary, or ‘prototypical’ golf ball. To distinguish between the Example 1, this guesstimation plan was instantiated with the class of objects called GolfBall and the arbitrary golf ball, substitutions: we refer to the latter as GolfBall. Continuing, we make an educated guess for Diameter(GolfBall) of about 4 cm. Smaller/GolfBall , and Substituting into (1) gives: Bigger/Earth . ( ) 4 × 107m Circumference Earth ≈> ≈> 109 Size Plan −2 Diameter(GolfBall) 4 × 10 m Example 2 asks for the total area that would be occupied S by the set of all humans. This, too, can be generalised to: In the solution to Example 1, the notation represents F (e) e S “What is the total of some property for all instances an arbitrary, ‘prototypical’ element from the set . This no- S of a class ?” and expressed as: tation and interpretation of our -operator is based on that of the Hilbert- operator (Avigad and Zach 2007). F (e) ≈>F(S) ×S, (5) Example 2. What would be the total area occupied if all e∈S humans in the world were crammed into one place? where S represents the count of elements in set S, and S is an arbitrary, ‘typical’ element of S. In Example 2, Solution. We begin by summing up the area required by this guesstimation plan was instantiated with the substitution each human on Earth (2). If we assume that every human S/Humans. requires the same area, to within an order of magnitude, then this rewrites to: Implementation Area(e) (2) A system to perform compound query answering has a small e∈Humans set of requirements. First, the system needs some general ≈> Area(Humans) ×Humans, (3) background knowledge of the nature of the world. This knowledge is encoded in a basic ontology that forms the where Human is an arbitrary, representative element of heart of the system’s reasoning. Second, the system needs the set of all Humans, and ≈> represents a rewrite that access to some set of numeric ground facts with which to is accurate to within an order of magnitude. The value compute answers. These facts may be available directly in for Area(Human) is a somewhat arbitrary choice (how the basic ontology, or from other linked data sources. Third, close is too close?), so we make an educated guess of about the system needs to have some procedural knowledge, and 1m2. It is also generally-accepted common knowledge that needs to know how to apply the procedures to advance a 3 tion of Canada from the CIA Factbook ontology if it is pro- User Interface vided with the URIs for Canada and population. The sys- tem is also capable of retrieving a fact for a given Subject Inference System Knowledge Base and Predicate by following the graph of owl:sameAs rela- Domain Domain Basic tionships on Subject and rdfs:subPropertyOf relationships Independent Specific Ontology on Predicate until an appropriate Object is found.