<<

AMERICAN UNIVERSITY OF BEIRUT

DBpediaSearch: An Effective Search Engine for DBpedia

by HIBA KHALED ARNAOUT

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science to the Department of Computer Science of the Faculty of Arts and Sciences at the American University of Beirut

Beirut, Lebanon February 2017

AMERICAN UNIVERSITY OF BEIRUT

THESIS, DISSERTATION, PROJECT RELEASE FORM

Student Name: Last First Middle

Master’s Thesis Master’s Project Doctoral Dissertation § ¤ § ¤ § ¤ ¦ ¥ ¦ ¥ ¦ ¥

2 I authorize the American University of Beirut to: (a) reproduce hard or electronic copies of my thesis, dissertation, or project; (b) include such copies in the archives and digital repos- itories of the University; and (c) make freely available such copies to third parties for research or educational purposes.

2 I authorize the American University of Beirut, to: (a) reproduce hard or electronic copies of it; (b) include such copies in the archives and digital repositories of the University; and (c) make freely available such copies to third parties for research or educational purposes after: One year from the date of submission of my thesis, dissertation or project. Two years from the date of submission of my thesis , dissertation or project. Three years from the date of submission of my thesis , dissertation or project.

Signature Date

This form is signed when submitting the thesis, dissertation, or project to the University Libraries Acknowledgements

Firstly, I would like to thank my thesis advisor Prof. Shady Elbassuoni for supporting me throughout the course of this experience. It all started with an Information Retrieval course, during which I started my research journey with Dr.Elbassuoni. His patience, motivation, and positivity made everything looks easier. I learned a lot from his vast experience in two domains, namely, Informa- tion Retrieval and Machine Learning. I would like to thank him for teaching me how to write a proper research paper. I would like to thank my committee members Prof. Mohamad Jaber and Prof. Wassim ElHajj for their comments and suggestions. Additionally, I am very thankful for Dr.ElHajj’s constant support since day one. I am grateful for the American University of Beirut’s research board (URB) for funding my research. Finally, a special thanks to my family. Words cannot express how grateful I am to my mother who encouraged me. My appreciation to my father for his moral and financial support, and for his ineterest in having a look at the results of my thesis.

v An Abstract of the Thesis of

Hiba Arnaout for Master of Science Major: Computer Science

Title: DBpediaSearch: An Effective Search Engine for DBpedia

The active progress of knowledge-sharing communities like Wikipedia have made it achievable to build large knowledge-bases, such as DBpedia. These knowledge-bases use the Resource Description Framework (RDF) as a flexible data model for representing the information in the Web. The semantic query language for RDF is known as SPARQL. Although querying with SPARQL gives very specific results, the users knowledge of the underlying data is a must. In this thesis, we propose an effective search engine for DBpedia. The search sys- tem takes SPARQL queries augmented with keywords as input and gives the most relevant results as output. To be able to do this, we develop a novel rank- ing model based on statistical machine translation for both triple-pattern queries and keyword-augmented triple-pattern queries. Our system supports automatic query relaxation in case no results were found. Our ranking model also takes into consideration result diversity in order to ensure that the user is provided with a wide range of aspects of the query results. We develop a diversity-aware eval- uation metric based on the Discounted Cumulative Gain to evaluate diversified result sets. Finally, we build an evaluation benchmark on DBpedia, which we use to evaluate the effectiveness of our search engine.

vi Contents

Acknowledgements v

Abstract vi

1 Introduction 1 1.1 Motivation ...... 1 1.2 Challenges ...... 3 1.3 Contributions ...... 3

2 Indexing and Query Processing 6 2.1 Knowledge Graph: DBpedia ...... 6 2.2 Triple Patterns ...... 6 2.3 Keyword-augmented Triple Patterns ...... 7 2.4 Query Relaxation ...... 8

3 Ranking 11 3.1 Ranking Model ...... 11 3.1.1 Result Ranking ...... 11 3.1.2 Triple Weight ...... 14 3.1.3 Keyword Augmentation ...... 15 3.2 Relaxation Model ...... 17 3.3 Related Work ...... 18 3.3.1 Unstructured Queries on Unstructured Data ...... 18 3.3.2 Unstructured Queries on Structured Data ...... 19 3.3.3 Structured Queries on Structured Data ...... 20 3.3.4 Query Relaxation ...... 20 3.4 Experimental Evaluation ...... 22 3.4.1 Setup ...... 22 3.4.2 Results ...... 22

4 Diversity 26 4.1 Diversity Problem ...... 26 4.2 Maximal Marginal Relevance ...... 28

vii 4.3 Diversity Notions in RDF Setting ...... 28 4.3.1 Resource-based Diversity ...... 29 4.3.2 Term-based Diversity ...... 29 4.3.3 Text-based Diversity ...... 30 4.4 Diversity-aware Re-ranking Algorithm ...... 31 4.5 Diversity-aware Evaluation Metric ...... 32 4.5.1 Resource-based Novelty ...... 32 4.5.2 Text-based Novelty ...... 32 4.6 Related Work ...... 33 4.7 Experimental Evaluation ...... 34 4.7.1 Evaluation Using Diversity-aware Metric ...... 34 4.7.2 Evaluation Using Crowdsourcing ...... 37

5 Efficiency 42 5.1 Database Statistics ...... 42 5.2 Running Time Statistics ...... 43 5.3 Running Time Challenges ...... 44

A Guidelines: Choosing the Better Result Set 45

B Guidelines: Assessing a Subgraph 48 B.1 Purely Structured Queries Guidelines ...... 48 B.2 Keyword-augmented Queries Guidelines ...... 48 B.3 Requiring Relaxation Queries Guidelines ...... 48 B.4 Keyword-augmented and Requiring Relaxation Queries Guidelines 49 B.5 Gold Queries ...... 49

C Abbreviations 53 List of Figures

1.1 An example RDF graph on movies ...... 2

2.1 Facts table in PostgreSQL ...... 7 2.2 Sample answer of a SPARQL query in PostgreSQL ...... 7 2.3 Witnesscounts associated with DBpedia triples ...... 8 2.4 Triples with keywords and keywords weights ...... 8

3.1 A subgraph of the ”WikiLinks” graph ...... 15

4.1 A subgraph and its description ...... 38

5.1 Percentage of time needed per step ...... 44

A.1 Which set is better? ...... 46 A.2 One query and two sets of answers ...... 47

B.1 Purely structured queries guidelines ...... 49 B.2 Sample assessment question (purely structured) ...... 49 B.3 Keyword-augmented queries guidelines ...... 50 B.4 Sample assessment question (keyword-augmented) ...... 51 B.5 Requiring relaxation queries guidelines ...... 51 B.6 Sample assessment question (requiring relaxation) ...... 51 B.7 Keyword-augmented and requiring relaxation queries guidelines . 52 B.8 Sample assessment question (keyword-augmented and requiring re- laxation) ...... 52

ix List of Tables

1.1 The set of RDF triples for the RDF graph in Figure 1.1 ...... 2 1.2 Un-ranked result set for: director/star of the same show...... 4 1.3 Non-diversified result set for: director couples...... 5

2.1 List of relaxed queries for movies starred by Allen and directed by ...... 9 2.2 List of possible subgraphs for queries in Table 2.1 ...... 9

3.1 Top-10 un-ranked results: producer/director of the same show. . . 12 3.2 Top-10 ranked results: producer/director of the same show. . . . . 13 3.3 Top-10 ranked results: producer/director of the same show, with the keyword: ”Chaplin”...... 14 3.4 Top-10 ranked results: people who graduated from Oxford and were born in Agatha Christie’s place...... 18 3.5 Top-10 ranked results of: people who worked with Frank Sinatra. 19 3.6 Results for 100 evaluation queries: ranking algorithm ...... 23 3.7 Ours vs. No-rank explanation samples...... 23 3.8 Ours vs. Indegrees explanations samples ...... 24 3.9 Ours vs. LMRQ explanations samples ...... 25

4.1 Top-10 ranked results of: Democratic politicians who had roles in military events or crises...... 27

4.2 Average DIV −NDCG10 of the training queries for different values of λ ...... 35

4.3 Average DIV − NDCG10 of the test queries ...... 35 4.4 Top-5 results: director-actor couple query with and without diversity 36 4.5 Top-5 results for the Disney query with and without diversity . . 37 4.6 Top-5 results for the Columbia Pictures query with and without diversity ...... 37 4.7 Results for 100 evaluation queries: diversity algorithm ...... 38 4.8 No Diversity vs. Resource-based Diversity ...... 39 4.9 No Diversity vs. Term-based Diversity ...... 40 4.10 No Diversity vs. Text-based Diversity ...... 41

x 5.1 Database information ...... 43 5.2 Running time per query category ...... 43 5.3 Running time per diversity notion ...... 43 Chapter 1

Introduction

1.1 Motivation

The continuous growth of knowledge-sharing communities like Wikipedia and the active progress in modern information extraction technologies have made it achievable to build large knowledge-bases (KBs) such as DBpedia [1], Freebase [2], and YAGO [3]. In addition, some domain-specific Knowledge-bases such as DBLife [4], and Libra [5] have been also constructed. These knowledge bases use the Resource Description Framework (RDF) [6] as a flexible data model for representing the information extracted and for publishing such information of the web. The RDF data model is based on representing binary facts as triples of form: Subject-Predicate-Object, where Subjects are URIs for resources, Predicates are URIs for properties, and Objects are URIs for resources or literals. A cer- tain fact can be represented in a table store as a triple, or in a graph as two nodes with one edge relating them. As an example, the following statement: “The sky has the color blue” can be converted into a triple form: . It also can be transformed into two nodes: Sky and Blue and one edge labeled: hasColor. Unlike the traditional web search, querying this type of structured data per- mits apprehending the underlying semantics of the data. Figure 1.1 shows a larger example of a partial RDF graph of a movie knowledge base. Table 1.1 shows the corresponding RDF triples. The semantic query language for RDF is known as SPARQL [7]. It al- lows to compose structured queries consisting of triple patterns and conjunc- tions. In general, a SPARQL query Q consists of n-triple patterns, where a triple pattern is an SPO triple with variables. A variable occurring in one of the triple patterns can be seen again in another triple in the same query Q, denoting a join condition. For example, if the user’s information need is to extract shows that were created and starred by the same person, the following

1 Figure 1.1: An example RDF graph on movies

Subject (S) Predicate (P) Object (O) Brad Pitt description American actor Courteney Cox description American actor Hello I Must Be Going director Courteney Cox Chandler Bing series Friends Chandler Bing firstAppearance Episode1.01 Matthew Perry knownfor Friends Friends starring Courteney Cox The Wife guest Courteney Cox The Wife director Tom Cherones Seinfeld director Tom Cherones Seinfeld genre Sitcom Friends genre Sitcom Seinfeld creator Jerry Seinfeld Seinfeld starring Jerry Seinfeld Seinfeld company Castle Rock Entertainment

Table 1.1: The set of RDF triples for the RDF graph in Figure 1.1 triple-pattern query can be used: . Given this 2-triple-pattern query, the results are all the subgraphs with 2 triples that are isomorphic to the query Q when compared to the triples in the knowledge-base. For the above example, the predicates in a subgraph of size two (2 triples), should be creator and starring, and both the subjects and objects in both triples should be identical. One relevant result for the query is: . Another result

2 is: .

1.2 Challenges

Although querying with SPARQL gives very specific results, we face a number of problems that we need to address in order to build our effective search engine:

• Result ranking: Running a SPARQL query over a knowledge base, such as DBpedia, would return a set of unranked results. Users usually prefer seeing a ranked result-list rather than a list of un-ranked matches [8]. An example of a subset of an un-ranked set of results for the following query: which requires movies directed and starred by the same person, is shown in Table 1.2. This example query produces 1793 subgraphs, making it difficult to choose the most relevant ones.

• In-expressiveness of triple-pattern queries: In some cases, the user’s query need cannot be completely written in terms of triple patterns. As an example: Movies starred and directed by the same person, and related to New York and love? With the absence of triples which describe the plot of the movie, the second part of the question can only be expressed as a set of keywords: .

• Exact matching of triple-pattern queries: The users knowledge of the underlying data is a must. That is, the user must use the exact URIs for resources and predicates to construct a query. Triple-pattern queries are very rigid and a query might have few or no results in many occasions. The following query: would return zero results, since Scarlett Johansson never acted in a movie that was produced by .

• Results diversity: It is often the case that the top-ranked results are dominated by one aspect of the query. For example, the query which asks for director couples. The ranked but non-diversified top-10 set of results is shown in Table 1.3. The set is dominated by Madonna and the two men she was married to. The user would prefer to see a variety of names instead of a homogeneous set of results.

1.3 Contributions

In this thesis, we develop a search engine that allows users to retrieve information from DBpedia.

3 Position Subject (S) Predicate (P) Object (O) 1 45 (film) starring Peter Coster 45 (film) director Peter Coster 2 Mothers & Daughters starring David Conolly Mothers & Daughters director David Conolly 3 A Daughter of Australia starring Lawson Harris A Daughter of Australia director Lawson Harris 4 Out In Fifty starring Scott Leet Out In Fifty director Scott Leet 5 A Man Among Giants starring Rod Webber A Man Among Giants director Rod Webber 6 Single starring Wilder Shaw Single director Wilder Shaw 200 City Lights starring Charlie Chaplin City Lights director Charlie Chaplin 1500 starring Woody Allen Annie Hall director Woody Allen

Table 1.2: Un-ranked result set for: director/star of the same show.

• We develop a novel ranking model based on statistical machine translation for triple-pattern queries.

• To address the in-expressiveness of triple-pattern queries, we allow keyword- augmented triple-pattern queries.

• Our system supports automatic query relaxation (similar to query modifi- cation in traditional information retrieval) in case no results were found.

• Our ranking model takes into consideration result diversity in order to en- sure that the user is provided with a wide range of aspects of the query results. We develop the first result diversity algorithm in the RDF setting.

• We propose a novelty-aware evaluation metric to assess the result set with respect to both relevance as well as diversity.

• Finally, we create a DBpedia benchmark that consists of 100 queries: purely structured queries, keyword-augmented queries, queries that require relax- ation, and keyword-augmented queries that require relaxation. We use the benchmark to evaluate our various models.

4 Subject (S) Predicate (P) Object (O) W.E. director Madonna (entertainer) The Pledge (film) director Madonna (entertainer) spouse Sean Penn Secretprojectrevolution director Madonna (entertainer) The Pledge (film) director Sean Penn Madonna (entertainer) spouse Sean Penn W.E. director Madonna (entertainer) The Crossing Guard director Sean Penn Madonna (entertainer) spouse Sean Penn W.E. director Madonna (entertainer) The Platinum Collection (video) director Sean Penn Madonna (entertainer) spouse Sean Penn Secretprojectrevolution director Madonna (entertainer) The Crossing Guard director Sean Penn Madonna (entertainer) spouse Sean Penn Secretprojectrevolution director Madonna (entertainer) The Platinum Collection (video) director Sean Penn Madonna (entertainer) spouse Sean Penn W.E. director Madonna (entertainer) Snatch (film) director Madonna (entertainer) spouse Guy Ritchie Secretprojectrevolution director Madonna (entertainer) Snatch (film) director Guy Ritchie Madonna (entertainer) spouse Guy Ritchie W.E. director Madonna (entertainer) RocknRolla director Guy Ritchie Madonna (entertainer) spouse Guy Ritchie Secretprojectrevolution director Madonna (entertainer) RocknRolla director Guy Ritchie Madonna (entertainer) spouse Guy Ritchie

Table 1.3: Non-diversified result set for: director couples.

5 Chapter 2

Indexing and Query Processing

The Resource Description Framework (RDF) represents information in the Web. An RDF knowledge graph is a collection of RDF triples, such as DBpedia [1]. In this chapter, we discuss setting up the appropriate environment for dealing with such data. In Section 2.1, we go over DBpedia [1] and its datasets. In Section 2.2, we discuss running triple-pattern queries over the stored facts, then in Section 2.3, running keyword-augmented queries. Finally, in Section 2.4, we discuss how to compose a set of relaxed queries in case the user’s query did not produce any results.

2.1 Knowledge Graph: DBpedia

DBpedia [1] is a large structured knowledge base retrieved from the information in Wikipedia. The English version of DBpedia, which we use in this thesis, includes 5.8 million things: persons, places, creative works, organizations, species, and diseases. DBpedia [1] has 1337 unique relations, and 15778883 facts. In order to manipulate the facts (i.e, triples), we use PostgreSQL [9], an object-relational database. We store the collection of facts in a 3-column table: the Subject column, the Predicate column, the Object column. A sample facts table is shown in Figure 2.1. To enhance the retrieval performance, we construct B-tree indexes on all columns: ”subject”, ”predicate”, and ”object”, as well as all the combinations of the columns: ”subject, predicate”, ”subject, object”, ”predicate, object”, and ”subject, predicate, object”.

2.2 Triple Patterns

Querying DBpedia is accomplished using SPARQL [7]. A SPARQL query is a set of triple patterns and conjunctions. The same variable can be seen in more than one triples of the query, indicating a join condition. For example: requires the names of shows or movies that were directed

6 Figure 2.1: Facts table in PostgreSQL

Figure 2.2: Sample answer of a SPARQL query in PostgreSQL and starred by the same person. Since we are using PostgreSQL to store our data, an SQL query should be created and executed for each SPARQL query. We develop a script that would take a SPARQL query as input, and produce its corresponding SQL query. For our example, the SQL translation is: ”SELECT * FROM facts f1, facts f2 WHERE f1.subject=f2.subject and f1.object=f2.object and f1.predicate=’starring’ and f2.predicate=’director’;” A snapshot of the an- swer is shown in Figure 2.2. Each row represents one answer, which is a subgraph isomorphic to the query. In this case a 2-triple-pattern query would produce a 2-triple subgraph. In our ranking model, which we discuss in Chapter 3, we need to associate a weight with each triple. The weight of the fact indicates the importance of this fact. The computation of the weight is proposed in Section 3.1.2. To store the triples weights, we augment our facts table with a new column we call ”witness- count”. A sample table is shown in Figure 2.3.

2.3 Keyword-augmented Triple Patterns

Another type of queries that our search engine accepts are keyword-augmented structured queries. It has the same format as the queries in Section 2.2, except the user is allowed to add one or more keywords to the triple patterns. For example: . This example asks for the name of a movie that was directed and starred by the

7 Figure 2.3: Witnesscounts associated with DBpedia triples

Figure 2.4: Triples with keywords and keywords weights same person. It also indicates the user’s preference in receiving romantic comedy movies that happened in New York. The execution of this type of queries is handled the same way as the queries in Section 2.2. All the sugraphs that are isomorphic to the query would be retrieved. In order to handle the keywords, each triple is associated with a set of keywords. In addition, each keyword associated with a certain triple would also have a weight. The keyword weight is used to indicate the relevance of the triple to the keywords in the query. The details of keyword-augmentation is discussed in details in Section 3.1.3. A sample of triples associated with keywords and keywords weights is shown in Figure 2.4.

2.4 Query Relaxation

The user must know the exact URIs for resources and predicates in order to construct a query that will return a set of results. Given a structured query Q, which consists of triple patterns q1,q2,....,qn, with the possibility of associating one of more triple patterns with a set of keywords: w1,w2,....,wl. The query can return no results in many occasions: Conjunction has no results: In this case, the user knows the exact URIs for the query, and the triple patterns are written correctly. However, the combination of those triple patterns has no result subgraphs. The following query is an exam- ple for such a case: . The query requires the name of a movie that was starred by Woody Allen and

8 directed by Madonna. This query produces no results since the knowledge base doesn’t contain any movies that were directed by Madonna and starred by Allen. The queries that have such a problem will be handled in the following manner: The query is transformed into a number of relaxed queries. And the number of relaxed queries is equal to the number of constants in the main query. For our example, 4 queries are produced and displayed in Table 2.1. The table shows that for each relaxed query, one constant is replaced by a variable, and the constant (URI) is moved to the keywords space.

Query# Relaxed Query Q1 ?m ?A Woody Allen [starring] ?m director Madonna (entertainer) Q2 ?m starring ?A [woody allen] ?m director Madonna (entertainer) Q3 ?m starring Woody Allen ?m ?A Madonna (entertainer) [director] Q4 ?m starring Woody Allen ?m director ?A [madonna entertainer]

Table 2.1: List of relaxed queries for movies starred by Allen and directed by Madonna Table 2.2 shows a number of un-ranked subgraphs for the relaxed queries in Table 2.1.

Isomorphic to Q# Subgraph Q3 Shadows and Fog starring Woody Allen Shadows and Fog starring Madonna (entertainer) Q4 Annie Hall starring Woody Allen Annie Hall director Woody Allen Q4 starring Woody Allen New York Stories director Martin Scorsese Q4 Fading Gigolo starring Woody Allen Fading Gigolo director John Turturro Q2 W.E. starring Abbie Cornish W.E. director Madonna (entertainer)

Table 2.2: List of possible subgraphs for queries in Table 2.1

No matches for one or more triple patterns: This case can be divided into two subcases: - Subject, Predicate, or Object in a triple pattern does not exist in DBpedia. For example: has no results, be- cause the object of the first triple pattern does not exist in the knowledge base. More precisely, ”Allan” must be ”Allen”. This kind of problem is handled as follows: The constant causing the problem is replaced by a variable then moved

9 to the keyword space. In this example, one of the relaxed queries produced is: . One sample result for the relaxed query is: . Another sample result is:. - Subject, Predicate, and Object exist in DBpedia but the triple does not. For example, , which requires the name of a book written by former president Chirac, and the name of the publisher of the book. In DBpedia, there exists a relation ”author” and an entity ”Jacque Chirac”, but there are no triples that include both of them at the same time. This problem is solved by relaxing the triple which is causing the issue: ?book ?A Jacques Chirac [author]; ?book publisher ?p, and ?book author ?A [jacques chirac]; ?book publisher ?p. One sample answer is: . The subgraphs produced by all the relaxed queries of the main query are ranked based on our proposed model in Section 3.2.

10 Chapter 3

Ranking

In this chapter, we discuss the problem of producing top-k subgraphs for a user’s structured or keyword-augmented structured query. In Section 3.1, we propose a ranking model based on statistical machine translation. We also explain how to calculate the weight of a triple, augment a triple with keywords, and give weights to those keywords. In section 3.2, we discuss the solution of the ”no results found” problem by proposing a query relaxation model. We revise the ranking and query relaxing related work in Section 3.3. In Section 3.4, we finally use a benchmark of 100 queries in order to compare the effectiveness of our models with other state-of-art models.

3.1 Ranking Model

3.1.1 Result Ranking Often, triple-pattern queries will return too many results, making it difficult for users to find the most relevant ones. Querying RDF graphs using SPARQL will result in subgraphs that match the query structure without taking into consider- ation which subgraph is more relevant than others. We develop a ranking model based on statistical language models (LMs). Given a query Q which consists of n triple patterns q1,q2,....,qn. Assume G is a result subgraph which consists of n triples t1,t2,...... ,tn. To rank G with respect to Q, we use the following LM-based ranking:

n Y P (Q|G) = P (qi|G) (3.1) i=1 We then use a translation model as follows:

m X P (qi|G) = P (qi|tj)P (tj|G) (3.2) j=1

11 where P (tj|G) is defined as the probability of generating triple tj given result subgraph G:

( 1 |G| , if tj is part of G P (tj|G) = (3.3) 0 , otherwise

The probability of translating triple pattern qi into triple tj is defined in two ways depending on whether qi is keyword augmented or not.

In case qi is not keyword-augmented, we have:

wc(t ) P (q |t ) = i (3.4) i j P wc(t) t∈qˆi whereq ˆi is the set of all triples matching qi and wc(t) is the witnesscount(i.e weight) of a triple t.

Example: Consider the following query Q: which asks for a person who produced and directed the same show, his/her name and the name of the show. Running this query without a ranking model would produce the results shown in Table 3.1.

Subject (S) Predicate (P) Object (O) Koi Tujh Sa Kahan producer Reema Khan Koi Tujh Sa Kahan director Reema Khan Chanda (film) producer S. Narayan Chanda (film) director S. Narayan The Bling Ring producer Sofia Coppola The Bling Ring director Sofia Coppola Tadpole (film) producer Gary Winick Tadpole (film) director Gary Winick Barbed Wire (1927 film) producer Rowland V. Lee Barbed Wire (1927 film) director Rowland V. Lee Tarnation (film) producer Jonathan Caouette Tarnation (film) director Jonathan Caouette Rock of Ages (2012 film) producer Adam Shankman Rock of Ages (2012 film) director Adam Shankman Crocodile Hunter (film) producer Wong Jing Crocodile Hunter (film) director Wong Jing Stoic (film) producer Uwe Boll Stoic (film) director Uwe Boll The Rosebud Beach Hotel producer Harry Hurwitz The Rosebud Beach Hotel director Harry Hurwitz

Table 3.1: Top-10 un-ranked results: producer/director of the same show.

12 Now we run the same query, but we rank the results using our LM-based ranking model. The results are shown in Table 3.2.

Subject (S) Predicate (P) Object (O) W.E. producer Madonna (entertainer) W.E. director Madonna (entertainer) None But the Brave producer Frank Sinatra None But the Brave director Frank Sinatra Snoop Dogg’s Hustlaz: Diary of a Pimp producer Snoop Dogg Snoop Dogg’s Hustlaz: Diary of a Pimp director Snoop Dogg Lincoln (2012 film) producer Steven Spielberg Lincoln (2012 film) director Steven Spielberg A.I. Artificial Intelligence producer Steven Spielberg A.I. Artificial Intelligence director Steven Spielberg Munich producer Steven Spielberg Munich director Steven Spielberg Amistad (film) producer Steven Spielberg Amistad (film) director Steven Spielberg Always (1989 film) producer Steven Spielberg Always (1989 film) director Steven Spielberg The Karnival Kid producer Walt Disney The Karnival Kid director Walt Disney The Opry House producer Walt Disney The Opry House director Walt Disney

Table 3.2: Top-10 ranked results: producer/director of the same show.

In case qi is keyword-augmented, with keywords w1,w2,....,wl.:

l Y P (qi|tj) = P (qi, wk|tj) (3.5) k=1 where P (qi, wk|tj) is defined as follows: wc(t , w ) wc(t ) P (q , w |t ) = α j k + (1 − α) j (3.6) i k j P wc(t, w ) P wc(t) tqˆi k tqˆi where α is a smoothing parameter, and wc(tj, wk) is the witnesscount of keyword wk associated with triple tj. α is set to 0.8.

Example: Consider the following query Q: which asks for a person, preferably Charlie Chaplin, who produced and directed the same show, his/her name and the name of the show. Running this query using our ranking model which allows keyword-augmented queries would produce the results shown in Table 3.3.

13 Subject (S) Predicate (P) Object (O) Modern Times (film) producer Charlie Chaplin Modern Times (film) director Charlie Chaplin Limelight (1952 film) producer Charlie Chaplin Limelight (1952 film) director Charlie Chaplin The Kid (1921 film) producer Charlie Chaplin The Kid (1921 film) director Charlie Chaplin A Countess from Hong Kong producer Charlie Chaplin A Countess from Hong Kong director Charlie Chaplin A King in New York producer Charlie Chaplin A King in New York director Charlie Chaplin A Dog’s Life producer Charlie Chaplin A Dog’s Life director Charlie Chaplin A Day’s Pleasure producer Charlie Chaplin A Day’s Pleasure director Charlie Chaplin Nashville (film) producer Robert Altman Nashville (film) director Robert Altman A Busy Day producer Mack Sennett A Busy Day director Mack Sennett Cruel, Cruel Love producer Mack Sennett Cruel, Cruel Love director Mack Sennett

Table 3.3: Top-10 ranked results: producer/director of the same show, with the keyword: ”Chaplin”. 3.1.2 Triple Weight In order to give each fact a certain value of importance, we associated each triple with a corresponding weight that denotes how important this triple is. These weights are pre-computed, indexed and stored along the actual triples in the knowledge base. We calculate the weight of a certain triple by leveraging external corpora to estimate the importance of a given triple (i.e., fact). For instance, for a knowledge base constructed from Wikipedia such DBpedia, we use the number of links pointing to the subject and object of a given triple as a measure of the weight of the triple. The intuition behind this is that the more articles that point to the subject and object of the triple, the more important a triple is. The following example presents two triples of TV shows and their companies. The triples have the same structure of t = (s, p, o) + the new witnesscount denoting the importance of that triple, t = (s, p, o, w): In the case of DBpedia, the link information is present as part of the knowledge base in the form of additional triples. For example, the following triple states

14 Figure 3.1: A subgraph of the ”WikiLinks” graph that the Wikipedia article of Afghanistan has a link to the Wikipedia article of BBC:

An example of links pointing to the ”BBC” entity is shown in Figure 3.1.

More generally, to compute the weight of a triple in any knowledge base, co- occurence statistics from external corpora such the Web can be used. However, in this thesis, we build our search engine on top of a particular RDF knowledge base: DBpedia, which is based on Wikipedia and thus we leverage the link structure to compute the weights of the triples in this knowledge base. In particular, the weight of a triple t = (s, p, o) is computed as shown in Equation 3.7.

W itnessCount(t) = |links(l, s)| + |links(l, o)| (3.7) where t is a triple, |links(l, s)| is the number of WikiLinks pointing to the Subject s in triple t, and |links(l, o)| is the number of WikiLinks pointing to the Object o in triple t.

3.1.3 Keyword Augmentation While RDF represents a flexible means to represent a variety of information or facts, some information are better suited as free text. For instance, in the domain of movies, a movie plot is typically text-based and cannot be represented naturally using RDF triples. To accommodate this, we augment the triples in DBpedia with keywords. This allows the user to combine SPARQL queries with keyword conditions to find information that cannot be found using triple-pattern queries only.

15 Example: The user is interested in finding movies/shows whose directors also starred in them, and in addition are movies/shows about war and whose directors were nominated for Oscars, the user can issue the following keyword-augmented query: .

One of the results for such keyword-augmented query would be: , whose description in Wikipedia consists of the fol- lowing snippet: Henry V is a 1944 war movie. This was Laurence Olivier’s third Oscar-nominated performance

This step can be handled differently depending on the knowledge base searched. For example, in the case of DBpedia, each entity is associated with an ”abstract”, which is the first paragraph of the Wikipedia page representing the entity. Thus, for each triple in the knowledge base, we extract the abstracts of the subject and the object, stem those paragraphs, remove stop words and duplicates, and then associate the triple with the resulting list of keywords. For example, the triple is as- sociated with the following keywords: [oscar, academi, 1944, shakespear, hamlet, war, cinema]. In the case of keyword-augmented queries, ranking the results should be based not only on the importance or the weight of the triples only, but also on the relevance of the triple to the keywords in the query. To be able to do this, we associate each keyword for a given triple with a weight reflecting the relevance of the triple with respect to the keyword. This score is expressed as the intersection of all the links pointing to the Subject entity and the Keyword entity, plus the intersection of all the links pointing to the Object entity and the Keyword entity. After augmenting the data with keywords, the keywords are given weights based on the following equation:

W itnessCount(k, t) = |inter(links(l, s), links(l, k))| (3.8) + |inter(links(l, o), links(l, k))| where W itnessCount(k, t) is the witnesscount of keyword k associated with triple t, |inter(links(l, s), links(l, k))| is the number of WikiLinks pointing at both the subject s and any other resource or literal containing the keyword k, and |inter(links(l, o), links(l, k))| is the number of WikiLinks pointing at both the object o and any other resource or literal containing the keyword k. For the above example, the triple: is associated with the follow- ing weight-augmented keywords with weights: [oscar: 175, academi: 1072, 1944: 245, shakespear: 595, hamlet:458, war: 903, cinema: 165].

16 3.2 Relaxation Model

The problem of ”no results found” is discussed in Section 2.4, as well as the reasons causing such a problem, and a solution is provided for each case. In this section, we discuss ranking the subgraphs produced by a set of relaxed queries. The goal of ranking subgraphs produced by the relaxed queries is to push to the top the results that are as related to the given query as possible. Given Q, a structured query or a keyword augmented structured query with zero results.

1. Q is relaxed based on the causes and solutions discussed in Section 2.4.

2. The relaxed queries are executed.

3. All subgraphs are retrieved and scored. More formally, the score of a sub- graph G produced by a set of relaxed queries R is computed as shown in equation 3.9.

k 1 X P (R|G) = P (R |G) (3.9) k i i=1

where k is the size of R and P (Ri|G) is defined in 3.10. ( 0 , if G ∈/ Result Set of Ri P (Ri|G) = (3.10) Equation 3.1 , otherwise

The intuition behind this scoring is that the duplicated subgraphs must be given higher weights.

4. Rank the subgraphs in descending order of score.

5. Display top-k results.

Example: Consider the query . The query returns zero re- sults. The relaxation strategy is applied and the top-10 subgraphs are shown in Table 3.4. Example: Consider the query . The entity ”Frank Senatra” must be ”Frank Sinatra”. Because of this mis- take, the query produces no results. The relaxation strategy is applied and the top-10 subgraphs are shown in Table 3.5.

17 Subject (S) Predicate (P) Object (O) Agatha Christie deathPlace Oxfordshire Roundell Palmer, 1st Earl of Selborne birthPlace Oxfordshire Roundell Palmer, 1st Earl of Selborne almaMater University of Oxford Viola Keats deathPlace United Kingdom Robert Bridges birthPlace United Kingdom Robert Bridges almaMater Oxford Agatha Christie deathPlace Oxfordshire David Cameron birthPlace Oxfordshire David Cameron almaMater Eton College Agatha Christie deathPlace Oxfordshire Roger Norrington birthPlace Oxfordshire Roger Norrington birthPlace Oxford Agatha Christie deathPlace Oxfordshire Martin Keown birthPlace Oxfordshire Martin Keown birthPlace Oxford Agatha Christie deathPlace Oxfordshire Orlando Gibbons birthPlace Oxfordshire Orlando Gibbons birthPlace Oxford Agatha Christie deathPlace Oxfordshire David Oyelowo birthPlace Oxfordshire David Oyelowo birthPlace Oxford Agatha Christie deathPlace Oxfordshire Dexter Blackstock birthPlace Oxfordshire Dexter Blackstock birthPlace Oxford Agatha Christie deathPlace Oxfordshire Dexter Blackstock birthPlace Oxfordshire Dexter Blackstock birthPlace Oxford Agatha Christie deathPlace Oxfordshire Heather Angel (actress) birthPlace Oxfordshire Heather Angel (actress) birthPlace Oxford Agatha Christie deathPlace Oxfordshire Garry Parker birthPlace Oxfordshire Garry Parker birthPlace Oxford

Table 3.4: Top-10 ranked results: people who graduated from Oxford and were born in Agatha Christie’s death place. 3.3 Related Work

3.3.1 Unstructured Queries on Unstructured Data

The traditional information retrieval technique was based on running a keyword query over unstructured data, mostly text-based documents. For works such as [10], [11], and [12], given a query Q (a set of keywords), and a result (a text

18 Subject (S) Predicate (P) Object (O) None But the Brave starring Frank Sinatra None But the Brave producer Frank Sinatra On the Town (film) starring Frank Sinatra On the Town (film) producer Arthur Freed The Amazing Mr. Bickford starring Frank Zappa The Amazing Mr. Bickford producer Frank Zappa From Here to Eternity starring Frank Sinatra From Here to Eternity producer Buddy Adler Take Me Out to the Ball Game (film) starring Frank Sinatra Take Me Out to the Ball Game (film) producer Arthur Freed The Pride and the Passion starring Frank Sinatra The Pride and the Passion producer Stanley Kramer On the Town (film) starring Frank Sinatra On the Town (film) producer Roger Edens Double Dynamite starring Frank Sinatra Double Dynamite producer Irwin Allen High Society (1956 film) starring Frank Sinatra High Society (1956 film) producer Sol C. Siegel None But the Brave starring Frank Sinatra None But the Brave producer William H. Daniels

Table 3.5: Top-10 ranked results of: people who worked with Frank Sinatra. document), two probability distributions are computed for both the query and the document. If the probability distributions are close enough, the document is considered highly relevant to the query. In this thesis, our corpus is a knowledge graph, instead of a set of documents, our query is a keyword-augmented set of triple patterns, instead of keywords, and our output is a set of ranked subgraphs, instead of a set of relevant documents.

3.3.2 Unstructured Queries on Structured Data The work in [13], [14], [15], and [16] is keyword search on XML data or graphs. The output of a keyword search in this case is a ranked list of trees or graphs. The authors in [17] and [5] proposed a model of entity search on records. The output is a list of ranked entities. Our work rank subgraphs consisting of triples instead of ranking entities. A closer work is [18], where the authors proposed a ranking model of keyword search over RDF graphs. The output in this case is a ranked list of subgraphs. Content-based measure such as tf*idf, which were applied in [17], [15], and [19] would be irrelevant in our work since we don’t deal with terms. Finally, the graphs which are proposed as results in [13] and [18] do not have a specific structure or size. On the other hand, in our system, the results(subgraphs) and the query are isomorphic, so aggregation over the nodes

19 and edges [17], [15], and [19] would not work for us. Our approach handles keyword-augmented structured queries which is different from all works in this category.

3.3.3 Structured Queries on Structured Data The work in [20] deals with SPARQL queries on RDF graphs. The result graphs are estimated based on confidence of the result, which is irrelevant in our work, because all the triples are extracted from the same knowledge-base using the same method, thus having the same confidence values. It is also based on compactness, which is also unrelated since the size of the outputted subgraphs is determined by the users query. (i.e the number of triple patterns of the query). The third element of result estimation is the informativeness component which we evaluate using witnesscount. Unlike our work, NAGA does not support relaxed queries and keyword-augmented queries. It also does not address diversifying the set of results. The authors in [21] proposed extending the knowledge graph to account for missing entities, classes, predicates, or facts. They combine the knowledge graph with textual corpora that might or might not include facts that already exist in the knowledge graph. In our work, we don’t modify the triples. The queries in this work are SPARQL plus text phrases, where the subject, predicate, or object can be augmented with one or more keywords. In our work, we allow augmenting the whole triple with one or more keywords. They handle query relaxation, which we do as well. However, they do not address the diversity problem. One of the closest related work to ours is [22, 23], where the authors propose searching RDF graphs with SPARQL and keywords. They introduce a language model based approach to ranking the results of exact, relaxed and keyword- augmented queries over RDF graphs such as graphs. They are different from our work in the following four aspects: our work develop a new ranking model based on statistical machine translation that utilizes Wikipedia link structure. Second, the authors work was evaluated using two relatively small datasets unlike our case where it will be evaluated on DBpedia. Third, our approach performs a careful and efficient relaxation of queries to ensure that the relaxed queries are as close to the original query intention as possible. Finally, our approach will not only take into consideration the relevance of results but also their diversity based on Maximal Marginal Relevance [24].

3.3.4 Query Relaxation On relaxation, there is [25]. The authors develop a comprehensive set of query relaxation techniques, where the relaxation candidates can be collected from both the RDF data itself as well as from external ontological and textual sources. Their query relaxation techniques consists of four types of relaxations. The first one is

20 replacing entities and relations with some other entities and relations that have the same meaning. For example: bornIn can be substituted with citizenOf. The second type is replacing the entities and relations with variables. This type has a lot to do with our relaxation techniques, in which we don’t just add variables randomly but according to a certain study of why the query is not giving any results. The third type that the authors present is removing a certain triple entirely, thus converting an n-triple query into an (n-1)-triple query. In our work, we don’t play with the size of the main query, we don’t remove nor add a triple pattern. The last type is dropping all or some of the keywords associated with a triple. Additionally, expansion terms could be also considered for each keyword. To rank their subgraphs, each relaxed query has its own weight. The higher the weight the greater the similarity between the relaxed query and the main query.

Also on relaxation, there is [26]. The authors address the issue of relaxing the user query on RDF databases and computing the most relevant answers. They measure the similarities of the relaxed queries with the original query and design the algorithm to retrieve the most relevant answers as soon as possible. Not all the relaxed queries produced will be executed on the data, they will disregard the unnecessary ones. For our work, we are only producing relaxed queries that we think are suitable and close enough to the user’s main query.

The authors in [21] propose two strategies for query relaxation: structural relaxation and predicate paraphrasing. Structural relaxation is done by replacing a triple pattern by a set of triple patterns that denote a path. For example: is replaced by . The other type of query relaxation is done by generating paraphrases for each predicate. For example, the predicate graduatedFrom would be transformed into the textual predicate "went to".

In [27], the authors work on investigating techniques to find the parts in the main query that are responsible of its failure. They propose producing a set of relaxed queries, called Maximal Succeeding Subqueries (XSSs). These generated subqueries must have the maximal number of triple patterns of the initial query. They continue to improve their work in [28], by proposing three relaxation strategies. In their first strategy, they deal with the failure causes of the initial query. In their second strategy, the proceed to deal with the failure causes in the set of relaxed queries. Since some of the failure causes might not be discovered in the second strategy, they propose a third strategy. This strategy is an optimization of the second one, where the gaps will be filled.

21 3.4 Experimental Evaluation

3.4.1 Setup In order to evaluate our ranking algorithm. We construct a 100 queries benchmark over DBpedia, and which can be broadly divided into 4 categories: structured, keyword-augmented, requiring relaxation, and keyword-augmented and requiring relaxation. We run the benchmark over four models, where we collect the top-10 sub- graphs for each query. The first model is ours. The second model is the no-rank model, where the order of the subgraphs is random. The third model is similar to ours for the queries with no keywords, however, the witnesscounts of the facts are calculated differently, Equation 3.11. This model does not support keyword- augmented queries. The fourth and last model is [22], which has been proven to outperform [13], [29], and [5].

W itnessCount(t) = deg(s) + deg(o) (3.11) where deg(s) and deg(o) are the number of edges in the knowledge graph G that have s and o as objects respectively. To gather the relevance assessments, we relied on the crowdsourcing platform CrowdFlower [30] as follows. We run each query four times over four models. We collect the top-10 results of each model. For each query, we submit two sets to be compared, one is produced by our ranking model, and the other one is a set produced by a competitor. More particularly, each set of ours is going through 3 comparisons: one with the unranked set, one with the set produced by the ”indegrees” model, and one with the set produced by the model in [22], which we call LMRQ. The assessor in each comparison must choose the set that she feels it answers the query better. Appendix A shows more details about the guidelines. The assessors are also asked to pass a test and get at least a 90% score to be able to continue assessing. Moreover, a set of gold queries are embedded in the benchmark to catch and exclude any bad behavior or random answer.

3.4.2 Results The results of our evaluation are shown in Table 3.6. The reported numbers are out of 100 queries. In the case of no-rank and in-degrees, our model outperforms both competitors by winning 70% and 60% of the queries respectively. For the thirs model, we proved to be as good as the state-of-art model, by having the same quality of result sets for 95% of the queries. Moreover, we did not loose any query against [22] (LMRQ), and we won 5% of the queries. Each assessor had to provide a valid explanation of her choice. In Tables 3.7, 3.8, and 3.9, we show some of these explanations.

22 Ours vs.: No-rank Indegrees LMRQ Won 70 60 5 Tie 29 36 95 Lost 1 2 0

Table 3.6: Results for 100 evaluation queries: ranking algorithm

Ours vs. No-rank WON Q:List of music composers in Disney movies ? X ours,Y no-rank Comments: ”Y does not include any of music composers in Disney movies.” ”X lists all the movies from Disney.” Q:List of NBA players who are also actors ? X no-rank,Y ours Comments: ”Y lists includes more famous NBA players who are also actors.” ”X list does not answer to the question.” TIE Q:List of couples who graduated from the same universities ? X ours,Y no-rank Comments: ”They both are good answers.” ”Different lists, both followed instruction.” Q: List of scientists who were born and died in the same place ? X no-rank,Y ours Comments: ”They both are good answers.” ”Everything OK in both lists.” LOST Q:List of married actors who are influenced by the same people ? X no-rank,Y ours Comments: ”More results in X.” ”There are more actors in the X list.” ”Same.” ”Only correct list is X.”

Table 3.7: Ours vs. No-rank explanation samples.

The inter-rater agreement produced by CrowdFlower was 76%, 88%, and 90%, for comparing with no-rank model, indegrees model, and [22] respectively. We outperformed No-rank since it does not support ranking. The subgraphs are chosen randomely. In addition, the no-rank model does not take into con-

23 Ours vs. Indegrees WON Q:List of death dates of indian actors ? X indegrees,Y ours Comments: ”X list are not Indian.” ”Y More indian actors.” Q:List of black comedy movies ? X indegrees,Y ours Comments: ”Y list includes black comedies.” ”List X is irrelevant.” TIE Q:List of late actresses from Los Angeles; and their death dates ? X ours,Y indegrees Comments: ”Both Y and X list correct answers”. ”Both lists contain information requested”. Q:List of doctors who graduated from the University of Oxford ? X ours,Y indegrees Comments: ”Both have the same accuracy.” ”X and Y both answer the question.” LOST Q:List of comedy movies starred by a couple and directed by Frank Sinatra ? X indegrees,Y ours Comments: ”X list includes Sinatra’s movies.” ”X is good and lists the correct answers.”

Table 3.8: Ours vs. Indegrees explanations samples sideration the keywords associated when ranking a subgraph. With the second model, the indegrees, we outperformed it because it does not handle the key- words associated with the query. In addition, the witnesscount computation in this model does not reflect the quality of a certain fact very well. That appears obviously in the case of the purely structured queries. In the case of the ties, where our sets and the no-rank sets, and our sets and the in degrees sets were considered the same, the logic of the assessors was that both sets answered the question, and that the question was somehow specific enough that almost all subgraphs are good enough answers. For the model in [22], it was expected that we will do as good as they did, since we both use statistical language models in building our ranking models. We both support keyword-augmented queries, and we both support query relaxation.

24 Ours vs. LMRQ WON Q:List of directors of comedy shows or movies about school and friends ? X LMRQ,Y ours Comments: ”ONLY Y HAS SHOW ABOUT SCHOOL.” ”Y is more accurate.” Q:List of people who were born in Agatha Christie’s death place; who graduated from Oxford ? X ours,Y LMRQ Comments: ”X fully answers the question. Y features random people’s death places.” ”y doesn’t cover topic.” TIE Q:List of award-wining authors ? X LMRQ,Y ours Comments: ”Both fully answer the question.” ”lists are identical.” Q:List of people who graduated from Harvard University and their nationalities ? X LMRQ,Y ours Comments: ”X and Y both include people who graduated from Harvard University and their nationalities. So X and Y are the same.” ”same content.” LOST No Loss.

Table 3.9: Ours vs. LMRQ explanations samples

However, we improve our search engine later on to support query diversification, which they do not have.

25 Chapter 4

Diversity

It is often the case that the top-ranked results are homogeneous. First, we discuss the problem of result diversity in Section 4.1. We then discuss the Maximal Marginal Relevance (MMR) approach in Section 4.2. In Section 4.3, we define different notions for result diversity in the setting of RDF. In Section 4.4, we develop an approach for result diversity based on the MMR. Finally, in Section 4.5, we develop a diversity-aware evaluation metric based on the Discounted Cumulative Gain [31] and use it, in Section 4.7, to evaluate our 100 queries benchmark over DBpedia.

4.1 Diversity Problem

Although result ranking improves the user satisfaction, it is often the case that the top-ranked results are dominated by one aspect of the query. This is a common problem in IR in general [24]. For example, consider the query which requires the name of a Democratic politician who had a role in a military event or crisis. As can be seen from Table 4.1, a large number of the top-10 results are events by the same commader, namely Barack Obama. The goal of a diversity-aware ranking model is to produce an ordering such that the top-k results are most relevant to the query and at the same time as diverse from each other as possible. This problem is an optimization problem where the objective is to produce an ordering that maximizes both the relevance of the top-k results and their diversity. This objective function is hard to solve. This is why most approaches address it as an approximation problem. The problem is known as the top-k set selection problem [32]. It can be formulated as follows: Let Q be a query and U be its result set. Furthermore, let REL be a function that measures the relevance of a subset of results S ⊆ U with respect to Q and let DIV be a function that measures the diversity of a subset of results S ⊆ U. Finally, let f be a function that combines both relevance and diversity. The top-k

26 set selection problem can be solved by finding S∗ in Equation 4.1.

S∗ = argmax f(Q, S, REL, DIV ) , such that|S∗| = k. (4.1) S⊆U

Subject (S) Predicate (P) Object (O) Jefferson Davis party Democratic Party (United States) American Civil War commander Jefferson Davis Barack Obama party Democratic Party (United States) Operation Odyssey Dawn commander Barack Obama Barack Obama party Democratic Party (United States) New York Air National Guard commander Barack Obama Barack Obama party Democratic Party (United States) Texas Air National Guard commander Barack Obama Barack Obama party Democratic Party (United States) Ohio Air National Guard commander Barack Obama Barack Obama party Democratic Party (United States) Arkansas Army National Guard commander Barack Obama Barack Obama party Democratic Party (United States) Georgia Air National Guard commander Barack Obama Barack Obama party Democratic Party (United States) Alabama Air National Guard commander Barack Obama Barack Obama party Democratic Party (United States) Michigan Air National Guard commander Barack Obama Barack Obama party Democratic Party (United States) Massachusetts Air National Guard commander Barack Obama Table 4.1: Top-10 ranked results of: Democratic politicians who had roles in military events or crises. In order to the solve this optimization problem, one must clearly specify both the relevance function REL and diversity function DIV , as well as how to com- bine them. Gollapudi and Sharma [32] proposed a set of axioms to guide the choice of the objective function f(Q, S, REL, DIV ) and they showed that for most natural choices of the relevance and diversity functions, and the combina- tion strategies between them, the above optimization problem is NP-hard. For instance, one such choice of the objective function is shown in Equation 4.2. X X f(Q, S, REL, DIV ) = (k − 1) rel(r, Q) + 2λ d(r, r0) (4.2) r∈S r,r0∈S where rel(r, Q) is a (positive) score that indicates how relevant result r is with respect to query Q (the higher this score is, the more relevant r is to Q) and d(r, r0) is a discriminative and symmetric distance measure between two results r and r0, and λ is a scaling parameter. The objective function, Equation 4.2, clearly trades off both relevance of re- sults in the top-k set with their diversity (as measured by their average distance).

27 Solving such objective function is again NP-hard, however there exists known ap- proximation algorithms to solve the problem that mostly rely on greedy heuristics [32].

4.2 Maximal Marginal Relevance

Carbonell and Goldstein introduced the Maximal Marginal Relevance (MMR) method [24] which they use to re-rank a set of pre-retrieved documents U given a query Q. Given a query Q, a set of results U and a subset S ⊂ U, the marginal relevance of a result r ∈ U \ S is defined in Equation 4.3.

MR(r, Q, S) = λrel(r, Q) + (1 − λ)min d(r, r0) (4.3) r0∈S where rel(r, Q) is a measure of how relevant r is to Q, d(r, r0) is a symmetric distance measure between r and r0 and λ is a weighting parameter. The idea behind the marginal relevance metric is very intuitive. Given a query Q and a set of already selected results S, the marginal relevance of a result r is a measure of how much do we gain in terms of both relevance and diversity by adding the result r to the selected set S. To measure how much the result r would contribute to the relevance aspect of S, it is straight forward and we can use the result’s relevance to Q. On the other hand, measuring how much result r would contribute to the diversity of S is more involved. The most natural thing to do is to compare r with all the results r0 ∈ S and compute a similarity (or rather dissimilarity) between r and every other result r0 ∈ S and then aggregate these similarities over all the results in S. We do exactly this by assuming there is a distance function that can measure how result r is different from any other result r0 and then we use the minimum of the distances of r from all the results r0 ∈ S as a measure of the overall contribution of result r to the diversity of set S. By maximizing this minimum over a set of results r∈ / S, we can find the result that when added to S would render it most diverse as compared to any other result. Given all these considerations, we set the relevance rel(r, Q) to the score of the result r obtained from our search engine.

4.3 Diversity Notions in RDF Setting

We propose three different notions of diversity and we explain how we build a subgraph representation that allows us to achieve each such notion.

28 4.3.1 Resource-based Diversity In this notion of diversity, the goal is to diversify the different resources (i.e., en- tities and relations) that appear in the results. This ensures that no one resource will dominate the result set. Recall our example query asking for the name of a Democratic politician who had a role in a military event or crisis. Table 4.1 shows the top-10 subgraphs retrieved for the query using our ranking model. In order to diversify the top-k results of a certain query, we define a language model for each result as follows: Resource-based Language Model: the resource-based language model of result r is a probability distribution over all resources in the DBpedia. The parameters of the result language model are estimated using a smoothed maximum likelihood estimator as shown in Equation 4.4.

c(w; r) 1 P (w|r) = α + (1 − α) (4.4) |r| |Col| where w is a resource, Col is the set of all unique resources in DBpedia, c(w; r) is the number of times resource w occurs in r, |r| is the number of times all resources occur in r, and |Col| is the number of unique resources in DBpedia. Finally, α is the smoothing parameter. For example, consider the subgraph:

from Table 4.1. The resource-based language model of this subgraph is: P(Jefferson Davis)=0.266, P(party)=0.133, P(commander)=0.133, P(American Civil War)=0.133 and P(Democratic Party (United States))=0.133., with |Col| = 5771699, and α=0.8.

4.3.2 Term-based Diversity In this notion of diversity, we are only interested in diversifying the results in terms of the variable bindings. To be able to do this, we define a language model for each result as follows. Term-based Language Model: the term-based language model of result r is a probability distribution over all terms (unigrams) in DBpedia. The parameters of the result language model are estimated using a smoothed maximum likelihood estimator as shown in Equation 4.5.

c(w; r) 1 P (w|r) = α + (1 − α) (4.5) |r| |Col|

29 such that w∈ / QT erms, where w is a term, QT erms is the list of the terms in the query, Col is the set of all unique terms in DBpedia, c(w; r) is the number of times term w occurs in r, |r| is the number of times all terms occur in r, and |Col| is the number of unique terms in DBpedia. Finally, α is the smoothing parameter. This notion of diversity is particularly important in the case of keyword- augmented queries and when performing query relaxation. By excluding terms that appear in the original query when representing each result, we ensure that when these representations are later used for diversity, the top-ranked results still stay close to the original user query. For example, the query . The term-based language model for the subgraph:

is as follows: P(crimes)=0.4, P(misdemeanors)=0.4, P(franz)= 0.2 and P(schubert)=0.2, with |Col| = 2424955, and α=0.8. As the term-based language model does not include any of the terms in the query, this will ensure that we are only diversifying the results in terms of variable bindings. The relations (predicates) are also considered terms in this notion of diversity.

4.3.3 Text-based Diversity In our search engine, each triple is also associated with a text snippet which can be used to process keyword-augmented queries. A text snippet can be directly utilized to provide diversity among the different results using the MMR measure. We define a language model for each result as follows. Text-based Diversity: the text-based language model of a result r is a probability distribution over all the keywords in all the text snippets of all the triples in DBpedia. The parameters of the text-based language model is computed using a smoothed maximum-likelihood estimator as shown in Equation 4.6.

c(w; D(r)) 1 P (w|r) = α + (1 − α) (4.6) |D(r)| |Col| where c(w; D(r)) is the number of times keyword w occurs in D(r) (the text snippet of subgraph r), |D(r)| is the number of occurrences of all keywords in D(r), and |Col| is the number of unique keywords in the text snippets of all triples in DBpedia. Finally, α is the smoothing parameter. For example, consider the following subgraph:

30 with the following two sets of keywords associated with triple 1 and triple 2, respectively: [need, story, standup, 2004, samantha, playwright, oscar] and [story, need, fox]. The keywords in this subgraph will then be the union of the keywords of the two triples in the subgraph as follows: need, story, standup, 2004, samantha, play- wright, oscar, fox. Consequently, the text-based language model of this subgraph is as follows: P(need)=0.123, P(story)=0.123, P(standup)=0.061, P(2004)=0.061, P(playwright)=0.061, P(oscar)=0.061 and P(fox)=0.061, with α = 0.8 and |Col| = 2937745.

4.4 Diversity-aware Re-ranking Algorithm

We explain how the marginal relevance can be used to provide a diverse-aware ranking of results given a query Q. Let U be the set of ranked results using any regular ranking model (i.e., that depends only on relevance without taking into consideration diversity). The algorithm to re-rank the results works as follows:

Maximal Marginal Relevance Re-Ranking Algorithm 1. Initialize the top-k set S with the highest ranked result r ∈ U 2. Iterate over all the results r ∈ U \ S, and pick the result r∗ with the maximum marginal relevance MR(r∗, Q, S). That is,

r∗ = argmax [λrel(r, Q) + (1 − λ)min d(r, r0)] (4.7) 0 r∈U\S r ∈S

3. Add r∗ to S 4. If |S| = k or S = U return S otherwise repeat steps 2, 3 and 4 The distance between two results r and r0 is computed in Equation 4.8.

d(r, r0) = pJS(r||r0) (4.8) where JS(r||r0) is the Jensen-Shannon Divergence [33] between the language models of results r and r0 and is computed as shown in Equation 4.9.

KL(r||M) + KL(r0||M) JS(r||r0) = 2 Pterms r(i) 0 r0(i) (4.9) i r(i)log + r (i)log = M(i) M(i) 2 where KL(x||y) is the Kullback-Leibler Divergence between two language models 1 0 0 x and y, and M = 2 (r + r ) is the average of the language models of r and r .

31 We opted for using the Jensen-Shannon Divergence as its square root is a symmetric distance measure which is exactly what is required in the MMR mea- sure.

4.5 Diversity-aware Evaluation Metric

To be able to evaluate the effectiveness of our result diversity approach, an eval- uation metric that takes into consideration both relevance of results as well as their diversity must be used. There is a wealth of work on diversity-aware eval- uation metrics for IR systems such as [34, 35, 36]. We adopt a similar strategy and propose a novel evaluation metric that takes into consideration both aspects we are concerned with here, namely relevance and diversity, to evaluate a result set for a given query. We introduce an adjustment to the Discounted Cumulative Gain (DCG) [31] metric by adding a component that takes into consideration the novelty of a certain result, which reflects result diversity in a given result set. In other words, for each result r, we assume there exist two scores: a relevance score for the result r, and a novelty score that reflects the novelty of result r with respect to the previously selected results S. More formally, given a particular result set (a result ordering) of p results, the diversity-aware DCG, which we coin DIV-DCG is computed in Equation 4.10.

p X reli DCG = rel + nov + ( + nov ) (4.10) p 1 1 log (i) i i=2 2 where reli is the relevance score of the result at position i and novi is its novelty.

4.5.1 Resource-based Novelty Concerning our first two diversity notions: the resource-based and term-based, the novelty of a result at position i can be computed as follows: #unseen nov = i (4.11) i #variables where #unseeni is the number of resources that are bound to variables in result at position i that have not yet been seen, and #variables is the total number of variables in the query. Our goal is to diversify the results with respect to the variable bindings.

4.5.2 Text-based Novelty The computation of the text-based novelty metric is very similar in spirit to the resource-based one. The only difference is that in the case of text-based diversity,

32 our goal is to diversify the results with respect to their text snippets. To be able to quantify this, we measure for each result, the amount of new keywords that this result contributes to the set of keywords of the previously ranked results. More precisely, the text-based novelty can be computed as follows:

i−1 |keywordsi \ (keywordsi ∩ (∪j=1keywordsj))| novi = (4.12) |keywordsi| where keywordsi is the set of the keywords associated with the subgraph at i−1 position i, ∪j=1keywordsj is the set of all the keywords seen so far (up to subgraph i − 1), and |keywordsi| is the number of keywords in the set keywordsi.

4.6 Related Work

Result diversity for document retrieval has gained much attention in recent years. The work in this area deals primarily with unstructured and semi-structured data [37, 24, 38, 35, 32, 36]. Most of the techniques perform diversification by optimizing a bi-criteria objective function that takes into consideration both result relevance as well as result novelty with respect to other results. Gollapudi and Sharma [32] presented an axiomatic framework for this problem and studied various objective functions that can be used to define such optimization problem. They proved that in most cases, such problem is hard to solve and proposed several approximation algorithms to solve such problem. Carbonell and Goldstein introduced the Maximal Marginal Relevance (MMR) method [24] which is one approximation solution to such optimization problem. Zhai et al. [36] studied a similar approach within the framework of language models and derived an MMR-based loss function that can be used to perform diversity-aware ranking. Aragwal et al. [37] assumed that query results belong to different categories and they proposed an objective function that tries to trade off the relevance of the results with the number of categories covered by the selected results. In an RDF setting, where results are constructed at query time by joining triples, we do not have an explicit notion of result categories. We thus adopted the Maximal Marginal Relevance approach [24] to the setting of RDF data since it directly utilizes the results to perform diversity rather than explicitly taking the categories of the results into consideration. Apart from document retrieval, there is very little work on result diversity for queries over structured data. In [39] the authors propose to navigate SQL results through categorization, which takes into account user preferences. In [40], the authors introduce a pre-indexing approach for efficient diversification of query results on relational databases. However, they do not take into consideration the relevance of the results to the query.

33 4.7 Experimental Evaluation

4.7.1 Evaluation Using Diversity-aware Metric 4.7.1.1 Setup In order to evaluate our greedy diversity-aware re-ranking algorithm, we use a benchmark of 100 queries which we constructed over DBPedia, and which can be broadly divided into 4 categories: structured, keyword-augmented, requiring relaxation, and keyword-augmented and requiring relaxation. Our query bench- mark was used in order to tune the weighting parameter of MMR (λ in Equation 4.7), that is used to trade-off relevance and diversity. It was also used to com- pare the various notions of diversity and to evaluate the effectiveness of diversity versus no diversity based on our diversity-aware evaluation metric DIV − DCG described in Section 4.5. Recall that in order to compute the DIV − DCG of a given query, we need two measures for each result at position i, namely, the relevance of the result reli and its novelty novi. While novi can be directly computed based on the results themselves as explained in Section 4.5, we needed to gather a relevance assessment for each result to be able to compute reli. To gather these relevance assessments, we relied on the crowdsourcing platform CrowdFlower [30] as follows. Each query was run several times using our diversity-aware re-rank algorithm with different notions of diversity and with different values of the weighting parameter λ in Equation 4.7. In particular, for each one of our three diversity notions, each query was run 10 times with λ ranging from 0.1 to 1 (i.e., no diversity) and the top-10 results were retrieved. This resulted in 30 sets of possibly overlapping result sets. Finally, the results in each of these 30 sets were pooled together and the unique set of results were assessed on CrowdFlower [30]. We prepared a set of guidelines for each query category. We explained the idea of a structured result to the contributors. The queries were represented in natural language. The answers were represented as a set of triples (i.e., subgraphs). The guidelines which we submitted to CrowdFlower [30] can be found in details in Appendix B.

4.7.1.2 Parameter Tuning The main parameter in our diversity-aware re-ranking algorithm is the weighting parameter λ which trades-off relevance and diversity. To be able to set this parameter, we compute the normalized DIV − DCG10 for each query in the training set varying the value of λ from 0.1 to 0.9. The normalized DIV −DCG10 or DIV − NDCG10 is computed by dividing the DIV − DCG10 by the ideal DIV − DCG10. To be able to compute the ideal DIV − DCG10, we re-ranked the results using a greedy approach. The new ordering pushes the results with the best combination of diversity and relevance gain to the top. Table 4.2 shows the

34 Notion λ = 0.1 λ = 0.2 λ = 0.3 λ = 0.4 λ = 0.5 λ = 0.6 λ = 0.7 λ = 0.8 λ = 0.9 Resouce-based 0.798 0.789 0.780 0.769 0.762 0.758 0.751 0.740 0.727 Term-based 0.798 0.785 0.770 0.759 0.752 0.747 0.741 0.731 0.724 Text-based 0.932 0.931 0.928 0.923 0.918 0.915 0.906 0.892 0.875

Table 4.2: Average DIV − NDCG10 of the training queries for different values of λ

Notion Diversified Result Set Non-diversified Result Set Resource-based 0.80 0.74 Term-based 0.81 0.74 Text-based 0.91 0.68

Table 4.3: Average DIV − NDCG10 of the test queries

average DIV −NDCG10 for our three notions of diversity for different values of λ. The numbers reported in Table 4.2 is the average over 5 training repetitions using 5 different 80-queries training sets. This is a well known validation technique called the k-fold cross-validation [41]. In this case, the value of k is 5. The values highlighted in bold are the highest values, and their corresponding λ values are the optimum values of the parameter. For our three notions, the best value for λ is 0.1, which means we should give 90% importance to diversity over relevance in our re-ranking algorithm in order to get the best DIV − NDCG10 possible. Moreover, before a worker was allowed to work on one of our tasks, she had to pass a qualification test that we created to guarantee that she understood the task guidelines. Any worker with a score less than 90% was not allowed to work on our tasks. The average score of those workers that were allowed to work on our tasks was 93%. The inter-rater agreement reported by CrowdFlower was 67%. In addition, we computed the Fleiss Kappa Coefficient as another measure of agreement which is more reliable measure than that of CrowdFlower as it takes into considera- tion agreement by chance. We obtained a Kappa Coefficient of 40% which can interpreted as ”Fair Agreement” [42].

4.7.1.3 Comparison of Various Notions of Diversity

Given the optimal values of the parameter λ that were set based on the five 80 training queries in our benchmark, we computed the average DIV −NDCG10 for the five 20 test queries for each notion of diversity, as well as for the cases when no diversity was employed. The summary of our findings over the 20 queries is shown in Table 4.3. As can be seen from the table, the average DIV − NDCG10 for all notions is significantly larger than those where no diversification of results took place.

35 4.7.1.4 Qualitative Results We show some qualitative results that highlight the importance of result diversity for RDF search and the difference between the various notions of diversity we discussed here.

Resource-based Diversity. Consider the query whose corresponding information need is ”Give me the name of a director whose partner is an actor (or actress) and the name of a movie for each”. Table 4.4 shows the top-5 results with and without diver- sification, with λ set to 0.1. The table shows that without diversification, the results are dominated by two resources, namely Madonna and Sean Penn with dif- ferent combinations of movies they acted in or directed, but when diversification is involved, we get a set of different director-actor pairs as shown in the second column of Table 4.4.

No Diversity Resource-based Diversity W.E. director Madonna W.E. director Madonna Mystic River starring Sean Penn Mystic River starring Sean Penn Madonna spouse Sean Penn Madonna spouse Sean Penn Secretprojectrevolution director Madonna Citizen Kane director Orson Welles Mystic River starring Sean Penn Cover Girl starring Rita Hayworth Madonna spouse Sean Penn Orson Welles spouse Rita Hayworth W.E. director Madonna Henry V director Laurence Olivier The Tree of Life starring Sean Penn Ship of Fools starring Vivien Leigh Madonna spouse Sean Penn Laurence Olivier spouse Vivien Leigh W.E. director Madonna That Thing You Do! director Tom Hanks Dead Man Walking starring Sean Penn Jingle All the Way starring Rita Wilson Madonna spouse Sean Penn Tom Hanks spouse Rita Wilson W.E. director Madonna Then She Found Me director Helen Hunt This Must Be the Place starring Sean Penn The Simpsons Movie starring Hank Azaria Madonna spouse Sean Penn Helen Hunt spouse Hank Azaria Table 4.4: Top-5 results: director-actor couple query with and without diversity

Term-based Diversity. Consider one of the test queries: . The information need is: ”Give me the name of a a movie, its director, and its star, preferably a Disney movie”. The top-5 results with the diversity parameter λ set to 0.1 are shown in Table 4.5. In the non- diversified set of results, we have one popular Disney movie, Aladdin, that is repeated five times. On the other hand, the diversified set of results contains five different Disney movies. Note that, if we had used the resource-based diversity notion to diversify the results of this query, we would have ended up with 5 different movies which would not have been necessarily Disney movies. The

36 merit of using the term-based diversity notion is that it diversifies the result set with respect to terms rather than resources, and excludes the terms that appear in the query when considering diversity.

No Diversity Term-based Diversity Aladdin (1992 Disney film) director John Musker Aladdin (1992 Disney film) director John Musker Aladdin (1992 Disney film) starring Robin Williams Aladdin (1992 Disney film) starring Robin Williams Aladdin (1992 Disney film) director Ron Clements 102 Dalmatians director Kevin Lima Aladdin (1992 Disney film) starring Robin Williams 102 Dalmatians starring Glenn Close Aladdin (1992 Disney film) director John Musker Cars 2 director John Lasseter Aladdin (1992 Disney film) starring Frank Welker Cars 2 starring Michael Caine Aladdin (1992 Disney film) director Ron Clements Leroy & Stitch director Tony Craig Aladdin (1992 Disney film) starring Frank Welker Leroy & Stitch starring Tara Strong Aladdin (1992 Disney film) director John Musker Snow White and the Seven Dwarfs director Wilfred Jackson Aladdin (1992 Disney film) starring Gilbert Gottfried Snow White and the Seven Dwarfs starring Pinto Colvig Table 4.5: Top-5 results for the Disney query with and without diversity

Text-based Diversity. Consider one of the test queries: . The corresponding information need is: ”Give me the name of a film distributed by the Columbia Pictures company”. The top-5 results for both no diversity, and text-based diversity with the diversity parameter λ set to 0.1 are shown in Table 4.6. While the non-diversified set contains different movies, these movies in fact are not very diverse. Unlike the non-diversified set, the diversified set of results contains movies that have totally different genres, locations, casts, and plots. This indirect level of diversity cannot be captured using our first two diversity notions.

No Diversity Text-based Diversity Spider-Man 2 distributor Columbia Pictures Spider-Man 2 distributor Columbia Pictures Close Encounters distributor Columbia Pictures Sharkboy & Lavagirl distributor Columbia Pictures A Clockwork Orange distributor Columbia Pictures Jungle Menace distributor Columbia Pictures Spider-Man 3 distributor Columbia Pictures Quantum of Solace distributor Columbia Pictures Lawrence of Arabia distributor Columbia Pictures The Da Vinci Code distributor Columbia Pictures Table 4.6: Top-5 results for the Columbia Pictures query with and without di- versity

Note that our framework consisted of some other parameters, namely the smoothing parameters for the language models used to compute the distance between results, however we opted to pre-set these to 0.8 to focus mainly on the effect of the weighting parameter λ in our experiments.

4.7.2 Evaluation Using Crowdsourcing 4.7.2.1 Setup We also evaluate our diversity algorithm using crowdsourcing [30] . We follow the same technique in setting up the guidelines as in Section 3.4. The assessor is

37 Figure 4.1: A subgraph and its description asked to choose between two sets, one is diversified, and the other one is ranked using our ranking model (i.e based on relevance only). The assessor does not know which set is the diversified set. The positions of the sets are randomly set for each query. She is also asked to provide a valid explanation of her choice. We do this test for all three diversity notions, over our 100-query benchmark. The assessors are also asked to pass a test with at least a 90% score to be able to continue assessing. Moreover, a set of gold queries are embedded in the benchmark to catch and exclude any bad behavior or random answer.

4.7.2.2 Results The results of this evaluation are shown in Table 4.7. The reported numbers are out of 100 queries.

Non-diversified vs.: Resource-based Term-based Text-based Won 52 57 38 Tie 43 38 57 Lost 5 5 5

Table 4.7: Results for 100 evaluation queries: diversity algorithm

For both the resource-based notion and the term-based notion, we outperform the non-diversified model by 9%. Unlike the text-based notion, the diversity for these two notions is clear and straight forward. The assessor can see the differ- ences between the sets by just looking at the answers. Some sample comments are shown in Tables 4.8 and 4.9 . The assessors who preferred the diversified sets noticed that the reason is they have more than one flavor in them. For the third notion, the text-based notion, the diversity was hidden. To help them in comparing the sets, we provided a set of keywords for each answer. As shown in Figure 4.1, by mousing over the answer, they can see its description. 57% of the diversified sets were considered as good as the non-diversified ones. 38% of the diversified sets were considered better than the non-diversified ones. Some explanations are shown in Table 4.10. The inter-rater agreement produced by CrowdFlower was 77%, 71%, and 71%, for comparing with resource-based notion, term-based notion, and text-based notion respectively.

38 Non-diversified vs. Diversity(Resource-based) WON Q:List of movies for which the directors and the actors have won awards ? X Non-Diversified,Y Diversified Notion 1 Comments: ”more actors and movies in Y.” ”x all orson welles.” Q:List of people who died in WWII and their resting places ? X Diversified Notion 1,Y Non-Diversified Comments: ”x various y repeat.” ”X looks better.” TIE Q:List of writers who had worked with D.W. Griffith ? X Diversified Notion 1,Y Non-Diversified Comments: ”both X and Y contain the same number of writer who had worked with D.W. Griffith.” ”both answer the question right.” Q:List of publishers that published Stephen King books? X Diversified Notion 1,Y Non-Diversified Comments: ”same information in both lists.” ”both right.” LOST Q:List of Bette Davis movies with her late co-stars ? X Diversified Notion 1,Y Non-Diversified Comments: ”Y is better because it’s more accurate.” ”X about something else.” Q:List of Brazilian actors ? X Non-Diversified,Y Diversified Notion 1 Comments: ”X and Y both answer the question with some precision. In Y they speak of an actor and of Brazilian films or news. Both include List of Brazilian actors, but in Y that only sees 1 actor: Meira. So X is better in this case.” ”X is better.”

Table 4.8: No Diversity vs. Resource-based Diversity

39 Non-diversified vs. Diversity(Term-based) WON Q:List of award-winning spielberg movies ? X Non-Diversified,Y Diversified Notion 2 Comments: ”y - less duplicates.” ”y better than x.” Q:List of award-winning cartoonists ? X Non-Diversified,Y Diversified Notion 2 ”y has more various results.” ”Y list more varied.” TIE Q:List of people influenced by Freud ? X Non-Diversified,Y Diversified Notion 2 ”Both are same list.” ”Both answer the question right.” Q:List of art directors ? X Non-Diversified,Y Diversified Notion 2 Comments: ”Both have varied lists.” ”They are the same.” LOST Q:List of Woody Allen movies that Dick Hyman worked on ? X Diversified Notion 2,Y Non-Diversified ”X no Dick.” ”Y better.” Q:List of producers who had worked with Frank Sinatra ? X Non-Diversified,Y Diversified Notion 2 ”x is all about frank sinatra but y not.” ”x gives better answer.”

Table 4.9: No Diversity vs. Term-based Diversity

40 Non-diversified vs. Diversity(Text-based) WON Q:List of classical music composers ? X Diversified Notion 3,Y Non-Diversified Comments: ”X is diversified.” ”Y all about Mozart.” Q:List of authors who are also actors in drama movies ? X Diversified Notion 3,Y Non-Diversified Comments: ”There is more variety in X list” ”X has a variety of actors” TIE Q:List of people who worked with Kevin Bacon ? X Diversified Notion 3,Y Non-Diversified Comments: ”X and Y answer the question accurately. X and Y are all varied and diverse and responds with precision. For this reason, the sets X and Y are equal and precise. Therefore, X and Y are the same.” ”all same.” Q:List of people influenced by Albert Einstein ? X Non-Diversified,Y Diversified Notion 3 Comments: ”Both X & Y are same lists just in different order.” ”Both have varied names.” LOST Q:List of rock music singers from Ottawa ? X Diversified Notion 3,Y Non-Diversified Comments: ”y has more relevant answers.” ”y relevant.” Q:List of Bette Davis movies with her late co-stars ? X Non-Diversified,Y Diversified Notion 3 Comments: ”In the list Y there are films in which they do not star by Bette Davis” ”Y is the correct one.”

Table 4.10: No Diversity vs. Text-based Diversity

41 Chapter 5

Efficiency

In this chapter, we discuss the running time of our algorithms. More particularly, the search engine can be divided into two parts. The first part begins with the user entering her query and ends with a list of top-10 subgraphs ranked by relevance only. The second part starts with the full ranked by relevance set and ends by producing a list of top-10 subgraphs ranked by both diversity and relevance.

5.1 Database Statistics

Ranking a query results requires a visit to several PostgreSQL indexed tables. The tables titles, columns, descriptions, examples, and sizes are shown in Table 5.1. An example of the ”Facts” table entry is: . An example of the ”Keywords” table entry is: <1141107: standup: 1, playwright: 119, role: 164, humor: 39...>. An example of the ”op var” table entry is: . An example of the ”sp var” table entry is: . An example of the ”so var” table entry is: . An example of the ”s var” table entry is: . An example of the ”p var” table entry is: . An example of the ”o var” table entry is: .

42 Title Columns Description Number of Rows Facts S, P, O, WC, ID DBpedia triples 15.7 million Keywords ID, Keywords Weighted keywords of a triple 15.7 million op var S, PWC, PKWC P of WCs and KWCs for all facts with S=x 2 million sp var O, PWC, PKWC P of WCs and KWCs for all facts with O=x 4.6 million so var P, PWC, PKWC P of WCs and KWCs for all facts with P=x 1337 s var P, O, PWC, PKWC P of WCs and KWCs for all facts with P=x and O=y 5.4 million p var S, O, PWC, PKWC P of WCs and KWCs for all facts with S=x and O=y 15.4 million o var S, P, PWC, PKWC P of WCs and KWCs for all facts with S=x and P=y 12.6 million Table 5.1: Database information 5.2 Running Time Statistics

The average running time for each query category is shown in Table 5.2.

Category Average Running Time Purely Structured 18 s Keyword-augmented 83 s Requiring Relaxation 115 s Table 5.2: Running time per query category

The average ranking running time per query is 47 s, and the standard devia- tion is 0.2 s. The minimum running time per query is 0.093 s (93 ms), and the maximum running time per query is 431 s (7 minutes). We follow this by reporting the average diversity running time for each of the three diversity notions in Table 5.3. Moreover, the average of the diversity running time for resource-based and term-based notions is 33 s.

Notion Average Running Time Resource-based 32.8 s Term-based 33.1 s Text-based 145.9 s Table 5.3: Running time per diversity notion

We record the running time for each step in the process over the 100-query benchmark, and then we do the average to get the values in Figure 5.1 as per- centages indicating the percentage of time each step needs. The percentage of time spent in processing the input, scoring the subgraphs, and re-scoring them is reasonable. However, 60% of the time is spent on running the SQL queries needed on PostgreSQL [9]. Another step that is slowing down the system is the 37% of the time spent on computing the language model for each subgraph in the result set. In addition, we record the running time of the scoring loop of the keyword- augmented queries versus the scoring loop of the purely structured queries. The job for this type of queries is slightly more complicated since we need to take care of the keywords associated with the queries. On average, the running time

43 Figure 5.1: Percentage of time needed per step needed for scoring a keyword-augmented set of subgraphs is 0.095 s. This value is greater than the needed time to score a purely structured set of subgaphs: 0.084s. Finally, we observe the average running time for producing a set of relaxed queries, the average running time for running a set of relaxed queries, and the average running time for scoring the sets of results produced by relaxed queries, which are 0.1s, 54s, and 0.2s respectively.

5.3 Running Time Challenges

The main two issues in our system, in terms of execution time, are running SQL queries on PostgreSQL and computing the language models for all the subgraphs for the purpose of diversifying the results. For the first problem, we can consider implementing the system using a NoSQL database, MongoDB [43] for instance. However, this kind of databases has some difficulty in implementing self joins. The ”$lookup” command, which MongoDB introduced in their newer version, can be used to join collections. Another solution is to use a rank-aware join algorithm such as the one proposed in [44]. Unlike the traditional materialize-then-sort method for retrieving the results of a SPARQL query, the authors propose a rank-aware join algorithm optimized for native RDF stores. As for the second issue, one solution is to go for re-ranking a subset of the result set. So instead of computing the language models of n subgraphs, where n is the number of subgraphs returned by the SQL query, we can consider computing k language models for top-k subgraphs (ranked by relevance).

44 Appendix A

Guidelines: Choosing the Better Result Set

We have developed a set of guidelines for comparing two sets of top-10 subgraphs. This will help the assessor in understanding how a subgraph (one or more triples) can describe the answer of a natural language question. Figure A.1 shows the guidelines for choosing between set X, set Y, or same. X and Y are anonymous sets, which means that our model in each query can be X or Y. This way the user would not stick on voting for the same side every time. In addition, the assessor is asked to provide a valid explanation of her choice. Figure A.2 shows a sample query. Moreover, before a worker was allowed to work on one of our tasks, she had to pass a qualification test that we created to guarantee that she understood the task guidelines. Any worker with a score less than 90% was not allowed to work on our tasks.

45 Figure A.1: Which set is better?

46 Figure A.2: One query and two sets of answers

47 Appendix B

Guidelines: Assessing a Subgraph

In Section 4.7, we have mentioned our 100 queries benchmark, which we submit- ted its results to CrowdFlower [30] for relevance assessments. Since our bench- mark’s results are subgraphs, we needed to explain the concept of subgraphs and triples to the workers (judges). In order to accomplish this, we have developed a set of guidelines for each query category. This will help the judge in understand- ing how a subgraph (one or more triples) can describe the answer of a natural language question.

B.1 Purely Structured Queries Guidelines

Figure B.1 shows the guidelines for purely structured queries and Figure B.2 shows a sample query for this category. We had 17360 unique subgraphs (results) for 62 purely structured queries.

B.2 Keyword-augmented Queries Guidelines

Figure B.3 shows the guidelines for keyword-augmented structured queries and Figure B.4 shows a sample query for this category. We had 7280 unique subgraphs (results) for 26 keyword-augmented queries.

B.3 Requiring Relaxation Queries Guidelines

Figure B.5 shows the guidelines for queries that need relaxation and Figure B.6 shows a sample query for this category. We had 2240 unique subgraphs (results) for 8 requiring relaxation queries.

48 Figure B.1: Purely structured queries guidelines

Figure B.2: Sample assessment question (purely structured)

B.4 Keyword-augmented and Requiring Relax- ation Queries Guidelines

Figure B.7 shows the guidelines for queries that need relaxation and Figure B.8 shows a sample query for this category. We had 1120 unique subgraphs (results) for 4 requiring relaxation queries.

B.5 Gold Queries

To ensure a high quality of the relevance assessments gathered and a careful understanding of the guidelines, we constructed a set of gold queries that were embedded within the benchmark queries. This enabled us to exclude assessments from untrusted workers.

49 Figure B.3: Keyword-augmented queries guidelines

An example of a gold answer (an obvious answer) is the answer to the follow- ing query:

”Give me the name of an operating system and its developer ?”

This 1-triple subgraph answers the query and the contains popular names like ”Windows” and ”Microsoft”. On the other hand, the following answer is obviously irrelevant to the query:

”Give me the name of an Asian country and its capital ?”

Moreover, before a worker was allowed to work on one of our tasks, she had to pass a qualification test that we created to guarantee that she understood the task guidelines. Any worker with a score less than 90% was not allowed to work on our tasks. The average score of those workers that were allowed to work on our tasks was 93%.

50 Figure B.4: Sample assessment question (keyword-augmented)

Figure B.5: Requiring relaxation queries guidelines

Figure B.6: Sample assessment question (requiring relaxation)

51 Figure B.7: Keyword-augmented and requiring relaxation queries guidelines

Figure B.8: Sample assessment question (keyword-augmented and requiring re- laxation)

52 Appendix C

Abbreviations

KB Knowledge-Base KG Knowledge-Graph KWC Keyword Witnesscount LMRQ Language-Model-based Ranking for Queries on RDF-graph O Object P Predicate RDF Resource Description Framework S Subject WC Witnesscount

53 Bibliography

[1] S. Auer, C. Bizer, R. Cyganiak, G. Kobilarov, J. Lehmann, and Z. Ives, “Dbpedia: A nucleus for a web of open data,” in ISWC/ASWC, 2007.

[2] “Freebase: A social database about things you know and love.” http://www.freebase.com.

[3] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A large ontology from wikipedia and wordnet,” J. Web Sem., vol. 6, no. 3, 2008.

[4] P. DeRose, X. Chai, B. J. Gao, W. Shen, A. Doan, P. Bohannon, and X. Zhu, “Building community wikipedias: A machine-human partnership approach,” in ICDE, 2008.

[5] Z. Nie, Y. Ma, S. Shi, J. Wen, and W. Ma, “Web object retrieval,” in WWW, 2007.

[6] “W3c: Resource description framework (rdf).” www.w3.org/RDF/, 2004.

[7] “W3c: Sparql query language for rdf.” www.w3.org/TR/rdf-sparql-query/, 2008.

[8] S. Chaudhuri, G. Das, V. Hristidis, and G. Weikum, “Probabilistic infor- mation retrieval approach for ranking of database query results,” SIGMOD Record, vol. 35, no. 4, 2006.

[9] “A powerful, open source object-relational database system..” https://www.postgresql.org/.

[10] D. Petkova and W. Croft, “Hierarchical language models for expert finding in enterprise corpora,” Int. J. on AI Tools, vol. 17, no. 1, 2008.

[11] H. Fang and C. Zhai, “Probabilistic models for expert finding,” in ECIR, 2007.

[12] J. D. Lafferty and C. Zhai, “Document language models, query models, and risk minimization for information retrieval,” in SIGIR, 2001.

54 [13] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan, “Key- word searching and browsing in databases using banks,” in ICDE, 2002.

[14] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar, “Bidirectional expansion for keyword search on graph databases,” in VLDB, 2005.

[15] V. Hristidis, H. Hwang, and Y. Papakonstantinou, “Authority-based key- word search in databases,” TODS, vol. 33, no. 1, 2008.

[16] K. Golenberg, B. Kimelfeld, and Y. Sagiv, “Keyword proximity search in complex data graphs,” in SIGMOD, 2008.

[17] T. Cheng, X. Yan, and K. C.-C. Chang, “Entityrank: Searching entities directly and holistically,” in VLDB, 2007.

[18] S. Elbassuoni and R. Blanco, “Keyword search over rdf graphs,” in CIKM, CIKM ’11, 2011.

[19] G. Li, B. Ooi, J. Feng, J. Wang, and L. Zhou, “Ease: an effective 3-in- 1 keyword search method for unstructured, semistructured and structured data,” in SIGMOD, 2008.

[20] G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath, and G. Weikum, “Naga: Searching and ranking knowledge,” in ICDE, 2008.

[21] M. Yahya, D. Barbosa, K. Berberich, Q. Wang, and G. Weikum, “Relation- ship queries on extended knowledge graphs,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, (New York, NY, USA), pp. 605–614, ACM, 2016.

[22] S. Elbassuoni, M. Ramanath, R. Schenkel, M. Sydow, and G. Weikum, “Language-model-based ranking for queries on RDF-graphs,” in CIKM, 2009.

[23] S. Elbassuoni, M. Ramanath, R. Schenkel, and G. Weikum, “Searching rdf graphs with SPARQL and keywords,” IEEE Data Engineering Bulletin, vol. 33, no. 1, 2010.

[24] J. Carbonell and J. Goldstein, “The use of mmr, diversity-based reranking for reordering documents and producing summaries,” in SIGIR, 1998.

[25] S. Elbassuoni, M. Ramanath, and G. Weikum, “Query relaxation for entity- relationship search,” in Proc. of the 8th Extended Semantic Web Confer- ence(ESWC), 2010.

55 [26] H. Huang, C. Liu, and X. Zhou, “Computing relaxed answers on rdf databases,” in WISE, 2008.

[27] G. Fokou, S. Jean, A. Hadjali, and M. Baron, “Cooperative techniques for sparql query relaxation in rdf databases,” in Proceedings of the 12th European Semantic Web Conference on The Semantic Web. Latest Advances and New Domains - Volume 9088, (New York, NY, USA), pp. 237–252, Springer- Verlag New York, Inc., 2015.

[28] G. Fokou, S. Jean, A. Hadjali, and M. Baron, RDF Query Relaxation Strate- gies Based on Failure Causes, pp. 439–454. Cham: Springer International Publishing, 2016.

[29] G. Kasneci, F. Suchanek, G. Ifrim, S. Elbassuoni, M. Ramanath, and G. Weikum, “NAGA: Harvesting, Searching and Ranking Knowledge,” in The 2008 ACM SIGMOD 2008 International Conference on Management of Data (SIGMOD), 2008.

[30] “Crowd flower: A crowd sourcing company..” https://www.crowdflower.com/.

[31] K. Jrvelin and J. Keklinen, “Cumulated gain-based evaluation of ir tech- niques,” ACM Transactions on Information Systems (TOIS), pp. 422–446, 2002.

[32] S. Gollapudi and A. Sharma, “An axiomatic approach for result diversifi- cation,” in Proceedings of the 18th international conference on World wide web, WWW ’09, (New York, NY, USA), pp. 381–390, ACM, 2009.

[33] J. Lin., “Divergence measures based on the shannon entropy,” IEEE Trans- actions on Information Theory, pp. 145–151, 1991.

[34] J. Allan, C. Wade, and A. Bolivar, “Retrieval and novelty detection at the sentence level,” in SIGIR, pp. 314–321, 2003.

[35] C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. B¨uttcher, and I. MacKinnon, “Novelty and diversity in information re- trieval evaluation,” in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SI- GIR ’08, (New York, NY, USA), pp. 659–666, ACM, 2008.

[36] C. X. Zhai, W. W. Cohen, and J. Lafferty, “Beyond independent relevance: methods and evaluation metrics for subtopic retrieval,” in Proceedings of the 26th annual international ACM SIGIR conference on Research and develop- ment in informaion retrieval, SIGIR ’03, (New York, NY, USA), pp. 10–17, ACM, 2003.

56 [37] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong, “Diversifying search results,” in Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM ’09, (New York, NY, USA), pp. 5–14, ACM, 2009.

[38] H. Chen and D. R. Karger, “Less is more: probabilistic models for retrieving fewer relevant documents,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information re- trieval, SIGIR ’06, (New York, NY, USA), pp. 429–436, ACM, 2006.

[39] Z. Chen and T. Li, “Addressing diverse user preferences in SQL-query-result navigation,” in Proceedings of the 2007 ACM SIGMOD international confer- ence on Management of data, SIGMOD ’07, (New York, NY, USA), pp. 641– 652, ACM, 2007.

[40] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, and S. A. Yahia, “Efficient Computation of Diverse Query Results,” in Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, (Washing- ton, DC, USA), pp. 228–236, IEEE Computer Society, 2008.

[41] R. Kohavi, “A study of cross-validation and bootstrap for accuracy esti- mation and model selection,” in IJCAI, pp. 1137–1143, Morgan Kaufmann, 1995.

[42] J. L. Fleiss, “Measuring nominal scale agreement among many raters,” Psy- chological Bulletin, vol. 76, no. 5, pp. 378 – 382, 1971.

[43] K. Chodorow and M. Dirolf, MongoDB: The Definitive Guide. O’Reilly Media, Inc., 1st ed., 2010.

[44] S. Magliacane, A. Bozzon, and E. Della Valle, “Efficient execution of top-k sparql queries,” in Proceedings of the 11th International Conference on The Semantic Web - Volume Part I, ISWC’12, (Berlin, Heidelberg), pp. 344–360, Springer-Verlag, 2012.

57