Quick viewing(Text Mode)

Leveraging Entity Types and Properties for Knowledge Graph Exploitation

Leveraging Entity Types and Properties for Knowledge Graph Exploitation

DEPARTMENTOF INFORMATICS

UNIVERSITYOF FRIBOURG (SWITZERLAND)

Leveraging Entity Types and Properties for Exploitation

THESIS

Presented to the Faculty of Science of the University of Fribourg (Switzerland) in consideration for the award of the academic grade of Doctor scientiarum informaticarum

by

ALBERTO TONON

from

ITALY

Thesis No: 2018 UniPrint 2017

Accepted by the Faculty of Science of the University of Fribourg (Switzerland) upon the recommendation of Prof. Dr. Krisztian Balog and Dr. Gianluca Demartini.

Fribourg, May 22, 2017

Thesis supervisor Dean

Prof. Dr. Philippe Cudré-Mauroux Prof. Dr. Christian Bochet

iii

Declaration of Authorship

Title: Leveraging Entity Types and Properties for Knowledge Graph Exploitation I, Alberto Tonon, declare that I have authored this thesis independently, without illicit help, that I have not used any other than the declared sources/resources, and that I have explicitly marked all material which has been quoted either literally or by content from the used sources.

Signed:

Date:

v

“Educating the mind without educating the heart is no education at all.”

— Aristotle

Acknowledgments There are many people without whom I would never have achieved this goal. All of them con- tributed in their own way and proved to be essential to my academic growth, to my personal development, and to my mental health. Here I want to thank them all.

Thanks to my supervisor, Professor Philippe Cudré-Mauroux. Philippe provided me with inestimable advice and guidance, he also showed great patience, and always understood the issues that each Ph.D. candidate had. One of Philippe’s best achievements is, in my opinion, the creation of a familiar and friendly group where students and researchers can freely interact, provide feedback, collaborate, and have a lot of fun. The path to obtaining a Ph.D. is long and difficult, I think that working in such an environment helped me a lot in dealing with all the difficulties I faced.

Special thanks go to Gianluca Demartini and Monica Noselli, my first family in Fribourg. Similarly to Philippe, Gianluca gave me invaluable advice, introduced me to the academic world, came with me to my first conference, and contributed significantly to the research presented in this thesis.

I would also like to acknowledge an important component of my daily life in the lab: Dr. Roman Prokofyev. Roman bravely shared an office with me for almost five years, managed to bear me even during deadline marathons and stressful periods. Not less important are all the other members and friends of the lab with whom I shared important moments (in alphabetic order): Alisa, Alyia, Artem, Dingqi, Djellel, Esther, Giuseppe, Ines, Julia, Laura, Marcin, Martin, Michael, Michele, Michelle, Paolo, Ruslan, Sabrina, Victor. You guys are great!

My friends in Italy and in Fribourg were also essential during the last five years. I would like, in particular, to mention all friends from “FR Fun!”, and my friends from all times, Alessan- dro, Annalisa, and Lorena.

Finally, I owe eternal gratitude to my parents, Gianna and Luigi, my sister Cristiana, my grandparents, Edoardo and Vally, and to Aurora for supporting me unconditionally. They inspired me and helped me becoming what I am.

Fribourg, June 2, 2017 Alberto Tonon

ix

Abstract

A Knowledge Graph is a containing semi-structured information represented as a graph. Entries (nodes) in a knowledge graph are called entities. Knowledge Graphs today play a central role in the services provided by major Web players. Recent examples include the Knowledge Vault, Yahoo!’s Knowledge Graph, and Bing’s Knowledge and Action Graph. In addition, open-source Knowledge Graphs such as DBpedia or , are also available for reuse. Being able to tap into a Knowledge Graph and exploit connections among its entities is key to numerous tasks related for example to computational linguistics, information retrieval, and . In this thesis we develop new methods for effectively retrieving entities and for ranking their types. In addition, we present algorithms that improve data quality in knowledge graphs by finding misused entity properties (labeled edges in the Knowledge Graph), and we show how Knowledge Graphs can be used in practice by exploiting them to effectively detect newsworthy events. Finally, we propose a novel evaluation methodology that can be used for continuously evaluating entity-centric retrieval systems. We start this thesis by presenting novel techniques to retrieve entities as responses to keyword queries. This task is called Ad-hoc Object Retrieval, and is essential for effectively exploiting Knowledge Graphs. We show that triplestores can be used to efficiently retrieve relevant responses by exploring the surroundings of entities obtained by applying standard Information Retrieval methods. This leads to significant improvements in average precision and NDCG with little increase in execution time. Subsequently, we introduce the novel task of Type Ranking (TRank), which consists in ranking all types of a given entity based on the textual context in which it appears. Information on entity types that best fit a certain context can be shown to end-users for text understanding, text summarization and for search results diversification. We experimentally demonstrate that approaches for TRank based on the type hierarchy of the Knowledge Graph provide more accurate results. As information on the schema of the Knowledge Graph is exploited by several of our methods, we also studied how to verify the adherence of the relations contained in the knowledge base to their formal specification. Specifically, we propose a method for detecting when properties in the Knowledge Graph are misused or need to be specialized. We present results that show how entropy can be used to detect the misused properties. We then switch to a more applied context by presenting a real world application that makes use of semantic technologies and Knowledge Graphs. The system we developed uses a combination of entity linking, anomaly detection and reasoning to efficiently and effectively detect newsworthy events in microblogs. We show that our

xi Abstract system outperforms state-of-the-art methods based on query expansion. Finally, we tackle the problem of how to evaluate entity-centric retrieval systems like those we proposed. In this context, we introduce a new evaluation methodology that uses crowdsourcing to evaluate sets of systems in a fair and continuous fashion, and define techniques to weight the strictness or leniency of different crowds evaluating the retrieved entities. We analyze the benefits and drawbacks of our methodology by comparing AOR systems developed at different points in time and study how standard Information Retrieval metrics, AOR system ranking, and several pooling techniques behave in such a continuous evaluation context.

Keywords: Knowledge Graphs, Entities, Entity Types, Data Integration.

xii Résumé

Un graphe de connaissances est une base de connaissances contenant des informations semi-structurées représentées comme un graphe. Les entrées (noeuds) dans un graphe de connaissances sont appelées entités. De nos jours, les graphes de connaissances jouent un role central dans les servies fournis par les majeurs acteurs du Web. Des examples récents incluent le "Google Knowledge Vault", le graphe de connaissances de Yahoo et le "Knowledge and Action Graph" de Bing. De surcroit, les graphes de connaissances open-source, comme DBpedia ou , sont aussi disponibles pour réutilisation. Etre en mesure de bénéficier des graphes de connaissances et exploiter les connections entre les entités sont la clé pour divers taches reliées, par example au calcul linguistique, l’extraction de connaissances, ou les questions réponses. Dans cette thèse, nous développons des nouvelles méthodes pour la recherche d’entités et pour le classement de leurs types. En plus, nous introduisons des algo- rithmes pour améliorer la qualité des données dans les graphes de connaissances en trouvant les propriétés d’entités mal utilisés (les arêtes étiquetées dans le graphe de connaissance) et nous illustrons comment les graphes de connaissances peuvent être utilisés en pratique en les exploitant dans l’otique de détecter les événements pertinents aux nouvelles (newsworthy). Finalement, nous proposons une nouvelle évaluation méthodologique qui peut être utilisée pour l’évaluation continuelle des entités centrées sur les systèmes d’extraction. Nous commençons cette thèse par présenter les nouvelles techniques pour extraire les entités comme réponses à des requêtes mots-clés. Cette tache, appelée Ad-hoc Object Retrieval, est essentielle pour l’exploitation efficace des graphes de connaissances. Nous montrons aussi que les triplestores peuvent être utilisés pour extraire efficacement les réponses pertinentes en exploitant les voisins des entités obtenus par l’application des méthodes classiques de l’extraction de connaissances. Ceci conduit à des améliorations signifiantes en précision moyenne et NDCG avec une minime augmentation dans le temps d’execution. En consé- quence, nous introduisons une nouvelle tache de classement de type (TRank), qui consiste dans le classement de tous les types d’une certaine entité basée sur le contexte textuel dans lequel elle apparait. Les informations sur les types des entités qui siéent le plus dans un contexte spécifique peuvent être montrées aux utilisateurs finaux pour la compréhension du texte, la récapitulation et la diversification de la recherche de résultats. Nous montrons empiriquement que les approches pour TRank basées sur la hiérarchie type du graphe de connaissance fournissent des résultats précis. Comme les informations sur le schéma du graphe de connaissances sont exploitées par plusieurs méthodes, nous avons aussi étudié comment vérifier l’adhésion des relations contenus dans la base de connaissances à leurs

xiii Abstract spéciations formelles. Plus précisément, nous proposons une méthode pour la détection quand les propriétés dans un graphe de connaissances sont mal utilisées ou ont besoin d’être spécialisées. Nous présentons les résultats qui montrent comment l’entropie peut être utilisée pour la detection des propriétés mal utilisées. Ensuite, nous tournons vers un contexte plus appliqué en présentant une application du monde réel qui met en exergue l’utilisation des technologies sémantiques et les graphes de connaissances. Le système que nous avons déve- loppé utilise une combinaison de liens d’entités, detection d’anomalies et le raisonnement pour une detection efficace et précise des événements pertinents aux nouvelles dans les microblogs. Nous montrons que notre système surpasse les méthodes de l’état de l’art basées sur l’expansion des requêtes. Finalement, nous abordons le problème de d’évaluation des entités centriques dans les systèmes d’extraction comme ceux que nous avons proposés. Dans ce contexte, nous introduisons une nouvelle méthodologie qui utilise le Crowdsourcing pour évaluer un ensemble de systèmes d’une manière equitable et continue, et nous définissons des techniques pour peser la rigueur ou la clémence des différents crowds qui évaluent l’ex- traction des entités. Nous analysons les mérites et les inconvénients de notre méthodologie en comparant les systèmes AOR développé dans différents moments et étudions comment les mesures classiques d’extraction d’information, systèmes de classement AOR, et plusieurs techniques de regroupement se comportent dans un pareil contexte d’évaluation continue.

Mots clefs : Graphe de Connaissances, Entités, Types d’Entités, Intégration de Données.

xiv Contents

Acknowledgments ix

Abstract (English) xi

1 Introduction 1 1.1 Scientific Contributions...... 3 1.1.1 Entity Retrieval...... 3 1.1.2 Entity Type Ranking...... 3 1.1.3 Schema Adherence...... 4 1.1.4 Exploiting Knowledge Graphs for Detecting Events...... 5 1.1.5 Evaluating Entity Retrieval Systems...... 5 1.1.6 Other Contributions...... 6 1.2 Thesis Outline...... 6

2 Background Knowledge7 2.1 Introduction to Knowledge Graphs...... 7 2.1.1 What is a Knowledge Graph? A Simple Intuition...... 7 2.1.2 Resource Description Framework...... 9 2.1.3 Knowledge Graphs...... 12 2.1.4 Entities and Entity Types...... 13 2.1.5 Properties...... 14 2.2 Creating Knowledge Graphs...... 14 2.2.1 DBpedia...... 15 2.2.2 YAGO...... 15 2.2.3 Wikidata...... 16 2.2.4 ...... 16 2.2.5 Linked Open Data...... 17 2.2.6 Other Knowledge Graphs...... 17 2.3 Using Knowledge Graphs...... 18 2.3.1 Entity Retrieval...... 18 2.3.2 Entity Linking...... 19 2.3.3 Entity Summarization...... 20 2.3.4 Systems Leveraging Knowledge Graphs...... 21

xv Contents

2.4 Evaluating Entity Retrieval Systems...... 22 2.4.1 Evaluation Metrics...... 23

3 Retrieving Entities from a Knowledge Graph 27 3.1 Introduction...... 27 3.2 Related Work...... 29 3.2.1 Early Entity-Centric IR Evaluation Initiatives...... 29 3.2.2 Semantic Search...... 30 3.2.3 Entity Linking...... 30 3.2.4 Ontology Alignment...... 31 3.2.5 Ad-hoc Object Retrieval...... 31 3.3 System Architecture...... 32 3.4 Query Processor...... 33 3.4.1 Query Expansion...... 33 3.4.2 Pseudo-Relevance Feedback...... 34 3.4.3 Named Entity Recognition...... 34 3.4.4 Entity Type Recognition...... 34 3.5 Inverted Index Searcher...... 35 3.5.1 Unstructured Inverted Index...... 35 3.5.2 Structured Inverted Index...... 36 3.5.3 NER in Queries to Improve Inverted Indexes Effectiveness...... 36 3.6 Graph-Based Result Refinement...... 36 3.6.1 Entity Retrieval by Entity Type...... 37 3.6.2 Entity Graph Traversals...... 37 3.6.3 Neighborhood Queries and Scoring...... 40 3.7 Results Combiner...... 41 3.8 Experimental Evaluation...... 43 3.8.1 Experimental Setting...... 43 3.8.2 Continuous Evaluation of AOR Systems Based on Crowdsourcing.... 44 3.8.3 Completing Relevance Judgments through Crowdsourcing...... 45 3.8.4 Baselines...... 46 3.8.5 Evaluation of the Inverted Index Searcher...... 46 3.8.6 Evaluation of Entity Retrieval by Entity Type...... 49 3.8.7 Evaluation of Graph Traversal Techniques...... 50 3.8.8 Learning to Rank for AOR...... 51 3.8.9 Effectiveness Evaluation of Hybrid Approaches...... 53 3.8.10 Efficiency Considerations...... 53 3.9 Conclusions...... 57

4 Displaying Entity Information: Ranking Entity Types 59 4.1 Introduction...... 59 4.2 Related Work...... 61 4.2.1 Named Entity Recognition...... 61 xvi Contents

4.2.2 Entity Types...... 62 4.3 Task Definition...... 63 4.4 System Architecture...... 63 4.5 Approaches to Entity Type Ranking...... 66 4.5.1 Entity-Centric Ranking Approaches...... 66 4.5.2 Hierarchy-Based Ranking Approaches...... 68 4.5.3 Text-Based Context-Aware Approaches...... 68 4.5.4 Entity-Based Context-Aware Ranking Approaches...... 70 4.5.5 Mixed Approaches...... 71 4.5.6 Scalable Entity Type Ranking with MapReduce...... 72 4.6 Crowdsourced Relevance Judgments...... 73 4.6.1 Pilot Study...... 74 4.7 Experiments...... 74 4.7.1 Experimental Setting...... 74 4.7.2 Dataset Analysis...... 75 4.7.3 Effectiveness Results...... 77 4.7.4 Scalability...... 79 4.8 Discussion...... 82 4.9 Conclusions...... 85

5 Towards Cleaner KGs: Detecting Misused Properties 87 5.1 Introduction...... 87 5.2 Related Work...... 88 5.3 Motivation and Core Ideas...... 89 5.4 Detecting and Correcting Multi-Context Properties...... 91 5.4.1 Statistical Tools...... 92 5.4.2 LeXt...... 93 5.4.3 Discussion...... 94 5.4.4 ReXt and LeRiXt...... 95 5.5 Experiments...... 95 5.6 Conclusions...... 97

6 Applications of KGs: Entity-Centric Event Detection on Twitter 99 6.1 Introduction...... 99 6.2 Related Work...... 100 6.3 Motivation & Methodology...... 101 6.3.1 Approach...... 102 6.3.2 Semantic Event Descriptions...... 102 6.3.3 System Output...... 103 6.3.4 System Architecture...... 103 6.4 Natural Language Processing of Tweets...... 104 6.4.1 Data Preparation...... 104 6.4.2 Location Extraction...... 105

xvii Contents

6.4.3 Passive Voice Correction...... 106 6.4.4 Entity Resolution...... 106 6.4.5 Verb Resolution...... 106 6.4.6 Quad Output...... 107 6.5 Semantic Analysis...... 107 6.5.1 The RDF Knowledge Graph for Event Detection...... 107 6.5.2 Resolving Location in the Knowledge Graph...... 108 6.5.3 Describing Complex Events and Extracting Time Series...... 109 6.6 Event Detection...... 110 6.7 Evaluation...... 111 6.7.1 Determining Complex Events...... 111 6.7.2 Creating Category Queries...... 111 6.7.3 Event Validation...... 112 6.7.4 Results...... 112 6.8 Conclusion...... 113

7 Continuous Evaluation of Entity Retrieval Systems 115 7.1 Introduction...... 115 7.2 Related Work...... 118 7.2.1 Evaluation Metrics...... 118 7.2.2 Pooling Strategies...... 119 7.2.3 Crowdsourcing Relevance Judgments...... 119 7.3 Continuous IRS Evaluation...... 120 7.3.1 Limitations of Current IR Evaluations...... 120 7.3.2 Organizing a Continuous IR Evaluation Campaign...... 121 7.3.3 Assumptions and Limitations of the Methodology...... 123 7.4 Continuous IR Evaluation Statistics...... 124 7.4.1 Measuring the Fairness of the Judgment Pool...... 124 7.4.2 Optimistic and Pessimistic Effectiveness...... 125 7.4.3 Opportunistic Number of Relevant Documents...... 126 7.5 Selecting Documents to Judge...... 127 7.5.1 Existing Pooling Strategies...... 127 7.5.2 Novel Pooling Strategies...... 127 7.6 Obtaining and Integrating Judgments...... 130 7.6.1 Dealing with Assessment Diversity...... 130 7.7 Experimental Evaluation...... 131 7.7.1 Experimental Setting...... 131 7.7.2 Continuous Evaluation Statistics...... 132 7.7.3 Pooling Strategies in a Continuous Evaluation Setting...... 135 7.7.4 Real Deployment of a Continuous Evaluation Campaign...... 139 7.8 Discussion...... 141 7.8.1 Integrating Relevance Judgments...... 142 xviii Contents

7.8.2 Building the CJS...... 143 7.8.3 Economical Viability...... 143 7.8.4 More Continuous Continuous Evaluations...... 144 7.9 Conclusions...... 144

8 Conclusions 147 8.1 Lessons Learned...... 147 8.2 Future Work...... 148 8.2.1 A Knowledge Graph Spanning Over the World Wide Web...... 149 8.2.2 Actionable Knowledge Graphs...... 149 8.3 Outlook...... 150

xix

1 Introduction

The last fifty years have witnessed a radical change in the way knowledge is gathered, stored, and represented. Starting from the 70s much effort was put into fitting knowledge into re- lational , and into defining strict schemata describing what information each table contains and how it is modeled. This trend begun to decline at the beginning of the new millennium when “Not only SQL” (NoSQL) databases, exploring other alternatives for mod- eling, storing, and retrieving information, were designed. The adoption of such alternatives accelerated as a consequence of the needs of big Web 2.0 players, and evolved about ten years ago with the adoption of entity-centric data modeled as a graph: Knowledge Graphs (KGs). Knowledge graphs have been embraced by important Web enterprises: Google products are enhanced by the ;1 Microsoft developed a knowledge graph called Knowledge and Action Graph, which is used to empower Microsoft’s personal assistant service, ;2 Facebook is building its Entity Graph by collecting facts from Wikipedia and from its users, who are asked to provide pieces of data that will probably be incorporated into such a graph;3 LinkedIn is organizing information about its users in a knowledge graph;4 Yandex, Baidu and Yahoo! are also investing in such technologies.5

Conceptually, a knowledge graph resembles a and is composed by nodes and labeled edges. Nodes represent concepts, also called entities, of the domain taken into consideration, while labeled edges represent relations between them. Knowledge graphs are “data-first, schema later” knowledge bases, meaning that they can feature a schema modeling their data, but the constraints they defines are not strictly applied, that is, data may not adhere strictly to its formal model. Existing knowledge graphs are automatically populated by extract- ing information from existing structured, unstructured, or semi-structured data sources. Such techniques can be used to build considerably large knowledge graphs. DBpedia, for instance, is a well-known publicly available knowledge graph containing encyclopedic knowledge about

1https://www.google.com/intl/es419/insidesearch/features/search/knowledge.html 2https://www.bing.com/partners/knowledgegraph 3https://www.technologyreview.com/s/511591/facebook-nudges-users-to-catalog-the-real-world/ 4https://engineering.linkedin.com/blog/2016/10/building-the-linkedin-knowledge-graph 5https://goo.gl/o9OZ2S (from http://ir.baidu.com), https://goo.gl/crjZRA (from http://searchengineland.com).

1 Chapter 1. Introduction

17 million entities extracted from semi-structured data contained in Wikipedia pages [9]. In some cases humans can participate in the curation of a knowledge graph either by modifying the information it contains, or by adding additional data. This was the case for Freebase, a knowledge graph maintained by Google containing both information harvested from different sources and information added by members of its community [36].

Exploiting knowledge graphs is often not trivial. Even the task of retrieving an entity without knowing its unique identifier can be challenging and was tackled in the literature by several scientists who proposed different methods to address the problem. Pound et al., for example, suggest to build a search engine that allows users to retrieve entities by using keywords queries [152]. This task is called Ad-hoc Object Retrieval (AOR). At the Text Retrieval Conference (TREC), one of the major events in the Information Retrieval (IR) community, two other ways of retrieving entities have been studied: retrieving entities given other example entities (Entity List Completion, ELC), and retrieving entities given a reference entity and the textual description of a relation (Related Entity Finding, REF)[18, 19]. Another popular task addressed by the Databases, Natural Language Processing, and communities is called Entity Linking (EL) and consists in spotting entity mentions in a given piece of text, and linking them to entries in the knowledge graph. All such tasks are difficult since they both have to deal with possibly very large repositories of data and with ambiguity. This last point is crucial as it often happens that similar inputs can match several entities, for example, the keyword “python” can identify many entities in DBpedia including snakes, programming languages, movies, comedians, orators, and a square in Fribourg (Switzerland) named after the founder of the University of Fribourg, Georges Python. The ambiguity issues are exacerbated by the presence of entities identified by common nouns such as the song “Yesterday”, by the Beatles.

Maintaining a knowledge graph is also difficult. To automatically add data into a knowledge graph one has to decide if the new information is true or not, and appropriate nodes and edges must be created in order to encode the new content. In addition, due to their “data- first, schema later” policy, keeping data and schema aligned in a knowledge graph can be challenging. While some knowledge graphs such as DBpedia and Freebase rely on a manually curated schemata which, as we see in Chapter5, can be violated, some researchers decided to adopt data-driven approaches and tried to automatically infer schemata by using statistics on the usage of the knowledge graph components [190]. In both cases, there might be data that does not adhere to its specification.

Numerous other issues and open questions related to knowledge graphs can be mentioned, such as the lack of a standard definition of a knowledge graph pointed out by Ehrlinger et al. [71]. This suggests that much work has still to be done in order to explore the potential of such objects, their properties, and the best ways to exploit them. As we see in the next section, this thesis is devoted exactly to this purpose.

2 1.1. Scientific Contributions

1.1 Scientific Contributions

During our research work we tackled several tasks connected to entity retrieval, knowledge graph maintenance, and entity summarization, and we proposed effective methods to deal with them. Additionally, we developed and evaluated an application of knowledge graphs to event detection. Finally, we studied the common denominator of all our contributions, that is, the use of test collections built by using crowdsourcing. In this context, we propose an evaluation methodology that exploits crowdsourcing to fairly evaluate methods for retrieving information, including entity retrieval systems and Web search engines.

In the following we give a succinct overview of the tasks we studied, and of the contributions we made. We also mention peer reviewed articles published along our research work.

1.1.1 Entity Retrieval

The first task we consider is Ad-hoc Object Retrieval (AOR, for short) which, as we mentioned previously, sums up to building a search engine for entities contained in a given knowledge graph allowing users to retrieve entities by keyword queries. AOR is motivated by the fact that often users of knowledge graphs do not know the identifiers of the entities they are searching but rather know some other information about them that can be simply expressed by using keywords (e.g., “the First Lady”).

The solution that we propose is described in Chapter3 and makes use both of Information Retrieval techniques and of structured repositories allowing us to explore knowledge graphs. Specifically, we index the entities composing the knowledge graph by using inverted indexes and then adopt standard IR ranking functions to obtain an initial ranked list of answers (entities) to the users’ keyword queries. We then use the structured repository to explore the surroundings of each retrieved entity to look for other entities related to the user query, or to collect further evidence on the relevance of entities already retrieved.

A large part of what we present in that context was also published at the SIGIR conference in 2012 [181].

1.1.2 Entity Type Ranking

Once an entity is retrieved, for example by using the method for AOR described previously, it is often summarized for the end-user. Summarizing an entity means selecting which information is better to show to end-users to describe the entity. This task is motivated by the fact that in current knowledge graphs entities are sometimes associated with too much information, which human users cannot process quickly: some nodes in a knowledge graph can have hundreds or thousands of incoming or outgoing edges describing the entities they represent. Among all the information that can be attached to entities, information on their types is of vital importance as it defines what the entities are; for example, in DBpedia the entity Georges

3 Chapter 1. Introduction

Python6 is defined to be a person, a politician, and several other things, including human, deputy, and thing. Nevertheless, as the reader might have notices, entities are often associated with several types (Barack Obama has 119 types in DBpedia), and selecting the one type to show to the user can be challenging as it can depend on the context in which the entity appears, on the user who looked for it, etc.

For this reason we tackle a task connected to entity summarization that we call entity Type Ranking (TRank, for short). TRank consists in ranking the types of a given entity appearing in a given textual context. In this thesis, we report on how different methods for type ranking perform by studying how they rank entity types appearing in news articles. The approaches we analyze take into consideration several factors including the textual content in which entities are mentioned, relations between types in the knowledge graph (e.g., a politician is also a person), and between entities. We also describe a pipeline for detecting entities and ranking their types that runs on a cluster of computers, and we test its scalability by processing a large dataset composed of more than one million webpages.

Our contributions in this area are reported in Chapter4 and were also published in the proceedings of the 12th International Semantic Web Conference (ISWC) [178], where our article was a best paper award candidate, and in the special issue on knowledge graphs of the Journal of Web Semantics [179].

1.1.3 Schema Adherence

We now change our point of view and focus on maintenance operations. As we mentioned previously, type information is essential in knowledge graphs but, while earlier we focused on nodes, we now focus on edges. We call each edge in a knowledge graph a property of the entity it departs from. It is possible to formally describe each property by defining which type of entities it can connect. In DBpedia, for example, the property “fastest driver” has to connect an entity of type “Grand Prix” to an entity of type “Person”. In this case we say that “Grand Prix” is the domain of the property, and that “Person” is its range. Having data that complies to its schema, in our case edges connecting entities of specific types, allows users to better exploit the information contained in the knowledge graph and is beneficial to many tasks.

In Chapter5 we tackle the problem of detecting misused and unspecified properties, that is, properties that are either used to describe entities of the wrong type, or lack a domain or range specification. Our contribution in this context consists in a method that leverages the statistical notion of entropy to detect property misuse, and to suggest modifications to the knowledge graph to make it more compliant with its schema.

Results on the effectiveness of our method were presented in 2015 at the On the Web workshop (LDOW) [177].

6http://fr.dbpedia.org/page/Georges_Python

4 1.1. Scientific Contributions

1.1.4 Exploiting Knowledge Graphs for Detecting Events

One of the goals of this thesis is to show how knowledge graphs, and semantic technologies in general, can be applied in order to solve various tasks. To do so, in Chapter6 we show how we exploited knowledge graphs to solve the task of detecting newsworthy events in Twitter. The system we propose, ArmaTweet, takes as input semantic queries describing precisely what kind of events the user wants to detect (e.g., “deaths of politicians”), and constantly monitors Twitter to identify relevant tweets that are then shown to the user together with a semantic summary of the event they describe. Relevant tweets are identified by using Entity Linking techniques to link entity mentions to DBpedia nodes, and state-of-the-art Natural Language Processing tools to extract relations between entities.

ArmaTweet is the result of a collaboration between Oxford University, Armasuisse, the R&D agency of the Swiss Armed Forced, and the University of Fribourg. The results we present in this context were also published in the proceedings of the Industrial Track of 14th edition of the Extended Semantic Web Conference (ESWC) in 2017 [180].

1.1.5 Evaluating Entity Retrieval Systems

The last contribution we describe in this thesis is not directly related to techniques that exploit knowledge graphs, but is rather connected to how such techniques are evaluated. In particular, in Chapter7 we focus on how to use crowdsourcing to fairly and continuously evaluate information retrieval systems. Delving into this topic is important since we adopted similar techniques to evaluate most of the research work we have done so far, including the contributions we describe in Chapters3 and4. We briefly describe our contribution by using as example the task of Ad-hoc Object Retrieval that we introduced previously in this chapter. Typically, academic approaches for AOR are evaluated by using test collections composed of a knowledge graph, a set of keyword queries, and a set of relevance judgments specifying if a given entry of the knowledge graph is relevant to a certain query or not. Nevertheless, recall that knowledge graphs are typically very large, so, labeling each possible combination of entities and queries is unfeasible. For this reason, current evaluation initiatives adopt a technique called pooling to select which documents to evaluate. In practice, all systems participating to an evaluation initiative are run on the given queries and create ranked list of entities retrieved from the input graph. The top-n documents retrieved by each system for each query will compose a pool of documents that human annotators will judge. Obviously, other systems willing to compare their results against those achieved by systems participating in the evaluation initiative might not have their top-n documents judged, and are thus penalized compared to the original participants. In Chapter7 we study this phenomenon and we discuss how crowdsourcing can be used in order to mitigate it. The results we present were also published in the Information Retrieval Journal [182].

5 Chapter 1. Introduction

1.1.6 Other Contributions

In addition to the core contributions described above, we also contributed to other scientific research reported in the following.

In 2014 we contributed to the design of “B-Hist”, a Web browser plug-in by Catasta et al. that uses entity types to better organize users’ browsing history [52]. In the same year we also co-developed the concept of memory queries, that is, queries where users try to recall from their memory some personal past experience. We proposed an approach called transactive search that consists in reconstructing such memories by using pieces of information coming from other users [53].

In 2016 we participated to the work of Prokofyev et al. Together we proved that entity linking techniques can be used to improve the performance of existing NLP tools for coreference resolution in textual content, that is, determining if two strings denote the same concept [154]. For example, the strings “” and “CEO” refer to the same entity in the following piece of text: “Larry Page met Barack Obama yesterday evening, the CEO was pleased to meet the president”.

Finally, in 2016 we participated to the construction of VoldemortKG, a dataset built over semantic annotations contained in a large Web crawl. We did this to foster research on ontology and instance matching to achieve the final vision of creating a large knowledge graph that potentially spans over a large fraction of the World Wide Web [183].

1.2 Thesis Outline

We conclude this first chapter by outlining the structure of this dissertation. This thesis is structured in eight chapters, five of which are devoted to presenting the core scientific contributions we made.

In the following chapter we introduce basic knowledge to help readers understanding the rest of the dissertation. We do this by first presenting an intuitive example knowledge graph, and then by formally defining its components. We also summarize current automatic methods for building knowledge graphs and existing datasets. Chapter3 describes our contributions to the task of Ad-hoc Object Retrieval and the experimental evaluation of our methods. We then go towards entity summarization in Chapter4, where we describe several approaches for entity Type Ranking, and subsequently change perspective in Chapter5, where we tackle a problem more connected to data quality, that is, improving the adherence of the knowledge graph data to its schema. Chapter6 is devoted to a practical use of knowledge graphs: ArmaTweet, our entity-centric event detection system. We conclude the description of our contributions in Chapter7, where we discuss how to fairly evaluate entity retrieval systems by using crowd- sourcing. Finally, Chapter8 closes this thesis by summarizing our contributions, the lessons we learned, and by discussing possible future work.

6 2 Background Knowledge

This part of the thesis is devoted to give readers background information needed to understand the rest of this dissertation. After this chapter, readers will have an idea of what a knowledge graph is, of current methods and available resources that can be exploited to automatically build knowledge graphs, and of what is possible to achieve by using them. In addition, we will briefly cover aspects related to the evaluation of the methods we propose in subsequent chapters.

In what follows, we focus on simplicity and concision: we try to provide all and only the information needed to follow the content of the thesis, sometimes by omitting technical details which are not relevant to our context, and that would just burden the reader. We try, however, to provide pointers to further material that interested readers can consult to get more information on the topics we mention.

2.1 Introduction to Knowledge Graphs

In the following, we first introduce knowledge graphs by proposing an intuitive example, we then focus on how a knowledge graph (KG) can be represented and formally defined. We point readers interested in deeper mathematical background or in other topics such as available ontologies, reasoning, storage of knowledge graphs, etc. to the well-known “Handbook of Semantic Web Technologies” [69].

2.1.1 What is a Knowledge Graph? A Simple Intuition

Figure 2.1 shows a small example knowledge graph. Knowledge graphs are mainly composed of three types of object: entities, properties, and entity types.

7 Chapter 2. Background Knowledge

"Cobie Smulders" name "How I Met Your Mother" type type showName Person starring type subtypeOf

starring type Actor name network "Neil Patrick Harris" type type

TV Network Work TV Show type type Broadcast Network

Figure 2.1 – An example Knowledge Graph. Entities are represented by images, datatype properties by thin arrows, and object properties by bold arrows. Blue bubbles represent entity types.

Entities

Entities are the concepts which are described by the knowledge graph. In Figure 2.1 entities are represented by images: the TV show titled “How I Met Your Mother”, the actors Cobie Smulders and Neil Patrick Harris, and the television network CBS are the entities composing our small knowledge graph. In the rest of this dissertation we use the words “entity”, “concept”, and “resource” interchangeably even if in some contexts they denote different concepts.

Properties

Properties are used in two ways: either they are used to give information about entities by using literals such as the string “How I Met Your Mother”, or they represent relations between pairs of entities. Note that we use different fonts to distinguish string literals from entities. In Figure 2.1 properties are represented by labeled arrows. For example, the property “showName” gives information on the name of the entity “How I Met Your Mother” by linking it to a literal, while the property “starring” connects the entity to the actors (other entities) playing in the TV show. The direction of such arrows is essential since it defines which entity does what: TV shows star actors but not vice versa. In the rest of this dissertation, if not explicitly stated, we use the words “property” and “predicate” interchangeably.

8 2.1. Introduction to Knowledge Graphs

Entity Types

Entity Types are used to specify what an entity is and which properties it usually has. Entities can be instances of one or more types, for example, in Figure 2.1 “Cobie Smulders” is both a “Person” and an “Actor”. In many knowledge graphs types are ordered by specificity, for instance, in our example the property “subtypeOf” is used to state that “Actor” is a subtype of “Person”, thus every instance of “Actor” is also an instance of “Person”, but not vice versa. In this case we say that “Actor” is more specific than “Person”. In the rest of this dissertation, if not explicitly stated, we use the terms “entity type”, “type”, and “class” interchangeably.

2.1.2 Resource Description Framework

To represent a knowledge graph we usually adopt the Resource Description Framework (RDF).1 RDF is meant to describe resources, in our case entities, and is a major component of the Semantic Web recommended by the W3C.

The data model of RDF is very simple and is able to encode directed graphs by using triples of the form (subject,predicate,object), where subject and object are nodes in the graph, and predicate is a labeled edge connecting them. Entities, properties, and entity types are identified by Uniform Resource Identifiers2 (URI) or, more generally, by Internationalized Resource Identifiers3 (IRI). For example, the following triples give information on the entity identified by the URI http://dbpedia.org/resource/How_I_Met_Your_Mother, which is an instance of the type TV Show (http://dbpedia.org/ontology/TelevisionShow is the URI identifying the type “TV Show”), its English show name is “How I Met Your Mother” (notice the @en mark), and stars the entity identified by http://dbpedia.org/resource/Cobie_Smulders, which is an actress named “Cobie Smulders”.

. "How I Met Your Mother"@en . . . "Cobie Smulders"@en .

1https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/ 2https://tools.ietf.org/html/rfc3986 3http://www.ietf.org/rfc/rfc3987.txt

9 Chapter 2. Background Knowledge

Table 2.1 – Common namespace prefixes used in this thesis (cf. http://prefix.cc).

Prefix Namespace dbo http://dbpedia.org/ontology/ dbr http://dbpedia.org/resource/ dc http://purl.org/dc/elements/1.1/ fb http://rdf.freebase.com/ns/ foaf http://xmlns.com/foaf/0.1/ geonames http://www.geonames.org/ontology# owl http://www.w3.org/2002/07/owl# rdf http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs http://www.w3.org/2000/01/rdf-schema# skos http://www.w3.org/2004/02/skos/core# wikidata http://www.wikidata.org/entity/ wn http://xmlns.com/wordnet/1.6/ xsd http://www.w3.org/2001/XMLSchema#

As can be noticed from the triples reported above, several URIs share the same namespace, e.g., http://dbpedia.org/resource/How_I_Met_Your_Mother and http://dbpedia.org/resource/ Cobie_Smulders. To simplify the syntax and increase readability, abbreviations identifying namespace prefixes are defined, for example, the prefix dbr is used to denote DBpedia re- sources and corresponds to the namespace http://dbpedia.org/resource. The entity “Cobie Smulders” can thus be denoted by dbr:Cobie_Smulders. In this thesis we use the namespace prefixes defined by http://prefix.cc; the most common prefixes we use are also reported in Table 2.1.

In addition, the reader might also have noticed that objects of triples can be both entity identifiers enclosed by angle brackets, or basic values called literals. Literals are associated with a datatype, enabling such values to be parsed and interpreted correctly; for example, the literal "5"ˆˆxsd:integer represents the integer 5. As we mentioned previously, literals can also be associated with language tags denoting their language.

Finally, RDF allows us to declare blank nodes to denote resources that do not have a global identifier. The following triples, for example, state that “Cobie Smulders” is married with a person called Taran Killam, however, since blank nodes are only visible in the graph being described, other knowledge graphs cannot refer to Cobie Smulders’ spouse by using any URI.

_:a . _:a . _:a "Taran Killam"@en .

10 2.1. Introduction to Knowledge Graphs

Data Modeling

It is possible to define data models for RDF data by using special vocabularies recommended by the W3C. The most used at the time of writing are RDF Schema (RDFS)4 and the Web Ontology Language (OWL)5. By using such vocabularies it is possible to define entity types, subsumption relations between them, properties, constraints on the entities that properties can connect, etc. For example, we can use RDFS to state that dbo:Person and dbo:Actor are types, and that dbo:Person subsumes dbo:Actor (i.e., the former is more generic than the latter) by using the following triples.

. . .

The main difference between RDFS and OWL is that OWL provides a larger vocabulary that can be used to describe additional characteristics of the data. For example, while with OWL is it possible to state that two resources denote the same thing, or that two types or two properties are equivalent, RDFS does not provide vocabulary to express such relations.

Querying RDF Data

Since RDF data can be represented by using a graph, the most intuitive way for querying is by using graph patterns. SPARQL,6 the query language for RDF data recommended by the W3C, allows us to do so by using triple patterns, which are like RDF triples except that each of the subject, predicate and object can be a variable. For example, if we consider the graph depicted in Figure 2.1, the following SPARQL query selects all the actors who played in “How I Met Your Mother”.

SELECT ?actor WHERE { ?actor . ?actor . }

Notice that the variable ?actor is used in two triple patterns in a conjunctive way, that is, the same entity has both to be an actor and to play in the TV show.

In this thesis we use queries very similar to the one described above, so we do not describe in detail more advanced features of SPARQL. We point the interested reader to the W3C specification of the language mentioned previously.

4https://www.w3.org/TR/rdf-schema/ 5https://www.w3.org/TR/owl2-overview/ 6https://www.w3.org/TR/sparql11-query/

11 Chapter 2. Background Knowledge

Serialization

Knowledge graphs are usually serialized by using RDF serialization formats such as RDF/XML7, N-Triples8, N-Quads9, Turtle10, and JSON-LD11. RDF triples can then be stored and queried by using conventional database management systems, noSQL databases, or ad-hoc solutions called triple stores. In this thesis we make use of two triple stores to store and query knowledge graphs: in Chapter3 we use a research prototype called RDF3X [ 139], while in Chapter6 we use an in-memory triple store called RDFox [138].

2.1.3 Knowledge Graphs

As Ehrlinger and Wöß pointed out recently [71], there is no standard definition of a knowledge graph yet. Some authors use the term “knowledge graph” to denote a semantic network, some consider knowledge graphs as ontologies, and some others use the term to denote knowledge bases. Finally, several people associate knowledge graphs with Big Data and state, basically, that a knowledge graph is a large ontology (or knowledge base), despite the notion of “large” being quite subjective.

The formal definition of knowledge graph we used in this thesis is based on the RDF data model and described by Equation 2.1, where E denotes the set of all entities (individuals), T the set of all entity types (classes), P the set of all the properties used in the knowledge graph, and L the set of all literals (data values).12

KG © (s,p,o) s E T P, p P, o E L T P ª (2.1) = | ∈ ∪ ∪ ∈ ∈ ∪ ∪ ∪ The equation basically states that a knowledge graph is a semantic network defined as a set of triples (s,p,o) specifying that a node s (either an entity or a type) is connected to another node o by the property p. We call the components of each triple as in RDF: subject, predicate, and object. Notice that, since knowledge is modeled by using a graph structure, only binary properties connecting pairs of entities are allowed, however, n-ary relations can still be encoded by using some of the techniques proposed by the W3C consortium.13 The simple definition of a knowledge graph we gave is accurate enough to express all the concepts we present in this dissertation; more complete and detailed definitions can be given by using a variant of the ontology model presented by Domingue et al. [69] but doing so does not provide any added value to the reader of this thesis.

Unless otherwise stated, we assume that a knowledge graph contains all information users

7https://www.w3.org/TR/2014/REC-rdf-syntax-grammar-20140225/ 8https://www.w3.org/TR/2014/REC-n-triples-20140225/ 9https://www.w3.org/TR/2014/REC-n-quads-20140225/ 10https://www.w3.org/TR/2014/REC-turtle-20140225/ 11https://www.w3.org/TR/2014/REC-json-ld-20140116/ 12Since in the rest of this thesis we make no distinction between different kinds of basic types (integer, string, float, etc.) we decided to group all their instances together in the set L, for the sake of simplicity. 13https://www.w3.org/TR/swbp-n-aryRelations/

12 2.1. Introduction to Knowledge Graphs need to tackle their tasks. In this sense, a knowledge graph can include (or make use of) several, possibly interlinked ontologies. With this statement we want to highlight the fact that KGs can reuse vocabulary defined by existing ontologies (e.g., by using both GeoNames properties and DBpedia properties), as well as entities which are part of other ontologies (e.g., by using links to both GeoNames and DBpedia entities). Moreover, we say that the schema of a knowledge graph is composed by its types, the subsumption relations among them, and by the formal declaration of the properties used in the knowledge graph, including their subsumption relations and the declarations of their domains and ranges. Knowledge graph schemata are usually defined by using RDFS and OWL. We will cover all these aspects in the following sections.

Knowledge Graphs are often mentioned together with “Linked Data”; the two concepts are related but denote very different things since a knowledge graph is a knowledge base while Linked Data is a method of publishing structured data. A knowledge graph can be Linked Data if it is published according to the three “extremely simple” rules described by Tim Berners-Lee.14 Nevertheless, we consider the Linked Open Data Cloud, which we describe in Section 2.2, to be a Knowledge Graph.

2.1.4 Entities and Entity Types

Entity types are analogous to classes in ontologies: it is possible to define subsumption relations between entity types as well as instantiation axioms specifying that an entity is an instance of a certain type. Subsumption relations define an ordering on entity types that we call type hierarchy and we often represent by using a Directed Acyclic Graph (DAG) or a tree, depending on the structured of the relations; in this last case we use the symbol to denote > the root of the tree. We say that a certain type t1 is subtype of a more generic type t0 if the latter type subsumes the former; in such a situation we say that t1 is child of t0 and that t0 is parent of t1. In general, we denote by Ch(t) the set containing the direct children of a certain type t, by Par(t) the parent of t (or its set of parents, depending on the context), and by Ancestors the set containing all the types in the path from t to , provided that the type hierarchy is a > tree, or a forest of trees. The type hierarchy is normally contained in the knowledge graph and represented by triples of the form (t ,subtypeOf ,t ), where t ,t T and subtypeOf is a special 1 0 0 1 ∈ property encoding the subsumption relation between two types, such as rdfs:subClassOf.

In general, as we will see in Chapter4, we do not make assumptions on the number of types associated to an entity, for example, the entity “Tom Cruise” can be both of type “Actor” and of type “Scientology Adept” even if they do not have any common ancestor in the type hierarchy. We use the symbol “a” to denote instantiations and we write “e at” to state that entity e is an instance of type t, that is, (e,a,t) KG. We usually represent instantiations by using the ∈ rdf:type property.

14http://www.ted.com/talks/tim_berners_lee_on_the_next_web

13 Chapter 2. Background Knowledge

2.1.5 Properties

Properties are used to describe entities and relations between entities. In the following chapter we distinguish datatype properties from object properties. Datatype properties connect entities in E to literals in L, while object properties connect entities. In Figure 2.1 object properties and datatype properties are represented by using thick and thin edges, respectively.

It is possible to specify domain and range constraints on properties, A domain constraint (or axiom, in ontology jargon) on an object property p states that, for any connection (e ,p,e ) 0 1 ∈ KG between two entities e0, e1, the source element e0 is an instance of the domain entity type. Similarly, a range constraint (or axiom) imposes a similar constraint on e1: for any connection (e0,p,e1), e1 has to be an instance of the range entity type. Properties can have more than one domain, in this case we treat them conjunctively, that is, if a property has two domains we force source entities to be instances of both entity types. Cases in which a property has more than one range are treated analogously. Notice that, although here and in the rest of this thesis we focused on object properties, it is possible to give such constraints for datatype properties, too.

Similarly to what we saw for entity types, one can define subsumption relations between properties. If such a relation exists between two properties p0 and p1 we say that p0 subsumes (is more general, or is the parent of) p1 and indicates that all entities connected by p1 are also connected by p0. This implies that domain and range constraints defined on p1 must be compatible with those defined on p0, that is, the domain of p1 has to be either the same or a descendant of p0’s domain. Range constraints are treated analogously.

Property subsumptions, and domain and range constraints are usually contained in the KG and represented by using special properties denoting the constraints we impose such as rdfs:domain, rdfs:range, rdfs:subPropertyOf. For example, the triple (p, rdfs:domain ,t) KG, where p P and t T denotes that t is the domain of p. Similar examples can be ∈ ∈ ∈ made to encode property subsumptions and range constraints.

2.2 Creating Knowledge Graphs

The goal of this section is to introduce readers to the knowledge graphs that we use throughout this thesis and to give them an overview of the core ideas currently used to create knowledge graphs. In the following we first briefly describe DBpedia, YAGO, Freebase and Wikidata, and then introduce Linked Open Data, and other methods for creating knowledge graphs worth mentioning. Interested readers can have more information on methods for creating and refining knowledge graphs by reading the survey made by Paulheim [144]. Finally, readers who are willing to know more about the differences between the knowledge graphs we present can consult the article by Färber et al., devoted to this topic [75].

14 2.2. Creating Knowledge Graphs

2.2.1 DBpedia

Many important approaches for building knowledge graphs use Wikipedia as a starting point.15 The DBpedia project, for example, extracts the structured information that is included in Wikipedia articles by using wiki markup [29].16 The most valuable source of information for DBpedia are the so called “”, which are tables summarizing the most important properties of the entity described by a Wikipedia page. This extraction process generates one of the most popular knowledge graphs publicly available that, at the time of writing, contains more than 17M entities.17 DBpedia also features a manually curated ontologies defining, in particular, a hierarchy of types and domain and range constraints for many properties. Being derived from Wikipedia, DBpedia features encyclopedic knowledge mostly focused on people, locations, organizations, and creative works such as books, pieces of art, movies, and music.

2.2.2 YAGO

In 2007, the same year in which the original DBpedia article was published [9], Suchanek et al. built YAGO (Yet Another Great Ontology) [174], a knowledge graph that shares with DBpedia some core ideas but aims at integrating Wikipedia and WordNet, a well-known lexicon for the English language [127].18

YAGO2 enhances the knowledge graph by adding spatial and temporal information to its data [93]. Spatial information is attached to entities of type Event, Group (or Organization), and Artifact by means of special properties. Such properties connect the entity to geographical entities extracted either from Wikipedia or GeoNames, a large knowledge graph about loca- tions containing more than 7M entries.19 Temporal information is mostly extracted from Wikipedia infoboxes and is attached to people, groups, artifacts, and events. In all cases, temporal information denotes the time of existence of the entity; for example, people exist from the point in time in which they were born to the one in which they died. YAGO2 puts particular emphasis also on the time dimensions of facts by using special algorithms to detect when facts, encoded by using properties, are valid.

The latest version of YAGO, YAGO3, uses Wikidata (described later in this section) in order to merge information coming from different versions of Wikipedia redacted in different lan- guages. Interestingly enough, in YAGO3 the spatial and time dimensions of entities and facts introduced in YAGO2 were not taken into consideration [117].

According to Hoffart et al., who also built YAGO2, one of the main differences between YAGO and DBpedia is the fact that the two knowledge graphs have different type hierarchies: DBpedia

15http://wikipedia.org 16http://dbpedia.org 17http://wiki.dbpedia.org/dbpedia-2016-04-statistics accessed on February 17th 2017 18Yago and WordNet are available online at https://goo.gl/s38cXt (from www.mpi-inf.mpg.de) and https:// wordnet.princeton.edu, respectively. 19http://www.geonames.org

15 Chapter 2. Background Knowledge features about 300 types while YAGO features more than 300,000 entity types [93]. This is due to the fact that DBpedia’s type hierarchy was created manually, while YAGO’s was automatically derived starting from Wikipedia categories and WordNet. The main consequence of this fact is that YAGO has very specific entity types which can have very few instances and, therefore, are not always interesting for the users of the knowledge graph. We extensively discuss this issue in Chapter4. Conversely, YAGO relies on carefully hand made patterns to extract information from Wikipedia infoboxes and tries to merge values coming from similar attributes of infoboxes (such as “birthdate” and “dateofbirth”), DBpedia does not resulting in a remarkable difference in the number of properties used by the two knowledge graphs: while YAGO features about 100 manually curated properties, DBpedia features more than 1,000 properties, which are sometimes duplicated or too specific (e.g., dbo:aircraftHelicopterAttack)[93]. Although DBpedia and YAGO can be considered to be competitors, they actually complement each other: there are numerous owl:sameAs links between the two datasets and DBpedia entities also feature YAGO types.

2.2.3 Wikidata

Wikidata, operated by the (which also created Wikipedia), is the com- mon source of data for all versions of Wikipedia.20 Contrary to DBpedia and YAGO, Wikidata does not automatically extract information from Wikipedia but, rather, uses mechanisms similar to those used by Wikipedia to allow its users to extend and edit its content [195]. Never- theless, there are constraints that must be satisfied in order to publish information in Wikidata. Most importantly, the notability criteria stating that each entity in Wikidata has to be either an entry in any of the WikiMedia projects, or has to be a notable entity, that is, “it can be described using serious and publicly available references”.21 One of the particularities of Wikidata is that it is possible to add references supporting the facts it contains. Although Wikidata was born as a document-oriented database, it was recently released in RDF [73], however, its data model is quite complicated as it is based on RDF reification. This motivated Wikimedia developers to provide also a simplified RDF exports of the information contained in Wikidata that is easier to exploit.

2.2.4 Freebase

A knowledge graph somehow similar in spirit to Wikidata is Freebase [35, 36]. Freebase was created in 2007 by a company called Metaweb Technologies Inc., which sold the knowledge graph to Google in 2010. Since its inception, Freebase was meant to be a community edited collection of structured data, however, information was not only provided by online users in a Wikidata-like fashion, but also harvested from different datasets such as Wikipedia,

20https://wikidata.org 21https://www.wikidata.org/wiki/Wikidata:Notability accessed on February 12th, 2017.

16 2.2. Creating Knowledge Graphs part of MusicBrainz, and the Notable Names Database (NNDB).22 Freebase was shut down by Google on May 2nd 2016 and its data was “donated” to Wikipedia, though only 9.5% of its entities have actually been included in Wikidata [175], partly because of the notability criteria mentioned previously. The last dump of Freebase is still available for download at https://developers.google.com/freebase/.

2.2.5 Linked Open Data

Linked Open Data (LOD) is an online movement whose origins can be traced back to a note published by Tim Berners-Lee a few years ago.23 LOD suggests to publish data on the Web by following four principles:

1. use URIs to identify things;

2. use HTTP URIs such that things can be dereferenced online by using, for example, a Web browser;

3. provide useful, structured information about the things when they are dereferenced by using, for example, by using some RDF serialization format;

4. add links to other related URIs in the exposed data to foster data aggregation and discovery.

Data published by following these principles can be seen as a gigantic knowledge graph that we call Linked Open Data cloud, and is composed by many interlinked datasets, including the knowledge graphs we have presented so far and many more: in 2014 the LOD cloud was composed by 1,024 datasets.24 Andrejs Abele and John McCrae maintain an interesting diagram showing the relations among such datasets at http://lod-cloud.net.

2.2.6 Other Knowledge Graphs

Freebase was used by Dong et al. as a starting point to create Google’s Knowledge Vault (KV), a knowledge graph built by extracting information from several Web resources [70]. Specifi- cally, KV fuses information contained in Freebase with data extracted from text documents, DOM trees, HTML tables, and semantic annotations included in Web pages. Each extracted fact/triple is associated with its probability of being true, computed based on the number of sources providing the fact (e.g., the fact was extracted by the text of two webpages and from semantic annotations included in three other different pages). This approach is probably used

22 MusicBrainz, https://musicbrainz.org/, is a database of music metadata containing information about musical artists, albums, musical tracks, etc. NNDB, http://nndb.com/, is a database containing information about more than forty thousands people of note. 23https://www.w3.org/DesignIssues/LinkedData.html 24http://lod-cloud.net/state/state_2014/

17 Chapter 2. Background Knowledge to populate the Google Knowledge Graph.25 Similarly, the Never-Ending Language Learning project (NELL) [129] enriches an initial knowledge graph containing hundres of categories and relations by continuosly extracting information from webpages 24 hours per day.

Analogously, Rospocher et al. propose an automatic method to extract event-centric knowl- edge graphs from news articles [161]. In the knowledge graphs generated by using such a method events are first-class citizens and are described by using DBpedia entities connected by special relations used to model events.

Another automatic approach for knowledge graph construction worth mentioning is DeepDive [142], a system that is able to build knowledge graphs starting from a corpus of unstructured textual content. The system exploits rules written by domain experts using a SQL-based declarative language to process the data and to extract the information that will constitute the knowledge graph. Similarly to KV, each fact is associated with a probability representing the confidence of the system on the fact.

2.3 Using Knowledge Graphs

Now that we know how to create or get a knowledge graph, we can discuss on how to use it. Including a knowledge graph in an existing application, or building an application leveraging a knowledge graph is not easy as it seems: event the simple task of retrieving an entity without knowing its identifier can be very complicated. Based on this observation, in this section we first describe Entity Retrieval, Entity Linking, and Entity Summarization, three tasks which are essential to effectively exploiting knowledge graphs. Finally, at the end of this section we briefly present systems and research projects that leverage knowledge graphs to achieve their goals. After this section the reader will have an idea of the challenges that must be faced in order to use knowledge graphs, and of what knowledge graphs can be user for.

2.3.1 Entity Retrieval

As we saw previously in Section 2.1.2, entities are identified by URIs. Such URIs are sometimes human readable, such us those presented previously in Section 2.1, however, this might not be the case in general, for example, wikidata:Q42 identifies the entity “Douglas Adams” in Wikidata. If we do not know the URI of the entity we are looking for we can still retrieve it by using structured query languages such as SPARQL, however this requires users to know such formalisms. To simplify the retrieval of entities in knowledge graphs, Pound et al. defined the task of Ad-hoc Object Retrieval (AOR) that takes as input an unstructured keyword query and a knowledge graph, and returns as output a ranked list of entity identifiers taken from the knowledge graph [152]. AOR systems are evaluated by computing well-known IR metrics exploiting relevance judgments made by humans (see Section 2.4 for more details).

25https://www.google.com/intl/es419/insidesearch/features/search/knowledge.html

18 2.3. Using Knowledge Graphs

Current methods for Ad-hoc Object Retrieval are based on standard IR techniques and rely on modeling each entity e as a textual document, called entity profile, whose content is composed by the literals linked to e by some datatype property. IR data structures such as inverted indexes are then applied to index entity profiles and entities can then be retrieved by using keyword queries and IR ranking functions. For example, Pound et al. used a variant of the well-known TF-IDF ranking function [152], while Blanco et al. used structured entity profiles composed by one field per datatype property attached to the entity taken into consideration [33]. By using such structured profiles it is possible to assign different weights to different information attached to the entity, for example by using the BM25F ranking function. Different structured entity profiles were used by Neumayer et al., who show that in AOR language models give better results than TF-IDF based ranking functions [140]. Similarly, Zhiltsov et al. propose another structured representation of entity profiles that include information taken from similar or related entities [209]. A recent method worth reporting was developped by Hasibi et al. and focuses on integrating information about entities contained in the input query and term based retrieval model [91]. Facebook engineers also tackled Ad-hoc Object Retrieval when developing methods to retrieve users from the Facebook social graph. Unicorn, the system they developed, is an entity search system designed to serve billions of queries per day by indexing a very large knowledge graph [59]. In Chapter3 we extensively discuss AOR and propose new methods to tackle Ad-hoc Object Retrieval by combining both well-known IR data structure and structured search.

Finally, for the sake of completeness, we want to mention two other tasks designed for re- trieving entities from a knowledge graph, namely, Related Entity Finding (REF) and Entity List Completion (ELC). Both tasks are about finding entities of a specified type related to a given main entity, however, while in REF a textual description of the relation between related entities and the input entity is provided, in ELC, such a specification is substituted by a small number of known relevant related entities to be used as examples. Interested readers can find the complete definitions of the tasks as well as the description of several approaches in the Overview of the TREC Entity Track (editions 2010 and 2011) [18, 19].

2.3.2 Entity Linking

In many contexts, in order to efficiently exploit knowledge graphs, one has to somehow create a link between its data and the knowledge graph. For example, a system designed to enrich textual content with additional information on the entities it contains first needs to link textual mentions of entities to entries in the knowledge graph. This task is called Entity Linking (EL) and has been tackled in different fields of Computer Science such as Databases, Natural Language Processing, and Semantic Web.

The main difference between EL and AOR lies in the presence of textual context that can be leveraged to better rank entities: in the former task the context is usually a coherent piece of text in which an entity is mentioned (e.g., a news article or a paragraph of text), while in

19 Chapter 2. Background Knowledge the latter the input is merely a bunch of keywords which has to be dereferenced to an entity identifier. Of course AOR techniques can be used for Entity Linking, for example by using entity mentions as a keyword queries, but state-of-the-art techniques also make use of the textual context to produce better links.

The first step in entity linking consists in extracting the entity from the input textual content. Several approaches developed within the NLP field provide high-quality Named Entity Recog- nition (NER) for people, locations, and organizations [21, 55] as well as for other entity types (e.g., scientific concepts [153]). State-of-the-art techniques for NER are implemented in tools like Gate [58], the Stanford parser [119], and Extractiv.26

Once entity mentions have been identified, they still need to be disambiguated and linked to semantically equivalent but syntactically different occurrences of the same real-world object (e.g., “Mr. Obama” and “former president of the USA”).

The final step in entity linking is that of deciding which links to retain in order to enrich the entity. Systems performing such a task are available as well, for example, both Open Calais and DBpedia Spotlight [123]) provide Web APIs.27 Relevant techniques aim, for instance, at enriching documents by automatically creating links to Wikipedia pages [126, 170], which are seen as entity identifiers. Crowdsourcing has also been used to improve entity linking. ZenCrowd [64], for example, uses crowdsourcing to link entities which cannot be linked automatically or for which no candidate can be selected with high enough confidence.

The state-of-the-art in EL is a very elegant method that exploits probabilistic graphical models to compute the probability of an entity being associated to a certain mention given a certain context [79].

2.3.3 Entity Summarization

In some knowledge graphs entities can have hundreds or thousands of properties. Imagine a scenario in which a small table describing an entity is shown to end-users in response to a keyword query. It would be simply unthinkable to fill such a table with all information contained in the knowledge graph; moreover, relevant information about entities can be different depending on the context in which the entity is mentioned, or on the profile of the user issuing the query. For this reason, we often have to face the problem of selecting few properties that best describe some entity we take into consideration. This task is called Entity Summarization (ES).

An interesting approach for entity summarization is included in Spark [30], a system that merges AOR and entity summarization and that was used by Yahoo! Search to make entity recommendations. Spark takes as input a keyword query and makes recommendations by exploiting direct connections between the target entity (that is, the entity the user is looking

26http://extractiv.com/ 27http://www.opencalais.com/, https://github.com/dbpedia-spotlight/dbpedia-spotlight

20 2.3. Using Knowledge Graphs for) and other entities in the knowledge graph. This is equivalent to creating a summary of the target entity based on its incoming and outgoing object properties. Spark makes recom- mendation by extracting features encoding co-occurrences of entities and entity popularity from query logs and other types of text. Graph-theoretic features related to the structure of the knowledge graph are also taken into consideration.

More recent approaches do not use statistics computed on external corpora and focus only on the structure of the knowledge graph. In this context it is worth mentioning FACES, an approach that creates faceted entity summaries including both datatype properties and object properties [85]. Facets of an entity are meant to capture orthogonal set of properties capturing different aspects of the entity and are produced by a hierarchical clustering algorithm that creates disjoint sets of property-value pairs. For example, the algorithm may distinguish be- tween properties providing information about the academic career of a person and properties describing his or her hobbies. The pairs contained in each facet are then ranked based on TF-IDF and, finally, summaries are created by taking the top pairs of each facet.

LinkSUM [176] uses a different approach to summarize entities: instead of using facets com- posed by properties it first ranks related resources by using a method based on PageRank, and then selects which properties connecting the target entity with the selected resources should be shown to users. Such properties are selected by using a mixture of statistics such as how frequently they are used in the knowledge graph and in the description of the entities taken into consideration.

In this thesis we tackle entity summarization from a different angle and we focus on a particular aspect that has not been covered so far: ranking the types attached to an entity based on the textual context in which the entity is mentioned. As we will see in Chapter4, entities contained in knowledge graph such as YAGO are associated with many types and deciding which type to show to the end user is often not trivial.

2.3.4 Systems Leveraging Knowledge Graphs

After having presented several challenges we have to face in order to exploit knowledge graphs, we want to briefly mention few systems based on what we described so far.

We start with, in our opinion, the most used application of knowledge graphs: the Knowledge Graph panels integrated in .28 This is a clear example that makes use of methods for AOR and for entity summarization. Related to this, Google also developed Explore, a feature added to GDocs that taps on Google’s knowledge graph to make real time suggestions of content that is relevant for documents being typed by users.29

Another well-known application that uses knowledge graphs is IBM : the AI system that beat humans in Jeopardy!, a television game in which participants are given answers and

28https://www.google.com/intl/es419/insidesearch/features/search/knowledge.html 29https://research.googleblog.com/2016/11/research-suggestions-at-your-fingertips.html

21 Chapter 2. Background Knowledge have to suggest questions.30 Watson is a very complex system that uses several data sources including WordNet, DBpedia, and YAGO, which we described previously in Section 2.2.

In 2013 we were finalist at the Semantic Web Challenge by proposing a system, BHist, that links entities extracted from webpages visited by users, and uses their types to better organize the users’ browsing history [52]. The winner of the challenge was “The BBC World Service Archive Prototype” developed by the BBC to annotate its audio and video archives by exploiting textual metadata and audio transcription. The system extracts entity mentions from its data sources and links them to DBpedia; the entity data is then used to automatically categorize the documents [156]. The BBC also uses knowledge graphs to empower services such as BBC Sport, BBC Education etc.31

Knowledge graphs are also used by robots to model and share their knowledge. In this con- text, RoboBrain [167] is a knowledge engine that integrates knowledge coming from different sources in a knowledge graph whose core includes WordNet, Freebase, and OpenCyc.32 As RobotBrain is a never ending learning system, meaning that its knowledge graph is continu- ously extended and modified due to new knowledge entering the system, several issues similar to EL can arise. For example, when new knowledge is gathered, the system has to decide if it is already contained in the knowledge graph or if it should be added; moreover, in the latter case, the presence of related entities that can be exploited to describe the new information should be checked.

Finally, as we will see in Chapter6, knowledge graphs can be used to detect events from microblogs such as Twitter. The system we describe, ArmaTweet, builds small knowledge graphs based on DBpedia and containing information extracted from tweets. Such knowledge graphs are then used to identify topics (e.g., “Mario Cuomo died”) and to associate them with time series of tweets which are analyzed to spot events.

2.4 Evaluating Entity Retrieval Systems

As we mentioned previously in the introduction, in this thesis we also explore techniques for evaluating entity retrieval systems. This choice was motivated by the outcome of preliminary experiments on methods for AOR in which we realized that existing evaluation techniques used in academic research to evaluate approaches for tasks such as AOR, REF,and ELC, sometimes do not treat systems fairly. Such techniques are borrowed from Information Retrieval and are based on reusable test collections which are datasets consisting of a document collection, several tests representing possible interactions of the system with users, and a set of relevance judgments specifying if a certain document is relevant or irrelevant for a certain test [118]. For example, the corpus of a test collection for AOR is a knowledge graph (i.e., each of its entities is a document) and the tests are keyword queries expressing users’ information needs.

30http://www.aaai.org/Magazine/Watson/watson.php 31http://www.bbc.co.uk/ontologies 32http://opencyc.org/

22 2.4. Evaluating Entity Retrieval Systems

A relevance score is associated to each (query, entity) pair indicating if entity is relevant for the information need expressed by query or not. Normally relevance score are either binary (i.e., 0 means irrelevant, 1 means relevant) or follow a graded scale, usually composed by 4 levels ranging from zero (the document is not at all relevant) to three (the document is a perfect match). Different approaches are then compared based on metrics that take into account such relevance scores. This methodology was developed in order facilitate the evaluation of Information Retrieval Systems (IRSs) by acting as a proxy to actual user studies, in which real users are examined in a controlled environment, or to other techniques such as A/B testing, where a random small fraction of the users of a service use the variant of the system to be tested, and their behavior is studied.

2.4.1 Evaluation Metrics

Relevance scores, also called relevance judgments, can be leveraged to compute several metrics evaluating different aspects of information retrieval systems. In the following we define few well-known metrics we use in this dissertation.33

Precision

Precision measures the amount of relevant documents retrieved by a system, and is computed by dividing the number of relevant documents retrieved by the system by the total number of retrieved documents. A precision of 1 means that all the documents retrieved by the system are relevant to the information need, a precision of 0.5 means that only half of the retrieved documents are relevant, and a precision of 0 means that the system retrieved no relevant document. Formally, the precision of a ranked list L of documents retrieved for a query q can be defined as follows:

P L i| |1 rel(di ,q) P(L,q) = = L | | where rel(di ,q) is 1 if the di -th returned document is relevant with respect to the query q, and zero otherwise.

A commonly used variant of precision is Precision at i, often denoted by P@i. P@i is simply precision computed by only considering the first i retrieved documents of the given ranked list.

33In the literature people refer to what we call “metrics” by using different names. Yilmaz and Aslam, for example, call them measures [204], other researchers use “metrics” (treated as singular or plural), while Moffat and Zobel prefer to separate “metric” (singular) and “metrics” (plural) [132]. In this thesis we choose to use this last option.

23 Chapter 2. Background Knowledge

L1 L2 ✔ 1/1 Rel(q) = 20 ✔ 2/2 AP(L1) 0.17 ⇡ ✔ 3/3 AP(L ) 0.09 2 ⇡

✔ 1/5

✔ 2/7 ✔ 4/9 ✔ 3/8 ✔ 5/10

Figure 2.2 – How to compute average precision. Suppose that two IR systems produce the ranked list of documents L1 and L2 as answer for a given query q. We marked each relevant document by a tick, and noted its contribution to AP,that is the precision at its rank (P@i), next to every such document. Irrelevant documents d have a contribution of 0 since rel(d, q) 0. = Finally, we computed AP supposing that the test collection contains 20 documents that are relevant for q.

Recall

Recall is another widely used metric in Information Retrieval. It measures the fraction of all the relevant documents contained in the test collection that is retrieved by an IRS. A recall of 1 means that all the relevant documents of a certain query were retrieved by the system, a recall of 0.5 means that half of the relevant documents were retrieved, and so on. Formally, the recall of a ranked list L of documents retrieved for a query q can be defined as follows:

P L i| |1 rel(di ,q) R(L,q) = = Rel(q) | | where rel is defined as for precision, and Rel(q) is the set containing all relevant documents for q contained in the test collection.

Average Precision

Both precision and recall do not take into consideration the order of the documents contained in the input ranked list. Average precision, usually denoted by AP,assigns a weight to each retrieved document by giving more importance to high ranked documents. As a result, docu- ments at the top of the ranked list contribute more to the value of the metric than documents at the end of the ranked list. Formally, the AP of a ranked list L of documents retrieved for a query q is defined as follows:

P L i| |1 rel(di ,q) P@i(L,q) AP(L,q) = · = Rel(q) | | where rel and Rel are as for precision and recall, and P@i is precision at i. Figure 2.2 shows how to compute the AP of two different ranked list of documents. As can be seen, the list

24 2.4. Evaluating Entity Retrieval Systems having most of its retrieved relevant documents at the top has a higher AP than the list having relevant documents ranked in lower positions.

Often IRSs are evaluated by using the average over the AP the given system achieves on each query. This derived metric is called Mean Average Precision (MAP).

Normalized Discounted Cumulative Gain

When using non binary relevance scores indicating that some documents are more connected to the query than others, it is possible to compute other, more sophisticated, metrics such as the Normalized Discounted Cumulative Gain (NDCG) [97]. As the name suggests, NDCG is based on Discounted Cumulative Gain (DCG), defined by Equation 2.2. DCG makes use of a gain vector G, containing the relevance judgments at each rank of L. As can be seen, DCG[k] measures the overall gain up to rank k, giving more weight at the top of the ranking.

k X G[j] DCG[k] ¡ ¢ (2.2) = j 1 log2 j 1 = + Finally, we obtain NDCG by dividing DCG by its optimal value, obtained by calculating the optimal gain vector computed by placing the most relevant results on top of the final ranked list. By doing this we obtain a number in [0,1]. As the reader can notice, the procedure for computing DCG is similar to the one shown in Figure 2.2.

In all the following chapters of this dissertation we use test collections and evaluation metrics to evaluate our methods. In particular, in Chapter3, where we discuss methods for AOR, we give an example justifying our claim that such evaluation methodology sometimes treat systems unfairly, and in Chapter7 we analyze such a phenomenon and study a methodology to deal with it.

25

3 Retrieving Entities from a Knowledge Graph

In Chapter2 we provided a brief introduction on several aspects connected to Knowledge Graphs. Starting with this chapter we turn to a more practical description of what it is possible to do with a KG and what techniques we can use to best exploit it. In particular, in this chapter we focus on how to retrieve entities. For example, let us assume that we have a KG and that we want to retrieve one specific entity but that, unfortunately, we do not know its identifier. We only know that the entity we are looking for is the former CEO of Microsoft Inc. How do we retrieve the entity from the KG in order to show to a user all the information we have about it?

3.1 Introduction

As previously mentioned in Chapter1, several of the major players in the technology market are today exploiting knowledge graphs to provide entity-centric functionalities. Common examples include the aggregation of all pieces of content related to a given entity, or the extraction of the most important entities appearing in a given article. Some companies, like the New York Times, manually maintain a KG and ask human experts to create links between their resources (e.g., news articles) and the corresponding entities in their knowledge base (e.g., celebrities appearing in the articles). Increasingly, however, websites are turning to automated methods due to the sheer size of the resources they have to analyze, and the large number of entities they have to consider.

Recently, the Linked Open Data (LOD) movement1, which we briefly introduced in Chapter2, started an effort to make entity data coming from different datasets and knowledge graphs openly available on the Web. In this initiative, Uniform Resource Identifiers2 (URIs) are used to identify entities. Each entity can be looked up (dereferenced) online, where it is described and linked to further entities using the Resource Description Framework3 (RDF). The fundamental difference between LOD and standard entity datasets such as, for instance, Wikipedia, lies in

1http://linkeddata.org/ 2http://tools.ietf.org/html/rfc3986 3http://www.w3.org/RDF/

27 Chapter 3. Retrieving Entities from a Knowledge Graph the inherent structure of the data. On the LOD cloud, the data is a mixture of interlinked KGs provided by by uncorrelated parties, thus producing a giant graph of semi-structured data.

One of the problems one has to tackle in order to effectively exploit LOD is related to retrieving specific entities. This task is not trivial since different KGs can provide different representations of the same entity, possibly by using different URIs. Moreover, LOD users often do not have prior information on the URIs of the resources they want to retrieve and have little experience in using structured query languages like SPARQL.4 This motivated researchers to study other methods for retrieving entities from knowledge graphs. In this context, the TREC Entity track [19] has studied two related search tasks: “Related Entity Finding” (i.e., finding all entities related to a given entity query) and “Entity List Completion” (i.e., finding entities with common properties given some examples). The SemSearch challenge5 focused on Ad-hoc Object Retrieval (AOR), that is, finding the identifier of a specific entity described by a user query [152].

In this chapter, of which an earlier and condensed version was published at a major infor- mation retrieval conference [181], we focus on Ad-hoc Object Retrieval over semi-structured data. The task we tackle is very challenging since, to the issues generated by having multiple representations of the same entity coming from different datasets, we add all the complexity and the challenges involved in understanding user keyword queries. We propose a novel search architecture that exploits both IR and graph data management techniques to effectively and efficiently answer AOR queries. Specifically, we propose a hybrid solution that starts by retrieving an initial list of results from an inverted index by leveraging a ranking function, and then re-ranks and extends such a list by exploiting evidence of relevance coming from the user query and from the underlying graph representation of the data. Our extended experimental evaluation, performed over standard collections, shows that our proposed solution signifi- cantly improves the effectiveness (up to 12% improvement over a state of the art baseline) while ensuring low query execution times. We consider our results as especially promising since the LOD test collections that are today available are noisy and incomplete, and since we expect both the quality, the coverage, and the documentation of LOD datasets to rapidly improve in the future.6

In summary, the goal of this work is to obtain some insight on how the underlying graph structure of LOD can be leveraged to improve the effectiveness of AOR systems by integrating different IR techniques. In particular we study:

• How standard IR techniques, including query extension, pseudo relevance feedback, query term dependencies, and well-known ranking functions, perform when applied to AOR tasks.

4https://www.w3.org/TR/rdf-sparql-query/ 5http://km.aifb.kit.edu/ws/semsearch10/ and http://km.aifb.kit.edu/ws/semsearch11/ 6 In Chapter5 we discuss possible methods to increase the quality of LOD datasets and to improve their documentation.

28 3.2. Related Work

• How entity types and Named Entity Recognition techniques can be exploited to improve the precision of our system.

• How to best combine IR and graph data management techniques to exploit the graph structure of the underlying data for improving system effectiveness.

• How to combine different entity relevance signals based on learning to rank approaches to blend high-precision low-recall ranking functions in an overall effective AOR system.

The rest of the chapter is structured as follows. In Section 3.2 we discuss related work, focusing on Entity Search and AOR approaches. Section 3.3 is devoted to a high-level description of the components of our AOR system while the subsequent sections present novel approaches for AOR. In particular, in Section 3.4 we describe several approaches to analyze user queries based on NLP and Semantic Web techniques, in Section 3.5 we report on our inverted index structures, and in Section 3.6 complementary methods based on structured queries and graph data management techniques are discussed. All presented techniques are then combined by using the methods detailed in Section 3.7. The core of our experimental evaluation, in which we compare the performance of our hybrid approaches to a series of state-of-the-art AOR approaches on two standard test collections, can be found in Section 3.8. We then conclude this chapter by summarizing our conclusions in Section 3.9.

3.2 Related Work

In Section 2.3 we provided a brief overview of the effort made to solve Ad-hoc Object Retrieval and we discussed on how AOR is different from Entity Linking. In the following we want to provide more detail on previous approaches for AOR and on other related work.

3.2.1 Early Entity-Centric IR Evaluation Initiatives

Early entity-centric IR approaches focused on single-type entity search. This task is called expert finding and was widely studied; in particular, in 2008 the Text REtrieval Conference (TREC), the main IR evaluation initiative, featured such task in its Enterprise track. In this context several techniques featuring standard IR approaches such as language modeling were evaluated [20,11,15].

Complex entity search tasks have been addressed more recently. For instance, the INEX Entity Ranking (XER) track studied the problem of finding entities matching a keyword query (e.g., “Countries where I can pay in Euro”) using Wikipedia as entity corpus [66]. In addition, in 2010 and 2011 TREC featured an “Entity track” consisting in two entity-centric tasks: Related Entity Finding (REF) and Entity List Completion (ELC). REF,which we briefly introduced in Chapter2 and which is the most related to AOR, aims at finding entities related to a given entity (e.g., “Airlines using the Boeing 747”) using both a collection of webpages as well as a collection of RDF triples [10, 18]. Several effective approaches for this task exploit information on the types

29 Chapter 3. Retrieving Entities from a Knowledge Graph of the entities in the corpus and on their co-occurrences [41, 100]. The Entity Recognition and Disambiguation Challenge [49], organized jointly by Yahoo! Labs, Microsoft Research, and Google focuses on detecting entity mentions in short and long texts and linking them to Freebase. One innovative aspect of the challenge is that the response time also plays a role in the evaluation of the participating systems.

Finally, the task of ranking entities appearing in one document taking into account the time dimension has been studied by Demartini et al. [67].

3.2.2 Semantic Search

A large amount of online search queries are about entities [152], and modern search engines exploit entities and structured data to build their result pages [87]. Related to this, a number of semantic search systems have already been developed. Elbassuoni and Blanco, for example, propose ranking models for keyword queries over RDF data [72]. However, the task they tackle is slightly different from AOR since it requires to select and rank subgraphs matching the input query instead of just provide a ranked list of entity identifiers. Similarly, another system that handles queries over RDF data is CE2 [197], a hybrid IR-DBMS system that provides ranking schemes for keyword queries describing joins and links between entities. A related system that supports both keyword queries and structured queries over RDF data is SIREn [63]. SIREn focuses on scalability and efficiency through novel indexing schemes that can index data and answer user queries efficiently by using an inverted index, instead, we propose methods to improve ranking effectiveness for AOR.

A lot of work has also been carried on semantic parsing, that is, parsing natural language text into logical formulas. We note, in particular, a recent work by Berant et al. [26], which aims at parsing natural language phrases into logical forms having Freebase properties as logical predicates in order to build a question answering system. In this work, we focus exclusively on keyword queries describing specific entities which are often shorter and less linguistically structured than natural language questions.

Finally, Minack et al. bring semantic search to desktop environment users by building Beagle++, a desktop search engine that exploits both an inverted index and a structured repository con- taining file metadata to provide more effective search functionalities [128]. While we also use a combination of inverted indexes and structured repositories, the task we tackle is different as we do not focus on file matadata but on LOD.

3.2.3 Entity Linking

Recall from Section 2.3.2 that Entity Linking (EL) is the task that consists in identifying entity mentions in textual documents and to create a link from the entity to its unique identifier in a given KG. As we mentioned previously, EL is highly related to AOR since in both cases entity identifiers have to be retrieved starting from some piece of text, however, the two tasks

30 3.2. Related Work differ from the fact that in EL the textual context we can exploit to decide which entities to return is way larger than entities description given as input in the case of AOR (mere keyword queries). In addition, current EL systems such as DBpedia spotlight [123] and the state-of- the-art method designed by Ganea et al. [79] focus on selecting Uniform Resource Identifiers (URIs) from a single Knowledge Graph (e.g., DBpedia, Freebase), the system we present assigns entity identifiers from a larger LOD crawl.

3.2.4 Ontology Alignment

The task we tackle requires us to retrieve entities from LOD that, as we saw in Section 2.2.5, is composed by many interlinked KGs. In this context, the Semantic Web community has put much effort in designing systems for ontology alignment, that is, systems that determine correspondences between concepts of different KGs. Examples are SLINT+ [141], LINDA [34], and RiMOM [111]. Our approach shares with those systems the use of both the text attached to an entity and the graph structure in which the entity is located. Nevertheless, the search task we tackle is different as it consists in understanding which entity the user is looking for given only keywords (in particular, not a structured representation of the entity to find).

A related effort has been carried out in the context of the OKKAM project7, which suggested the idea of an Entity Name System (ENS) to assign a unique identifier to each entity on the Web [38]. The ENS could leverage techniques from our paper to improve its matching effectiveness.

3.2.5 Ad-hoc Object Retrieval

In Section 2.3.1 we motivated the need for AOR and presented some existing techniques used to tackle the task. To summarize, the task definition is proposed by Pound et al. [152], and current methods are based on the notion of entity profile. Each such profile characterize an entity and is composed by text taken from datatype properties attached to the entity it describes or to some other related entity. IR techniques can then be used to index such entity profiles and to retrieve entities. Moreover, various methods already proposed for different entity search tasks can be exploited for AOR as well, such as approaches exploiting probabilistic models [12], Natural Language Processing techniques [65], and relevance feedback [96]. In our work, we leverage such methods (e.g., query annotation and relevance feedback) and suggest further AOR approaches relying on recent IR techniques in Section 3.4.

More recently, entity search approaches were compared in a setting where users provide example relevant entities together with their query (Entity List Completion, cf. Section 2.3.1) [42]. While they also evaluate the proposed techniques on the SemSearch AOR collections, our techniques show more effective results for the AOR task without the need of example entities being provided together with the keyword query.

7http://www.okkam.org

31 Chapter 3. Retrieving Entities from a Knowledge Graph

Many methods for AOR are evaluated by using the test collections that were developed in the context of different editions of the SemSearch Challenge [89] by using methods similar to those described in Section 2.4. In particular, such test collections use a large crawl of LOD as corpus and feature 50 to 92 queries. Relevance judgments are obtained by using crowdsourcing and distinguish 4 different levels of relevance [31]. The contribution of this work is novel techniques and their combination to benefit both from known IR ranking functions as well as from an analysis of the graph structure of the entities and their properties (e.g., types). We compare below our novel approach against state-of-the-art entity search techniques. To the best of our knowledge, our proposed AOR approach is one of the best performing approaches evaluated on the SemSearch test collections so far.

3.3 System Architecture

In the following We describe the architecture of our hybrid search system. Our approach leverages inverted indexes that support full text search on one hand, and a structured database maintaining a graph representation of the original data, on the other hand.

The ability to query an inverted index and to obtain a ranked list of results allows us to retrieve an initial set of entities that match the user query. At the same time, our structured repository allows us to use algorithms exploiting the graph structure to either retrieve new entities or to produce pieces of evidence suggesting that entities that were already retrieved are indeed relevant. Specifically, our system is composed by the following macro components:

Query Processor (Section 3.4), which is responsible for parsing and extending user queries by using IR techniques, and for recognizing named entities and entity types .

Inverted Index Searcher (Section 3.5), which exploits one or more inverted indices and pro- duces as output one or more lists of entities ranked by state-of-the-art IR ranking functions (e.g., BM25).

Graph Refiner (Section 3.6), which queries the structured repository in order to enrich an initial ranked list of entities and to provide evidence of relevance for each retrieved en- tity.8 Our graph refinement techniques take advantage of both the entity type mentions found in the query and of two types of graph queries: scope queries following datatype properties, and graph traversal queries following object properties.

Results Combiner (Section 3.7), which aggregates all the evidence of the relevance of each entity and produces the final ranked list of entities by using either linear combinations

8We note that the score or the rank assigned by the Inverted Index Searcher can be viewed as an evidence of relevance: a high score suggests a stronger correlation of the entity to the query.

32 3.4. Query Processor

Query Annotation Entity Retrieval & Ranking Result Combination Pseudo-relevance Feedback Third Party Inverted Index Searcher Search Engine Ranking Inverted Index Ranking FunctionsIR RankingFunctions Structured Functions Inverted Index Query Processor Evidences of Keyword Query Result Entity Type Intermediate results Relevance Query Annotation and Combiner Recognition Expansion Linear Named Entity Combinations Recognition Graph Refiner Evidences of Machine Query Annotated Learning Neighborhoods Relevance Graph Traversals Models (datatype (object properties) properties) NER WordNet Models Entity Type Matching Graph-Enriched Results

RDF Store

Figure 3.1 – Our hybrid AOR search system. First, the input query is processed and expanded. Then, results retrieved from inverted indices are extended and refined through structured graph queries. All the retrieved entities are then ranked by using a linear combination of their scores or a machine learning model leveraging different evidence of relevance.

or a supervised machine learning model.

Figure 3.1 illustrates the main components of our system and their interactions. The search process proceeds as follows. First, a user initiates an entity search through a keyword query. The query is parsed, annotated, and extended by the Query Processor and is then forwarded to the Inverted Index Searcher, which produces one or more lists of results. Those lists are then exploited by the Graph Refiner in order to retrieve more entities and compute additional evidence of relevance. Entities and evidence are then given as input to the Result Combiner that generates a ranked list of results. Finally, the top entities of such a list are returned to the user.

3.4 Query Processor

Here we describe the Query Processor: the component of our system devoted to parsing, annotating, and expanding the user query.

3.4.1 Query Expansion

Natural Language Processing techniques have been applied to other entity search tasks. In this paper we exploit query expansion and relevance feedback techniques on top of a BM25 baseline for the AOR task.

33 Chapter 3. Retrieving Entities from a Knowledge Graph

Specifically, we extend a given disjunctive keyword query by adding additional related terms such as synonyms, hypernyms, and hyponyms from Wordnet [127]. This makes the query (e.g., “New York”) retrieve additional results including different ways of referring to the specified entity (e.g., “Big Apple”).

In addition, we exploit commercial search engines query autocompletion features to obtain additional keywords: we send the original user query to a commercial search engine (Google) and we expand it by appending the first 5 terms its autocompletion algorithm suggests (e.g., “New York Times”).

3.4.2 Pseudo-Relevance Feedback

We implement pseudo-relevance feedback techniques on top of the baseline approach by first running the original query q and then considering the labels of the top-3 retrieved entities to expand the user query. For instance, the query “q=NY” can be expanded to “New York”, “New York City” and “North Yorkshire” (labels of the top-3 entities retrieved for the original query q).

We report evaluation results comparing those various techniques on standard AOR collections in Section 3.8.

3.4.3 Named Entity Recognition

We use a Conditional Random Field Named Entity Recognizer9 based on the work by Finkel et al. [77] in order to find entity mentions of people, locations, and organizations inside the query. This method is very well known and used, however, it is mainly designed annotate long text documents. In order to understand how the tool applies to queries, we report in Section 3.8.5 its precision and recall on the set of queries we use. The other components of our system can then exploit the information generated at this stage in order to take decisions on which entities to include in the result set. For example, if we do NER on the query “charles darwin”, the named entity recognizer identifies that the whole query is about a specific person, thus the other components can decide to filter out all results regarding, for example, Erasmus Darwin (see Sections 3.5.3 and 3.7 for more details).

3.4.4 Entity Type Recognition

Users often specify in their queries the type of the entities they are trying to retrieve; for example, in the query “james clayton MD” the string “MD” suggests that the user is looking for a medical doctor. This couples well with the fact that entities in the LOD often have one or more entity types connected to them via the rdfs:type property.

In order to exploit the entity type hints given by the user, we need to extract them from the

9http://nlp.stanford.edu/software/CRF-NER.shtml

34 3.5. Inverted Index Searcher query. In order to do that we collect all the types of an entity from the data graph by using the following SPARQL query and we compute their labels (i.e., human readable representations of entity types) by tokenizing the retrieved URIs and by doing some basic processing in order to remove their domain name, their hierarchical part and, possibly, the query they contain.10

SELECT DISTINCT ?type WHERE { ?e ?type }

The resulting strings are then normalized by separating CamelCased words, downcasing, removing all English stop-words and all non-alphabetical characters, and by stemming with the Porter Stem algorithm [151]. At the end of this process, we have a list of strings we can use to recognize entity type mentions in the user queries.

In order to match a substring of the user query to one of the type labels we computed, we first normalize the query as described previously. Then, we find the longest n-grams of the query that exactly match some type label (e.g., if the query contains the entity type “actor”, “american actor” is not a match). Notice that the final output of this query processing phase can be composed by more than one entry since different parts of the query can match different entity type labels. Moreover, more than one type can be retrieved if the matched n-gram is the label of more than one type. We use an inverted index containing the normalized entity type labels in order to efficiently perform type recognition.

3.5 Inverted Index Searcher

Here we describe the Inverted Index Searcher component shown on top of Figure 3.1. As can be seen from the diagram, the Inverted Index Searcher is composed by two inverted indexes. Information on how such indexes are built is provided in the following.

3.5.1 Unstructured Inverted Index

The first approach to index graph-structured datasets such as those found in the LOD cloud is to aggregate all information attached to the entities. Thus, we create entity profiles by building an inverted index on the entity labels. Similarly to what was done by Balog et al. in the context of candidate-centric expert finding [11], each document we index is an entity profile containing all the labels directly attached to the entity in the graph. The resulting index structure considering entity profiles as bag-of-words combined with a BM25 ranking function [160] will be one of the baseline approaches for AOR we compare against (cf. Section 3.8.4).

10We noticed that the labels computed with this method had less irregularities than those retrieved by using the rdfs:label property.

35 Chapter 3. Retrieving Entities from a Knowledge Graph

3.5.2 Structured Inverted Index

Similarly to what was done by Blanco et al. [33], we create a structured inverted index based on structured entity profiles built by separating, for each entity, the following pieces of infor- mation:

URI. We tokenize the URI identifying the entity in the LOD cloud. As the URI often contains the entity name (e.g., http://dbpedia.org/resource/Barack_Obama), this field often matches the user query well.

Labels. We consider a list of manually selected datatype properties which frequently occur in various LOD datasets and which point to a label or textual description of an entity.11

Attributes. Finally, we consider all the other datatype properties (i.e., non-label attributes) attached to the entity.

We then build a structured index composed by the three mentioned fields and obtain a ranked list of results by using ranking functions such as BM25F.

3.5.3 NER in Queries to Improve Inverted Indexes Effectiveness

In order to improve the effectiveness of the baseline we exploit all the annotations provided by the Query Processor (Section 3.4). In particular, we use NER to detect entity mentions and to issue focused queries containing them. More precisely, whenever the query is the mention of some entity (that is, the named entity recognizer recognizes the whole query as one entity mention), we use the Labels index in order to extract the nodes representing the mentioned entity. To increase precision we issue a phrase query to the index, that is, we search for entities whose label contains the exact sequence of words matched in the query (for example, “university of north dakota” matches the query “north dakota” but does not match “north dakota university”). Retrieved entities are then ranked by using the BM25 ranking function. An extensive study of this heuristic, of how NER applies to queries, and of the other methods based on query processing is presented in Section 3.8.5.

3.6 Graph-Based Result Refinement

As previously mentioned, the Graph Refiner component takes ranked lists of entities (i.e., nodes in the data graph) coming from the Inverted Index Searcher and uses them in order to find new results and to produce new evidence of relevance.

11The properties we use are: akt:family-name, akt:full-name, akt:given-name, akt:has-pretty-name, akt:has-title, cycann:label, dbp:label, dbp:name, dbp:title, dc:title, foaf:name, foaf:nickname, geonames:name, nie:title, rdfs:label, rss:title, sioc:name, skos:prefLabel, vcard:given-name

36 3.6. Graph-Based Result Refinement

3.6.1 Entity Retrieval by Entity Type

In this step we use entity type information to retrieve new entities related to the user query. We do this by exploiting the entity type annotations computed by the Query Processor in order to select a starting set of entities that we then filter and rank by comparing their textual description in the graph with the content of the user query. For example, in the query “pierce county washington” the word “county” is recognized as an entity type mention, so, all the counties in the knowledge base are retrieved. They are then scored by counting how many times “pierce” and “washington” occur in their literals, since the same entity may be retrieved many times by using different n-grams, we aggregate by summing up their scores. Finally, we only keep the results with a score greater than zero. By doing this we expect to retrieve only few relevant entities, thus, achieving high precision. Notice that we use the score we computed both to filter entities and as an evidence of relevance. Moreover, this technique is different from the approach by Dalton and Huston [61] since their system infers the most suitable entity types for a given entity query by using a language model while we detect actual mentions of entity types in the query. Section 3.8.6 reports the evaluation results of the described methodology.

3.6.2 Entity Graph Traversals

All the techniques described above neglect two key characteristics of LOD data: i) LOD entities are interlinked, in the sense that related entities are often connected, and ii) data values are clustered in the graph, such that many relevant values can be retrieved by following series of links iteratively from a given entity. In this section, we describe how we can leverage structured graph queries to improve the performance of our system taking advantage of those two characteristics.

Formally, let Retr [e ,e ,..,e ] be the input ranked list of entities. Our goal is to define = 1 2 n a set of functions that use the top-N elements of Retr to create an improved set of results

StructRetr {e0 ,e0 ,..,e0 }. This new set of results may contain entities from Retr as well as = 1 2 m new entities discovered when traversing the LOD graph. To obtain StructRetr, we exploit top LOD properties that are likely to lead to relevant results.12 We finally use the score( , , ) • • • ranking function defined in Section 3.6.3 and the properties used to retrieve the new entities as evidence of relevance.

We describe next how we take advantage of SPARQL queries to retrieve additional results by following object property links in the LOD graph.

Structured Queries at Scope One Our simplest graph traversal approach looks for candidate entities directly attached to entities that have already been identified as relevant (i.e., entities

12It is possible to obtain such a list of properties by means of a training test collection. In this paper, we perform cross validation across the two different AOR test collections in order to identify the most promising properties.

37 Chapter 3. Retrieving Entities from a Knowledge Graph

Table 3.1 – Top object properties in the SemSearch collections by estimated likelihood (Recall) of connecting retrieved and relevant entities.

From retrieved (x) to relevant entities (y) From relevant (y) to retrieved entities (x) Object Property Recall Precision Object Property Recall Precision dbp:wikilink 82.10% 0.81% dbp:wikilink 67.73% 0.52% skos:subject 11.75% 0.7% dbp:redirect 7.47% 0.44% owl:sameAs 1.60% 1.54% owl:sameAs 2.93% 0.76% dbo:artist 0.98% 15.42% dbp:disambiguates 1.60% 3.85% dbp:disambiguates 0.68% 1.98% skos:subject 1.33% 1.87% dbp:title 0.55% 1.81% foaf:homepage 1.33% 2.95% dbo:producer 0.43% 2.87% dbo:artist 0.80% 1.97%

SemSearch 2010 dbp:region 0.43% 8.37% dbp:first 0.37% 7.32% dbp:redirect 0.25% 3.91% dbp:wikilink 89.50% 0.44% dbp:wikilink 37.50% 0.41% skos:subject 4.42% 0.22% skos:subject 37.50% 0.32% owl:sameAs 1.66% 1.66% owl:sameAs 10.71% 0.7% dbp:disambiguates 1.10% 0.89% skos:broader 7.14% 0.93% skos:broader 1.10% 0.67% foaf:homepage 1.79% 2.08% swatr:piOrg 1.10% 67.44% sioc:links_to 1.79% 0.04% SemSearch 2011 foaf:page 0.55% 0.13% dbp:redirect 1.79% 0.24%

in Retr ). First, we need to ensure that only meaningful edges are followed. The object property dbp:wikilink, for instance, represents links between entities in Wikipedia. As an example, the entity dbr:Stevie_Wonder has a dbp:wikilink pointing to dbr:Jimi_Hendrix. Since the AOR task aims at finding all the identifiers of one specific entity, this kind of links is not very promising. On the other hand, the owl:sameAs object property is used to connect identifiers that refer to the same real world entity. For instance, the entity dbr:Barack_Obama links via owl:sameAs to the corresponding Freebase entity freebase:Barack_Obama. This object property is hence probably more valuable for improving the effectiveness of AOR in a LOD context.

To obtain a list of object properties worth following, we rank all properties in the dataset by their observed likelihood of leading to a relevant results.13 Table 3.1 gives the top object property scores for the SemSearch collections. Using such ranked list of properties as a reference, we define a series of result extension mechanisms exploiting structured queries over the data graph to identify new results. At this stage, we focus on recall rather than precision in order to preserve all candidate entities. Then, we rank candidate entities by means of a scoring function.

13We compute the property scores by counting the number of times a property represents a path from a retrieved to a relevant entity (or the other way around) and dividing the result by the total number of such paths.

38 3.6. Graph-Based Result Refinement

The first and simplest approach (SAMEAS) we consider is to follow exclusively the owl:sameAs links. Specifically, for each entity e in the top-N retrieved results, we issue the following query:

SELECT ?x WHERE { { ?x } UNION { ?x } } which returns all entities that are directly linked to e via a owl:sameAs object property.

Our second approach is based on the property scores defined above and exploits informa- tion from DBpedia. As DBpedia originates from Wikipedia, it contains disambiguation and redirect links to other entities. For example, dbr:Jaguar links via dbp:disambiguates to dbr:Jaguar_Cars as well as to dbpedia:Jaguar_(computer) and others. Our second ap- proach (S1_1) also follows such links to reach additional relevant entities. Thus, we define the following structured query:

SELECT ?x WHERE { { ?x } UNION { ?x } UNION { ?x } UNION { ?x } } which extends SAMEAS queries with two additional triple patterns.

Our third scope one approach (S1_2) includes properties that are more specific to the user queries. In addition to the generic properties of the previous two approaches, we add to the list of links to follow object properties like dbp:artist, skos:subject, dbp:title, and foaf:homepage, that appear in Table 3.1 and which lead to more general, albeit still related, entities.

The final scope one approach we propose (S1_3) extends S1_2 by adding matching patterns using the skos:broader property that links to more general entities. In addition to those four approaches, we additionally evaluated more complex query patterns based on the prop- erty scores defined above but did not obtain significant improvements over our simpler approaches.

Structured Queries at Scope Two An obvious extension of the above approach is to look for related entities further in the graph by following object property links iteratively. In the following, we describe a few scope two approaches following pairs of links.

The number of potential paths to follow increases exponentially with the scope of the queries.

39 Chapter 3. Retrieving Entities from a Knowledge Graph

To reduce the search space, we rank object property pairs by their likelihoods of leading to relevant entities as done for scope one queries.

Based on this, the first strategy (S2_1) to retrieve potentially interesting entities at scope two is based on structured join queries containing top-two property pairs. The query issued to the structured repository looks as follows:

SELECT ?y WHERE { { ?x . ?x ?y } UNION { ?w . ?w ?y } UNION [...] }

The second approach (S2_2) uses a selection of the top-two property pairs considering the starting entity e both as a subject as well as an object of the join queries. Finally, the third scope two approach (S2_3) applies a join query with the most frequent property pairs that do not include a dbpedia:wikilink property. The assumption is that due to the high frequency of this type of links too many entities are retrieved, which produces a noisy result set.14 On the other hand, by focusing on non-frequent properties we aim at reaching few high-quality entities.

We note that it would be straightforward to generalize our graph traversal approach by consid- ering scopes greater than two, or by considering transitive closures of object properties. Such approaches impose however a higher overhead and only marginally improve on scope one and scope two techniques from our experience.

3.6.3 Neighborhood Queries and Scoring

Once a new set of entities (StructRetr) has been reached, there is the need to 1) rank them by means of a scoring function and 2) merge the original ranking Retr with StructRetr. Given an entity e0 StructRetr and the entity e Retr from which e0 originated, we compute the score of ∈ ∈ e0 by defining a function score(q,e,e0) that is used to rank all entities in StructRetr.

Our scoring function exploits a text similarity metric applied to the query and the literals directly or indirectly attached to the entity by means of datatype properties. As noted above, related literals are implicitly clustered around entities in the LOD cloud. While many literals are directly attached to their entities (e.g., age, label), some are attached indirectly, either through RDF blank nodes (e.g., name firstname, address zip code), or are attached to → → related entities.

14As described at http://km.aifb.kit.edu/projects/btc-2009/, dbp:wikilink is the most frequent property in the test collection.

40 3.7. Results Combiner

We use neighborhood queries similar to the one shown below to retrieve all literals attached to a given entity through datatype properties, either at scope one or at scope two. To minimize the overhead of such queries, we only retrieve the most promising literals. Table 3.2 lists the datatype properties whose values are most similar to the user queries. We adopt here the Jaro-Winkler (JW ) similarity metric, which fits the problem of matching entity names well [57].

SELECT ?S WHERE { { ?S } UNION { ?x ?y . ?y ?S } }.

Additionally, we adopt a modified version of this scoring function by applying a threshold τ on the value of JW(e0,q) to filter entities that do not match well. Thus, only entities e0 for which JW(e0,q) τ are included in the result set. Also, we check the number of outgoing links of > each entity in StructRetr and only keep those entities having at least one outgoing link. The assumption (which was also made by the creators of the AOR evaluation collections) is that entities with no literals attached and that are not the subject of any statement are not relevant.

The final results are then constructed by linearly combining the Jaro-Winkler scores with the original entity score as noted above. The following section gives more information about this combination and compares the performance of several ranking methods.

Table 3.2 – Top datatype properties with 10+ occurrences ranked by JW (e0,q) text similarity on the 2010 collection.

Datatype Property JW(e0,q) Occurrences wn20schema:lexicalForm 0.8449 19 dbp:county 0.8005 17 daml:name 0.7674 27 geonames:name 0.7444 78 akt:full-name 0.7360 55 dbp:wikiquoteProperty 0.7096 10 skos:prefLabel 0.6911 158 dc:title 0.6711 236 opencyc:prettyString 0.6680 48 dbp:officialName 0.6623 54

3.7 Results Combiner

As we anticipated in Section 3.3, the Results Combiner is responsible for merging evidences of relevance coming from the other components.

Our system features two methods for aggregating results. The first one is a simple model that

41 Chapter 3. Retrieving Entities from a Knowledge Graph linearly combines two types of numeric evidence:

finalScore(e) λ s(q,e) (1 λ) s0(q,e) (3.1) = · + − · where q is the user query, e is a retrieved entity, s(q,e) is the first type of evidence for e, and s0(q,e) is the second. If no evidence of relevance (neither positive nor negative) is coming from one of the two lists we combine, we consider the entity as neutral (w.r.t. that method) and we assign it a score of 0. Section 3.8 presents experimental results on how to select values for λ.

In order to combine evidences of relevance coming from all the other components, we train a decision list model based on M5 [155] to predict the relevance of each entity w.r.t. the user query. The features we use are:

1. bm25baseline, the BM25 score given by the unstructured inverted index;

2. bm25uri, the BM25 score given by the structured inverted index on the field URI;

3. bm25label, the BM25 score given by the structured inverted index on the field Label;

4. nerFullQuery, the BM25 score given by using the NER heuristic described in Sec- tion 3.5.3;

5. typeRetrievalScore, the score given to the entity by using the approach described in Section 3.6.1;

6. isSameAs, whether or not the Graph Refiner obtained this entity by using a owl:sameAs link;

7. isRedirects, whether or not the Graph Refiner obtained this entity by using a dbp:redirect link;

8. isDisambiguates, whether or not the Graph Refiner obtained this entity by using a dbp:disambiguates link;

9. nTypesMatched, the number of entity type URIs matched in the user query;

10. maxTypeNGramLen, the length of the longest entity type n-gram matched in the user query;

11. nNgramsMatched, the number of longest entity type n-grams matched;

12. isFullQueryEntity, whether or not the entire user query is an entity mention;

13. nNEREntities, the number of entities the NER found in the user query;

14. containsEntities, whether or not the user query contains any entity mention.

42 3.8. Experimental Evaluation

We choose to merge entity-specific and query-specific features (resp., 1–8 and 9–14) since we believe that the relevance of an entity can only be decided with respect to the specific user query it answers.

3.8 Experimental Evaluation

We present below the results of a performance evaluation of our approaches for AOR. Our aim is to evaluate the overall effectiveness and efficiency of the proposed methods as well as the impact of the various parameters involved. Particular emphasis is also placed on obtaining relevance judgments by means of crowdsourcing and on its economic aspects.

3.8.1 Experimental Setting

In order to evaluate and compare the proposed approaches, we adopt standard collections for the AOR task. We use the testsets created in the context of the 2010 and 2011 editions of the SemSearch challenge.15 These collections are tailored to the AOR task and they were already used in the literature for the same task (see Section 3.2). This choice allows us to compare with previous work that was evaluated on the same collections.

Corpus The two test collections contain 92 and 50 queries respectively, together with rele- vance judgments on a 3-level scale obtained by using paid crowdsourcing. The underlying dataset used in both testsets is the Billion Triple Challenge 2009 dataset (BTC09) which consists of 1.3 billions RDF triples crawled from different domains of the LOD cloud. Using a crawl of LOD is different from using a smaller and more homogeneous data collection (e.g., DBpedia, Freebase, or WikiData), mostly in terms of noisiness in the data. We observed that BTC09 contains literals in many languages as well as literals coming from user comments and posts containing spelling errors, slang, many non-standard onomatopoeia, and all other character- istics of informal text written by non-professional writers. According to the official dataset statistics, DBpedia (see Chapter2) is the most popular dataset in our collection; moreover, many of the top datasets in BTC09 cover data from social networks (e.g., sioc-project.org) and blogs (e.g., livejournal.com), thus explaining the low quality text that our system needs to deal with.16 This can influence the performance of ad-hoc retrieval methods based on term frequency or language models. Nevertheless, limiting ourselves to using one specific knowledge base (e.g., Freebase) would result in a reduced amount of retrievable results since we would not consider entities covered more in detail by other specialized LOD datasets.

Other related datasets we could have used include the Freebase-annotated ClueWeb corpus, and the schema.org annotations included in ClueWeb and CommonCrawl pages.17 However,

15http://km.aifb.kit.edu/ws/semsearch10/ for 2010 and http://km.aifb.kit.edu/ws/semsearch11/ for 2011. 16http://km.aifb.kit.edu/projects/btc-2009/ 17http://lemurproject.org/clueweb09/FACC1/, http://lemurproject.org/clueweb12/ and http://commoncrawl.

43 Chapter 3. Retrieving Entities from a Knowledge Graph differently from datasets based on LOD, these corpora contain webpages annotated with entities, thus making such collections a good choice for evaluating entity linking systems as done, for example, in the context of the ERD Challenge (see Section 3.2). We believe that, given that content of webpages is often discursive, queries generated from text extracted from webpages can be more verbose and structurally different than actual keyword queries issued by users, thus not reflecting the true nature of the AOR task.

Metrics We adopt the official evaluation metrics from the SemSearch initiative and from previous work: Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), and early Precision (P10). Statistical significance is measured against the baselines described in Section 3.8.4 by means of a two-tailed paired t-test by considering a difference significant when p 0.05. Interested readers can find additional information on the metrics < mentioned above in Chapter2.

3.8.2 Continuous Evaluation of AOR Systems Based on Crowdsourcing

To fairly compare different approaches for AOR we embrace the continuous evaluation methodology described in detail in Chapter7 and summarized in the following.

As the reader might know, traditional effectiveness evaluation methods based on the Cranfield paradigm have been largely used in TREC initiatives. In this paradigm, document collections, queries, and relevance judgments form test collections, which are typically made available to foster repeatable and comparable experiments. However, because of the increasing size of the corpora and of the consequent impossibility of manually judging all documents, a pooling methodology is typically used: top retrieved results from each run submitted to the evaluation initiative are judged; non-judged results are assumed not to be relevant. This approach was shown to be fair for comparing systems that participated in the original evaluation initiative but unfair with respect to any following system evaluated on the same collection “because the non-contributors will have highly ranked unjudged documents that are assumed to be not relevant” [193]. For instance, the top part of Figure 7.1 (Chapter7, Page 122) shows a scenario in which three systems, A, B, and C, participated in an evaluation initiative and, therefore, contributed to the assessment pool with their top-10 results. As a consequence, each result of A ranked higher than 10 will be judged, while results beyond rank 10 may be judged or not depending on if they appear or not in the top-10 results of B or C. Some time after the evaluation initiative, a new system D has to be compared against all other systems, however, its top-10 results may contain unjudged results that are typically considered as not relevant. This strongly penalizes D and other systems that retrieve results that are very different from the original documents that were part of the evaluation initiative.

On the one hand, to overcome the limitations of pooling, novel metrics taking judgment incompleteness into account (e.g., bpref [44]) or adopting sampling to delve deeper into org, respectively.

44 3.8. Experimental Evaluation the ranked list (e.g., infAP [204]) have been proposed. On the other hand, the recent trend of crowdsourcing relevance judgments enables the alternative approach depicted at the bottom of Figure 7.1. In this new scenario new systems exploit crowdsourcing to obtain the missing judgments by running the same micro-tasks that provided the original (crowdsourced) relevance judgments.

One key advantage of this approach is that subsequent runs created after the initial evaluation campaign can be judged on a fair basis as if they were part of the original results. One drawback however is that the number of relevant results changes (i.e., increases) in this way, thus making metrics that take this into account (e.g., MAP) incomparable with those computed originally by the evaluation initiative. Anyhow, the availability of the original retrieved results (i.e., the submitted runs which are available for all tasks at TREC) allows the authors of later approaches to recompute the metrics and compare them against previous approaches.

As the AOR approaches proposed in this paper substantially differs from the approaches run during the evaluation campaign, both in terms of system architecture and retrieved results (see Table 3.3), we decided to adopt this new approach to obtain fair relevance judgments for the top-10 results of our various approaches.

In Chapter7 we provide a deeper analysis of advantages and disadvantages of running a continuous evaluation campaign and study how traditional IR evaluation metrics and pooling strategies compare in such settings.

3.8.3 Completing Relevance Judgments through Crowdsourcing

In order to obtain additional relevance judgments for unjudged entities in the top-10 results of the baselines and of our runs, we published micro-tasks (HITs) on Amazon MTurk.18 We followed the same task design and settings used to create the test collections for the AOR task at SemSearch [31, 89]. We asked the crowd to judge a total of 2,806 and 1,467 additional query-entity pairs for the 2010 and 2011 collection respectively.

We split the tasks in batches of 10 entities per HIT, which were rewarded $0.20 each and assigned to 3 different workers. The overall number of relevant entities increased by 46% (from 2,028 to 2,959 entities) for the 2010 collection and by 75% (from 470 to 821 entities) for the 2011 collection.19

As a consequence, we can compare our approaches against the original SemSearch submis- sions by re-computing the selected metrics using the new relevance judgments, as suggested in Chapter7. The results of the comparison are summarized in Table 3.11 and discussed later in this section.

18https://www.mturk.com/ 19The extended assessment files we created together with the script to generate them from MTurk output are available at https://git.io/vMTBY.

45 Chapter 3. Retrieving Entities from a Knowledge Graph

Table 3.3 – New retrieved results that were not retrieved by the BM25 baseline for top-10 results of IR and graph-based approaches in both collections. “Unj” stands for “Unjudged”.

SemSearch 2010 SemSearch 2011 New Retr. New Rel. New Unj. New Retr. New Rel. New Unj. Extension 264 9 260 179 23 175 Query Autoc. 310 13 306 216 22 212 PRF3 220 6 218 153 14 152 NER-heuristic 20 18 14 14 13 10 SAMEAS 84 50 68 20 10 12 S1_1 204 118 155 56 18 39 S1_2 310 86 277 103 20 94 S1_3 303 84 271 103 20 94 S2_1 14 7 11 2 1 1 S2_2 14 7 11 2 1 1 S2_3 2 1 2 0 0 0 Entity-Types 34 24 32 3 1 2 Decision-List 218 143 143 51 22 29

3.8.4 Baselines

We evaluate the effectiveness of our methods by comparing them against three baselines: the unstructured inverted index based on BM25 described in Section 3.5, the Sequence Depen- dence Model (SDM) [125], and a structured language model (StrLM) designed for AOR [140]. The first two baselines are implemented on top of an inverted index built on all the literals at- tached to the entity (entity profile). The parameter b in BM25 was selected by cross-validation on the 2010 and 2011 datasets, while for SDM we fixed λ 0.8, λ 0.1, and λ 0.1 since T = O = U = such assignment is very robust and is near-optimal for many retrieval tasks [125]. The last baseline, StrLM, is the Structured Entity Model proposed in the work by Neumayer et al., cited previously; the authors kindly provided us with runs of their system on the two datasets we use. We obtained relevance judgments as described previously for all baselines. The effectiveness of the three baselines are reported in the upper part of Table 3.10.

3.8.5 Evaluation of the Inverted Index Searcher

Table 3.4 gives the results of the comparison of the inverted index based baselines, BM25 and SDM, against standard IR techniques such as query extension using related words (Extension), query autocompletion as provided by commercial Web search engines (Query Autoc.), and Pseudo Relevance Feedback using top-3 retrieved entities (PRF3). We experimented with various approaches to handle the terms in the query and in the end opted for a disjunctive approach (i.e., we take each term in the query separately and “OR” the results) since it per-

46 3.8. Experimental Evaluation

Table 3.4 – Standard IR approaches over the inverted index.

2010 Collection 2011 Collection MAP P10 NDCG MAP P10 NDCG Query Autoc. 0.0883 0.2076 0.4319 0.1195 0.2160 0.4530 Extension 0.1452 0.3098 0.5536 0.1629 0.2660 0.4902 PRF 0.1660 0.3033 0.5305 0.1407 0.2680 0.5216 BM25 0.2582 0.5196 0.6385 0.2438 0.4100 0.6194 SDM 0.2552 0.5728 0.6276 0.2178 0.3400 0.5546

Table 3.5 – Approaches based on a structured inverted index with BM25 and BM25F scoring. “*” denotes statistically significant difference against the BM25 baseline.

2010 Collection 2011 Collection MAP P10 NDCG MAP P10 NDCG URI only 0.1791 0.3304 0.6348 0.0941 0.1740 0.4413 Label only 0.1817 0.4196 0.6180 0.1822 0.2880 0.5913 Attrib. only 0.2079 0.4207 0.5773 0.1422 0.2580 0.4884 BM25 NER-heuristic 0.2686* 0.5467* 0.6453 0.2566 0.4120 0.6289 URI-Label 0.2669 0.4554 0.6704 0.1875 0.2840 0.5740 Label-Attrib. 0.2349 0.4348 0.5870 0.1576 0.2640 0.4943 URI-Attrib. 0.2475 0.4185 0.5811 0.1583 0.2540 0.4604 BM25F ULA 0.2659 0.4511 0.5924 0.1724 0.2640 0.4710

formed best. The only exception concerns methods based on a structured index and BM25F scoring, for which we adopt a conjunctive approach.

As we can see, the BM25 baseline performs best on both test collections. We observed that, if we consider only Average Precision, it outperforms the standard IR techniques in 82 queries out of 92 in the 2010 collection, and in 34 queries out of 50 in the 2011 collection. This indicates that methods working well for other search tasks may not be directly applicable to AOR due to the specific semantics of the user query, meant to uniquely describe one specific entity. We also observed that the effectiveness of SDM is comparable to that of BM25, outperforming its Precision@10 in the 2010 dataset.

Table 3.5 gives results for the structured inverted index approach (see Section 3.5.2). The table lists results for indexes built on entity profiles constructed using different types of literals. When compared to the BM25 baseline index (which aggregates all literals directly attached to the entities in a single document), we observe that structured indexes perform better. Specifically, when a conjunctive query and BM25F ranking is used over a structured inverted index, effectiveness increases. In this context, the best results are obtained by the index encompassing all three fields, that is, URI, Label, and Attributes (ULA). However, none of these

47 Chapter 3. Retrieving Entities from a Knowledge Graph

0.7 0.9 103 0.6 0.8 BM25 0.7 0.5 NER-heuristic 0.6 102 0.4 0.5 0.3 0.4 0.3 101 0.2 Precision 0.2 Log(#Results) 0.1 0.1 Average Precision SemSearch 2010 0.0 0.0 100 q3 q5 q3 q5 q3 q5 q12 q14 q21 q25 q49 q60 q65 q12 q14 q21 q25 q49 q60 q65 q12 q14 q21 q25 q49 q60 q65

0.7 1.0 103 0.6 0.8 0.5 102 0.4 0.6 0.3 0.4 101

0.2 Precision 0.2

0.1 Log(#Results) 0 Average Precision 0.0 0.0 SemSearch 2011 10 q2 q7 q9 q2 q7 q9 q2 q7 q9 q17 q27 q31 q17 q27 q31 q17 q27 q31 Query Query Query

Figure 3.2 – Average Precision (left), Precision (center), and number of retrieved entities (right) of the NER heuristic compared to the BM25 baseline for the two datasets. approaches outperform the StrLM baseline.

NER and Structured Inverted indexes We studied how the heuristic presented previously in Section 3.5.3 behaves by first evaluating how the named entity recognizer we use, which was originally designed for long documents, applies to the short keyword queries composing our datasets. To do that we manually annotate the queries by marking occurrences of named entities, and then use such ground truth to compute the named entity recognizer’s precision and recall. On the 2010 dataset we measured a precision and a recall of, respectively, 0.67 and 0.2, while in the 2011 dataset the tool achieved a precision of 1 and a recall of 0.38 (F 0.31 1 = on the 2010 dataset and F 0.55 on the 2011 dataset). Since our NER heuristic is applied 1 = only to named entities whose mentions span over the whole query, we also report precision and recall computed by taking into account queries which only contain an entity mention. We found 9 and 8 such queries in the 2010 and the 2011 datasets, respectively. Precision is 1 on both datasets while recall is 0.22 on the former (F 0.36) and 0.3 on the latter (F 0.53). 1 = 1 = Moreover, our approach was able to retrieve results for only 6 of the 8 entity-mention queries of the 2011 dataset due to the strict matching policy that we adopt. We notice that the F1 scores achieved by the NER in our limited dataset are substantially different from those reported in Finkel et al. paper [77](F 0.92 in the best case). We believe this is the consequence of 1 = the small test entries that we used, and of the fact that we are running it on short text, thus reducing the context the tool can exploit to detect named entities. Nevertheless, as reported in the following, our experiments show that NER can positively contribute to improving AOR effectiveness if correctly combined to other retrieval techniques.

To evaluate the effectiveness of the NER heuristic, we measured the Average Precision it achieves on the above mentioned queries, and compared it against those attained by the BM25 baseline. As Figure 3.2 shows, the heuristic does not outperform the baseline several

48 3.8. Experimental Evaluation

0.40 1.0 103 0.35 0.8 BM25 Entity-Types 0.30 102 0.25 0.6 0.20 0.4 0.15 101 0.10 Precision 0.2 0.05

Log(#Retrieved) 0 Average Precision 0.00 0.0 10 q4 q5 q4 q5 q4 q5 q13 q16 q22 q23 q34 q59 q68 q73 q81 q86 q89 q90 q13 q16 q22 q23 q34 q59 q68 q73 q81 q86 q89 q90 q13 q16 q22 q23 q34 q59 q68 q73 q81 q86 q89 q90 Query Query Query

Figure 3.3 – Average Precision (left), Precision (center), and number of retrieved entities (right) of Entity Retrieval by Entity Type method compared to the BM25 baseline for the 2010 dataset. times. This is mostly due to the different number of results the two approaches retrieve: on the 2010 dataset the NER heuristic retrieved on average 203 entities per query, against the 895 retrieved by the baseline. On the other dataset the heuristic retrieved an average of 44 entities against the 1,000 retrieved by the baseline. Figure 3.2 (right) gives more detail by showing the number of retrieved entities per query. This effect is more evident when we compare the precision of the two methods: Figure 3.2 (center) clearly shows that in both datasets the heuristic trades off recall for precision. This motivates our choice to use it as an evidence of relevance rather than as a standalone retrieval model (cf. the description of the nerFullQuery feature in Section 3.7).

Finally, we assess how our heuristics improve the BM25 baseline by linearly combining the two methods as described in Section 3.7. We first normalize their scores to lie in the interval [0,1] and then use the parameter λ to weight the BM25 score. We found that the best value of such parameter is λ 0.1 in both cases, yielding a MAP of 0.2813 for SemSearch 2010 and of = 0.2869 for SemSearch 2011. Notice that such a low value of λ indicates that, as expected, the NER-heuristic score, when present, better predicts the relevance of an entity than the BM25 one. In Table 3.5 we reported the score obtained by cross-validating between the two testsets.

3.8.6 Evaluation of Entity Retrieval by Entity Type

We now evaluate the methodology described in Section 3.6.1. We were able to find entity type mentions in 62 queries out of 92 in the 2010 dataset, and in 33 queries out of 50 in the 2011 dataset. However, since we only keep entities to which the method associate a positive score, our approach could retrieve results for only 14 and 2 queries for SemSearch 2010 an 2011, respectively. Relaxing the similarity constraint in order to retrieve results for more queries leads to the retrieval of many unrelated results with a null score that cannot be used for ranking.

The number of entities we retrieve varies from 1 to 52 (with an average of 11) for the 2010 dataset and from 2 to 78 (average 40) for the 2011 collection. As for the effectiveness of the methodology, we notice an effect similar to that of the NER-heuristic described previously, that is, its average precision is lower than that of the baseline because of the few retrieved entities; however, analogously to what we presented before, the precision of entity retrieval

49 Chapter 3. Retrieving Entities from a Knowledge Graph by type is much higher, showing that the methodology trades-off recall for precision, making it a good feature to predict the relevance of an entity. Figure 3.3 reports Average Precision, Precision, and number of retrieved entities for the 2010 dataset.

In addition, we analyze how this methodology behaves when combined with the BM25 base- line by linearly combining the two approaches as described in Section 3.7, that is, we first normalize the scores and then we use the parameter λ to weight the BM25 scores. Its best val- ues are λ 0.4 for the 2010 collection (MAP = 0.3348) and λ 0.8 for the 2011 collection (MAP = = = 0.2944). Table 3.10 shows the cross-validated evaluation of the approach (line “Entity-Types”).

3.8.7 Evaluation of Graph Traversal Techniques

Here we evaluate the graph traversal techniques introduced in Section 3.6.2 by investigating the impact of the following parameters on the linear combination of the graph traversal techniques scores with the BM25 baseline:

• N, the number of top entities from the IR method which are used as seeds for the structured queries;

• score(q,e,e0), the scoring function used to rank entities from the structured repository (cf. Section 3.6.3);

• τ, the threshold on the text similarity measure we use to consider an entity description a match to the query;

• λ, the weight used to linearly combine the results from the IR and the structured query approaches (cf. Section 3.7).

Parameter Analysis Figure 3.4 shows how effectiveness (MAP) varies when varying the parameter N indicating the number of entities considered as input for the structured queries, both on the 2010 and 2011 collections. We observe that the best number of entities to consider lies between 3 and 4 in most cases, therefore we fix this parameter to 3 for the following experiments. The effectiveness is evaluated with the original relevance judgments from the SemSearch initiative.

Moreover, we observe that high values (i.e., 0.8 0.9) for the threshold τ used to discard − entities lead to higher MAP in all cases. The optimal value for λ, weighting the BM25 score, varies for scope one and two approaches. Interestingly, for the best performing method S1_1, the optimal value for λ is 0.5 for both the 2010 and 2011 collections: for such an approach, our hybrid system reaches an optimal trade-off between IR and graph data management techniques.

Table 3.6 gives the performance of different functions used to compute score(q,e,e0). As we can see, the most effective method is the one that exploits the original BM25 score of e to score

50 3.8. Experimental Evaluation

0.26

0.24

0.22

0.20 MAP S1_1-2010 S1_1-2011 0.18 S2_3-2010 S2_3-2011

0.16

0.14 1 2 3 4 5 6 7 8 9 10 N

Figure 3.4 – Effectiveness values varying the number of top-N entities retrieved by IR ap- proaches.

Table 3.6 – MAP values for S1_1 and S2_3 obtained by means of different instantiation of score(q,e,e0) using λ 0.5 on the 2010 collection. =

score(q,e,e’) S1_1 S2_3

Count JW(e0,q) τ 0.3099 0.2582 > Avg count JW(e0,q) τ 0.3058 0.2582 > Sum JW(e0,q) 0.3105 0.2582 Avg JW(e0,q) 0.3057 0.2582 Sum BM25(e,q) ε 0.3123 0.2642 −

20 e0. This is the scoring function we use to compute the final results of the hybrid system as reported in Table 3.10.

3.8.8 Learning to Rank for AOR

In Section 3.7 we described the features used to predict the relevance of an entity with respect to a query; in the following, we report on the training of a learning to rank approach based on a decision list model.

In order to evaluate the effectiveness of this method, we perform cross-validation between the two SemSearch datasets. As we mentioned previously, the 2011 dataset has fewer queries than the 2010 one (50 and 92 queries, respectively); hence, the number of training examples is different for the two models we build. Specifically, we use 4,176 examples to build the model based on SemSearch 2010 that we use to classify the SemSearch 2011 instances, and 2,260 examples in the other case. In both cases, the value we want to predict (the relevance score of an entity) is a real number that lies in the interval [0,3]: the higher the number, the

20In our experiments we set ε 10 3. = −

51 Chapter 3. Retrieving Entities from a Knowledge Graph

Table 3.7 – Statistics on the training set. Number of examples split into the four grades of relevance we consider and time taken to generate them.

SemSearch 2010 SemSearch 2011 # examples 4176 2260 # not relevant 2186 1765 # somewhat relevant 338 398 # fair 1016 97 # excellent 636 0 Feature Extraction (s) 639 277

Table 3.8 – Set of features selected by the correlation based feature selection algorithm for the two datasets.

SemSearch 2010 SemSearch 2011 bm25uri bm25uri typeRetrievalScore typeRetrievalScore nNEREntities nNEREntities bm25label bm25label isSameAs isSameAs isRedirects isRedirects isFullQueryEntity containsEntities isDisambiguates isFullQueryEntity isDisambiguates more relevant the entity is. We then use the predicted score to rank the retrieved entities. Table 3.7 shows statistics on the training data.21 As can be seen, the ratio of non-relevant entities is greater for the 2011 dataset, suggesting that its queries are “more difficult” than those contained in the 2010 dataset; this is also highlighted by the absence of entities judged excellent. The difference between the time taken to extract features from the training data, that is the time spent in order to compute all the evidences of relevance described previously (NER, entity type recognition, graph queries, etc.), is also remarkable: on average it took 6.95s to produce data from a single SemSearch 2010 query and 5.54s for a SemSearch 2011 query.

Finally, we run the correlation based feature selection algorithm proposed by Hall [88] to select the most predictive set of features, leading to the results shown in Table 3.8. We note that, except for the baseline, at least one evidence of relevance per approach was selected, upholding our intuition that the different approaches provide complementary evidence of relevance.

The effectiveness evaluation of the machine learning based method for AOR is summarized in Table 3.10 (line “Decision-List”).

21The data we used to train and test the model is available at https://git.io/vMTBY.

52 3.8. Experimental Evaluation

3.8.9 Effectiveness Evaluation of Hybrid Approaches

Tables 3.9 and 3.10 give results for the graph-based extensions of the BM25 ranking respectively before and after having obtained missing judgments for all the methods. We notice that, before obtaining the needed relevance judgments, the StrLM dominates all other approaches, while after having collected missing judgments its effectiveness is matched, and sometimes outperformed by the new methods we proposed. This indicates how the systems participating to the original evaluation campaign were similar to StrLM, and how different approaches which retrieve many unjudged results are disadvantaged.

After having judged the top-10 documents retrieved by each method, we observe that most approaches based on scope one queries significantly improve over the baseline BM25 ranking. The simple S1_1 approach, which exploits owl:sameAs links plus Wikipedia redirect and disambiguation information, performs best over the graph traversal methods, obtaining a 21% improvement of MAP over the BM25 baseline on the 2010 dataset. The only method outper- forming the structured language model baseline, StrLM, is the machine learning approach, which attains a 12% improvement on MAP and a 3% improvement on Precision@10 in the 2010 dataset. Nevertheless, the fact that the NDCG of the baseline is 12% higher than that of our approach suggests that in the 2010 dataset StrLM is able to prioritize highly relevant documents by ranking them higher. Interestingly, in the 2011 dataset the two approaches change role, with StrLM leading for Precision@10 and our machine learning algorithm leading for MAP and NDCG.

Figure 3.5 shows Precision/Recall curves for the BM25 and StrLM baselines, for the best graph traversal method, and for the machine learning approach (Decision-List). Again, we can see how Decision-List outperforms all other approaches.

To conclude, Table 3.11 reports on the comparison among our best methods (Decision-List), the best baseline we selected (StrLM), and the submissions to the SemSearch initiative re- evaluated with our extended set of relevance judgments. We can see how the proposed techniques outperform the systems participating to the official SemSearch initiative.

3.8.10 Efficiency Considerations

Table 3.12 shows execution times for the different components of our system and for various approaches.22 Quite naturally, the IR baseline is faster than more complex approaches and the execution time of the NER-heuristic is mostly spent by doing NER. Interestingly, we observe that structured approaches based on scope one queries perform very well, only adding a very limited overhead to the inverted index approach. The S1_1 approach, which is the second best approach in terms of effectiveness, adds a cost of only 17% in terms of execution time to the BM25 baseline. The approaches based on scope two queries that use dbp:wikilink are

22We did not put any emphasis on improving the efficiency of our system. Experiments were run on a single machine with a cold cache and disk-resident indexes for both the inverted indexes and the structured repository.

53 Chapter 3. Retrieving Entities from a Knowledge Graph

Table 3.9 – Graph-based refinements compared to the baselines before judging unjudged documents.

2010 Collection 2011 Collection MAP P10 NDCG MAP P10 NDCG BM25 0.2260 0.3348 0.5755 0.1700 0.2020 0.4362 SDM 0.1806 0.2935 0.5485 0.1690 0.1980 0.4298 StrLM 0.2820 0.3978 0.6796 0.2606 0.2440 0.5406 SAMEAS 0.2213 0.3207 0.5643 0.1856 0.2160 0.4431 S1_1 0.2213 0.2891 0.5560 0.1898 0.2080 0.4416 S1_2 0.1965 0.2511 0.5323 0.1723 0.1920 0.4246 S1_3 0.1964 0.2533 0.5329 0.1700 0.1880 0.4252 S2_1 0.2275 0.3315 0.5763 0.1773 0.2060 0.4446 S2_2 0.2272 0.3304 0.5761 0.1779 0.2080 0.4450 S2_3 0.2298 0.3391 0.5813 0.1823 0.2120 0.4446 Ent-Types 0.2274 0.3359 0.5750 0.1720 0.2060 0.4413 Dec-List 0.2538 0.3424 0.5745 0.2164 0.2280 0.4530

Table 3.10 – Graph-based refinements compared to the IR baseline after having judged un- judged documents. “*” indicates statistically significant difference against the StrLM baseline.

2010 Collection 2011 Collection MAP P10 NDCG MAP P10 NDCG BM25 0.2582 0.5196 0.6385 0.2438 0.4100 0.6194 SDM 0.2552 0.5728 0.6276 0.2178 0.3400 0.5546 StrLM 0.3099 0.6120 0.7377 0.2660 0.4240 0.6141 SAMEAS 0.2807 0.5370 0.6433 0.2529 0.4140 0.6247 S1_1 0.3123 0.5489 0.6469 0.2568 0.4060 0.6239 S1_2 0.2752 0.4565 0.6173 0.2483 0.3720 0.6062 S1_3 0.2741 0.4554 0.6169 0.2470 0.3680 0.6070 S2_1 0.2630 0.5163 0.6402 0.2458 0.4040 0.6199 S2_2 0.2632 0.5163 0.6402 0.2462 0.4040 0.6201 S2_3 0.2642 0.5250 0.6446 0.2492 0.4060 0.6194 Ent-Types 0.2600 0.5196 0.6386 0.2442 0.4040 0.6187 Dec-List 0.3476* 0.6337 0.6580 0.2853 0.4080 0.6257

54 3.8. Experimental Evaluation

0.9 BM25 0.8 StrLM 0.7 S1_1

0.6 Decision-List

0.5

0.4 Precision 0.3

0.2

0.1

0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall

Figure 3.5 – Precision/Recall curves for graph-based approaches on the 2010 collection.

more costly and would require a specific effort to make them scalable over large datasets. The scope two approach S2_3, which uses less frequent object properties, has still a reasonable overhead (37%). The graph approach based on entity types, the only that does not exploit any inverted index, suffers from a conspicuous overhead mostly due to the high cardinality of general types. For example, if in a query the type dbo:City is detected, then all the cities in the knowledge based are retrieved and analyzed. The execution time of the machine learning approach is lower than the sum of the execution times of its components being only about 100 milliseconds slower than Entity-Types. We argue that its efficiency may be improved parallelizing the computation of all the evidences of relevance but this is out of the scope of this paper.

In terms of storage consumption, the original collection containing 1.3 billions statements occupies 253GB. The baseline inverted index created with MG4J23 is 8.9GB, the structured index used with BM25F scoring is 5GB, the index containing the entity type labels is 23MB, and the graph index created with RDF-3X [139]24 is 87GB. We note that the graph index could easily be optimized by discarding all the properties that are not used by the graph traversals and neighborhood queries (we could save considerable space this way, from 50% to 95% approximatively depending on the graph approaches used).

23http://mg4j.di.unimi.it/ 24http://code.google.com/p/rdf3x/

55 Chapter 3. Retrieving Entities from a Knowledge Graph

Table 3.11 – Comparison between the best baseline, our best method, and the SemSearch submissions.

2010 Collection 2011 Collection Submission MAP P10 NDCG Submission MAP P10 NDCG sub27-dlc 0.0774 0.3891 0.6522 team32 0.1368 0.2300 0.5028 sub27-dpr 0.0738 0.3793 0.6479 team33 0.1350 0.2280 0.5291 sub27-gpr 0.0774 0.3891 0.6523 team51 0.0689 0.1420 0.3835 sub28-AX 0.0968 0.4359 0.6972 team52 0.0662 0.1420 0.3551 sub28-Dir 0.0736 0.3652 0.6555 team91 0.1183 0.2180 0.5223 sub28-Okapi 0.0931 0.4228 0.7113 team92 0.1638 0.2660 0.5632 sub29 0.0560 0.2848 0.6109 team93 0.1141 0.1940 0.4788 sub30-RES.1 0.1044 0.4163 0.7699 team131 0.1190 0.2000 0.4633 sub30-RES.2 0.1033 0.4185 0.7536 team132 0.1167 0.1900 0.4020 sub30-RES.3 0.1251 0.4924 0.7795 sub31-run1 0.0776 0.3717 0.6726 sub31-run2 0.0968 0.4239 0.6982 sub31-run3 0.1210 0.4826 0.7894 sub32 0.0454 0.2641 0.5885 StrLM 0.3099 0.6120 0.7377 StrLM 0.2660 0.4240 0.6141 Decision-List 0.3476 0.6337 0.6580 Decision-List 0.2853 0.4080 0.6257

Table 3.12 – Average query execution time (ms) on the 2011 dataset. Since the Decision-List approach is composed by many parts we could not precisely compute its IR time and its RDF-time, we only report the average total time.

Approach IR time RDF time Total time BM25 baseline 285 - 285 Extension 580 - 580 (+104%) Query Autoc. 1447 - 1447 (+408%) PRF3 2670 - 2670 (+837%) NER-heuristic 309 - 309 (+8%) SAMEAS 285 30 315 (+11%) S1_1 285 48 333 (+17%) S1_2 285 84 369 (+29%) S1_3 285 86 371 (+30%) S2_1 285 1746 2031 (+613%) S2_2 285 2192 2477 (+769%) S2_3 285 105 390 (+37%) Entity-Types - 1103 1103 (+287%) Decision-List N/D N/D 1658 (+482%)

56 3.9. Conclusions

3.9 Conclusions

In this chapter, we tackled the problem of retrieving entities without knowing their identifiers that we introduced in Chapter2 (Page 18) by proposing a hybrid system for effectively and efficiently solving the Ad-hoc Object Retrieval task. Our approach gathers evidence of the relevance of an entity with respect to a query by exploiting both an inverted index storing entity profiles and a structured database storing graph data.

Our extensive experimental evaluation shows that standard IR techniques like query expansion and pseudo-relevance feedback do not substantially improve the results provided by an inverted index, while detecting named entities and entity types in the query leads to a better precision. Finally, we show that the use of structured search on top of standard IR approaches can lead to significantly better results (up to 12% MAP improvement over a state-of-the- art language model based approach for AOR). This is obtained by incorporating additional components in the system architecture and by implementing additional merging and ranking functions in the processing pipeline, including training a Decision-List supervised classifier. Our results also suggest a trade-off between effectiveness (in terms of Mean Average Precision) and execution time. The proposed techniques are easily scalable using scale-out indexing architectures. The overhead of our best approach, which combines all evidences of relevance we extracted, is four times greater than that of the BM25 baseline. This overhead is mostly due to the fact that all the features used to compute the score of an entity are computed sequentially. We believe that a parallelization of the feature extraction methods would significantly reduce the execution time of the approach while maintaining its effectiveness. However, we show that a simple graph-based approach leading to results only slightly inferior than those of our best method only cause a limited overhead (+17%) in terms of latency.

We consider our hybrid results very promising, especially given that the LOD sample datasets used in the test collections were extremely noisy and incomplete. We believe that recent advancements in entity linking and automated data cleaning techniques such as those we report in Chapter5 could, hopefully, alleviate those issues.

57

4 Displaying Entity Information: Rank- ing Entity Types

In Chapter3 we discussed about methods to retrieve entities given keyword queries, for example, given the query “Microsoft CEO”, we obtain the URI of the entity “Bill Gates”. We can now check all the information we have about the entity, however, when we are asked the question “What is Bill Gates?” we can choose from 42 different answers, including “Person”, “Object”, “Director of Microsoft”, and “Bridge Player”. How do we select the one that best fits our use-case?

4.1 Introduction

A large fraction of online queries targets entities [107]. As a result, commercial search engines are becoming more and more specialized in handling this kind of queries and are now able to return rich Search Engine Result Pages (SERPs) that do not just provide the user with ten blue links but also features images, videos, news, entity summaries etc. To create such SERPs, search engines first need to identify the entity described by the user query, for example by using methods similar to those described previously in Chapter3, and then need to decide what information about the detected entity should be extracted from the underlying knowledge graph and shown to the user. It is possible, for example, to display pictures, a short textual description, and a few specific related entities. One interesting entity facet which can be displayed in the SERP is the type of the retrieved entity. As we discussed in Chapter2, in public knowledge graphs such as Yago or DBpedia, entities are associated with several types. For example, the entity “Bill Gates” in Yago has 42 types, including “Person”, “Object”, “Director of Microsoft”, and “Bridge Player”.1 When deciding what to show in the SERP,it is important to select the few types the user would find relevant. Very generic types such as “Object” or “Person” can be considered too generic and thus not interesting, while others may be interesting for a user who does not know much about the entity, for example “American Billionaire”. Finally, users who already know the entity but are looking for some of its specific facets might be interested in less obvious types, such as “Bridge Player”, and its associated search results.

1https://gate.d5.mpi-inf.mpg.de/webyago3spotlx/Browser?entity=%3CBill_Gates%3E accessed on January 4, 2017.

59 Chapter 4. Displaying Entity Information: Ranking Entity Types

More than just for search, entity types can be displayed to Web users while browsing and reading webpages. In such a case, pop-ups displaying contextual entity summaries similar to those displayed on SERPs such as the Google Knowledge Panel can be shown to users who want to know more about the entities they are reading about. In this case again, picking the types that are relevant is critical and highly context-dependent.

A third example scenario involves using selected entity types to summarize the content of webpages or online articles. For example, one might build the summary of a given news article by extracting the most important entities in the article and listing their most relevant types (e.g., “this article is about two actors and the president of Kenya”).

In this chapter, we focus on TRank: the novel task of ranking available entity types based on their relevance given a context. We propose several methods exploiting the entity type hierarchy (i.e., types and their subtypes such as “Person” and “Politician”), the graph structure connecting semantically related entities (potentially through the type hierarchy), collection statistics such as the popularity of the types or their co-occurrences, and text statistics such as the probability of observing a mention of an entity of a certain type after a given n-gram of text.

We experimentally evaluate our different approaches by using crowdsourced judgments on real world data and by extracting different amounts of context in which entity mentions appear (e.g., only the entity mention, one sentence, one paragraph, etc.) Our experimental results show that approaches based on the type hierarchy perform more effectively in selecting the entity types to be displayed to the user and that the combination of several of the ranking functions we proposed by means of learning to rank models yields to the best effectiveness. We also assess the scalability of our approach by designing and evaluating a Map/Reduce version of our ranking process over a large sample of the CommonCrawl dataset exploiting existing schema.org annotations.2

Summarizing, the main contributions described in this chapter are:

• The definition of the new task of entity type ranking, whose goal is to select the most relevant types for an entity given some context.

• Several type-hierarchy and graph-based approaches that exploit both schema and instance relations to select the most relevant entity types based on a query entity and the user browsing context.

• An extensive experimental evaluation of the proposed entity type ranking techniques over a Web collection and over different entity type hierarchies including YAGO [174] and Freebase by means of crowdsourced relevance judgments.

• A scalable version of our type ranking approach evaluated over a large annotated Web crawl.

2http://commoncrawl.org/

60 4.2. Related Work

• The proposed techniques are available as an open-source library as well as an online web service.3

The rest of the chapter is structured as follows. We start below by describing related work and then we formally define our new type ranking task in Section 4.3. Section 4.4 is devoted to the presentation of our system’s architecture, while in Section 4.5 we propose a series of approaches based on collection statistics, type hierarchies, and entity graphs. Our methods are then evaluated by using a ground truth obtained by using the crowdsourcing techniques described in Section 4.6. The results of our evaluation are reported in Section 4.7, in which we also describe a scalability validation of our Map/Reduce implementation over a large corpus. Finally, we discuss our finding in Section 4.8, and conclude the chapter in Section 4.9.

4.2 Related Work

Entity-centric data management is an emerging area of research at the overlap of several fields including Databases, Information Retrieval, and the Semantic Web. In Section 2.3 and Chapter3 we described current advancements on techniques to retrieve entities given keyword queries (Ad-hoc Object Retrieval, AOR), to spot entity mentions in a given text (Named Entity Recognition, NER), and to connect them to entities in a KG (Entity Linking, EL). In this chapter we go one step further by targeting the specific problem of assigning types to entities that have already been through those two steps, that is, our input consists in entities already extracted from webpages and correctly linked to entries of a given knowledge graph. In the following we describe work done by other researchers that is connected to what we present in the following sections.

4.2.1 Named Entity Recognition

Methods for NER are related to what we present later in this chapter since, as we briefly mentioned in Section 2.3.2, they typically provide as output some type information about the identified entities. For example, two of the most well-known tools for NER, namely CoreNLP and GATE, provide very generic types, including people, locations, and organizations [119, 58]. Nevertheless approaches considering more types have been developed. Finkel and Manning, for example develop a method for NER that distinguish 18 types [78], while Nadeau et al. propose a system that recognizes 100 entity types by using a semi-supervised approach. To define such 100 types is the BBN linguistic collection which includes 12 top types and 64 subtypes.4 While this is useful for applications that need to focus on one of those generic types, for other applications such as entity-based faceted search it would be much more valuable to provide specific types that are also relevant to the user’s browsing context.

3https://github.com/eXascaleInfolab/TRank and http://trank.exascale.info, respectively 4https://catalog.ldc.upenn.edu/LDC2005T33

61 Chapter 4. Displaying Entity Information: Ranking Entity Types

4.2.2 Entity Types

The Semantic Web community has been creating large-scale knowledge graphs defining a multitude of entity types. Efforts such as YAGO [174], DBpedia [29] and Freebase [36], which we described in Chapter2, have collected large collections of structured representations of entities along with their related types. Such knowledge graphs hence represent extremely valuable resources when working on entity type ranking as we do in the following.

To the best of our knowledge, the first work proposing methods for ranking entity types was done by Vallet and Zaragoza [187]. The authors proposed a method to select the best entity type matching a Web search query. Similarly, Balog and Neumayer propose methods to select the best type given a query by exploiting the type hierarchy from a background knowledge graph [17]. As compared to such contributions, we aim to rank types assigned to entities mentioned on the Web and to select the right granularity of types from the background type hierarchy. This last point was reported to be “particularly challenging” by the authors of the above mentioned paper [17].

An interesting demo by Tylenda et al. [186] proposed the task consisting in selecting the most relevant types used to summarize an entity. The focus of this work was on generating an entity description of a given size, while our focus is to select the most relevant types given the context in which the entity is described, however, similarly to that work, we build our approaches using large KGs such as YAGO and DBpedia. More recently, after a few years from our original publication, the new related task of Triple Scoring was analyzed at the WSDM Cup 2017 [92]. This is a novel task that requires systems to score relevant triples describing “type-like relations”. This extends the task we tackle by also considering information that can help defining what entities are such as, for example, properties specifying the profession of the entities taken into consideration.

Another series of contributions that is related to what we are going to present focuses on assigning additional types types to entities of a KG or to emerging entities, that is, new entities that are not yet part of the KG. For example, Tipalo is a system featuring an algorithm to extract entity types based on the natural language description of the entity taken from Wikipedia [80], while PEARL is an approach that selects the most appropriate type for an entity by leveraging a background textual corpus of co-occurring entities and performing fuzzy pattern matching for new entities [137]. Yao et al. also worked on extracting entity types from text [203], however, contrary to our case, their approach is not bound to a fixed ontology and extracts entity types coming from textual surface-form expressions without linking them to any knowledge graph. For example, from the sentence “Jan Zamoyski, a magnate” the approach extracts the fact that the person is a magnate but does not link the string “magnate” to any concept in a KG. Finally, slightly differently from other work we just mentioned, Paulheim and Bizer propose a method for the automatic creation of instantiation axioms stating that an entity is instance of a certain type only by exploiting statistics on the usage of properties of the KG taken into consideration [145]. Although, similarly to PEARL, we also rely on a background n-gram corpus to compute

62 4.3. Task Definition type probabilities (cf. Section 4.5.3), the task we tackle is different from the tasks discussed by the above mentioned approaches since we aim at ranking existing information on the types of entities already contained in a KG, assuming that such information is correct.

4.3 Task Definition

Given a knowledge graph containing semi-structured descriptions of entities and their types, we define the task of entity Type Ranking (TRank, for short) for a given entity e appearing in a textual context c as the task of ranking all the types T {t ,...,t } associated to e based on e e = 1 n their relevance to ce . In RDFS/OWL, the set Te is typically given by the objects that are related to the e’s URI via the rdfs:type property. Moreover, we take into consideration entities belonging to other selected ontologies and connected to e via a owl:sameAs link. For example, dbr:Tom_Cruise has an owl:sameAs connection to fb:Tom_Cruise which allows us to add the new type fb:fashionmodels.

The textual context ce of an entity e is defined as the textual content surrounding the entity mention in a certain document. This context can have a direct influence on the rankings. For example, the entity “Barack Obama” can be mentioned in a Gulf War context or in a golf tournament context. The most relevant type for “Barack Obama” shall probably be different given one or the other context. The different context types we consider in this paper are: i) three paragraphs around the entity mention (one paragraph preceding, one following, and the paragraph containing the entity); ii) one paragraph only, containing the entity mention; iii) the sentence containing the entity mention; and iv) the entity mention itself with no further textual context.

To rank entity types we also exploit hierarchies of entity types. In RDFS/OWL, a type hierarchy is typically defined based on the property rdfs:subClassOf. For example, in DBpedia we observe that dbo:Politician is a subclass of dbo:Person. Knowing the relations among types and their depth in the hierarchy is often helpful to automatically rank entity types. For example, we might prefer a more specific type (e.g., “American Politician”) rather than a more general one (e.g., “Person”).

We evaluate the quality of a given ranking (t1,...,tn) by using ground truth relevance judgments assessing which types are most relevant to an entity e given a context ce . We discuss rank-based evaluation metrics in Section 4.7.

4.4 System Architecture

Our solution, TRank, automatically selects the most appropriate entity types for an entity given its context and type information.5 TRank implements several components to extract entities and automatically determine relevant types. First, given a webpage (e.g., a news

5Notice the difference between the task name, TRank, and the name of the system we propose, TRank.

63 Chapter 4. Displaying Entity Information: Ranking Entity Types

Entity linking Text Named Entity List of foreach (inverted index: extraction Recognition entity DBpedia labels (BoilerPipe) (Stanford NER) labels ⟹ resource URIs)

Type ranking: Ranked List of Type retrieval List of - Hierarchy based (inverted index: list of - Textual-context type resource URIs entity - Entity-context ⟹ types URIs type URIs) URIs - Entity-based

Figure 4.1 – TRank’s pipeline. article), we identify entities mentioned in its textual content by using a state-of-the-art named entity recognizer focusing on persons, locations, and organizations. Next, in order to obtain a URI for each extracted entity, we use entity mentions to query an inverted index built over DBpedia literals and use the URI we obtain to retrieve all the types attached to the entities (for example by using a SPARQL query). For instance, the entity dbr:Tim_Berners-Lee is associated with more than 30 types, including owl:Thing, dbo:Scientist, schema:Person, and yago:Honoree. Finally, our system produces a ranking of the resulting types based on evidences computed by using different methods exploiting the textual context where the entity was mentioned. Figure 4.1 shows a summary of the different steps composing the TRank pipeline, which we briefly describe in the rest of this section, together with the integrated type hierarchy we built. The rest of this chapter will then be devoted to the definition and the experimental comparison of different methods for ranking entity types.

Entity Extraction The first component of the pipeline takes as input a document collection and performs NER, that is, the identification of entity mentions in the text. Entities that can be accurately identified are persons, locations, and organizations. The current implementation of our system adopts the Conditional Random Field approach for NER included in the Stanford CoreNLP library [77, 119].

Entity Linking The following step is entity linking, whose goal is to assign a URI to an entity mention identified in the previous step. To this end, we have to disambiguate entity mentions by uniquely identifying the entities they denote. For example, the entity mention “Michael Jordan” can denote either the former NBA basketball player or a Machine Learning professor. TRank decides which entity corresponds to a given mention by using a variant of the BM25 baseline described previously in Chapter3 (Page 35), which was also used for linking entities in ZenCrowd [64]: we first create an inverted index over all the literals attached to entries in DBpedia, and then query it by using the given entity mention. Entities are then ranked by using the TF-IDF similarity their labels and the entity mention given as input. We decided to opt for such a simple approach since it is easy to implement and to parallelized; anyway,

64 4.4. System Architecture

Mappings YAGO/DBpedia (PARIS)

type: DBpedia schema.org Yago subClassOf relationship: explicit inferred from PARIS ontology manually mapping added

Figure 4.2 – TRank’s integrated type hierarchy. notice that improving the state-of-the-art in named entity recognition or linking out of the scope of this study.

Entity Type Retrieval and Ranking Finally, given an entity URI we retrieve all its types from a background RDF corpus or from a previously created inverted index, and rank them by using the methods described in Section 4.5. TRank retrieves entity types from the the Sindice-2011 RDF dataset [48], which is a subset of the Web of Data of about 11 billion RDF triples and includes important components of Linked Open Data (we briefly described Linked Open Data in Chapter2).

Integrating Different Type Hierarchies For the purpose of our task, we require a large and integrated collection of entity types to enable fine-grained typing of entities. As we mentioned in Chapter2, there are several large ontologies available, however, the lack of alignments among such ontologies hinders the ability of comparing types belonging to different datasets. In TRank we exploit existing mappings provided by DBpedia and PARIS [173] to build a coher- ent tree of 447,260 types, rooted on owl:Thing and with a max depth of 19. Figure 4.2 shows a visual representation of the integrated type hierarchy used by TRank. The tree is composed by all the rdfs:subClassOf relationships among DBpedia, YAGO and schema.org types. To eliminate cycles and to enhance coverage, we exploit owl:equivalentClass to create rdfs:subClassOf edges pointing to parent classes (in case one of the two classes does not have a direct parent). Considering that the probabilistic approach employed by PARIS does not provide a complete mapping between DBpedia and Yago types, we manually added 4 rdfs:subClassOf relationships (reviewed by 4 domain experts) to obtain a single type tree

65 Chapter 4. Displaying Entity Information: Ranking Entity Types rather than a forest of 5 trees.6

4.5 Approaches to Entity Type Ranking

The proposed approaches for entity type ranking can be grouped into entity-centric, context- aware, and hierarchy-based. Entity-centric approaches look at the relation of the given entity e with other entities in a background knowledge graph following edges such as dbo:wikilink and owl:sameAs, while context-aware approaches exploit data on the co-occurrences of e and words or other entities in the textual context given as input. Hierarchy-based approaches look at relations between types such as type subsumption and rank the types of a given entity based on them.

4.5.1 Entity-Centric Ranking Approaches

The first group of approaches we describe only considers background information about a given entity and its types and does not take into account the context in which the entity appears.

Our first entity-centric approach, FREQ, ranks entity types based solely on their frequency in the background knowledge graph, that is, the more instances a type has, the higher it is ranked. For example, the type “Person” has more instances than “EnglishBlogger”, thus it is more popular. As a consequence, the former is always ranked higher than the latter.

Our second approach, WIKILINK, exploits the relations existing between the given entity and further entities in the background knowledge graph. Specifically, we count the number of neighboring entities that share the same type. This can be performed, for example, by issuing the following SPARQL queries retrieving connected entities from/to e:

SELECT ?x SELECT ?x WHERE { ?x . WHERE { ?x . ?x ?x } } For instance, to rank the types of the entity e of Figure 4.3 we exploit the fact that several linked entities also have the type “Actor” to rank it higher than the type “Person”.

In a similar way, in our third approach, SAMEAS, we exploit the owl:sameAs connections contained in the knowledge graph to get other equivalent entities. We then rank the types of the input entity based on how many times they occur among the newly retrieved entities, as shown by the following SPARQL query.

6The type hierarchy is available in the form of a small inverted index that provides for each type the path to the root and its depth in the hierarchy at https://git.io/vM0iI

66 4.5. Approaches to Entity Type Ranking

Person Actor ActorFromCalifornia wikiLink Person Actor e Thing wikiLink Actor ActorFromNewYork

Figure 4.3 – An example of entity-centric ranking approach exploiting the types of related entities in a background knowledge graph. The type “Actor” is ranked higher than the type “Person” since two related entities feature it (against one entity featuring the latter type).

Thing

Person Organization

Actor Foundation

Humanitarian Foundation

Figure 4.4 – Examples of hierarchy-based ranking approaches: when using DEPTH the high- est ranked type is “Humanitarian Foundation”, while the top type is “Actor” if the methods ANCESTORS or ANC_DEPTH are applied.

SELECT ?x WHERE { ?x . ?x } Our last entity-centric approach, LABEL, adopts text similarity methods. We consider the label le of the input entity e look for related entities by measuring the TF-IDF between le and the labels of other entities contained the background knowledge graph.7 At this point, we inspect the types of the most related entities to rank the types of e. More specifically, we select the top-10 entities having the most similar labels to le and rank types based on their frequency among those entities.

67 Chapter 4. Displaying Entity Information: Ranking Entity Types

4.5.2 Hierarchy-Based Ranking Approaches

The more complex techniques described below rank the types t T attached to the input i ∈ e entity e by exploiting the type hierarchy described previously and depicted in Figure 4.2.

The simplest approach we define, DEPTH, gives to each type ti a score that is equal to its depth in the type hierarchy. As the reader can guess, this approach favors types that are more specific (i.e., deeper in the type hierarchy); for example, the type “Person Of Scottish Descent” is always ranked higher than “Person”.

In some cases, always selecting the most specific types lead us to favor types providing infor- mation that is not useful to the end user. For example, we can imagine several scenarios in which the user prefers to know that Bill Gates is a “Billionaire” rather than being told that he is a “BridgePlayer”. That is, we sometimes want to favor certain branches of the type hierarchy focusing on particular facets of an entity; in this case, the branch of our integrated hierarchy containing types connected more to wealth and social statuses. To do this, we define a method,

ANCESTORS, that takes into consideration how many types t j are both ancestors of ti and types of e. This method is formally defined in Equation 4.1 and makes use of the function Ancestors that, given an entity type t, returns a set containing all its ancestors in the type hierarchy. As an example, suppose that the types marked in red in Figure 4.4 are all the types of a certain entity. If we apply ANCESTORS to rank them, we have that “Actor” is ranked before “Humanitarian Foundation” even if the latter type is placed deeper in the hierarchy.

¯© ¯ ª¯ ANCESTORS(t ) ¯ t ¯ t Ancestors(t ) t T ¯ (4.1) i = j j ∈ i ∧ j ∈ e

A variant of this approach, ANC_DEPTH, considers not just the number of such ancestors of ti but also their depth is defined by Equation 4.2, where depth denotes the function mapping entity types to their depth in our integrated hierarchy.

X ANC_DEPTH(t ) depth(t ) (4.2) i = j t j Te t Ancestors∈ (t ) j ∈ i

4.5.3 Text-Based Context-Aware Approaches

We now propose two context-aware approaches that exploit the textual context in which the considered entities appear. Formally, we want to rank the types of an entity e, mentioned in a window of text (w k ,w k 1,...,e,w1,...,wk ). In this method, we rank types according to the − − + probability of seeing an entity of type t given the tokens w k , w k 1,..., w 1, w1,..., wk . The − − + − intuition behind this is that sentences such as “I’m going to entity .” give clear indications on 〈 〉 the possible types of entity that are relevant given the textual context (in this case we expect entity to be some kind of place).

7This can be efficiently performed by means of an inverted index over entity labels.

68 4.5. Approaches to Entity Type Ranking

We use the text of the Wikipedia webpages to compute relevant statistics for type ranking. More precisely, given a Wikipedia article, we extend the set of its user-annotated entities by using DBpedia Spotlight [123] to obtain more annotations. All the linked entities are then substituted by special tokens containing their entity types. For example, if we suppose that an entity e, having types t0 and t1, appears in a certain window of text w¯ , we replicate twice w¯ : the first time we replace e with t0, and the second time with t1. The resulting sequence of tokens is finally split into n-grams, and aggregated counts are computed. We then compute the probability of an entity type t, given a window of text w¯ , by averaging the probability of finding an entity of type t in each individual n-gram we can extract from w¯ . This last probability is computed as shown in Equation 4.3, where T is the set of all considered types.

Count(t ng) Pr(t ng) P | (4.3) | = j T Count(t j ng) ∈ |

Notice that the probability of owl:Thing given any text is always 1 since every entity is instance of that type and that, in general, coarser entity types tend to have higher probabilities. Un- fortunately, as a consequence of the sparsity of natural language data, we do not always have evidence of the occurrence of all considered types, given a certain textual context. This is exac- erbated when the knowledge graph taken into consideration contains many entity types. We address this issue by using a well-known technique from statistical machine translation known as Stupid Back-off smoothing [39], in which we fall back to estimating the probability by using (n 1)-grams if the original n-gram is not contained in our background corpus. For unigrams, − the probability estimate is the probability of the given type in the knowledge graph, which is based on the number of instances of the selected type. We denote by NGramScore(t ng) the | probability of a type t given a context ng computed as just described.

Finally, the score that our first text-based context-aware approach, NGRAMS0, assigns to a type, given a window of text w¯ , is computed as shown in Equation 4.4. NGrams(w¯ ,n) is a function returning all the n-grams composing w¯ . During our experimental evaluation, we fix the length of the textual context w¯ to 5 tokens (that is, we take two tokens before and two tokens after the entity type).

P ng NGrams(w¯ ) NGramScore(t ng) NGRAMS0(t,w¯ ) ∈ | (4.4) = NGrams(w¯ ) | |

As one can notice, NGRAMS0 always prioritizes coarser types. In order to mitigate this phe- nomenon, we devise an approach drawing inspiration from the methods described in Chap- ter5. Our second text-based context-aware approach, NGRAMS1, models the probability of seeing an instance of a certain entity type given a window of text as a flow crossing the type hierarchy starting from the root (with probability 1) and descending the tree until either a

69 Chapter 4. Displaying Entity Information: Ranking Entity Types leaf is reached, or the flow of probability is broken (that is, we reach a node that does not transmit any probability to its children). Additionally, in order to avoid cases in which deeper or coarser types are always prioritized, when computing the score of an entity type we take into consideration the entropy of its children’s probabilities, and that of its sibling’s probabilities, too. For example, suppose that the probability that flowed to dbo:Actor is scattered almost equally among all its 465 children. This indicates that there is no clear preference on which of the children of dbo:Actor should co-occur with the given n-gram, so, there is no added value in letting the probability mass flow any further, suggesting that the current node should be ranked higher than its children as it is more “informative” than each of them. In Section 4.8, we give more detail on the relation between entropy, number of children, and relevance.

Despite using the same concepts, we cannot directly apply the algorithm described in Chap- ter5 since it was designed to return only one type. What we propose, instead, is a generalization of that approach based on machine learning techniques that uses features derived from the probability that a certain type has to co-occur with the input n-gram. This is appropriate in our context as we can exploit the labeled data of our testset (see Section 4.6 for more information). Specifically, we use a decision tree [81] to assign a score to the input entity type and window of text by exploiting the following features:

1. NGRAMS0, the score of the NGRAMS0 method, playing the role of the type probability;

2. ratioParent, the ratio between the current type probability and the probability of its parent;

3. nChildren, the number of children to which some probability mass flows;

4. hChildren, the entropy of the probabilities of the children;

5. nSiblings, the number of siblings of the current node having some probability mass;

6. hSiblings, the entropy of the probabilities of the siblings.

Section 4.7 gives more technical detail and reports on the evaluation of NGRAMS0 and NGRAMS1; in addition, the relation between the features we selected and relevance is dis- cussed in Section 4.8.

4.5.4 Entity-Based Context-Aware Ranking Approaches

This section focuses on context-aware approaches that instead of analyzing the raw textual context in which the input entity e is mentioned, take into consideration other entities co- occurring with e. In this section we make an abuse of notation and write e c to denote all i ∈ e entities extracted from the textual context ce .

The first approach we propose, SAMETYPE, is based on counting how many times each type t T appears among the other entities e c . In this case, we say that two types match when i ∈ e i ∈ e 70 4.5. Approaches to Entity Type Ranking

Thing T.C. T.H.

Intellectual Person Context Actor T.H. T.C. T.H. AmericanActor e Person Scholar Actor Actor Organization Alumnus e' T.H. T.C. T.H. Thing e''

T.H. T.H. T.C. California State American TV Actor From University Alumni Actor New Jersey

Figure 4.5 – Examples of entity-based context-aware approaches. On the left the PATH method is used to rank first the type “American TV Actor” associated to the entity “Tom Hanks”, co- occurring with “Tom Cruise” (T.H. and T.C. are used to mark ancestors of types associated to “Tom Hanks” and to “Tom Cruise”, respectively). On the right, the SAMETYPE method is used to rank the type “Actor” associated to e first since another entity in the context features it

they share the same URI, or when they have the same label. For example, the type “Actor” associated to the entity e of Figure 4.5 (right) is ranked first because e co-occurs with other entities of the same type in the same context.

A slightly more complex approach, PATH, leverages both the type hierarchy and the context in which e appears. Given an entities e c , the approach measures how similar types are based i ∈ e on the type hierarchy. More precisely, we measure the similarity between two types t T and ∈ e t 0 T by taking the intersection between the paths from the root of the type hierarchy (i.e., ∈ ei owl:Thing) to t and to t 0. The scoring method is formally defined by Equation 4.5.

X X PATH(t,c ) Ancestors(t) Ancestors(t 0) (4.5) e = | ∩ | e ce t T 0∈ 0∈ e0 Suppose, for example, that we want to rank the types “California State University Alumni” and “American TV Actor” associated to the entity “Tom Hanks” (T.H.), appearing in a context in which the entity “Tom Cruise” also appears (T.C.), as depicted in Figure 4.5 (left). As can be seen, a large part of the path connecting “Thing” to “American TV Actor” is shared between the two entities, while there is no ancestor of “California State University Alumni” other than the root of the hierarchy that is also a type of “Tom Cruise”; thus, the former type is ranked higher than the latter.

4.5.5 Mixed Approaches

Below, we report on techniques that exploit several features to rank entity types.

71 Chapter 4. Displaying Entity Information: Ranking Entity Types

Mixing Evidence from Type Hierarchy and Knowledge Base

The first technique we propose, KB-HIER, uses decision trees [81] to combine features that come from the entity whose types are ranked, from the type hierarchy, and from the knowledge base. The features we consider in that context are:

1. popularity of the entity in the knowledge base, measured by computing the number of triples with the given entity as the subject.

2. nTypes, that is, the number of types connected to the entity.

3. nChildren, typeDepth, nSiblings, the number of children, depth in the type hierar- chy, and number of siblings of the entity type taken into consideration.

We focus on the interplay among those features as we believe that it is the most interesting aspect of this approach.

Learning to Rank Entity Types

Finally, since the approaches we propose cover fairly different types of evidence to assess the relevance of a type, we also propose to combine our different techniques by determining the best potential combinations by using a training set, as it is commonly carried out by commercial search engines to decide how to rank webpages (see, for example, the work by Liu [114]). Specifically, we use decision trees [94] and linear regression models to combine the ranking techniques described above into new ranking functions.

The effectiveness of these two approaches is discussed in Section 4.7, while in Section 4.8 we discuss the relations between the features we use.

4.5.6 Scalable Entity Type Ranking with MapReduce

Using a single machine and a SPARQL end-point to run the above methods on all the entities identified in a large-scale corpus is impractical, given the latency introduced by the end-points and the intrinsic performance limitations of a single node. Instead, we propose a self-sufficient and scalable Map/Reduce architecture for TRank, which does not require to query any SPARQL end-point and which pre-computes and distributes inverted indexes across the worker nodes to guarantee fast lookups and ranking of entity types. We build an inverted index over the DBpedia 3.8 entity labels for the entity linking step, and an inverted index over the integrated type hierarchy which provides, for each type URI, its depth and the path connecting it to the root of the hierarchy. This enables a fast computation of the hierarchy-based ranking methods proposed in Section 4.5.2.

72 4.6. Crowdsourced Relevance Judgments

Figure 4.6 – (left) First task design tested in our pilot study in which the workers had to explicitly express their opinion on each of the proposed entity types. (right) Final version of the interface used by the crowd to generate relevance judgments.

4.6 Crowdsourced Relevance Judgments

Similarly to what we describe in Chapter7, we use paid crowdsourcing to create the ground truth we use to evaluate our methods. This is also motivated by the fact that our system mostly targets anonymous Web users. Specifically, we run a series of micro-tasks over the Amazon MTurk platform.8 Each task was assigned to 3 different workers from the United States. The workers were asked to select the most relevant type for 5 different entities, and were paid $0.10 per entities without context and $0.15 for entities mentioned in some context (see Section 4.7 for more detail on how the dataset is composed). Additionally, we allowed workers to annotate entity extraction and entity linking errors, and to add additional types if the proposed ones were not appropriate. Overall, the creation of the ground truth cost $190.

8The collected data and task designs are available for others to reuse at https://git.io/vM0iI.

73 Chapter 4. Displaying Entity Information: Ranking Entity Types

4.6.1 Pilot Study

In order to better understand how to obtain accurate relevance judgments from the crowd, we ran a pilot study in which we compared three different task designs. We assigned each different task to 10 workers, for a total of 30 requests and a budget of $6. The workers were requested to read a paragraph of text and rank the types of 6 entities having, on average, 11 entity types each.

Two of the task designs we analyzed were based on the interface depicted in Figure 4.6 (left) but differ for the fact that in one the workers could select multiple relevant types, while in the other they could only select one relevant type. In this last setup workers had to explicitly mark all the other types as irrelevant. The rationale behind this choice was that having to express an opinion for each entity type forces the worker to read the type label, thus, reducing spam (i.e., random selection of entity types). The high frequency of “yes” we recorded for the first task design (6.76 on average) led us to understand that the workers misinterpreted the crowdsourced task and signaled as relevant all the types that they knew are correct for the entity displayed, which was not the goal as clearly explained in the task instructions. The second task design was not popular among the workers due to the high number of clicks it required: only 7 workers out of 10 completed the task. This could have led to too potentially long delays to obtain judgments for the whole dataset. Moreover, during the pilot study we monitored the main web forums where crowd workers exchange information about tasks (e.g., http://www.mturkforum.com) and we noticed that this task design was criticized for using “mega-bubbles”, a neologism used to describe the presence of many radio buttons that must be clicked to complete the task. Finally, the interface of the third task design we considered is similar to the one shown in Figure 4.6 (right). In this case, users were still allowed to select only one relevant type for each entity displayed, but only one click was needed to complete the task.

Given the results of the pilot study, we selected our last task design to generate the relevance judgments in our datasets; this design also simulates our target use case of showing one single entity type to a user browsing the Web. Table 4.1 lists the inter-rater agreement among workers in terms of Fleiss’ κ computed on the final relevance judgments for each test collection we made. We observe that agreement increases with an increasing degree of context. We argue that the context helps the worker in selecting the right type for the entity. Without much context, workers are prone to subjective interpretation of the entity, that is, they associate to the entity the type that is the most relevant based on their background knowledge.

4.7 Experiments

4.7.1 Experimental Setting

We have created a ground truth of entity types mentioned in 128 news articles selected from the top news of each category of the New York Times website during the period 21 Feb – 7 Mar

74 4.7. Experiments

Table 4.1 – Agreement rate among crowd assessors during the evaluation of the four test collections.

Collection Fleiss’ κ Entity-only 0.3662 Sentence 0.4202 Paragraph 0.4002 3-Paragraphs 0.4603

2013. On average, each article contains 12 entities and, after the entity linking step, each entity is associated to an average of 10.2 types from our integrated type hierarchy. We crowdsourced the selection of the most relevant types by following the results of the pilot study described in Section 4.6. To convert the raw annotations made by the crowd to binary relevance judgments we consider relevant each type which was selected by at least one worker, while, in a graded relevance setting, we use the number of workers who selected a certain type as the relevance score of that type. More than one worker can vote for a type, making it more relevant than other types which received just one vote; for example, if two workers decide that “Actor” is the best type for “Tom Cruise” in the current textual context and only one worker votes for “Scientology Adept”, then “Actor” has a relevance score of 2 and “Scientology Adept” has a score of 1. Hence, we consider the former type more relevant than then latter.

Evaluation Metrics

As main evaluation metrics for comparing different ranking methods we use Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG), previously discussed in Chapter2. Here, we compute the metrics for each entity appearing in a given textual context, that is, what we previously called “query” is not an entity/context pair.

4.7.2 Dataset Analysis

We used the 128 articles we crawled to create four different datasets on which we test the proposed TRank approaches. The four datasets differ in the amount of textual context that surrounds entity mentions.

Entity-only Collection First, we use a collection consisting exclusively of entity mentions and their types, as extracted from the news articles. This collection is composed by 770 distinct entities: although we were able to extract 990 entities, we discarded 220 entities that either are associated to only one type or are marked as entity linking errors by the crowd. Each entity is associated with, on average, 10 types (which we then need to rank).

75 Chapter 4. Displaying Entity Information: Ranking Entity Types

Table 4.2 – Overlap of relevance judgments obtained by displaying different amounts of textual context.

Context A Context B JudgOverlap Sentence Entity-only 0.7644 Paragraph Entity-only 0.7689 3-Paragraphs Entity-only 0.7815 Sentence Paragraph 0.8501 Sentence 3-Paragraphs 0.8616 Paragraph 3-Paragraphs 0.8328

Sentence Collection We built a Sentence Collection consisting of all the sentences extracted from the original articles that contain at least two entities. As mentioned previously, human assessors were asked to judge the relevance of the associated entity types with respect to the given sentence. Thus, every assessor has to read the context and select the type that best describes the entity given the presented text. This collection contains 419 sentences composed, on average, by 32 words and mentioning, on average, 2.45 entities each.

Paragraph Collection This collection is composed by all the paragraphs longer than one sentence and containing at least two entities having more than two types. As a result, we could extract 339 paragraphs of text composed by, on average, 66 words and 2.72 entities each.

3-Paragraphs Collection The last collection we built contains the largest context for an entity: the paragraph where it appears together with the preceding and following paragraphs in the news article. As the paragraph collection, this collection contains 339 context elements. Each of them is composed, on average, by 165 words. The entire context contains on average 11.8 entities that can be used to support the relevance of the entity types appearing in the mid paragraph.

To understand how textual context influences how entity types are ranked, we measured the similarity among relevance judgments obtained by displaying different levels of context to human judges. Specifically, to compare the judgments over two contexts A and B we use the formula reported in Equation 4.6. Table 4.2 shows the results we obtained. As can be seen, showing to human judges entity mentions with no textual context yields to somehow different judgments as compared to displaying some contextual information about the entity. No major differences can be observed among different sizes of contexts.

sharedRelevant(A,B) JudgOverlap(A,B) | | (4.6) = min¡ relevant(A) , relevant(B) ¢ | | | |

76 4.7. Experiments

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4 NDCG NDCG 0.2 0.2

0.0 0.0

1.0 1.0

0.8 0.8

0.6 0.6

MAP 0.4 MAP 0.4

0.2 0.2

0.0 0.0 10 20 30 40 50 60 10 20 30 40 50 60 Number of Types Number of Types

Figure 4.7 – MAP and NDCG of FREQ (left) and DEC-TREE (right) for entities with different numbers of types on the 3-paragraphs collection.

4.7.3 Effectiveness Results

Here we study how the methods proposed previously perform on the four datasets we built. We also report briefly on how the crowd assigned types to badly linked entities, thus justifying future work on using crowdsourcing to get better entity types.

Evaluation over different Contexts

Figure 4.7 shows the evolution of MAP and NDCG values by varying the number of types associated to an entity. We can see that ranking entity types is more difficult for entities belonging to many classes as there are more possible rankings that can be chosen. Even for the simple approach FREQ when few types are assigned to an entity we can obtain effective results. On the right side of Figure 4.7 we can see the robustness of DEC-TREE over an increasing number of types associated to the entity.

Figure 4.8 shows average effectiveness values for entities in different sections of our NYT collection. We can see that the most effective ranking can be obtained on articles from the Dealbook section that deals with merge, acquisitions, and venture capital topics. Most challenging categories for ranking entity types include Arts and Style. On the right side of Figure 4.8 we can see, for each section of the dataset, the number of types associated to entities it contains. Entities with most types (16.9) appear in the Opinion category while Dealbook contains entities featuring, on average, the least number types (8.7). This can explain the good results obtained in that category.

Table 4.3 reports the results of the evaluation of all the methods described in Section 4.5. We notice that, effectiveness values obtained without any textual context are generally higher, supporting the conclusion that the type ranking task for an entity without context is somehow easier than when we need to consider the text in which the entity is mentioned.

Among the entity-centric approaches, we notice that in most of the cases WIKILINK-OUT

77 Chapter 4. Displaying Entity Information: Ranking Entity Types

Dealbook Dealbook US US Science Science Politics Politics Business Business World World New York NewYork Opinion Opinion Health Health Sports Sports Style Style NDCG Arts Arts MAP Technology Technology

0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 12 14 16 18 Avg(Types per Entity)

Figure 4.8 – Distribution of NDCG and MAP scores over article categories (top) and average number of types per entity over article categories (bottom).

works best. Recall that this method exploits dbo:wikiLink edges connecting the input entity e to other entities mentioned in e’s Wikipedia article. These links thus induce an implicit entity- context derived from the text of e’s Wikipedia entry. As our results show, such context, does not generalize well to other texts contexts in which e is mentioned. Among the context-aware approaches, the NGRAMS1 method performs best while, as expected, NGRAMS0 performs poorly, obtaining scores which are similar to those of FREQ. Despite being conceptually simple, the hierarchy-based approaches clearly outperform most of the other methods, showing scores similar to the most sophisticated context-aware approach. Even the simple DEPTH approach performs effectively.

To evaluate NGRAMS1 and KB-HIER, we use 5-fold cross validation over 4,334, 6,011, 5,616, and 5,620 data points for the entity-only, sentence, paragraph, and 3-paragraphs datasets, respectively. To evaluate the combination of all the approaches proposed previously in Sec- tion 4.5.5, we select 12 features covering the different methodologies we studied (i.e., entity- centric, context-aware, and hierarchy-based), and use them to train regression models that we exploit to rank entity types. To train and test the models we use 10-fold cross validation over 7,884, 11,875, 11,279, and 11,240 data points for the entity-only, sentence, paragraph, and 3-paragraphs datasets, respectively.

We can observe that the best performing method is the one based on decision trees (DEC- TREE), which outperforms all other approaches. KB-HIER, which was initially design to investigate how hierarchy-based features interplay with features extracted from the knowledge base, performs actually better than our linear regression approach, LIN-REG, that combines scores coming from the other approaches.

Crowd-powered Entity Type Assignment

For some entities the knowledge base may not contain good enough types. For example, some entities only feature the owl:Thing and rdfs:Resource types. In such cases, we ask the crowd to suggest a type for the entity they are judging. While it is not the focus of this paper to

78 4.7. Experiments

Table 4.3 – Type ranking effectiveness in terms of NDCG and MAP for different textual contexts. Statistically significant improvements (t-test p 0.05) of the mixed approaches over the best < ranking approach are marked with “*”.

Approach Entity-only Sentence Paragraph 3-Paragraphs NDCG MAP NDCG MAP NDCG MAP NDCG MAP FREQ 0.5997 0.4483 0.5546 0.3898 0.5503 0.3932 0.5163 0.3682 WIKILINK-IN 0.6390 0.5110 0.5658 0.4194 0.5759 0.4361 0.5520 0.4054 WIKILINK-OUT 0.6483 0.5189 0.5795 0.4509 0.5864 0.4579 0.5668 0.4294 SAME-AS 0.6451 0.5088 0.5731 0.4192 0.5823 0.4243 0.5583 0.4046 LABEL 0.6185 0.4777 0.5693 0.4173 0.5555 0.4090 0.5344 0.3875 SAMETYPE - - 0.5964 0.4465 0.5938 0.4385 0.5583 0.4083 PATH - - 0.5959 0.4654 0.5966 0.4642 0.5609 0.4290 NGRAMS0 - - 0.5559 0.3865 0.5449 0.3855 0.5144 0.3625 NGRAMS1 - - 0.6401 0.5255 0.6608 0.5526 0.6413 0.5431 DEPTH 0.7016 0.5994 0.6210 0.5082 0.6279 0.5171 0.6117 0.4984 ANCESTORS 0.7058 0.6041 0.6484 0.5434 0.6510 0.5544 0.6335 0.5322 ANC_DEPTH 0.7138 0.6186 0.6352 0.5269 0.6420 0.5370 0.6211 0.5149 KB-HIER 0.7179 0.6167 0.6954* 0.5966* 0.7050* 0.6233* 0.6759* 0.5885* DEC-TREE 0.7248 0.6224 0.7282* 0.6535* 0.7297* 0.6618* 0.7003* 0.6279* LIN-REG 0.6930 0.5847 0.6337 0.5304 0.6488 0.5465 0.6202 0.5100 extend existing LOD ontologies with additional schema, we claim that this can be easily done by means of crowdsourcing. Examples of crowd-generated entity types are listed in Table 4.4.

4.7.4 Scalability

We run the MapReduce TRank pipeline over a sample of CommonCrawl9 which contains schema.org annotations. When we run the experiment, CommonCrawl was composed by 177 crawling segments, accounting for 71TB of compressed Web content. We sampled uniformly 1TB of data and kept only the HTML content with schema.org annotations. This resulted in a corpus of 1,310,459 HTML pages, for a total of 23GB (compressed). Table 4.5 shows statistics on the domains composing our sample, and on the schema.org types of the entities it contains. We observed that the distributions of such entity types almost overlaps with the one previously found by Mühleisen and Bizer [134].

Our MapReduce testbed is composed by 8 slave servers, each with 12 cores at 2.33GHz, 32GB of RAM, and 3 SATA disks. The relatively small size of the 3 Lucene inverted indexes ( 600MB) ∼ used by the TRank pipeline allowed us to replicate them on each single server (transparently via HDFS). In this way, no server represented a read hot-spot or, even worse, a single point of failure–mandatory requirements for any architecture which could be considered Web-scale ready. We argue that the good performance of our MapReduce pipeline is mostly due to the

9http://commoncrawl.org/

79 Chapter 4. Displaying Entity Information: Ranking Entity Types

Table 4.4 – Examples of crowd-generated entity types.

Entity Label Existing Types Crowd Suggested Type Alumnus, Resource, David Glassberg Uni. Alumni, New York City policemanNorthwestern US television journalists Fox Thing, Eukaryote Television Network Minor league team, Bowie Musical Artist Minor league sports team Atlantic Resource, Populated Place Ocean European Type of profession, Governmental Organizations Commission Landmark Childress Thing, Resource Locality

Table 4.5 – Statistics on the CommonCrawl sample we used.

Domain % in Corpus schema.org Type % in Corpus .com 39.65 VideoObject 40.79 blogspot.com 9.26 Product 32.66 over-blog.com 0.67 Offer 28.92 rhapsody.com 0.54 Person 20.95 fotolog.com 0.52 BlogPosting 18.97

use of small and pre-computed inverted indexes instead of expensive SPARQL queries.

Processing the corpus on such testbed took 25 minutes on average, that is to say each server runs the whole TRank pipeline on 72 documents per second. Table 4.6 shows a performance breakdown for each component of the pipeline. The value reported on the “Type Ranking” column refers to the implementation of ANCESTORS, but it is comparable for all the other techniques presented in the paper except for those based on the Learning to Rank approach, which we did not test in such a setup MapReduce.

We inspected the output produced by TRank on our sample by paying particular attention on the entity types it computed. We observed that, as can be seen from Table 4.8, entity types about actors are frequently mentioned together. Moreover, we analyzed how types produced by TRank compare against types introduced in webpages by using schema.org annotations. Table 4.7 shows how TRank types refer to specific entities mentioned in the text of topic-specific pages. For example, yago:InternetCompaniesOfTheUnitedStates entities are contained in pages annotated with instances of schema:Product.

80 4.7. Experiments

Table 4.6 – Performance breakdown of the MapReduce pipeline.

Text Entity Type Type NER Extraction Linking Retrieval Ranking 18.9% 35.6% 29.5% 9.8% 6.2%

Table 4.7 – Co-occurrences of schema.org annotations and entity types selected by TRank.

schema.org Type top-3 most frequent TRank types dbo:GivenName schema:VideoObject dbo:Settlement dbo:Company yago:InternetCompaniesOfTheUnitedStates schema:Product yago:PriceComparisonServices dp:Settlement yago:InternetCompaniesOfTheUnitedStates schema:Offer yago:PriceComparisonServices dbo:Company dbo:GivenName schema:Person dbo:Company yago:FemalePornographicFilmActors dbo:GivenName schema:BlogPosting dbo:Settlement yago:StatesOfTheUnitedStates

Table 4.8 – Co-occurrences of entity types computed by TRank in the CommonCrawl sample.

Type Type % yago:Actor109765278 yago:Actor109765278 0.193 dbo:GivenName dbo:GivenName 0.077 dbo:Settlement dbo:Settlement 0.072 yago:Actor109765278 yago:AmericanStageActors 0.064 dbo:Person yago:Actor109765278 0.061 dbo:GivenName dbo:Settlement 0.040 yago:EnglishTelevisionActors yago:Actor109765278 0.039 dbo:GivenName yago:FirstName106337307 0.038 yago:StatesOfTheUnitedStates yago:StatesOfTheUnitedStates 0.035 yago:AmericanStageActors yago:AmericanStageActors 0.030

81 Chapter 4. Displaying Entity Information: Ranking Entity Types

1.4 1.2 1.0 0.8 0.6 hChildren 0.4 Entropy 0.2 hSiblings 0.0

80 70 60 nChildren 50 40 nSiblings 30 20 N. Nodes 10 0

90,000 80,000 70,000 N. of Leaves 60,000 50,000 40,000 30,000 20,000 N. Leaves 10,000 0

450 400 350 N. Relevant Types 300 250 200 150 Types 100

N. Relevant 50 0 1 2 3 4 5 6 7 8 9 10 Type Depth

Figure 4.9 – Distribution of NGRAMS1 features, number of leaves, and number of relevant results at different depth levels of the type hierarchy.

4.8 Discussion

In this section, we comment on the performance of the various approaches we evaluated empirically. We focus particularly on the text-based approaches and on KB-HIER since, in our opinion, they cover the most interesting aspects of the proposed methods.

We begin by pointing the reader’s attention to the bottom part of Figure 4.9, as it illustrates where the relevant types are located in our type hierarchy. As can be observed, most of the relevant types can be found between the second and the fifth level of our tree, with a remarkable peak at Type Depth = 4. We also notice that the greatest number of leaves can be found at that depth level. This provides a hint as of why DEPTH performs worse than other methods: as many of the relevant types have a depth of four, returning deeper results is typically not optimal.

The upper part of the figure shows how some of the features taken into consideration by NGRAMS1 behave. From the first two charts, we can see that both nSiblings and hSiblings, defined in Section 4.5.3, are able to detect the peak of relevant results at Type Depth = 4. These results suggest that such features are good estimators of the relevance of entity types and,

82 4.8. Discussion

1.0 0.40

0.35 0.9

0.30 0.8

0.25

Entropy 0.7 0.20

0.6 hChildren 0.15 ratioParent hSiblings NGRAMS0 0.5 0.10 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Relevance Score Relevance Score

Figure 4.10 – (left) Relation between hChildren, hSiblings, and relevance: notice the almost a linear relation between hChildren and relevance. (right) Relations between the two most predictive features of NGRAMS1 and relevance. contrary to DEPTH, do not favor exceptionally long branches of the hierarchy.

As for the hChildren feature, the experimental results depicted in Figure 4.10 (left) con- firm that relevance is related to entropy. As can be seen, nodes whose children’s entropy is lower have a lower relevance score. Despite the interesting properties of the metrics just de- scribed, the most predictive features of NGRAMS1 are, in order, NGRAMS0 (41%), ratioParent (15%), hSiblings (14%), nSiblings (11%), hChildren (10%).10 Figure 4.10 (right) shows how NGRAMS0, ratioParent and relevance relate to each other and explains why the NGRAMS0 approach performs poorly: it can easily detect the really few types which have the highest level of relevance but fails in detecting irrelevant types, which are by far more numerous. It is also interesting to notice that, on average, the n-gram probability mass is scattered among very few nodes at Type Depth = 5, . . . , 7, where most of the leaves are. This can be due partly by the fact that the extension of the text window we used is not big enough to catch differences between deeper types, thus not assigning any probability to types of those levels, and partly to the fact that in our testset there are fewer occurrences of types having depth greater than 5.

As for KB-HIER, surprisingly the most predictive feature selected by the learner is nTypes (38%), followed by popularity (20%), nSiblings (17%), nChildren (14%), and typeDepth (11%).11 We studied the relation between nTypes, nSiblings, and relevance; the results of our analysis are shown in Figures 4.11 and 4.12. From the former figure we observe that, in general, with small numbers of types to rank, those with fewer siblings are more likely to be relevant, while as the number of types to be ranked increases, types with fewer siblings get more relevant. The latter figure suggests that popular entities have more types, thus suggesting a connection between nTypes and popularity.

We conclude this section with a final remark about the effectiveness of context-aware and

10The reported numbers are computed on the Sentence Collection; a similar trend was observed in the other datasets. 11The reported numbers are computed on the Entity-only collection; a similar trend was observed in the other datasets.

83 Chapter 4. Displaying Entity Information: Ranking Entity Types

6000 relevance = 0 5000 relevance = 1 relevance = 2

4000 relevance = 3

3000

2000

Mean Number of Siblings 1000

0 (0, 2] (2, 9] (9, 16] (16, 23] (23, 30] (30, 37] (37, 44] Number of Types

Figure 4.11 – Interplay among nTypes, nSiblings (two of the most predictive features of KB-HIER), and relevance.

3000

2500

2000

1500

1000 Entity Popularity

500

0

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 # of types

Figure 4.12 – Relation between the popularity of entities, computed on Sindice-2011, and their number of types.

context-unaware methods. We noticed that, in general, preferring deeper types has often a positive effect on the final ranking of the types of a given entity. This is highlighted by the high scores achieved by all hierarchy based methods. However, there are cases for which such a choice does not pay off and for which context-based methods perform better. For instance, in document P3-0146, “Mali” co-occurs with “Paris”, “Greece”, and “Europe”. The top-3 results selected by ANCESTORS (context-unaware) are “LeastDevelopedCountries”, “LandlockedCountries”, and “French-speakingCountries” but they are all marked as non- relevant by the crowd since they were deemed too specific. In contrast, the top-3 types selected by PATH (context-aware), namely, “PopulatedPlace”, “Place”, and “Country”, are all relevant. In this case, ANCESTORS obtained a low score since it favored the most specific types, while PATH obtained a higher score since it exploited the types of the other entities to favor coarser types. We believe that this justifies the adoption of a mixed approach to rank entity types and partly explains the high effectiveness achieved by the DEC-TREE approach.

84 4.9. Conclusions

4.9 Conclusions

This chapter was the logical continuation of Chapter3: while in the previous chapter we discussed possible methods to retrieve entities, in this one we discussed about what informa- tion is better to show to end users by focusing mostly on information about the types of the entities taken into consideration. We argue that, for the same entity, different types should be presented to users depending on the context in which the entity is mentioned. We believe that displaying entity types selected based on such a context can improve user engagement in online platforms as well as improve a number of advanced services such as ad-hoc object retrieval, text summarization or knowledge capture. We studied this problem by defining the TRank task, that is, to rank a set of types associated to an entity in a background knowledge graph given a textual context in which the entity appears.

We proposed different classes of approaches for ranking entity types and evaluated their effec- tiveness using crowdsourced relevance judgments. We have also evaluated the efficiency of our approaches by using inverted indexes for fast access to entity and type hierarchy informa- tion and a Map/Reduce pipeline for entity type ranking over a Web crawl. Our experimental evaluation shows that features extracted from the type hierarchy and from the textual context surrounding the entities perform well for ranking entity types in practice, and that methods based on such features outperforms other approaches. A regression model learned over train- ing data that combines the different classes of approaches significantly improves over the individual ranking functions reaching a NDCG value of 0.73.

We also implemented TRank, a system based on our research whose source code is publicly available.12

As future work, we could evaluate the user impact of the proposed techniques by running a large scale-experiment through the deployment of a browser plug-in for contextual entity type display. An additional dimension we plan to investigate in the future is the creation and display of supporting evidence for the selection of the entity types. For example, a sentence or related entities may be presented to motivate the fact that the selected type is not the most popular type related to the entity (e.g., Tom Cruise is a vegetarian according to Freebase).

12The source code of TRank can be downloaded from https://github.com/eXascaleInfolab/TRank. TRank and also be used by accessing http://trank.exascale.info/

85

5 Towards Cleaner KGs: Detecting Mis- used Properties

In the previous chapters of this thesis we discussed how to retrieve a particular entity by using keyword queries such as “CEO Microsoft”, and how to select which of the types of the retrieved entity it is better to show given a certain context (e.g., American Billionaire vs Bridge Player). We now focus on the semantics of the properties used to encode information about entities composing a KG. In DBpedia 2014, for example, the property “gender” is attached to both instances of “Person” and instances of “School”. Is it conceptually the same property? Shall we display it to end-users in both cases? In this chapter we argue that such properties are either misused, that is, they do not comply to the specification provided by the schema of the KG, or if we decide to be more flexible, we say that they are ambiguous and thus their different meanings should be distinguished. In the following we discuss on possible methods to detect such properties and to handle them.

5.1 Introduction

Linked Open Data (LOD) is rapidly growing in terms of the number of available datasets, mov- ing from 295 available datasets in 2011 to 1,014 datasets in 2014.1 As we report in Section 5.2, LOD quality has already been analyzed from different angles. Nevertheless, one key LOD quality issue is the fact that, as we point out in Section 5.3, often there is little or no information on the schema of the dataset and, when there is, the data does not always comply to it. Having a schema which the published data adheres to allows for better parsing, automated processing, reasoning, or anomaly detection over the data. Also, it serves as a de facto documentation that helps end-users querying LOD datasets and fosters an easier deployment of Linked Data in practice.

To mitigate issues related to the non-conformity of the data to its schema, statistical methods for inducing schemata have been proposed. Völker et al., for example, extract OWL-EL axioms from the data and use statistics to associate them with confidence values [190]. Similar statistics were also used in order to detect inconsistencies in the data [184].

1http://lod-cloud.net

87 Chapter 5. Towards Cleaner KGs: Detecting Misused Properties

In this work, we focus on one particular issue of LOD schema adherence: the proper definition of domains and ranges of properties used in LOD.2 More precisely, in Section 5.4 we propose a new data-driven technique that amends both schema and instance data in order to assign better domains and ranges to properties. This goal is achieved by detecting cases in which a property is used for different purposes (i.e., with different semantics) and by disambiguating its different uses by dynamically creating new sub-properties extending the original property. Our approach modifies both the schema and the data since it can define new sub-properties that are then used to replace some of the occurrences of the original property. One of the interesting properties of our approach is that the modified data is retro-compatible, that is, a query made over the original version of the data can be posed as is over the amended version.

We evaluate our methods in Section 5.5 by first comparing how much data it can fix by adjusting different parameters, and then by asking Semantic Web experts to judge the quality of the modifications suggested by our approach.

5.2 Related Work

One of the most comprehensive piece of work describing LOD is the article by Schmachtenberg et al. in which the adoption of best practices for various aspects of the 2014 LOD, from creation to publication, are analyzed [168]. Such practices, ultimately, are meant to preserve the quality of a large body of data as LOD—a task that is even more daunting, considering the inherently distributed nature of LOD.

Data quality is a thoroughly-studied area in the context of companies, because of its impor- tance in economic terms [149]. Recently, LOD did also undergo a similar scrutiny: Knuth and Sack show that the Web of Data is by no means a perfect world of consistent and valid facts [105]. Linked Data has multiple dimensions of shortcomings ranging from simple syntactical errors over logical inconsistencies to complex semantic errors and wrong facts. For instance, Töpper et al. statistically infer the domain and range of properties in order to detect inconsis- tencies in DBpedia [184]. Similarly, Paulheim and Bizer propose a data-driven approach that exploits statistical distributions of properties and types for enhancing the quality of incom- plete and noisy Linked Data, for adding missing type statements, and for identifying faulty statements [145]. Differently from us, they leverage the number of instances of a certain type appearing in the property’s subject and object position in order to infer the type of an entity, while we use data as evidence to detect properties used with different semantics.

There is also a vast literature that introduces statistical schema induction and enrichment (based on association rule mining, , etc.) as a means to generate ontologies from RDF data [190, 82, 62, 45]. Such methods can, for example, extract OWL axioms and then use probabilities to come up with confidence scores, thus building what can be considered a “probabilistic ontology” that can emerge from the messiness and dynamism of Linked Data. In

2see Section 2.1.3 for an introduction on properties, sub-properties, domains, and ranges.

88 5.3. Motivation and Core Ideas this work, we focus on the analysis of property usage with the goal of fixing Linked Data and improve its quality.

5.3 Motivation and Core Ideas

The motivation that led us to the research presented in this chapter is summarized by the tables on Page90. There we report the top-5 DBpedia and Freebase properties which are most frequently associated to instances of wrong domains or ranges. The tables also report the absolute number of violations of the properties, together with their Wrong Domain Rate (WDR) and Wrong Range Rate (WRR). WDR corresponds to the ratio between the number of times the property is used with a wrong domain to its total number of uses, while WRR takes into consideration range violations and is defined analogously.3 The statistics are computed on the English version of DBpedia 2014 and on a dump of Freebase dated March 30th 2014.4

We can observe that the absolute number of occurrences of wrong domains or ranges in Free- base is two orders of magnitude greater than that of DBpedia. This cannot be explained only by considering the different numbers of entities contained in the two KGs since, according to their respective webpages, they contain approximately 47.43 and 4.58 million topics, respectively, meaning that the number of topics covered by Freebase is only one order of magnitude greater than that of DBpedia. We thus deduce that in Freebase the data adheres to the schema less than in DBpedia. This is also suggested by the fact that the top-3 most frequent properties defined in the DBpedia ontology, namely dbo:birthPlace, dpo:birthYear, and dbo:birthDate, have WDR and WRR smaller than 0.01, while the top-3 most used property in Freebase, namely fb:type.object.type, fb:type.type.instance, and fb:type.object.key, have an av- erage WDR of 0.2987 and an average WRR of 0.8698. This disparity can in part be explained by the fact that types in Freebase are organized in a forest of trees rather than in a single tree as in DBpedia. Therefore, while one could expect that each entity in the dataset should descend from the root type “object”, this is not the case when looking at the data. In addition, we noticed that in DBpedia, out of the 1,368 properties actually used in the KG, 1,109 feature a domain declaration and 1,181 feature a range declaration. Conversely, Freebase specifies the domain and the range of 65,019 properties but only 18,841 properties are used in the data.

We argue that a number of occurrences of wrong domains or ranges are due to the fact that the same property is used in different contexts, thus with different semantics. We call such properties multi-context properties. For instance, dbo:gender, whose domain is not specified in the DBpedia ontology, is a multi-context property since it is used both to indicate the gender of a proper name and the gender of the people accepted in a certain school (that is, if it accepts only boys, girls or both). In this case, we say that dbo:gender appears both in the context of dbo:GivenName and of dbo:School. While this can make sense in spoken

3WDR and WRR are computed by taking into account the KGs underlying type hierarchies, that is, if a property has “Actor” as range and is used in a RDF triple where the object is an “American Actor” we consider it as correct since “American Actor” is a subtype of “Actor”. 4http://wiki.dbpedia.org/Downloads2014, http://freebase.com.

89 Chapter 5. Towards Cleaner KGs: Detecting Misused Properties

Table 5.1 – Top-5 DBpedia properties by absolute number of domain violations.

DBpedia property #Wrong Dom. WDR dbo:years 641,528 1.0000 dbo:currentMember 260,412 1.0000 dbo:class 255,280 0.9548 dbo:managerClub 47,324 0.9999 dbo:address 36,449 0.8995

Table 5.2 – Top-5 DBpedia properties by absolute number of range violations.

DBpedia property #Wrong Rng. WRR dbo:starring 298,713 0.9492 dbo:associatedMusicalArtist 70,307 0.6424 dbo:instrument 60,385 1.0000 dbo:city 55,697 0.5452 dbo:hometown 47,165 0.5210

Table 5.3 – Top-5 Freebase properties by absolute number of domain violations.

Freebase Property #Wrong Dom. WDR fb:type.object.type 99,119,559 0.6050 fb:type.object.name 41,708,548 1.0000 fb:type.object.key 35,276,872 0.2910 fb:type.object.permission 7,816,632 1.0000 fb:dataworld.gardening_hint.last_referenced_by 3,371,713 1.0000

Table 5.4 – Top-5 Freebase properties by absolute number of range violations.

Freebase Property #Wrong Rng. WRR fb:type.type.instance 96,764,915 0.6103 fb:common.topic.topic_equivalent_webpage 53,338,833 0.9992 fb:type.permission.controls 7,816,632 1.0000 fb:common.document.source_uri 4,578,671 0.9995 fb:dataworld.gardening_hint.last_referenced_by 3,342,789 0.9914

90 5.4. Detecting and Correcting Multi-Context Properties

Table 5.5 – Notation used to describe LERIXT.

Symbol Meaning KB the knowledge base composed of triples (s,p,o) s.t. s E T , p P, ∈ ∪ ∈ o E L T with E set of all entities, P set of all properties, T set of all ∈ ∪ ∪ entity types, and L set of all literals. e, t an entity and an entity type, respectively. p a property. t L an entity type t on the left side of a property. t R an entity type t on the right side of a property.

Cov(p0) the coverage of a sub-property p0 of a property p, that is, the rate of occurrences of p covered by p0.

language, we believe that the two cases should be distinct in a knowledge base. Unfortunately, it is not possible to make a general rule out of this sole example as, for instance, we have that foaf:name, whose domain is not defined in the DBpedia ontology, is attached to 25 out of 33 direct subtypes of owl:Thing, including dbo:Agent (the parent of dbo:Person and dbo:Organization), dbo:Event, and dbo:Place. In this case, it does not make sense to claim that all these occurrences represent different contexts in which the property is used, since the right domain in this case is indeed owl:Thing, as specified by the FOAF Vocabulary Specification.5 Moreover, in this case creating a new property for each subtype would lead to an overcomplicated schema. Finally, the fact that dbo:name is not attached to all the subtypes of owl:Thing suggests that the property is optional.

What follows describes the intuition given by this example in terms of statistics computed on the KG and describes an algorithms to identify the use of properties in different contexts.

5.4 Detecting and Correcting Multi-Context Properties

In this section, we describe in detail the algorithm we propose, namely, LERIXT (LEft and RIght conteXT). For the sake of presentation, we first describe a simpler version of the method that we call LEXT (LEft conteXT) and that uses the types of the entities appearing as subjects of the property in order to identify multi-context properties. We then present the full algorithm as an extension of this simpler version. For the description of the algorithm we add to the notations introduced in Chapter2 the symbols defined in Table 5.5.

5http://xmlns.com/foaf/spec/.

91 Chapter 5. Towards Cleaner KGs: Detecting Misused Properties

5.4.1 Statistical Tools

LEXT makes use of two main statistics: Pr(t L p), that is, the conditional probability of finding | an entity of type t as the subject of a triple having p as predicate (i.e., finding t “to the Left” of p), and the probability Pr(p t L), that is, the probability of seeing a property p given a triple | whose subject is an instance of t. Equation 5.1 formally defines the two probabilities. ¯© ª¯ L ¯ (s,p0,o) KB s at, p p0 ¯ Pr(t p) ¯© ∈ | = ª¯ | = ¯ (s,p0,o) KB p p0 ¯ ¯© ∈ | = ª¯ (5.1) L ¯ (s,p0,o) KB s at, p p0 ¯ Pr(p t ) ¯© ∈ | =ª¯ | = ¯ (s,p ,o) KB s t ¯ 0 ∈ | ∈ As one can imagine, Pr¡t L p¢ 1 indicates that t is a suitable domain for p, though, t can | = be very generic, in fact, Pr¡ L p¢ 1 for every property p where is the root of the type > | = > hierarchy. Conversely, Pr¡p t L¢ measures how common a property is among the instances | of a certain type. Pr¡p t L¢ 1 suggests that the property is mandatory for t’s instances. In | = addition, whenever we have strong indicators that a property is mandatory for many children t of a given type t, that is, Pr¡p t L¢ is close to 1 for all t s, we can deduce that t is a reasonable i | i i domain for p and that all the ti are using p as an inherited, and possibly optional, property. For example, if in DBpedia we consider the property foaf:name and we analyze Pr¡p t L¢ for | i all t Ch(owl:Thing), we see that such probability is positive in 25 cases, and is greater than i ∈ 0.5 in 18 cases out of 33, suggesting that all the ti s do not constitute uses of the properties in other contexts but, rather, that the properties are used in the more general context identified by owl:Thing.

Computationally, we only need to compute one value for each combination of property p and type t, that is, the number #¡p t L¢ of triples having as subject an instance of t, and p as ∧ predicate. In fact, if we assume that whenever there is a triple stating that (e,a,t) there is also a triple (e,a,t 0) for each ancestor t 0 of t in the type hierarchy, we have that the statistics needed to compute the probabilities shown in Equation 5.1 can be expressed by means of #¡p t L¢ by ∧ using the following equations.

¯© ª¯ ¡ L¢ p P.¯ (s,p0,o) KB p p0 ¯ # p (5.2) ∀ ∈ ∈ | = = ∧ > ¯© ª¯ X ¡ L¢ p P.¯ (s,p0,o) KB s at ¯ # p0 t (5.3) ∀ ∈ ∈ | = p P ∧ 0∈ The computation of such numbers can be done with one map/reduce job similar to the well-known word-count example often used to show how the paradigm works, thus, it can be efficiently computed in a distributed environment, allowing the algorithms we propose to scale to large amounts of data. Another interesting property implied by type hierarchy of the KG is that if t Ch(t ) then Pr¡t L p¢ Pr¡t L p¢. Assuming the same premises, however, 1 ∈ 0 1 | ≤ 0 | nothing can be said about Pr¡p t L¢ and Pr¡p t L¢. | 0 | 1

92 5.4. Detecting and Correcting Multi-Context Properties

Algorithm 1 LEXT Require: 0 λ 1 strictness threshold, η 0 entropy threshold. ≤ ≤ ≥ Require: curr_root T the current root, p P. ∈ ∈ Require: acc a list containing all the meanings found so far. Ensure: acc updated with all the meanings of p. © ¡ ¢ ª 1: p_given_t Pr p t L t Ch(curr_root) ← | c | c ∈ 2: H Entropy(p_given_t) = © ¡ ¢ ª 3: candidates t t Ch(curr_root) Pr t L p λ ← c | c ∈ ∧ c | ≥ 4: if H η candidates then ≥ ∨ = ; 5: if Pr(curr_root p) 1 then | = 6: acc (p,curr_root,1): acc ← 7: else 8: p0 new_property(p,curr_root) ← © ¡ ¢ ª 9: KB KB p0, rdfs:subPropertyOf, p ←¡ ∪ ¢ 10: acc p0, curr_root, Pr(curr_root0 p) : acc ← | 11: end if 12: else 13: for c candidates do ∈ 14: LEXT(λ,η,c,acc) 15: end for 16: end if

5.4.2 LeXt

As previously anticipated, LEXT detects multi-context properties by exploiting the types of the entities found on the left-hand side of the property taken into consideration. Specifically, given a property p, the algorithm makes a depth-first search of the type hierarchy, starting from the root, to find all cases in which there is enough evidence that the property is used in a different context. Practically, at each step a type t—the current root of the sub-tree we are considering—is analyzed, and all the t Ch(t) having Pr¡t L p¢ greater than a certain i ∈ i | threshold λ are considered. If there is no such child, or if we are in a case similar to that of the foaf:name example described previously, a new property p_t, sub property of p, is created with t as domain; otherwise the method is recursively called on each ti . Finally, cases analogous to the foaf:name example are detected by using the entropy of the probabilities ¡ D ¢ Pr p ti with ti Ch(t) that captures the intuition presented while introducing the above | ∈ ¡ D ¢ mentioned statistics. We obtain a probability distribution out of the separate Pr p ti by P | using Z t Ch(t) Pr(ti ) to normalize them. We then compute the entropy H as shown in = i ∈ Equation 5.4. µ ¶ ¡ ¢ X Pr(p ti ) Pr(p ti ) H p Ch(t) | log2 | (5.4) | = − t Ch(t) Z · Z i ∈ For example, if a type t has two children t and t such that Pr¡p t L¢ 1 and Pr¡p t L¢ 0.8 1 2 | 1 = | 2 = we have that H¡p Ch(t)¢ 0.99 Sh. | = 93 Chapter 5. Towards Cleaner KGs: Detecting Misused Properties

Thing 1.00

H =0.09 SportSeason Agent ... 0.55 0.44

SportsTeamSeason ...... Organisation 0.55 0.44

SoccerClubSeason ... SportsTeam ... 0.55 0.44

H =1.96 Soccer Baseball ... Cricket Rugby 0.42 "1 "k 1 "k

Figure 5.1 – Execution of LEXT on dbo:manager.

Algorithm1 formally describes the whole process. In the pseudo-code, a context of the input property is encoded by a triple (p0,dom(p0),coverage) where p0 is a property identifying the context, dom(p0) is its domain, and coverage λ is the rate of the occurrences of the main ≥ property p covered by the context, denoted by Cov(p0). If the coverage is one, p is used in just one context (cf. Line5). In Line8, a new property p0 is created and its domain is set to curr_root while, in Line9, p0 is declared to be a sub-property of p: this makes the data retro-compatible under the assumption that the clients can resolve sub-properties. Ideally, after the execution of the algorithm, all the triples referring to the identified meanings should be updated. The algorithm can also be used to obtain hints on how to improve the KG.

The execution steps of LEXT on dbo:manager (m, for short) with λ 0.4 and η 1 are = = depicted in Figure 5.1. The entity types are organized according to the DBpedia type hi- erarchy and each type t is subscripted by Pr(t m). As can be observed, during the first | step the children of owl:Thing are analyzed: the entropy constraint is satisfied and two nodes satisfy the Pr(t m) constraint. The exploration of the dbo:sportsSeason branch | ends when dbo:SoccerClubSeason is reached. The triple (SoccerClubSeason manager, dbo:SoccerClubSeason 0.55) is returned. The new property is a sub-property of dbo:manager that covers 55% of its occurrences. Finally, the algorithm goes down the other branch of the type hierarchy until the entropy constraint is violated, and returns the context (SportsTeam manager, dbo:SportsTeam, 0.45).

5.4.3 Discussion

The threshold λ sets a condition on the minimum degree of evidence we need to state that we have identified a new meaning of the input property p, expressed in term of the Pr(t p) | probability. This threshold is of key importance in practice. On the one hand, low thresholds

94 5.5. Experiments require little evidence and thus foster the creation of new properties, possibly over-populating the schema; on the other hand, high thresholds almost never accept a new meaning of p, thus inferring coarser domains. In particular, with λ 1 the exact domain of p is inferred (which = in several cases can result to be ). In Section 5.5 we show how the algorithm behaves with > varying levels of strictness.

The presented algorithm has a number of limitations. In particular, it does not explicitly cover the cases for which one type has more than one parent, thus multi-inheriting from several other types. In that case, an entity type can be processed several times but at most once per parent. We leave to future work studying if simply making sure that each node is processed once is enough to cover that case.

5.4.4 ReXt and LeRiXt

It is straightforward to define a variant of LEXT that considers property ranges instead of ¡ ¢ ¡ ¢ property domains by using Pr t R p and Pr p t R . We call this method RIXT. In our | | implementation we only consider object properties, that is, properties that connect an entity to another entity (rather than, for example, to a literal since these values are not entities and thus are not in the type hierarchy).

Generalizing LEXT to identify multi-context properties based on both domains and ranges is a more complicated task. The solution we propose is called LERIXT and consists in using two copies of the type hierarchy, one for the domains, and one for the ranges. At each step there is a “current domain” tl and a “current range” tr whose children are analyzed (thus the algorithm ¡ ¢ takes one more parameter than LEXT). Instead of using the condition Pr t L p λ to select | ≥ the candidate types to explore, we use Pr¡t L t R p¢ λ for each t Ch(t ),t Ch(t ), and i ∧ j | ≥ i ∈ l j ∈ r we recursively call LERIXT for each pair of types satisfying the constraint (see Line 14 of Algorithm1).

5.5 Experiments

We empirically evaluate the three methods described in Section 5.4, namely, LEXT, RIXT, and LERIXT, first by studying how they behave when varying the threshold λ, and then by measuring the precision of the modifications they suggest. The LOD dataset we selected for our evaluation is DBpedia 2014 since its entity types are organized in a well-specified tree, contrary to Freebase, whose type system is a forest. As we anticipated in Section 5.4.4, we consider only object properties when the range is used to identify multi-context properties by using RIXT and LERIXT. The numbers of properties we take into consideration when running LEXT and the other two algorithms are 1,368 and 643, respectively. Finally, during our experimentation we fix the η threshold to 1. This value was chosen based on the analysis of the entropy stopping criterion on a small subset of properties.

95 Chapter 5. Towards Cleaner KGs: Detecting Misused Properties

LeXt ReXt LeRiXt 1.0 0.8 0.6 0.4 0.2 Avg. Coverage 0.0 1000 800 600 400 200

# New Properties 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Threshold Value

Figure 5.2 – Average coverage and number of new properties with varying values of the thresh- old λ.

The impact of λ on the output of the algorithms is studied in terms of average property coverage and number of generated sub-properties. Recall that in Section 5.4.2 we defined the coverage of a sub-property. Here we measure the property coverage, defined as the overall rate of occurrences of a certain property p that is covered by its sub-properties, that is, the sum of

Cov(p0) for all p0 generated sub-property of p.

In the upper part of Figure 5.2 the average property coverage is shown for various λ. We notice that, as expected, lower values of λ lead to a high coverage since many new properties covering small parts of the data are created. As the value of the threshold increases, fewer and fewer properties are created, reaching the minimum at λ 1. Interestingly, we observe = that the average coverage curve is M-shaped with a local minimum at λ 0.5. That is the = consequence of the fact that with λ 0.5 the new properties are required to cover at least half ≥ of the occurrences of the original property, leaving no space for other contexts, thus, at most one new context can be identified for each property. Finally, at λ 1 the average coverage = drops to 0 since no sub-property can cover all the instances of the original property.

In order to evaluate our methods, 3 authors and 2 external experts evaluated the output of the algorithms computed on a sample of fifty randomly selected DBpedia properties. The algorithms were run by setting λ 0.1 and η 1. To decide whether the context separation = = proposed by the algorithm is correct or not, we built a web application showing to the judges

96 5.6. Conclusions the clickable URI of the original property together with the types of the entities it appears with. The judges had then to express their opinion on every generated sub-property.

The judgments were aggregated by majority voting and then precision was computed by dividing the number of positive judgments by the number of all judgments. LEXT, RIXT, and LERIXT achieved a precision of 96.5%, 91.4%, and 87.0%, respectively. We note that this result was obtained with just one configuration of the parameters—we leave a deeper evaluation of the algorithm as future work.

In practice, we envision our algorithms to be used as a decision-support tool for LOD curators rather than a fully automatic system to fix LOD datasets.

5.6 Conclusions

In this chapter, we tackled the problem of extracting and then amending domain and range information from LOD. The main idea behind our work stems from the observation that many properties are misused at the instance level, that is, they are used in several, distinct contexts. We call such properties multi-context properties. The three algorithms we proposed, namely, LEXT, RIXT, and LERIXT, exploit statistics about the types of the entities appearing as subject and object of the selected property in order to identify the various cases in which a multi-context property is used. Once a particular context is identified, a new sub-property is derived and used to substitute all the occurrences of the original property related to the identified context. Our methods can also be used to provide insight into the analyzed KG and how it could be improved in subsequent steps of its life cycle. We evaluated our methods by studying their behavior with different parameter settings and by asking Semantic Web experts to evaluate the generated sub-properties.

The algorithms we propose require the entities contained in the dataset to be typed with types organized in a tree-structured type hierarchy. As future work, we could run a deeper evaluation of our techniques, and design a method that overcomes the limitation presented above by considering the case in which the entity types are organized in a Direct Acyclic Graph, thus supporting multiple inheritance.

97

6 Applications of KGs: Entity-Centric Event Detection on Twitter

Up to now we analyzed tasks related to KGs only at the academic level; we now show how KGs can be concretely exploited in the real world. In this chapter we present ArmaTweet, a system developed in collaboration with armasuisse Science and Technology, the R&D agency of the Swiss Armed Forces. ArmaTweet reuses some of the concepts described in the previous chapters to detect newsworthy events from Twitter, and allows users to express their informa- tion needs by using special semantic queries. For example, users could use the techniques described in Chapter3 to retrieve the URI of an entity they are interested in, extract its most in- teresting types by using one of the TRank approaches shown in Chapter4, and use ArmaTweet to track events involving entities which are instances of those types.

6.1 Introduction

Twitter is a popular microblogging service that in 2016 had estimated 317 million users pro- ducing daily more than 500 million —also known as tweets—that are broadcast to all of the users’ followers. Tweets contain at most 140 characters and cover almost any topic, including personal messages and opinions, celebrity gossip, entertainment, news, and more. Current events are widely discussed on Twitter; for example, around 1.7 M tweets were sent on January 7th 2015 in response to the Charlie Hebdo attacks in Paris. Twitter users often provide live updates in critical situations; for example, users tweeted “In Brussels Airport. Been evacuated afer [sic] suspected bomb.” and “Stampede now. Everyone running” during the attack at the Bruxelles airport on March 22nd 2016. Most tweets can be read even by unregistered users, so Twitter can potentially provide a real-time source of information for detecting newsworthy events before the conventional broadcast media channels. Thus, the development of techniques for tweet analysis and event detection has attracted considerable attention lately. The Natural Language Processing (NLP) community adapted their techniques to tweets [143, 159, 106], which are short and are often written in a colloquial style with nonstandard acronyms, slang, and typos. These tools were used in many event detection approaches, which we discuss in detail in Section 6.2.

99 Chapter 6. Applications of KGs: Entity-Centric Event Detection on Twitter

Based on these results, armasuisse Science and Technology—the R&D agency of the Swiss Armed Forces—is developing a Social Media Analysis (SMA) system, whose aim is to help security analysts detecting security-related events. Similarly to previous work [23, 120], ana- lysts currently describe the relevant events using keywords, which are evaluated over tweets using standard Information Retrieval (IR) techniques. This, however, has been found to be insufficiently flexible for detecting events with complex descriptions. For example, to detect deaths of politicians, an analyst might try to query the SME system using keywords “politician die”, but this results in both low precision and low recall: the system misses events such as the death of Edward Brooke (the first African American US senator) since, instead of the word “politician”, most tweets reporting this event contain phrases such as “the senator” or “elected to the US Senate”; moreover, the word “die” is very frequent on Twitter and so the query retrieves mostly irrelevant tweets. To reliably detect such events, we must understand the intended meaning of the query, know which people are politicians, and understand which tweets contain a mention of such a person as a subject of a verb “to die”. Similarly, to match a query for “militia terror act” to an attack of Boko Haram on a village in Nigeria, we must know that Boko Haram is a militia group and that terror acts include kidnappings and bombings.

This motivated us to develop ArmaTweet—an extension of SMA developed in an extensive collaboration between armasuisse and the Universities of Fribourg and Oxford that supports semantic event detection. Our system uses NLP techniques to extract a simple structured representation of the tweets’ meaning. It then constructs an RDF KG with links to DBpedia and WordNet, allowing users to describe relevant event categories by using semantic queries over the knowledge graph. ArmaTweet uses state of the art semantic technologies to evaluate such queries and detect relevant tweets, after which it uses an anomaly detection algorithm to determine whether and how these tweets correspond to actual events. We evaluated our system on the 1%-sample of tweets collected by the Twitter’s streaming API during the first six months of 2015. The system detected a total of 941 events across seven different event categories. We evaluated our results using three different definitions of which tweets should be considered relevant to the query. Depending on the selected relevance metric, our system achieved precision between 46% and 67% across all categories. Most of these events could not be detected by the previous version of the system, which shows that our approach provides key improvements over standard keyword search.

6.2 Related Work

Although analyzing tweets is very challenging, initiatives such as the Named Entity rEcognition and Linking (NEEL) Challenge have spurred on the NLP community to develop a compre- hensive set of tools including Part-of-Speech (POS) taggers [143], Named Entity Recognizers [158, 37] and dependency parsers [106]. To understand the syntactic structure of the tweets, our system must identify dependencies between terms (e.g., identify the subject of a given verb, determine grammatical cases, and so on). We do not know of a Twitter-specific system that provides such functionality, so, decided to use the Stanford CoreNLP library [119], mentioned

100 6.3. Motivation & Methodology previously in Chapter2, which was originally designed to analyze cleaner text.

A recent survey of the methods for event detection on Twitter [76] classifies existing ap- proaches into three groups. The first one contains approaches for detecting unspecified events—that is, events of general interest with no advance description. These approaches typically detect trends in features extracted from tweets and/or cluster tweets based on their topic [24, 115, 200]. Several systems detect breaking news [166, 148, 147], and one additionally classifies events into predefined types such as “Sports”, “Death”, or “Fashion” [159]. Probabilis- tic similarity has also been used instead of clustering [210]. Similarly to these approaches, we also identify events by detecting trends, but only after semantic queries have been used to identify the tweets matching the user’s interests (cf. Section 6.3).

The second group contains approaches for detecting predetermined events, such as concerts [25], controversial events [150], local festivals [108], earthquakes [163], crime and disaster events [112], and disease progression [171]. Such systems are specifically tailored to an event type, and they usually involve training a classifier on manually annotated tweets to learn the correlation of features that identifies tweets talking about an event. The EMBERS system [157] goes a step further by aggregating many sources of information (Twitter, Web searches, news, blogs, Internet traffic, and so on) to detect and predict instances of civil unrest.

The third group contains approaches for detecting specific events, which typically use IR methods to match a query (i.e., a Boolean combination of keywords) to a database of tweets. Queries are typically provided by users but can also be learned from the context [23]; moreover, query expansion can be used to improve recall [120]. These techniques have been combined with geographical proximity analysis to detect civil unrest [208], and to model events in Twitter streams [83]. ArmaTweet also uses queries provided by users to identify tweets and thus, broadly speaking, falls into this category; however, instead of keyword queries, it uses semantic queries that describe the relationships between entities mentioned in the tweets. The system thus supports queries about specific events (e.g., “Obama meets Trump”) that can be captured using keywords, as well as more complex queries specifying a type of an event (e.g., “somebody hacks a company”) for which a keyword-based approach is not effective. In addition, our system does not rely on a training phase, but requires users to specify their interests precisely by constructing semantic queries. An approach most similar to ours constructs a knowledge graph of events from news articles [161], and the main difference to our work is that it focuses on longer, cleaner texts.

6.3 Motivation & Methodology

Detecting events on Twitter matching complex descriptions (e.g., based on entities’ classes or their relationships) remains a challenge. Consider, for example, the “politician dying” description from the introduction, and the death of Edward Brooke on January 3rd 2015. The event has been widely discussed on Twitter, and running the keyword query “edward brooke” for that day on SMA returns 121 tweets. This, however, is just a small fraction of the

101 Chapter 6. Applications of KGs: Entity-Centric Event Detection on Twitter tweets produced on the day, and so this event is unlikely to be detected by the techniques for unspecified events (see Section 6.2); for example, the technique from [159] detects just five completely unrelated events on that day.1 Moreover, there are no obvious keyword queries: “die” returns 5,161 mostly irrelevant tweets in SMA, “politician” returns 46 irrelevant tweets, and “politician die” and “senator die” return no results (note that SMA uses just 1% of all tweets). Thus, although “edward brooke” is an effective query, it is unclear how to construct it from the “politician dying” description. Similarly, it is unclear how to exploit classification-based techniques since common features, such as n-grams or bags of words, are unlikely to reflect the semantic information that Edward Brooke was a politician. Other examples of complex events that we consider in this paper include “politician visits a country”, “militia terror act”, or “capital punishment by country”.

6.3.1 Approach

Since the objective of armasuisse was to detect events with complex descriptions, we de- part from statistical and IR approaches and use semantic search instead. In particular, we use natural language processing to associate each tweet with a set of quads of the form (subject,predicate,object,location), describing who did what to whom and where; any of these components can be empty, which we denote by . We also associate with each tweet a set of en- × tities whose role (subject or object) in the tweet could not be determined. Subjects, objects, lo- cations, and entities are matched to DBpedia [109], a knowledge base extracted from Wikipedia and already discussed in Chapter2, and predicates are matched to verb synsets in WordNet [127], an extensive lexicon. DBpedia and WordNet provide us with a vocabulary and back- ground knowledge for describing complex events. For example, tweets about the death of Ed- ward Brooke are associated with quads such as (dbr:Edward_Brooke,wnr:200359085-v, , ), × × where wnr:200359085-v identifies the synset “to die” in WordNet, and DBpedia classifies dbr:Edward_Brooke as an instance of the type yago:Politician110451263. Our simple quad model cannot represent semantic relationships such as appositions, adverbs, dependent clauses, modalities, or causality. While such relationships would clearly be useful, our evalua- tion (see Section 6.7) demonstrates that our model is sufficient for detecting many kinds of complex event that cannot be detected using keywords only.

6.3.2 Semantic Event Descriptions

To use ArmaTweet, users must first describe formally the events of interest. To facilitate that, the system provides an intuitive and declarative query interface that allows users to query quads in our knowledge graph while exploiting the background knowledge from DBpedia and WordNet. For example, “politician dying” events can be precisely described by a query that identifies in our KG all quads whose subject is of type yago:Politician110451263, and whose predicate is wnr:200359085-v. As we discuss in Section 6.5, such queries are matched

1http://statuscalendar.com/month/201501/ accessed on 14 December 2016.

102 6.3. Motivation & Methodology

NLP Event Categories Semantic Analysis Event Detection Preparing Data Creating Knowledge Extracting Locations Graph Aggregating Tweets by Day Correcting Passive Voice Creating Semantic Events Queries Tweets Resolving Verbs Anomaly Detection Reasoning Resolving Entities

Quads Time Series

Figure 6.1 – ArmaTweet architecture. to the knowledge graph in a way that attempts to compensate for the imprecision of natural language analysis. Queries are currently constructed manually, which allows users to precisely describe their information needs by using methods previously described in this thesis. In particular the methods described in Chapter3 can be used to look for entities of interest, and approaches for TRank described in Chapter4 to select relevant types. In our future work we shall investigate techniques that can automate, or at least provide some visual help to build semantic queries.

6.3.3 System Output

Given a set of tweets and a set of queries describing complex events, ArmaTweet produces a list of events, each consisting of an event date, an event summary, and a set of relevant tweets. The event summary is specific to the event type; for example, for “politician dying”, it identifies the politician in question, and for “militia terror act”, it identifies the militia group and the verb of the act. Finally, the set of relevant tweets allows the user to validate the system’s output, gain additional information, and possibly initiate an appropriate event response. The system currently does not detect long-running events (e.g., political turmoil or health crises)—that is, each event is associated with a single day only. Thus, the same real-world event can be reported as several events having the same summary but occurring on distinct days. However, longer-running events are often reported as events with the same summary occurring close to each other, and we shall investigate this further in our future work.

6.3.4 System Architecture

Figure 6.1 shows the architecture of ArmaTweet and its three main components. The Natural Language Processing component analyzes the tweets’ text and extracts the quads and entities. This component is independent of the complex event descriptions. The Semantic Analysis component converts the output of the NLP step into RDF,which is then analyzed and filtered using the user’s event descriptions. The output of this component is a set of tweet time series, each consisting of a summary and a set of tweets. Finally, the Event Detection component

103 Chapter 6. Applications of KGs: Entity-Centric Event Detection on Twitter uses an anomaly detection algorithm to extract from each time series zero or more dates that correspond to the actual events. The resulting events and their summaries are finally reported to the user.

To understand the conceptual difference between time series and events, consider the “militia terror act” event query. Our Semantic Analysis component produces one time series with subject “Boko Haram” and predicate “to attack”, which contains all tweets talking about attacks by Boko Haram regardless of the time of the tweets. Next, the Event Detection component groups the tweets by time and detects anomalies (e.g., abrupt changes in the number of tweets per day). Since Boko Haram committed several attacks in our test period, our system extracts and reports several events from this tweet time series.

Our NLP processing is computationally intensive, but it is massively parallel since each tweet can be processed independently; hence, we parallelized it using Apache Spark. Moreover, we used the state of the art RDFox2 system to store and process our knowledge graph [138]. The parts of our system that are independent from the Spark environment (i.e., the core of the NLP component and the queries/rules used for semantic analysis) are available online.3

6.4 Natural Language Processing of Tweets

The NLP component of ArmaTweet extracts from the text of each tweet a set of quads consisting of a subject, predicate, object, and location, as well as a set of entities that could not be assigned to a quad. Subjects, objects, locations, and entities are matched to DBpedia resources, and predicates are matched to verb synsets in WordNet. The system currently handles only tweets in English.

Our system uses the OpenIE annotator from the Stanford CoreNLP library [6] to transform each tweet into text triples consisting of a subject, a predicate, and an object. By calling these objects “text triples”, we stress that their components are pieces of text, rather than DBpedia or WordNet resources. Figure 6.2 shows the text triples produced from two example tweets. The main job of the NLP component is to transform text triples into quads, which is done as follows.

6.4.1 Data Preparation

For each tweet, we first prepare certain data structures. Specifically, we first clean the tweet’s text by removing emoticons, uncommon characters, and so on. We next use OpenIE to extract the text triples, which we immediately extend into text quads by specifying the location as unknown. As a side-effect of extracting text triples, OpenIE also annotates the parts of the tweet’s text that correspond to named entities (i.e., real-world objects with a proper name)

2http://www.cs.ox.ac.uk/isg/tools/RDFox/ 3http://github.com/eXascaleInfolab/2016-armatweet

104 6.4. Natural Language Processing of Tweets

nmod:in advmod det nsubjpass agent case auxpass case nsubj dobj comp

Hawija was bombed by ISIS again ! Obama met Trump in the White House NNP VBD VBN IN NNP RB ! NNP VBD NNP IN DT NNP NNP (‘Hawija’, ‘was’, ‘bombed’), (‘Obama’, ‘met Trump in’, ‘White House’), (‘Hawija’, ‘was bombed by’, ‘ISIS’) (‘Obama’, ‘met’, ‘Trump’)

Figure 6.2 – Data structures produced by OpenIE on two example tweets with their corresponding types (location, organization, or person). The tool also computes part-of-speech (POS) tags describing the relation a word has with adjacent or related words, and a dependency-based parse tree (or parse tree, for short), which represents the syntactic dependencies between sentence parts using labeled edges between words in the text.

Figure 6.2 shows the output of OpenIE on two example tweets. The tweet text is shown in bold. Named entity types are coded using colors: the locations “Hawija” and “White House” are shown in green, the organization “ISIS” is shown in blue, and the persons “Obama” and “Trump” are shown in yellow. POS tags are shown in italic below the words: “Hawija” is a singular proper noun (NNP), “was” is a verb in past tense (VBD), and “again” is an adverb (RB). Finally, the parse trees are shown as labeled arrows connecting words. The roots of the trees are words without incoming edges, “bombed” and “met” in this case. Moreover, in the rightmost tree, “Obama” is the subject of the verb “met” (denoted by a nsubj dependency), while “Trump” is its direct object (denoted by a dobj dependency). Finally, the text triples are shown at the bottom of the figure.

In addition to OpenIE, the NLP component also uses DBpedia Spotlight [60] to annotate parts of the tweets’ text with the relevant DBpedia resources (Entity Linking).4 For instance, on the example shown in Figure 6.2, DBpedia Spotlight annotates “Hawija” with dbr:Hawija, “Obama” with dbr:Barack_Obama, “Trump” with dbr:Donald_Trump, “White House” with dbr:White_House, and “ISIS” with dbr:ISIS.5 . We chose Spotlight due to its scalability and ease of use. Spotlight is parameterized by a confidence value, which regulates the precision of the annotations, and a support value, which is used to filter out uncommon entities. We empirically determined 0.5 and 20, respectively, to be appropriate values for our system.

6.4.2 Location Extraction

To identify the location in the text quads, we note that words that introduce a grammatical case in a sentence and that are connected to a location often denote a spatial location. Thus, for

4Notice that the task of discovering and linking entities given a piece of text is very different from the AOR task we described previously in Chapter3. 5We abbreviate the actual resource dbr:Islamic_State_of_Iraq_and_the_Levant.

105 Chapter 6. Applications of KGs: Entity-Centric Event Detection on Twitter each text quad where the object is a location (as indicated by the entity recognition), we check whether the parse tree contains a grammatical case dependency between a word occurring in the quad’s predicate and a word occurring in its object and, if so, we move the quad’s object to its location. As an example, the object of (‘Obama’, ‘met Trump in’, ‘White House’) in Figure 6.2 has been classified as a location, and the parse tree for the tweet contains a grammatical case dependency between the word “House” occurring in the object and the preposition “in” occurring in the predicate, and so we treat “White House” as a location instead of an object.

6.4.3 Passive Voice Correction

The usage of passive voice can be problematical. For example, (‘Hawija’, ‘was bombed by’, ‘ISIS’) from Figure 6.2 would produce a quad where “Hawija” is the subject and “ISIS” is the object, which would not correctly reflect the intended meaning of the tweet. We therefore try to correct such situations: for each text quad, we check whether the predicate contains a word that was classified by the POS tagger as a verb and that has (i) an outgoing passive auxiliary modifier dependency (to any other word), (ii) a passive subject dependency to a word occurring in the subject, and (iii) an agent dependency to a word occurring in the object; if so, we swap the subject and the object. In our example, “was” is an auxiliary modifier, “ISIS” is an agent, and “Hawija” is a passive subject, therefore we apply the correction.

6.4.4 Entity Resolution

For each text quad, we next try to match a subject, object, or location component to the annotations produced by Spotlight; if an exact match is found, we replace the component with the DBpedia resource, otherwise we replace the component with . ×

6.4.5 Verb Resolution

Since Spotlight does not handle verbs, we developed our own approach for verb resolution. We identify all verb occurrences in a tweet using POS tags and we next lemmatize each occurrence—that is, we substitute it with the verb’s infinitive form (e.g., ‘met’ becomes ‘to meet’, ‘bombed’ becomes ‘to bomb’, and so on)—and then we search the tweet’s parse tree for any phrasal verb particles connected to the verb’s occurrence. Such a dependency indicates that the verb and the particle form an idiomatic phrase (e.g., “take off” or “sort out”) and should therefore be analyzed together. Consequently, whenever we find such a dependency, we concatenate the verb with the phrasal verb particle. We finally match the (possibly extended) verb occurrence to a WordNet synset; if several candidate synsets exist, we select the one that is most frequent according to WordNet’s statistics. The output of this part of our system is thus similar to that of Spotlight.

Finally, we resolve the predicates in the quads to the matched verbs. Unlike entities, which we resolved using exact matches, we substitute the predicate of a quad with a matched verb if

106 6.5. Semantic Analysis

WordNet aso:TimeSeries-SP Time “pass from physical life and Tweets rdf:type Series lose all bodily attributes aso:Tweet And functions necessary to _:ts-sp_4344996_1855965 “2015-01-03T22:35:07” sustain life” wno:gloss rdf:type aso:tweetQuad ast:551507074258325504 asq:5705079 wnr:200359085-v aso:quadPredicate aso:tweetCountry aso:timeSeriesSubject “Edward Brooke, first black asq:5705080 senator since Reconstruction, dbr:Birmingham dies at 95 #Birmingham http://t.co/SixG7VOUC1” dbr:Edward_Brooke

dbr:United_Kingdom yago:Location100027167 rdf:type dbr:Reconstruction_Era yago:Politician110451263 dbo:Country DBpedia

Figure 6.3 – A fragment of the RDF KG.

the former contains the latter. This allows us to match “was bombed by” in Figure 6.2 to the synset for “to bomb”. Again, we replace predicates that could not be resolved with . ×

6.4.6 Quad Output

For each tweet, we return all quads except those containing only markers. In addition, for × each verb that was resolved to the tweet’s text but could not be associated with a quad, we also return a fresh quad where the subject, object, and location are empty. Finally, we return the set of all entities that were detected by Spotlight but could not be matched to a quad.

6.5 Semantic Analysis

The Semantic Analysis component constructs a knowledge graph by leveraging DBpedia, WordNet, and the output of the NLP component, and evaluates complex event descriptions provided by the users. We next discuss on the structure of such knowledge graph and on the complex event descriptions. We then explain how these are used to identify the tweet time series.

6.5.1 The RDF Knowledge Graph for Event Detection

ArmaTweet uses RDF as the data model for the knowledge graph. This allows us to import DBpedia and WordNet directly by using the state-of-the-art RDFox system (both resources are

107 Chapter 6. Applications of KGs: Entity-Centric Event Detection on Twitter available in RDF). We encode tweet information by using a simple schema where each tweet is represented by a URI obtained form the tweet’s ID. Each tweet is an instance of the aso:Tweet class, and the data properties aso:createdAt and aso:tweetText specify the time of the tweet’s creation and its text, respectively. A tweet can be associated with zero or more quads, each with at most one aso:quadSubject, aso:quadPredicate, aso:quadObject, and Property:quadLocation property value. Finally, a tweet can be associated with zero or more entities whose role in a sentence could not be determined (see Section 6.4).

As an example, Figure 6.3 shows a tweet with URI ast:551507074258325504 that is associated with two quads: one connects dbr:Edward_Brooke from DBpedia with the WordNet synset “to die”, and another connects dbr:Edward_Brooke with dbr:Reconstruction_Era (which is due to an imprecision in the NLP analysis). Finally, the tweet is also directly associated with dbr:Birmingham since the NLP component could not detect the role of this entity in the sentence.

As we explained in Section 6.3, the main objective of the Semantic Analysis component is to identify tweet time series, each consisting of a summary and a set of tweets; such time series are also stored in the knowledge graph. For example, _:ts-sp_4344996_1855965 in Figure 6.3 is a tweet time series containing all tweets about Edward Brooke dying, from which the Event Detection component (cf. Section 6.6) extracts zero or more events. Our system currently does not take into account that a person can die only once, and so it can potentially report multiple “Edward Brooke dies” events. Each time series is classified according to the type of the summary information; currently, this includes subject–predicate (SP), predicate–object (PO), subject– country (SC), predicate–country (PC), and subject–predicate–country (SPC) time series. For example, the time series in Figure 6.3 is determined by a subject and a verb; thus, this time series belongs to the aso:TimeSeries-SP class, and the values of aso:timeSeriesSubject and aso:timeSeriesPredicate determine the time series summary.

6.5.2 Resolving Location in the Knowledge Graph

We observed that the granularity of the event location often varies between tweets; for exam- ple, tweets about the Charlie Hebdo attacks refer both to France and Paris. To simplify event detection, we decided to aggregate event information at the country level. Thus, we extend the knowledge graph by resolving references to locations mentioned in tweets to the correspond- ing countries. For example, the tweet shown in Figure 6.3 refers to dbr:Birmingham so, since DBpedia contains the information that Brimingham is a city in the UK, we associate the tweet with dbr:United_Kingdom using the aso:tweetCountry property. Entities in tweet quads are resolved to countries in a similar vein.

108 6.5. Semantic Analysis

6.5.3 Describing Complex Events and Extracting Time Series

Events of interest are described using conjunctive SPARQL queries that select the relevant quads. For example, Queries (6.1) and (6.2) describe the “politician dying” and the “unrest in a country” events, respectively, where aso:UnrestVerb contains all verbs from WordNet that we identified as indicating unrest. The answer variables of each query determine the time series summary.

SELECT ?S wnr:200359085-v WHERE { SELECT ?P ?C { ?Q aso:quadPredicate wnr:200359085-v . ?Q aso:quadCountry ?C . ?Q aso:quadSubject ?S . (6.1) ?Q aso:quadPredicate ?P . (6.2) ?S rdf:type yago:Politician110451263 ?P rdf:type aso:UnrestVerb } }

Querying quads is important because quads preserve the semantic relationships between their components. For instance, when experimenting by looking for deaths of politicians, we noticed that tweets often mention a politician and the verb “to die”, but not in a desired semantic relationship. Tweet ast:551766588421312512 (not shown in Figure 6.3), for example, says “@BarackObama @pmharper I’m just trying to get some realization, is school supposed to cause you so much stress&anxiety that you want to die?” and it is annotated with dbr:Barack_Obama and wnr:200359085-v, but, as one might expect from the text, there is no quad connecting the two resources. The lack of a semantic relationship, however, does not always indicate that a tweet is irrelevant to the event query. For example, tweet ast:555598764589977600 (not shown in Figure 6.3) says “Edward Brooke, first black US senator elected by popular vote, dies - Reuters”, and it is annotated with dbr:Edward_Brooke and wnr:200359085-v, but, due to the complex sentence structure, the NLP component could not identify the semantic relationship correctly. In fact, our knowledge graph contains 44 tweets with quads matching ‘Edward Brooke dies’, as well as 111 additional tweets without the semantic relationship.

To exploit the knowledge graph as much as possible without losing precision, our system proceeds as follows. It creates a tweet time series for each distinct result of a quad query (or, equivalently, for each distinct time series summary), to which it adds as ‘high confidence’ members all tweets containing a matching quad. Next, for each time series created in this way, the system adds to the time series as ‘low confidence’ members all tweets mentioning the relevant entities/predicates without the semantic relationship. For example, our system creates a time series for each distinct value of ?S produced by Query (6.1); this includes _:ts-sp_4344996_1855965 from Figure 6.3 that contains tweets ast:551507074258325504 and ast:555598764589977600 as ‘high’ and ‘low confidence’ members, respectively. In con- trast, no time series is created for dbr:Barack_Obama since our knowledge graph does not contain a quad matching the above mentioned query where ?S is dbr:Barack_Obama. Intu- itively, the presence of ‘high confidence’ tweets raises the importance of the ‘low confidence’ tweets, which helps compensate for the imprecision of the NLP analysis.

We implemented the Semantic Analysis component using the RDFox system, which supports reasoning over RDF datasets using datalog rules. To facilitate that, we (automatically) turn

109 Chapter 6. Applications of KGs: Entity-Centric Event Detection on Twitter each query provided by the user into a datalog rule that creates the tweet time series and identifies the ‘high confidence’ tweets. For example, Query (6.1) is converted into the following datalog rule:

[?TS, rdf:type, aso:PoliticianDying], [?TS, aso:timeSeriesSubject, ?S], [?TS, aso:timeSeriesVerb, wnr:200359085-v], [?TS, aso:timeSeriesHigh, ?TW] :- [?TW, aso:tweetQuad, ?Q], [?Q, aso:quadSubject, ?S], (6.3) [?S, rdf:type, yago:Politician110451263], [?Q, aso:quadPredicate, wnr:200359085-v], BIND(SKOLEM("ts-sp", ?S, wnr:200359085-v) AS ?TS) .

This rule uses the datalog syntax of RDFox, which supports calling SPARQL built-in functions in its body. The SKOLEM function is an RDFox-specific extension that creates a blank node uniquely determined by the function’s parameters, thus simulating function symbols from logic programming. Thus, for each value of ?S, Rule (6.3) assigns to ?TS a unique blank node that identifies the time series, and its head atoms then attach to ?TS the relevant information and the “high confidence” tweets. A fixed (i.e., independent from the queries) set of rules then identifies the “low confidence” members of each time series by selecting tweets that mention all entities/predicates from the time series summary, but without the semantic relationship. For example, for subject–predicate time series, these rules select all tweets that mention the subject and the predicate outside a quad.

6.6 Event Detection

The Event Detection component accepts as input the tweet time series produced by the Semantic Analysis component and identifies zero or more associated events. This is done using the Seasonal Hybrid ESD (S-H-ESD) test [188] developed specifically for detecting anomalies in Twitter data. The algorithm is given a real number p between 0 and 1, a set of time points T , and a real-valued function x : T R that can be seen as a sequence of → observations of some value on T where x(t) is the value observed at time t T . The algorithm ∈ identifies a subset Ta of T of time points at which the value of x is considered to be anomalous, while ensuring that T p T holds; thus, p is the maximal proportion of the time points | a| ≤ | | that can be deemed anomalous. Roughly speaking, the S-H-ESD test first determines the periodicity/seasonality of the input data, and then it splits the data into disjoint windows each containing at least two weeks of data; then, for each window, it subtracts from x the seasonal and the median component and applies to the result the Extreme Student Derivative (ESD) test—a well-known anomaly detection technique. Twitter is currently using this technique on a daily basis to analyze their server load. ArmaTweet uses the open-source implementation of this test from the R statistical platform.6

To apply the S-H-ESD text, each tweet time series is converted into a sequence of temporal observations by aggregating the tweets by day—that is, the set T corresponds to the set of all days with at least one tweet, and the value of x(t) is the number of (both “high” and “low confidence”) tweets occurring on the day t T . We then run the S-H-ESD test with p 0.05— ∈ = 6http://github.com/twitter/AnomalyDetection

110 6.7. Evaluation

Table 6.1 – Evaluation results by event category.

Positive Instances by Relevance Event Category Type Total Events R3 R3+R2 R3–R1 Aviation accident SP 84 44 (52%) 51 (61%) 64 (76%) Cyber attack on a company PO 129 20 (16%) 42 (33%) 57 (44%) Capital punishment in a country PC 153 47 (31%) 67 (44%) 92 (60%) Militia terror act SP 220 92 (42%) 125 (57%) 141 (64%) Politician dying SP 111 76 (68%) 80 (72%) 85 (77%) Politician visits a country SPC 44 29 (66%) 36 (82%) 44 (100%) Unrest in a country PC 200 125 (63%) 133 (67%) 148 (74%) Total: 941 433 (46%) 534 (57%) 631 (67%)

that is, at most 5% of the time points can be deemed anomalous. Moreover, we configured the algorithm to detect only positive anomalies (i.e., cases where the number of tweets is above the expected value), which is natural for event detection.

6.7 Evaluation

To evaluate ArmaTweet, we constructed an RDF knowledge graph from 195.7M tweets collected in the first half of 2015 using Twitter’s streaming API, which is about 1% of all tweets from that period. Our main objective was to investigate the precision and the benefits of semantic event detection. We next present our experimental setup and then discuss our findings.

6.7.1 Determining Complex Events

We consulted the Wikipedia page for 20157 to identify interesting concrete events, which provided us with a starting point for a series of workshops in which we identified events and event types of interest to armasuisse customers. We eventually settled on the seven complex event categories shown in Table 6.1. We made sure that our categories cover many different types of event summary (i.e., subject–verb, verb–object, etc.).

6.7.2 Creating Category Queries

For each event category, we constructed a semantic query as follows. We first identified the entities from our example events on Wikipedia (e.g., dbr:Edward_Brooke), which we then looked up in DBpedia to identify their types (e.g., yago:Politician110451263). Next, we queried our knowledge graph for the verbs occurring together with the sample entities in the tweets. We ranked these verbs by the frequency of their occurrence, and then selected those

7http://en.wikipedia.org/wiki/2015

111 Chapter 6. Applications of KGs: Entity-Centric Event Detection on Twitter that best fit the event category. Finally, we formulated the category query and tested it on example events. Most queries capture the meaning of the categories directly, apart from the “Aviation incident” query where, to select useful data, we ask for a subject of type “airline” and a verb indicating a crash. Creating all the queries took about four person-days, and optimizing this process will be the main subject of future development of ArmaTweet.

6.7.3 Event Validation

By evaluating the event categories over the knowledge graph and detecting events as discussed in Sections 6.5 and 6.6, we identified a total of 941 events (see Table 6.1), which we validated manually—that is, we determined whether the reported event is a positive instance. This, however, turned out to be surprisingly challenging. First, often we could not verify whether the event really happened, so we decided to just evaluate whether the retrieved tweets correctly talk about the event; we justify this choice by noting that detecting “invented” events could also be very important to security analysts. Moreover, some events happened in the past (e.g., the anniversary of Robert Kennedy’s assassination was widely discussed on Twitter), but we decided to count these as positive instances since they are also likely to be of interest. Finally, in some cases the retrieved events were only partially relevant to the query, and so we assigned each event one of the following three relevance scores:

• R3 are positive event instances;

• R2 are positive instances where the entity resolution is incorrect (e.g., dbr:British_Raj vs dbr:India), or the subject–object relationships are inverted (e.g., “ISIS attacked X ” vs “X attacked ISIS”);

• R1 are events with a “fuzzy” relationship to the category (e.g., “ISIS kills X ” or “policeman killed” for the “Unrest in a country” category); and

• R0 are events with no relevance to the event category.

6.7.4 Results

For each event category, Table 6.1 shows the total number of detected events and the numbers of positive instances for different relevance scores. As one can see, precision varies consid- erably across categories. Visits and deaths of politicians were detected with high precision: entity resolution and type filtering seem very effective at identifying the relevant entities, and our NLP component seems very effective on sentences that report such events. In contrast, detecting cyberattacks is difficult: our query essentially searches for “company hacked”, but the verb “to hack” often means “to cut” or “to manage” and so the query also selected many irrelevant tweets such as about a being stabbed.

A particular problem for ArmaTweet was to correctly differentiate the subject from the object of an action: the approach to passive voice detection we described in Section 6.4 was effective,

112 6.8. Conclusion but should be further improved. Moreover, precision often suffered due acronyms; for example, “APIs” (i.e., “Application Programming Interfaces”) was resolved to “Associated Press”. Finally, popular entities posed a particular problem. For example, ISIS appears in a great number of tweets, which increases the likelihood of incorrect event recognition; in contrast, Boko Haram is not that well known and seems to be mainly mentioned in tweets reporting terrorist activity. We plan to further investigate ways to “normalize” the tweet time series based on the “popularity” of the entities involved.

6.8 Conclusion

In this chapter we moved from a purely academic setting to a more concrete use-case in which knowledge graphs are used by an industrial partner, armasuisse, in order to detect events from Twitter. ArmaTweet, the system presented in this chapter, represents tweets’ contents in an RDF knowledge graph containing links to DBpedia entities and WordNet synsets, thus allowing users to precisely describe the events of interest. The results of our evaluation show that ArmaTweet can detect events such as “politician dying” and “militia terror act”, which cannot be detected by conventional keyword-based methods.

Despite being more effective than other methods, our system is far from being perfect, in particular, we see two main challenges that we could address in future work. First, to help users describe complex events, we could develop dedicated user interfaces, as well as investigate ways to extract semantic queries from sample tweets or from news articles. Second, we could improve the precision of the NLP component, particularly focusing on the detection of passive voice and the quality of entity resolution in the presence of acronyms. Nevertheless, we believe that ArmaTweet is an interesting example of how technologies based on KG can be used in a real world scenario.

113

7 Continuous Evaluation of Entity Re- trieval Systems

In this thesis we used crowdsourcing to obtain the ground truth used to evaluate the method- ologies we proposed: in Chapter3 we extended the relevance judgments created during two editions of the SemSearch initiative, and in Chapter4 we asked the crowd to decide which entity type better fits the given textual context. We could have used crowdsourcing also to test ArmaTweet, the event detection system described in Chapter6, and the algorithms studied in Chapter5 but the size of the data used during their evaluations was not large enough to justify such an economical effort. Given the extensive use we made of this technique, it is worth analyzing its advantages, its drawbacks and, most importantly, if experimental results obtained by using it can be reliable. We are particularly interested in situations such as the one we identified during the evaluation of the AOR methods presented in Chapter3, in which runs of the new algorithms we proposed were treated unfairly since part of their top retrieved documents were not judged by any human annotator, and assumed not relevant even if it was not the case.

In this chapter we study this phenomenon and propose a new continuous evaluation method- ology in which new systems joining an IR evaluation initiative are also in charge of obtaining missing judgments that penalize their effectiveness.

7.1 Introduction

Evaluating the effectiveness of IR systems (IRSs) has been a focus of IR research for decades. Historically, the Cranfield paradigm [56] defined the standard methodology with which IRSs are evaluated by means of reusable test collections.

Over the past 20 years, the Text REtrieval Conference (TREC) has created standard and reusable test collections for different search tasks by refining and improving the original evaluation strategies first proposed by Cleverdon [56] and later refined by Salton [110]. A standard IR evaluation collection is composed of:

115 Chapter 7. Continuous Evaluation of Entity Retrieval Systems

1. a fixed document collection;

2. a set of topics (from which keyword queries are created);

3. a set of manually edited relevance judgments defining the relevance of the documents with respect to the topics;

4. ranked results (called “runs”) for all topics and for all participating IRSs.

One of the pressing issues encountered by TREC and by commercial search engines over the years is the rapid growth of the document collections. Very large document collections are essential for assessing the scalability of the systems being evaluated, yet make it impractical to obtain relevance judgments for all the documents in the collection. This led to the idea of pooling [98], that is, collecting the top documents retrieved by the set of IRSs being evaluated in a pool of documents to be judged, obtaining relevance judgments for each of them, and assuming that the rest of the corpus contains documents which are irrelevant to the considered topic (which might not be the case in reality). In the following we refer to “the pool” to denote the pool of documents to judge, and say that an IRS “participates in” or “contributes to” the pool to denote the fact that it is one of the systems contributing documents to the pool of documents to be judged.

Recently, evaluation metrics dealing with incomplete judgments have been proposed [44, 204, 205, 7, 50]. While some of them are computed by only considering judged documents (e.g., bpref [44]), others attempt to estimate the values of well-known metrics by using information on the current set of judged documents (e.g., infAP [204]). In addition, it is worth reporting that several metrics based on weighted precision allow experimenters to quantify how much unjudged documents affect the measurement by aggregating the weight associated to each or them. Such weight represents the contribution that the unjudged documents give to the final measurement and can be reported as part of the evaluation. Unfortunately, to the best of our knowledge, this practice has not got traction, yet. Another issue of pooling is that, while the initial systems participating in the pool are fairly compared, other IRSs evaluated on the same test collection are disadvantaged as it can happen that their top results are actually relevant but are considered as irrelevant because they were left unjudged since they were not part of the pool. As the document collection grows, it is more likely that an IRS which did not participate in the pool retrieves documents that were not judged, making it impossible to accurately measure its effectiveness. Such bias has already been highlighted by previous work (e.g., [198, 7]), moreover, recall that we faced the same situation during the evaluation of the methods for Ad-hoc Object Retrieval we presented previously in Chapter3. In that case, as can be seen from Tables 3.9 and 3.10 (on Page 54), integrating missing relevance judgments drastically changed the ranking of approaches we were evaluating.

Lately, crowdsourcing has been suggested as a practical means of building large IR evaluation collections using the Web. Instead of employing trained human assessors, micro-tasks are created on online crowdsourcing platforms to collect relevance judgments from what is

116 7.1. Introduction generally called the crowd, a crowd of untrained Web users willing to perform simple tasks. One example of this recent trend is the SemSearch initiative which provided us with the datasets we used to evaluate the AOR techniques mentioned above.1 In that context, crowdsourcing techniques were used to produce relevance judgments by granting a small economic reward to anonymous Web users who judged the relevance of semi-structured entities [89]. Similarly, in Chapter4 we give a detailed presentation on how we used crowdsourcing to create a dataset to evaluate approaches for TRank. Techniques for using crowdsourcing in similar contexts were also studied at the TREC Crowdsourcing track, whose focus was on how to best obtain and aggregate relevance judgments from the crowd [172].

Such a novel approach to obtaining relevance judgments triggers obvious questions about the reliability of the results. Previous work experimentally showed how such an approach is reliable and, most importantly, repeatable [31, 32] . This result opens the doors to new methodologies for IR evaluation where the crowd is exploited in a pay-as-you-go manner. That is, as a new search strategy or ranking feature is developed, its evaluation can trigger the update of existing test collections, which i) ensures a fair comparison of the new system against previous baselines and ii) provides the research community with an improved collection featuring more relevance judgments.

In this chapter, we introduce and study a new evaluation methodology that iteratively and continuously updates the evaluation collections by means of crowdsourcing as new IRSs appear and get evaluated. We claim that crowdsourcing relevance judgments is helpful to run continuous evaluation of IRSs since it is infeasible to involve TREC-style annotators each time a new IRS is developed or a new variant of a ranking function needs to be tested and compared to previous approaches. Thanks to continuously available crowd workers, it is possible to create and continuously maintain evaluation collections in an efficient and scalable way.

In the following, we first describe other research related to our methodology (Section 7.2); we then describe how to overcome some of the limitations of standard pooling-based evaluation methodologies by running a continuous evaluation campaign (Section 7.3). In this context particular emphasis will be placed on the limitations of the methodology we propose. Subse- quently, several aspects of a continuous evaluation campaign are analyzed. We begin with a series of statistics that help participants assessing how external IRSs compare to those which are already part of the campaign (Section 7.4); we then describe existing and novel strategies that can be used to select which document should be judged by crowd workers (Section 7.5); finally, we discuss on how to obtain relevance judgments and we treat the problem of merging new relevance judgments and existing judgments made by different sets of crowd workers, or by professional annotators (Section 7.6). All the methods we discuss are then tested in a simulated continuous evaluation campaign based on test collections made by following the Cranfield paradigm specifically, TREC collections. We also describe the results of small campaign based on SemSearch 11, a datasets built to evaluate entity search engines like the one we described in Chapter3 (Section 7.7). We conclude this chapter by further discussing

1http://km.aifb.kit.edu/ws/semsearch10/ and http://km.aifb.kit.edu/ws/semsearch11/

117 Chapter 7. Continuous Evaluation of Entity Retrieval Systems issues regarding the integration of relevance judgments, the economical viability of contin- uous evaluation campaigns and by envisioning “more continuous” continuous evaluation campaigns.

7.2 Related Work

IR evaluation is a well-studied research topic. In this section, we briefly review the efforts in this area that are most relevant to the novel continuous evaluation paradigm we propose in this work.

Early research efforts studied the evaluation bias caused by incomplete relevance judgments and thus pointed out the limited reusability of test collections. In this context, Zobel studied the bias introduced by pooling in evaluating IRSs [211]. Although he concluded that available evaluation collections are still viable, he also experimentally illustrated the drawback of pooling by estimating the number of relevant results in the entire collection beyond those actually observed by the assessors. Later work by Büttcher et al. presented more alarming results [47]. The authors analyzed large collections (larger than those used previously by Zobel) and show that IRSs rankings were unstable when considering or discarding certain IRSs contributing to the pool. Buckley et al. also observed that runs not participating to the pool are unfairly evaluated [43]. Further research focused on how different judgments can modify the outcome of the evaluation. In particular, Voorhees measured the correlation of IRSs rankings using different sets of relevance judgment [192]. Her results show that test collections are reliable since high ranking correlations were observed.

To overcome these issues the research community focused mainly on three directions: pooling strategies, metrics designed to deal with missing judgments, and different methods for ob- taining relevance judgments. As we will see in Section 7.3, our approach is orthogonal to both types of contribution as it suggests a novel evaluation methodology in which existing pooling strategies or metrics can be used. Our interest is on studying how such existing approaches to evaluate IRSs behave if used in the methodology we propose (cf. Section 7.7)

7.2.1 Evaluation Metrics

After analyzing the bias introduced by pooling, some researchers focused on the definition of novel evaluation metrics that could cope with incomplete relevance judgments.

The first metric proposed to mitigate the issue is bpref [44] which is based on the number of documents judged irrelevant and does not consider at all unjudged documents. Few years later Sakai wrote an article titles “Alternatives to bpref” in which he examines how other metrics behave if computed by using only judged documents and concludes that such metrics are actually a better solution to the incompleteness issue than bpref [162].

Two metrics which are currently popular are infAP [204] and xinfAP [205]. The former approxi-

118 7.2. Related Work mates Average Precision by using a probabilistic generative process, while the latter extends infAP by using stratified sampling to give higher weights to highly ranked results.

Finally, the problem of evaluating new runs with missing judgments was also studied by Webber and Park [198], who proposed to measure the bias of new IRSs based on the unjudged documents they retrieve. Such bias can then be used to adjust the evaluation scores of new IRSs which did not participated in the pool. We instead propose to extend the existing pool with new the unjudged documents retrieved by new IRSs.

7.2.2 Pooling Strategies

Several alternatives to fix-depth pooling have been proposed in the literature. Aslam et al., for example, proposed a pooling method based on non-uniform random sampling. Documents to be judged are selected at random from a given run by following a non-uniform distribution defined over the retrieved document contained in the run. Additionally, strata can be defined to give different priorities to different locations of the run, similar to xinfAP [7,8].

Another direction was taken by Carterette et al., who propose an iterative process where the pool is constructed selecting the next document to be judged after each relevance judgment [50]. In this case, the best document to be judged is selected based on its expected probability of being relevant. Carterette et al. also described an approach to create test collections that aims to measure its future reusability was proposed by Carterette et al. [51]. Instead, we propose to update existing test collections over time by increasing both the collection quality and its reliability; moreover, we compare different pool construction strategies in such settings.

We conclude by reporting on an alternative to judging the relevance of documents by Pavlu et al. [146]. The authors suggest to judge relevant nuggets of information instead and to match them to retrieved documents in order to automatically generate relevance judgments for documents. While the goal of obtaining scalable and reusable test collections is the same as ours, we instead propose to keep people judging documents instead of inferring their relevance based on imperfect text matching algorithms.

7.2.3 Crowdsourcing Relevance Judgments

Some recent research efforts in the field of IR evaluation focus on the use of crowdsourcing to replace trained assessors and create relevance judgments. This led to the creation and the study of a number of test collections featuring crowdsourced relevance judgments such as those used in different editions of the SemSearch challenge and of the TREC 2010 Blog track [122]. One relevant piece of work in that context is the study of repeatability of crowdsourced IR evaluation by Blanco et al. [31, 32]. Their findings show that, by repeating the crowdsourced judgments over time, the evaluation measures may vary, though the IRSs ranking is somewhat stable. Other researchers also put efforts on studying how crowdsourcing should be done in order to obtain high quality test collection. For example, Kazai et al. studied how the

119 Chapter 7. Continuous Evaluation of Entity Retrieval Systems crowdsourcing tasks (HITs) should be designed and suggested quality control techniques in the context of book search evaluations [102, 103]. Other researchers also studied how to design HITs for IR evaluations and observed that the crowd can be as precise as TREC assessors [3, 101]. Significant results proving the quality of crowdsourced relevance judgments were pro- vided by Alonso and Mizzaro. The researchers crowdsourced some relevance judgments from TREC-7 and compared them against the original judgments. They found that crowdsourcing is a valid alternative to trained relevance assessors [5]. Finally, one of the latest efforts related to this topic was done by Maddalena et al. and studies how time constraints affect crowd workers. The authors show that it is possible to reduce the cost associated to the crowdsourced collection of relevance judgments by giving time constraints to the workers without having significant losses in the quality of the judgments [116].

Since crowdsourcing might be affected by malicious workers completing tasks randomly to be paid as fast as possible, studies on how crowd workers behave have been done. Hosseini et al. [95] proposed a technique that takes into account each worker’s accuracy to weight each answer differently. Moreover, the TREC Crowdsourcing track [172] has studied how to best obtain and consolidate crowd answers to obtain relevance Scholer et al. analyze the effect of threshold priming, that is, how people’s relevance perception changes when seeing varying degrees of relevant documents [169]. They show that people exposed to only non-relevant documents tend to be more generous when deciding about relevance than people exposed to higher relevant documents. While the authors did not experiment with anonymous Web users but rather with university employees, we believe their results are applicable to online crowdsourcing platforms; however, addressing this effect is out of the scope of this thesis.

All these results open the door to continuous IR evaluations such as the methodology we present and study in this chapter.

7.3 Continuous IRS Evaluation

In this section we describe the evaluation methodology we propose. We start by highlighting the limitations of current IR evaluation methodologies, by formally introducing our methodol- ogy and by specifying its assumptions. We discuss each of its components in the following sections, in particular: we describe a new set of statistics that can be used to have a preliminary evaluation of a new IRS (Section 7.4); we discuss existing strategies to select the documents to judge, and we propose two novel approaches for this task (Section 7.5); finally, we discuss how to obtain and integrate relevance judgments using different populations of annotators (Section 7.6).

7.3.1 Limitations of Current IR Evaluations

Two problems often surface when applying current evaluation methodologies to large-scale evaluations of IRSs:

120 7.3. Continuous IRS Evaluation

1. The difficulty in gathering comprehensive relevance judgments for large corpora.

2. The unfair bias towards systems that are evaluated as part of the original evaluation campaign (i.e., when the collection is created).

Both issues relate to the potential lack of information pertaining to the relevance of documents from the collection since often a significant fraction of their relevant documents are not retrieved by any IRSs participating in an evaluation initiative, and are never judged since they do not appear in the pool of documents to be judged. This issue was first observed in smaller test collections [211]) and recently was exacerbated by the large size of current corpora such as ClueWeb122, which contains around one billion webpages. While recent efforts have addressed such issues by sampling retrieved documents [7, 8, 50], relevance judgments are still incomplete for large collections. This motivates the need for new evaluation strategies that take into account or try to compensate for this shortcoming.

The second aspect highlights the bias of evaluating an IRS contributing to the pool versus another system being evaluated later on [198]. While the early IRSs will have, by definition of fix-depth pooling, their top retrieved documents judged, the later IRS may have a significant number of top documents unjudged. This occurs whenever one of its retrieved documents was not retrieved by the IRSs that contributed to the pool. Hence pooling penalizes later approaches that could actually be more effective than those proposed earlier but retrieve very different sets of results. This motivates the need for a different evaluation methodology that provides a fairer comparison of IRSs not contributing to the pool of documents to judge.

7.3.2 Organizing a Continuous IR Evaluation Campaign

We now describe how to organize and run a Continuous Evaluation Campaign to address the two limitations described above by creating and using evaluation collections that are not static since their sets of relevance judgments are updated with each new IRS being evaluated.

Formally, a continuous evaluation campaign (or just campaign, for short) is characterized by:

1. a fixed document collection D;

2. a fixed set of topics T ;

3. a set of relevance judgments J , whose size increases over time;

4. a set of runs R participating to the campaign, whose size also increases over time.

We note that, while the first two components are fixed, the third and the fourth ones vary over time. The idea behind this is that a new system can join the campaign at any time but, in order

2http://lemurproject.org/clueweb12/

121 Chapter 7. Continuous Evaluation of Entity Retrieval Systems

A, B, C D wants to compete submit their against A, B, C but it is runs missing judgments

A, B, C

time C B A Judgements Set Evolution

A starts the B joins the C joins the D joins the campaign campaign campaign campaign

A B C D

time C A B D B C A A B Judgements Set A Evolution

S = Time spent to get judgments for system S

Figure 7.1 – Traditional evaluation methodology (top) vs Continuous evaluation campaign (bottom).

to do that, its creators have to provide the missing relevance judgments needed to evaluate it, and to make both the evaluated runs of the system and the new relevance judgments available to future participants (thus, increasing the cardinality of J and R). When the i-th system joins the continuous evaluation campaign we say that we are at the i-th step of the campaign.

Figure 7.1 illustrates the difference between a traditional evaluation methodology (top), in which the systems are evaluated all at the same time, and a continuous evaluation campaign (bottom). The diagram shows that, in the first case, relevance judgments for systems A, B, and C are all produced at the same time, while judgments for system D, which did not participate in the evaluation, will never be computed. On the contrary, when a system joins a continuous evaluation campaign, it enriches the ground truth by providing additional relevance judgments. Notice that the set of documents judged after C arrived is the same in both cases, however, in the latter case it is extended when D joins the campaign.

Every step of a continuous evaluation campaign is composed of four stages, namely, document selection, relevance judgments collection, relevance judgments integration, and run evaluation. During the document selection stage, the new documents that have to be judged are chosen and, in the next stage, relevance judgments are obtained by means of crowdsourcing or by using professional annotators. The new judgments are then integrated into those obtained in the previous steps of the campaign during the relevance judgments integration stage and, finally, a run of the current system is evaluated and the scores of all the other runs participating to the campaign are updated (or recomputed) to take into account the new

122 7.3. Continuous IRS Evaluation judgments. This is necessary since metrics computed by using different sets of judgments are typically not comparable. In the rest of this paper we use the notation m indicating the |i value of the evaluation metric m computed by using the judgments available at the i-th step of the continuous evaluation campaign. We also use J , R to denote, respectively, the set |i |i of judgments and the set of runs of the i-th step of the campaign. It is up to the organizers of the continuous evaluation campaign to select the evaluation metrics used to evaluate the runs and the algorithms to use select the documents to judge and to integrate new judgments.

7.3.3 Assumptions and Limitations of the Methodology

As the reader might have noticed, we made several assumptions when defining our methodol- ogy. In the following we specify them and we report on some of the limitations our methodol- ogy has.

First of all, a continuous evaluation campaign is limited to comparing information retrieval systems based on relevance judgments made by humans. Consequently, the methodology we propose does not require any user study nor any direct contact with the final users of the system. In particular, in this work we do not consider methods such as A/B testing, in which the behavior of users using different systems is analyzed. The productivity of the users is estimated by standard IR metrics taking into account relevant and irrelevant documents as it is common practice in Information Retrieval research [118]. Such an evaluation does not reflect the satisfaction of the users as well as a user study does but allows to reliably compare several different systems with considerably less effort. The fact that Moshfeghi et al. proved that crowdsourcing can be used to run user studies [133] suggests that it could be feasible to run a continuous evaluation featuring user studies, however, this is out of the scope of our study.

Moreover, as the reader may have noticed, our methodology is based on the assumption that re-judging the top documents retrieved by all systems of interest is out of the question since this would impose a higher “price” in order for a system to join a continuous evaluation campaign. Moreover, in order to facilitate the organization of the campaign and to foster participation, we assume that the corpus and the topics composing the dataset used during the evaluation cannot change over time. On one hand, this allows the participants to submit one run of their method over the fixed dataset instead of either releasing their system to other participants of the campaign or providing and maintaining an endpoint to it; on the other hand, the organizers and the participants of the campaign can easily and rapidly update the scores of all the IRSs without having to obtain new judgments to evaluate the performance of the systems on the new topics. This assumption implies that it is not possible to extend the evaluation by including new topics (the older system cannot be evaluated on them) thus, contrary to what happens in TREC, all the participants know the topics used in the evaluation. Nevertheless, notice that this also applies to all researchers using one of the past TREC datasets in order to evaluate their systems.

123 Chapter 7. Continuous Evaluation of Entity Retrieval Systems

Another aspect we do not tackle here is multi-graded relevance as it is harder to integrate multi-graded relevance judgments made by different judges in different points of times (see Section 7.6 for more details on the integration of relevance judgments). Nevertheless, the results obtained by Blanco et al. [31] suggest that our evaluation methodology can be extended to multi-graded relevance and, consequently, to metrics based on it. The price to pay for this choice is less expressive (and thus less precise) relevance judgments.

Finally, as we describe in Section 7.6.1, we assume that judges can be characterized by their strictness/leniency; we thus do not take into account the case in which the dataset contains idiosyncratic queries or documents that judges cannot understand. Such cases can be handled, for example, by exploiting a push-crowdsourcing methodology, that is, to people of whom we know (part of) the background [68].

7.4 Continuous IR Evaluation Statistics

As already discussed, people can submit their system to a continuous evaluation campaign at any time and, most important, they are allowed to use the data shared by other participants (runs and relevance judgments). This allows potential participants to assess if it is worth or not joining the campaign by submitting a new system run and, in particular, if there is any chance to outperform the current best method. Following this idea, we introduce several statistics that can be use to understand the cost/benefit trade-off of joining the continuous evaluation campaign and if it is worth paying for obtaining new relevance judgments or not. Such statistics can also be used in order to set the minimum requirements for new runs joining the evaluation.

7.4.1 Measuring the Fairness of the Judgment Pool

Throughout a continuous evaluation campaign, each runs has a varying number of judged documents depending on its overlap with other runs. For instance, if two systems participated to a campaign by submitting the runs r [a,b,c,d,e] and r [f ,g,h,a,b] and we use 0 = 1 = pooling with a depth of two (the top-2 documents of each run are judged), the set J contains judgments for documents a, b, f , and g, thus, run r0 has two judged documents, while run r1 has four of them. A similar situation also occurs in classic block evaluation initiatives such as TREC. Given that unjudged documents are assumed to be irrelevant (or have a relevance probability or score lower than 1), runs having a higher number of judged (and retrieved) documents are typically advantaged since their evaluation results are closer to those performed in the ideal setting in which every retrieved document is judged.

To measure how this effect affects the evaluation of an IRS, we introduce a metric called Fairness Score (FS) that is defined similarly to Average Precision (AP) but focuses on the judgments rather than on the relevance of the results. Fairness Score is formally defined by Equation 7.1, where r un [1,...,n] is a ranked list of retrieved results, JudCov(k) is the =

124 7.4. Continuous IR Evaluation Statistics proportion of judged documents contained among the first k entries of the run, and J(k) equals to 1 if the k-th retrieved document is judged and 0 otherwise.

Pn k 1 JudCov(k) J(k) FS(run) = · , (7.1) = n

Fairness Score, similarly to AP,puts more weight on the judged documents that appear higher in the ranking. This follows from the intuition that the end user of an IRS is most likely to take into consideration the top results that are displayed, so, having top documents unjudged is more critical than having unjudged documents at lower ranks. Moreover, since this intuition is followed by many well-known metrics built to evaluate ranked lists of results (AP,DCG [97], RBP [132], etc.), having high-ranked unjudged documents can make the difference when com- paring two IRSs, as information on the relevance of high-ranked documents may contributes more in increasing or decreasing the value of the metrics used during the evaluation.

The Fairness Score of a given run equals to 1 if all its retrieved documents are judged and to 0 if not a single retrieved document is judged. Notice that in some cases it does not make sense to consider the whole list of retrieved results, for instance, if IRss are ranked only by considering their Precision10, only the top-10 retrieved documents are used to evaluate the systems so there is no need to consider lower ranked documents. In such cases we assume that the input run is a ranked list composed of only the documents influencing the value of the considered metrics. Leveraging on this definition, we can compute the Fairness Score of each run of a test collection in order to check whether the runs were all treated on a fair basis or not (see Section 7.7.2). Fairness Score can be used to asses the potential of a run before submitting it to a continuous evaluation campaign: if its Fairness Score is significantly lower than those of the other participants then many of its top-ranked documents are unjudged, thus, the final rank of the run (computed without obtaining new relevance judgments) has chances to improve.

7.4.2 Optimistic and Pessimistic Effectiveness

Before participating in a continuous evaluation campaign, it is possible to assess the per- formance of an IRS by exploiting existing relevance judgments to compute optimistic and pessimistic bounds on its future performance.

Some metrics come with methods to bound the effectiveness of the systems being evaluated. With Rank-Biased Precision (RBP) [132], for example, it is possible to compute the “residual” of a measurement to quantify the uncertainty value due to missing judgments. By using this residual it is possible to simply compute an upper and a lower bound to the effectiveness of the IRSs being evaluated. In the following we propose general bounds that give an idea on how the performance of the new system r compare to that of the best system of the campaign identified so far. On the one hand, the optimistic effectiveness of r , denoted by ∆+(r ), gives information on the relative difference between the effectiveness of the two systems with the assumption that all documents the new IRS needs to judge are relevant; on the other hand, the

125 Chapter 7. Continuous Evaluation of Entity Retrieval Systems

pessimistic effectiveness of r , denoted by ∆−(r ), is based on the opposite assumption. In both cases a score of 0 means that the new IRS has the same effectiveness as the best system found so far, a score greater than 0 means that the new IRS outperforms the best, and a negative score means that the system underperforms it. It is worth noticing that, since relevant documents are often a very small part of the entire document collection, the actual score of a system is closer to the pessimistic bound rather then to the optimistic one. Nevertheless, optimistic effectiveness can be used to understand if the new system has any chance to outperform the best system found so far or not. Formally, we define the relative effectiveness of r at the step n of the campaign as shown by Equation 7.2, where m denotes the evaluation metric used in the campaign to rank the runs, r denotes the new run willing to join the campaign at the

(n 1)-th step, and maxs R (m n(s)) is the score attained by the best system of the n-th step + ∈ |n | of the campaign.

m n(r ) maxs R n (m n(s)) ∆m(r ) n | − ∈ | | , (7.2) | = maxs R (m n(s)) ∈ |n |

Practically, ∆m is the relative difference between the effectiveness of the new system and that of the best system found so far, measured by using the metric m. Starting from ∆m we define

∆m− (r ) and ∆m+ (r ) as the values of ∆m computed by setting all the documents selected to be judged as non-relevant and relevant, respectively. Notice that for many well-known metrics

∆ (r ) ∆− (r ) because unjudged documents are assumed irrelevant, however, this may m |n = m |n not be true if the probability of a document to be relevant is used in to evaluate runs (e.g. by using infAP).

We observe a link between the two statistics we proposed and the use of the RBP residuals: in both cases an unjudged document is assumed to be relevant in order to compute a best case bound of the considered metrics. However, optimistic and pessimistic effectiveness are strictly related to a continuous evaluation setup as their goal is to give hints on how likely an IRS is to be better than the best among the IRSs participating in the campaign.

7.4.3 Opportunistic Number of Relevant Documents

Another measure based on the statistics defined by Equation 7.2 is the opportunistic number of relevant documents, denoted by ρ+ (r,t) . This is a per topic metric defined as the mini- m |n mum number of new relevant documents needed in order to attain ∆ (r ) t, where t is a m |n ≥ predefined improvement threshold over the current best system, and n and r are as defined previously. ρm+ can be used to assess how much money one has to spend in the best case (that is, all judged documents are relevant) in order to obtain enough judgments to outperform the best system so far by t%. Obviously, it can happen that the current best system is actually more effective than r , no matter how many judgments one does. In this case, it is not possible to reach the desired threshold, thus we set ρ+ (r ) . We note that, in order to compute m |n = +∞ ρ+ (r ) for a specified topic, one needs to take into consideration the documents retrieved m |n by both r and the by the current best run b since it may happen that the same unjudged

126 7.5. Selecting Documents to Judge document retrieved by both runs is ranked higher in r but gives a greater improvement in the effectiveness of b rather than in that of r .

7.5 Selecting Documents to Judge

The next step of the continuous evaluation process involves the selection of the additional documents to judge during the next step of the continuous evaluation campaign. The selection of the documents to be judged goes under the name of pooling. In the IR field, different pooling approaches have been proposed and are currently being used. We start this section by describing four well-known pooling strategies and two novel approaches based on the notion of Fairness Score and of opportunistic number of relevant documents described in Section 7.4.

7.5.1 Existing Pooling Strategies

In this paper we take into consideration the following well-known pooling strategies:

Fix-depth pooling [98] is a widely used technique, which defines the set of documents to be judged as the set containing the top-n documents retrieved by a run for each topic. Documents that are already judged are not re-evaluated.

Alsam and Pavlu’s random sampling [7] selects, for each topic, a specified number of ran- dom documents among the ranked lists produced by different runs following a prob- ability distribution that gives more weight to high-ranked shared documents. In this method also, documents that are already judged are not re-evaluated.

Carterette et al.’s selective pooling [50] selects one document at a time and collects its judg- ments. Each time, it picks the document that is most likely to maximize the difference in AP between each pair of systems. This process continues until it is possible to conclude with a 95% confidence that the difference among each pair of systems is greater than zero.

Moffat et al.’s adaptive pooling (RBP-pooling, for simplicity) [131] assigns to each document a score based on the contribution it gives to the effectiveness of all the systems be- ing evaluated and on RBP residuals. We experiment with the “Method C” approach described in the original paper.

7.5.2 Novel Pooling Strategies

We propose two novel pooling strategies, namely fair pooling and opportunistic pooling, which are based on the Fairness Score and on the opportunistic number of relevant document defined in Section 7.4.

127 Chapter 7. Continuous Evaluation of Entity Retrieval Systems

Algorithm 2 judgement pool construction based on Fairness Scores. Require: prevRuns, judgedDocs, newRun, judgTokens 1: FS[] . Fairness Scores for the runs 2: FStopics[ ][ ] . FS for each run for each topic 3: toJudge poolingStrategy(newRun, judgTokens) ← 4: morejudgments TRUE ← 5: while toJudge judgTokens moreJudgememts do | | ≤ ∧ 6: poolDocs judgedDocs toJudge ← ∪ 7: for all run prevRuns {newRun} do ∈ ∪ 8: FS[run] FS(run, poolDocs) ← 9: FStopics[run] FSPerTopic(run, poolDocs) ← 10: end for 11: unfairRun min(FS) ← 12: unfairTopic min(FStopics[unfairRun]) ← 13: fairestDoc topUnjudged(unfairRun, unfairTopic, poolDocs) ← 14: toJudge toJudge fairestDoc ← ∪ 15: morejudgments fairestDoc ← 6= ; 16: end while 17: return toJudge

Fair Pool Construction

Algorithm2 shows how to construct a judgment pool by maintaining the Fairness Scores as similar as possible across the participating runs. Our algorithm takes as input the list of the runs that participated in the previous steps of the continuous evaluation (prevRuns), a set of already judged documents (judgedDocs), the new run joining the continuous evaluation (newRun), and a predetermined number of judgment tokens representing the overall number of documents that can be judged at this step of the continuous evaluation (judgTokens).3 First, the judgment tokens are distributed among ranked documents according to the classic pooling strategy in use. We note that the chosen strategy may assign a token to an already judged document (this happens quite often with fix-depth pooling) or it may be designed not to spend all the available tokens. In both cases, the unassigned tokens are fairly distributed among all runs: to assign each remaining token the run with the lowest Fairness Score is selected and, among all its ranked lists of documents (one for each topic), the one with the lowest Fairness Score is taken into consideration and its top-ranked unjudged result is added to the pool of documents to judge.

This algorithm has two desirable properties: i) it ensures that all runs participating in the pool contribute the same number of judgments (i.e., judgTokens judgments for each topic in Algorithm2) and ii) it systematically attempts to improve the Fairness Score of the run that was treated most unfairly so far (i.e., the run that has the lowest number of judged high-ranked documents). We discuss those two points in more detail and run experiments showing how

3For example, given a pre-defined budget of new relevance judgments that can be obtained or crowdsourced.

128 7.5. Selecting Documents to Judge

Algorithm 3 Opportunistically Select Best Document to Judge. Require: b, r ranked lists of documents for a given topic. 1: maxImp 0 ← 2: bestDoc None ← 3: for all unjudged documents d in r do 4: r_imp mrel(d) 1(r ) - m(r ) ← = 5: b_imp mrel(d) 1(b) - m(b) ← = 6: imp r_imp - b_imp ← 7: if imp maxImp then > 8: bestDoc d ← 9: maxImp imp ← 10: end if 11: end for

our fair judgment pool construction compares to previous methods in Section 7.7.3.

Opportunistic Pooling

Opportunistic pooling is an application of the ρm+ metric previously defined in Section 7.4 and is designed to work in the context of a continuous evaluation campaign. In order to use this pooling strategy, one needs to set two parameters: the overall number n of documents that can be judged (i.e., the judging budget), and an improvement threshold t. At the i-th step ¡ ¢ of the campaign, j min n,ρm+ (r,t) i 1 judgments per topic are made, where r is the new = | − run joining the campaign. Hence, we take the least number of documents (lower or equal than n) that allows the new system to perform better than the best system identified so far.‘’ Practically, the documents to be judged relative to a particular topic are chosen by running j times Algorithm3 to select the j documents that maximize the gap between the effectiveness of r and that of the current best run, measured by using m i 1. Each such document is selected | − by scanning all r to find the document that best favors r while not increasing the least the effectiveness of the best system identified so far. Notice that the number n of documents to judge can be obtained from the threshold t, and vice versa. On the one hand, we decide to impose a limit, n, on the maximum number of documents to judge in order to avoid situations in which too many documents need to be judged; on the other hand, the number of documents selected for judgment can be lower than n if the threshold t is reached with fewer documents. In this case, relevance judgments (and thus budget) are saved by stopping the document selection process when a satisfying optimistic improvement over the currently best system is reached. Notice that n could be not large enough to achieve the specified improvement or that the new system could actually be not able to outperform the best system found so far. As this can be computed before obtaining the relevance judgments (and, in particular, before participating in the campaign), it is up to the aspiring participant to decide if it is worth joining the anyway by obtaining n relevance judgments per topic or not. An alternative could consist in generalizing opportunistic pooling to sequentially compare the new system against more

129 Chapter 7. Continuous Evaluation of Entity Retrieval Systems than one run (e.g., against the top-k best runs). The algorithm should thus be run once for each comparison. For example, the new system may not outperform the best system so far in n judgments, but it could outperform the second best system, and so on. Finally, opportunistic pooling differs from the approach by Carterette et al. [51] since it does not require judgments to be done one after the other and it is generalizable to any metrics. We study the behavior of opportunistic pooling in a simulated continuous evaluation campaign in Section 7.7.3.

7.6 Obtaining and Integrating Judgments

How should new relevance judgments be obtained? Ideally, the same group of human as- sessors who originally created the topics composing T should judge subsequently retrieved documents to extend the set of relevance judgments J . However, as pointed out by Mizzaro, even if this were possible, the resulting judgments might still be biased since they would be judged at a different point in time [130]. In practice, in order to extend J by adding new relevance judgments, one could either use professional assessors or crowdsourcing. In both cases, there is a need to integrate the new judgments into the set of previously available ones.

7.6.1 Dealing with Assessment Diversity

As previously described in the literature, different expert annotators may provide different relevance judgments for the same topic/document pair as relevance is a somewhat subjective notion [130]; in addition, the same annotator can provide different judgments at different points in time. Those problems are exacerbated when replacing such experts with a crowd of Web users. While Blanco et al. showed that IR evaluation performed by means of crowd- sourcing is reliable and repeatable, that is, IRSs rankings are stable according to Kendall’s correlation, they also observed that absolute values of effectiveness measures (e.g., Average Precision) can vary as judgments are made by different “crowds” [31]. Therefore, while it is sound to evaluate distinct sets of IRSs using crowdsourced or professional judgments, merging judgments coming from heterogeneous crowds/judges queried at different points in time might generate unstable rankings over the course of a continuous evaluation campaign. This implies that when extending a set of relevance judgments one has to consider assessment diversity, too. The approach we propose below consists in selecting a judgment baseline, i.e., a set of topic/document pairs that must be judged by all annotators involved in the creation and extension of the collection. Thanks to the judgments made over this set, it is possible to assess the strictness or the tolerance of individual judges as compared to other annotators.

More specifically, after evaluating the first run in the campaign we select a set of topic/docu- ment pairs to create a Common Judgment Set (CJS) that will be judged by every other annotator. We note that bigger CJSs yield higher costs to balance the crowd assessment diversity, with no addition to the actual set of relevance judgments. However, the larger the CJS is, the better the adjustments we can potentially make are. In the context of crowdsourced relevance judgments, for example on the Amazon Mechanical Turk (AMT) platform, it is possible to implement

130 7.7. Experimental Evaluation the CJS feature as a Qualification Test each worker has to complete in order to get access to relevance judgment HITs. An alternative is to implement the common judgments as additional tasks in each HIT in order to better integrate the CJS with the rest of the relevance judgment tasks.

Once the integration stage has finished, it is possible to use the new set of judgments in order to evaluate all the runs participating to the campaign.

7.7 Experimental Evaluation

Here we study how the concepts we have introduced so far apply to a continuous evaluation campaign. We do this by simulating continuous evaluation campaigns based on data from different editions of TREC. In particular, after having described our setup we show how Average Precision varies during a campaign, we analyze the accuracy of the IRSs rankings it generates, and how fair each system is treated.

The raw data we used for our experimental evaluation, our raw crowdsourcing results, and a set of software tools that can be used in a continuous evaluation campaigns can be found at https://git.io/vD8Sw.

7.7.1 Experimental Setting

We use three standard collections in order to study our methodology: the testset created in the context of the SemSearch challenge 2011 (SemSearch11) for Ad-hoc Object Retrieval that we also used in the research presented in Chapter3, 4 and the collections created in the context of the Ad-hoc task at TREC-7 (TREC7) [191] and TREC-8 (TREC8) [194].

The SemSearch11 collection is based on the Billion Triple Challenge 2009 dataset that consists of 1.3 billion RDF triples crawled from the Web. In contrast, TREC7 and TREC8 are based on a dataset of 528,155 documents from the Financial Times, the Federal Register, the Foreign Broadcast Information Service, and the LA Times. All collections come with 50 topics together with relevance judgments. However, SemSearch11 differs from the two TREC collections in the number of documents the systems could retrieve for each topic, that is, the length of the runs they submitted, which is of 100 and 1,000 documents, respectively. The number of submitted runs is also different: 10 runs are available in SemSearch11, 103 in TREC7, and 129 in TREC8. Nevertheless, the most important difference characterizing the collections is the way the relevance judgments are computed: in all cases fix-depth pooling is used to select the documents to judge but in SemSearch11 the top 10 retrieved documents of each run were judged by means of crowdsourcing, while in TREC7 and in TREC8 the top 100 retrieved documents of each run were evaluated by expert NIST annotators. Using crowdsourcing to obtain relevance judgments can lead to situations in which relevance judgments for a topic

4http://km.aifb.kit.edu/ws/semsearch11/

131 Chapter 7. Continuous Evaluation of Entity Retrieval Systems are made by different assessors; this contrasts with the approach used by TREC, in which the same annotator who created a topic also makes the relevance judgments, and no assumption about the generalizability of the relevance assessments is made.

One of our goals is to compare the results obtained through a continuously updated evaluation collection against the optimal case of a collection having complete relevance judgments. For this reason, we created the fully-judged TREC7 sub-collection (JTREC7) as the ideal collection whose documents are composed of all judged documents of TREC7, and whose runs are computed from the original runs by removing all unjudged entries and by ranking the remaining documents according to their original order. The new JTREC7 collection, in which every retrieved document of every run is judged, is composed of 80,345 documents taken from TREC7, and 103 runs containing an average of 411 retrieved documents per topic. Analogously, starting from TREC8 we define JTREC8, another fully-judged collection composed by 86,830 documents and 129 runs containing an average of 431 retrieved documents per topic. Additionally, we define a variant of JTREC7 named JTREC7BPT, which includes the best run per team, resulting in 41 runs.

7.7.2 Continuous Evaluation Statistics

Evaluation Metrics

Since the number of available relevance judgments increases at each step of a continuous eval- uation campaign, the values of the evaluation metrics used to rank IRSs become increasingly accurate. For example, Figure 7.2 (top) shows the value of AP as a function of the number of judgments j for five different runs of JTREC7. We notice that, In general, the bigger the value of j , the lower the resulting AP (since its denominator, i.e., the number of relevant documents, steadily increases). As expected, these variations on AP lead to a variation on the accuracy of ranking of the systems. Figure 7.2 (bottom) shows, for each step s of a continuous evaluation campaign, the evolution of accuracy of the ranking computed at step s using AP.The accuracy is calculated by means of Kendall’s τ correlation between the ranking produced by computing the AP of each IRSs with the relevance judgments available at step s, and the ranking obtained by computing the AP of the same systems but with all the relevance judgments. While the vari- ability of the rankings is high in the first few steps, the rankings become relatively stable after 20 steps. As one can imagine, when passing from two to three systems it is more likely for the statistics to change than when passing from twenty to twenty-one systems, hence increasing the values of Kendall’s τ. For example, the two rankings [a,b,c] and [a,c,b], which differ by only one swap have a Kendall’s τ correlation of 0.33, while two rankings of 20 IRSs which differ by only one swap have a correlation of 0.99. Notice, anyway, that Figure 7.2 (bottom) does not reflect the fact that a system before obtaining new judgments can be unfairly ranked lower than after having obtained them as this is not captured by the statistics we used.

132 7.7. Experimental Evaluation

0.55 Irs 1 0.50 Irs 2

0.45 Irs 3 Irs 4 0.40 Irs 5

0.35 AP

0.30

0.25

0.20

0.15 20 40 60 80 100 % Judgments

1.00

0.95

0.90

0.85

Kendall's Tau 0.80 Fix-depth 10 0.75 Fix-depth 50 Fix-depth 100 0.70 0 20 40 60 80 100 120 Continuous Evaluation Step

Figure 7.2 – Evolution of AP values (top) and Kendall’s correlation (bottom) during a simulation of a continuous evaluation campaign based on JTREC7. Runs join the campaign according to the lexicographic order of their names.

133 Chapter 7. Continuous Evaluation of Entity Retrieval Systems

0.40 0.40

0.35 0.35

0.30 0.30

0.25 0.25

0.20 0.20

0.15 0.15

0.10 0.10

0.05 0.05 Average Fairness Score Among All Topics All Among Score Fairness Average

0.00 0.00 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Continuous Evaluation Step Continuous Evaluation Step

Figure 7.3 – Fairness Scores evolution for each run participating to a continuous evaluation using sampling-based pooling (left) and fix-depth pooling with Fairness (right) on a sample of 50 randomly selected permutations of JTREC7BPT runs. Each permutation represents a possible order in which the systems join the campaign. At each step 10 new relevance judgments are obtained for each topic.

Table 7.1 – Difference between maximum and minimum Fairness Scores after a continuous evaluation with and without Algorithm2. In all cases 50 judgment tokens per topic were made.

Max - Min Pooling approach Fairness Score Sampling-based 0.95 Opportunistic t 0.25 0.86 = Fix-depth 0.76 Sampling-based w/ Fairness 0.23 Fix-depth w/ Fairness 0

134 7.7. Experimental Evaluation

Fairness

Figure 7.3 shows how the variance of the Fairness Scores increases as we perform more steps of a continuous evaluation campaign for both sampling-based pooling and fix-depth pooling with Fairness. As expected, the fairness-aware algorithm treats the runs in a fairer way, i.e., we observe higher FS values as we progress in the campaign. In addition, Table 7.1 shows the final difference between the maximum and minimum values of the Fairness Score among all the runs participating in the continuous evaluation campaign. As we can see, when we follow Algorithm2, the final difference is reduced. The minimum is attained by using the fair variant of the fix-depth pooling strategy.

7.7.3 Pooling Strategies in a Continuous Evaluation Setting

To study how the pooling strategies described in Section 7.5 behave during a continuous evaluation campaign, we make use of the fully-judged collections described previously. We use such datasets since all documents retrieved by their runs are judged, thus we can assume that the ranking computed by using such judgments, the real ranking of the systems, correctly reflects the real effectiveness of the IRSs taken into consideration. In our experiments we simulate a continuous evaluation campaign by associating a step to each submitted run, that is, if the dataset contains 40 runs, the simulated campaign is composed by 40 steps. We are interested in comparing the real ranking against the rankings produced by using the partial sets of judgments obtained by using specific pooling strategies at each step of the simulated campaign. In this setting, a high correlation indicates that the selected pooling strategy succeeded in selecting the documents to judge in order to rank the runs correctly. In all cases we used AP in order to compute the rankings except for RBP-pooling and Aslam- related pooling methods, in which RBP and a variant of AP [7] are used, respectively. We also report on the ranking produced by computing infAP [204] with the judgments available before the relevance judgments collection phase of the current step of the campaign. The ranking correlation is measured by using the Kendall’s τ correlation between the real ranking and the ranking produced only the judgments available at each step of the campaign, similarly to what was done in Section 7.7.2. Few more precise details: we set p 0.8 for both the computation = of RBP and for computing the documents to be judged by using RBP-pooling; we compare four values of the t threshold regulating opportunistic pooling: 0.25, 0.3, 0.5, and 0.75. A clear limitation of this experiment is that we only a simulate a continuous evaluation campaign, that is, we simulate new relevance judgments by using the original assessments done by the TREC assessors and not elicited from people in different points in time.

Figure 7.4 summarizes our evaluation. We observe with little surprise that the most effec- tive approaches in terms of correlation to the real ranking (top plot), namely RBP-pooling, Fix-depth w/Fairness and Sampling w/Fairness, are also those that make more relevance judgments per step, and thus more judgments in total. The steep drops we can notice in the bottom part of the figure is due to the fact that all documents of the dataset (JTREC7BPT) were

135 Chapter 7. Continuous Evaluation of Entity Retrieval Systems

1.00

0.95

0.90

0.85

Kendall's Tau infAP Opportunistic Pooling t = 0.25 0.80 Sampling-based Pooling Opportunistic Pooling t = 0.3 Fix-depth Pooling Opportunistic Pooling t = 0.5 0.75 Fix-depth w/Fairness Opportunistic Pooling t = 0.75 Aslam w/Fairness RBP 0.70

2500

2000

1500

1000 New Judgments

500

0 0 5 10 15 20 25 30 35 40 Continuous Evaluation Step

Figure 7.4 – Effectiveness evaluation of pooling algorithms on 20 random samples of JTREC7BPT runs with 50 new judgments per topic for each new system participating in the campaign. Ranking correlation with the real ranking (top) and number of judgment per step (bottom). In the first figure three outliers were removed in order to increase readability.

136 7.7. Experimental Evaluation

80000

70000

60000 infAP

50000 Sampling-based Pooling Fix-depth Pooling

40000 Fix-depth w/Fairness Aslam w/Fairness

30000 Opportunistic Pooling t = 0.25 Opportunistic Pooling t = 0.3

20000 Opportunistic Pooling t = 0.5 Opportunistic Pooling t = 0.75

10000 RBP

0

Figure 7.5 – Total number of judged documents per pooling algorithms.

judged. An interesting fact that caught our attention is that, although RBP-pooling requires a number of judgments comparable to that of Sampling w/Fairness and Fix-depth w/Fairness, the correlation between the ranking produced by RBP and the real ranking is always perfect. This is the consequence of the geometric weighting used to compute the metric which weights exponentially more retrieved documents on top positions. We provide more detail on this in the next subsection.

Another interesting thing we observed is that Fix-depth pooling and opportunistic pooling with t 0.25 require a very similar number of judgments (30,538 and 30,568, resp.) but the former = correlates better with the real ranking than the latter. We believe that such a difference is due to the fact that fix-depth pooling selects top documents to be judged, and those documents influence more AP.Opportunistic pooling, on the other hand, tends to avoid selecting those documents if they are shared with other runs (we expect good runs to have many relevant documents shared in top positions).

Finally, Another interesting thing we observed is that using infAP as a measure to estimate AP before performing the judgments selected at a given step gives a very good indication of what, according to AP,the IRSs ranking will be after having performed the judgments.

In the charts discussed previously we do not report on the results obtained by Carterette et al.’s pooling strategy since its correlation to the ideal ranking is significantly lower than that of all other methods, attaining τ 0.25 on JTREC7BTP.We believe that this is the cause of the = very low number of relevant judgments it requires.

137 Chapter 7. Continuous Evaluation of Entity Retrieval Systems

1.00

0.98

0.96

0.94 Kendall's Tau p = 0.8, 10 judgements p = 0.8, 50 judgements 0.92 p = 0.95, 10 judgements p = 0.95, 50 judgements 0.90 0 5 10 15 20 25 30 35 40 Continuous Evaluation Step

Figure 7.6 – Evolution of Kendall’s correlation during a continuous evaluation campaign on JTREC7BPT. The chart highlights the effect of using different values of the p parameter of RBP and different numbers of judgments per topic at each step.

Analysis of the Sensitivity of Rank-Biased Precision

As we mentioned previously, RBP and RBP-pooling obtain a perfect correlation to their corre- sponding real ranking, that is, the ranking obtained by computing RBP using all the judgments included in JTREC7BPT. We argue that this is due to weighting scheme used by the metric. For example, in our setting in which p 0.8 a relevant documents ranked 20, 10, and 5 by an IRS = contribute 0.0029, 0.0268, and 0.0819 respectively. Thus the ranking is mostly decided by the first five documents retrieved by each IRS, not capturing differences in lower ranks.

Here we analyze this effect by studying the influence of modeling a more persistent user, that is, a user who is more prone to analyze lower ranked results. Figure 7.6 shows the behavior of RBP-pooling over a continuous evaluation campaign with varying number of judgments and values of p. As expected, higher values of p make the metric more sensitive to changes in the lower part of the rankings thus increasing the need for additional relevance judgments in order to produce a more accurate ranking. Anyway, the lower correlation we obtained is 0.9, which is a remarkable result that is mostly due to the fact that most of the probability mass is concentrated on the top-50 ranked documents. This implies that changes on the relevance of documents ranked lower scored do not significantly affect RBP.

138 7.7. Experimental Evaluation

7.7.4 Real Deployment of a Continuous Evaluation Campaign

As pointed out in Section 7.3.2, continuous evaluation campaigns can leverage crowdsourcing in order to obtain relevance judgments over time. To demonstrate the feasibility of such an approach, we run a continuous evaluation campaign featuring the IRSs participating in the SemSearch 2011 competition. We crowdsourced relevance judgments for SemSearch11 following the same HIT design and using AMT settings similar to the ones originally chosen by the SemSearch 2011 organizers [89] and used previously in Chapters3 and4. 5 We run a continuous evaluation by grouping together all runs submitted by the same research group in one evaluation step.6 As four groups submitted runs, we obtain four steps in our continuous evaluation. Additionally, to correctly run a continuous evaluation, we make sure that no crowd worker participates in two different evaluation steps, since evaluations at different points in time are typically handled by different crowds. The judgments were collected in different points in time, as reported in the following:

• First step and CJS: Taken from SemSearch11.

• Second step: May 8–13, 2012.

• Third step: May 14–19, 2012.

• Fourth step: May 18–23, 2012.

The short time span we use to collect relevance judgments (15 days) and the limited number of IRSs taken into consideration are limitations of this evaluation, since we expect a real continuous evaluation to last for some years and to involve many systems. However, the data we collected gives insights into the diversity of the assessments and on the economical feasibility of the proposed methodology.

We analyze the results along two dimensions: How the IRSs ranking compares to the original final SemSearch 2011 results, and how stable the IRSs ranking is across the steps of the continuous evaluation, that is, how much the ranking changes as compared to the previous step measured by Kendall’s τ. We call this ranking stability.

From the correlation with the original SemSearch 11 ranking reported in Table 7.3, we observe that the best correlation is obtained with fix-depth pooling since it is the same strategy as the one used by the organizers of SemSearch 11. Looking at the ranking stability, in Table 7.2 we see that in all cases more evaluation steps make IRSs ranking more stable. The minimum ranking stability (τ 0.6) occurs at the third step of our campaign, when the new judgments = produced by the third team reveal that the run ranked third during the second step of the campaign is actually the most effective. Nevertheless, it is worth noticing that the differences

5Three judgments per document made by workers from the U.S. and aggregated with majority vote. 6We suppose that during a continuous evaluation initiative each group would have submitted all its runs together as it is quite likely they come from the same system tuned with different values for its parameters.

139 Chapter 7. Continuous Evaluation of Entity Retrieval Systems

Table 7.2 – Kendall’s τ correlation between each step of a continuous evaluation campaign on SemSearch 11 and the original results of the initiative done by using a “block” evaluation.

1st step 2nd step 3rd step 4th step Metric (3 runs) (5 runs) (8 runs) (10 runs) τ vs SemSearch11 -0.33 0.60 0.86 0.87 fix-depth pooling τ vs SemSearch11 1.00 0.40 0.79 0.60 sampling τ vs previous step - 0.33 0.80 1.00 fix-depth pooling τ vs previous step - -0.33 0.40 1.00 sampling

Table 7.3 – Stability of the ranking at each step of a continuous evaluation campaign on SemSearch 11.

1st step 2nd step 3rd step 4th step Measure (3 runs) (5 runs) (8 runs) (10 runs) τ vs previous step - 1 0.6 0.8571

between the scores of the two systems were not statistical significant (t-test: p 0.9310 at the = second step and p 0.9861 at the third step). Recall that a similar situation occurred in the = evaluation of the AOR methods described in Chapter3.

To deal with the assessment diversity of different crowds, as explained in Section 7.6.1, we defined the CJS containing topic-result pairs from the first step of the continuous evaluation and built an AMT Qualification Test that each worker has to perform before starting to judge documents for the current IRS. To balance the judgments we create the parametric function defined by Equation 7.3, where w(i) is the judgment given by worker w to the topic/document pair i, cjs(i) is its original judgment, and α is a parameter in [0,1] that defines how we treat strict and lenient workers (so, weigth(w) α 0 for strict workers and weight(w) α 0 for = > = − < tolerant workers). In our experiments we set α 0.5. =  α if w(i) cjs(i) 0  − < weight(w) α if w(i) cjs(i) 0 (7.3) = − − > 0 if (w(i) c j s(i)) 0 − =

An interesting comparison can be made between the correlation values of the rankings ob- tained by the raw judgments of different crowds and those obtained by considering the CJS to adjust the values. As we can observe in Table 7.4, there is a clear improvement in the stability

140 7.8. Discussion

Table 7.4 – Kendall’s τ correlation between rankings computed at each step of the continuous evaluation campaign on SemSearch 11 and the rankings computed by using the original relevance judgments included in the dataset. The relevance judgments used during the continuous evaluation campaign were modified by using Equation 7.3.

Measure 1st step 2nd step 3rd step 4th step (3 runs) (5 runs) (8 runs) (10 runs) τ vs SemSearch11 -0.33 0.40 0.71 0.73 fix-depth pooling τ vs SemSearch11 -0.33 0.40 0.71 0.82 sampling τ vs previous step - 1.00 1.00 0.86 fix-depth pooling τ vs previous step - 1.00 1.00 1.00 sampling

Table 7.5 – Relevant/Judged rate over the steps of the continuous evaluation campaign on SemSearch 11.

1st Step 2nd Step 3rd Step 4th Step Raw Crowd 0.098 0.091 0.078 0.092 CJS-balanced Crowd 0.095 0.068 0.063 0.073

of the ranking over the evaluation steps since we adjust later crowds to be more similar to the first one.

Table 7.5 shows the rate of documents considered as relevant versus the number of judgments at each step of the evaluation. As we can see, the non-adjusted crowd behaves in a more tolerant way (i.e., more results are considered relevant) as compared to the adjusted crowd where judgments on CJS are compared to those of SemSearch 11 to downscale the new judgments.

7.8 Discussion

In this section we discuss about points which are of key importance when deploying a contin- uous evaluation campaign in practice. In particular, we focus on issues on the integration of relevance judgments and on the construction of the CJS; we discuss the economical viability of a continuous evaluation campaign; eventually, we conclude the section by describing how we envision “more continuous” continuous evaluation campaigns.

141 Chapter 7. Continuous Evaluation of Entity Retrieval Systems

7.8.1 Integrating Relevance Judgments

In Section 7.7.4 we used a simplistic method to integrate new relevance judgments to an exist- ing set of judgments. That approach implicitly characterizes judges based on their strictness or leniency, however, we believe that other aspects of the integration of relevance judgments are to be considered when running a continuous evaluation campaign.

In particular, inconsistencies in the relevance judgments can arise as a consequence of having several judges and of continuously judging documents on the same corpus in different points in time. We observed that, the more a continuous evaluation advances, the fewer new relevant documents can be found. As a consequence, judges are exposed to many non-relevant documents and might tend to be more lenient, possibly introducing inconsistencies [169]. We believe that the problem could be attenuated by using more complex weighting functions possibly depending on the number of relevant documents found until now and/or on the “age” of the continuous evaluation campaign (that is, the number of steps done so far).

Inconsistencies in the relevance judgments can also arise because during time things change and, as a result, the answer to a certain topic can change as well (e.g., “Who is the president of Italy?”). In this case, we think that the training of the judges and/or the design of the HITs might play a central role: the judge has to know when the topic was chosen in order to provide assessments that are coherent to those composing the dataset. If we assume to relevance judgments, one option could be to ask the workers to take a particular HIT only if we deem them old enough to answer; however, when this research was done (late 2013) crowdsourcing platforms did not allow users to impose constraints on the age of the workers. Another viable option could be to add an age requirement in the description of the task but anyway we could not be sure that workers comply with such a requirement. A more reliable solution would be to use push crowdsourcing [68], in which we can leverage the background of the workers and push the tasks to workers that we deem adequate. Anyway, these methods can only mitigate the effect of time since it may by difficult for the judges to collocate themselves in the right time period necessary to make the requested relevance judgments. Understanding how the design of the judging interface and the provision of additional contextual information can help judges remembering the times in which the collection was built affects crowdsourced relevance judgments is an interesting future work.

Finally, inconsistencies can arise simply because different judges interpret in different ways the intent of a keyword query, or because they use a different interpretation of relevance. As for the first case, Verbene et al. show that keywords are not enough to transmit to the judge the searcher’s intent and that, in general, external judges can only correctly extrapolate the topic and the spatial sensitivity of the original search intent [189]. This aspect also has to be taken into consideration when merging different kinds of relevance judgments, for example when mixing TREC-like judgments made by the creators of the topics with crowdsourced judgments. In particular, the resulting test collection might, at some point, have a large number of relevance assessments made by a few people, and a small number made by a lot of

142 7.8. Discussion different people. We think that a longer description of the topic (e.g., the “narrative” field of TREC topics) can help the judge in deciding about the relevance of the document. We suspect, however, that there might be a trade-off between the precision of the description of the topic and the quality of the judgments obtained: since crowd workers are often less prone to read long instructions. It is also worth noticing that filtering the crowd workers by country might help dealing with linguistic and cultural issues.7

7.8.2 Building the CJS

In our work we defined the judgments composing the CJS to be selected by the organizers of the campaign, possibly with the help of the creators of the topics. Optimally selecting the best judgments to include in the CJS and/or deciding how to adjust the relevance scores provided by the judges is an open research question which will require further investigation. We can envision several way of populating the CJS: the most trivial (and possibly the less valid) option would be not to involve the creators of the topics, crowdsource the first relevance judgments, and select the documents with higher agreement. A more sophisticated method could consist in asking the creators of the topics to provide positive and negative examples of documents for each topic where the negative examples cover different interpretations of the query. For example, the INEX Entity Ranking track topics contain example relevant results as defined by the topic creator [66]. Doing so would possibly make the size of the CJS too large, but probably still manageable if the HITs are organized by topic. We believe that it would also be interesting to use per-topic qualification tests in order to accept workers with the same “point of view” as the topic creators.

7.8.3 Economical Viability

Another question we tackle in this section is whether or not it is economically viable to use crowdsourcing to create additional relevance judgments for a new run participating in a continuous evaluation campaign. While the cost of crowdsourcing the relevance judgments for an entire TREC collection can be too high for a single research group (around $10,000 [5]), the per-run cost in a fair, continuous evaluation campaign is much more affordable.

As shown in Figure 7.4 (bottom), in a fix-depth pooling setting the judgement cost per run on average decreases as we add more runs. We observe that the first runs are more expensive to evaluate, which leads to the conclusion that the first steps of a continuous evaluation should be carried out in the context of a classic TREC-like initiative. Assuming that the first 20 runs participated in the evaluation initiative, we compute the cost of creating the remaining judgments for each run assuming that three workers are asked to judge each document and are paid $0.10 for each relevance judgment (this is the standard setting used by most of the approaches we refer to in Section 7.2). With such settings, the average cost per run is $22 for pool depth 10, $90 for pool depth 50, and $160 for pool depth 100. This, in our opinion, would

7This option is supported, for example, by AMT.

143 Chapter 7. Continuous Evaluation of Entity Retrieval Systems be acceptable for most research groups proposing a new system. It is also worth noticing that in some cases it is possible, during the organization of the campaign, to decide to lower the accuracy of the score in order to reduce the cost of relevance evaluations. This can be done, for example, by using Rank-Biased Precision.

Various further financial schemes could sustain the our evaluation methodology, for example reimbursing the crowdsourcing costs through sponsoring, splitting the costs by introducing registration fees, etc.

7.8.4 More Continuous Continuous Evaluations

We envision an extended version of the continuous evaluation methodology we proposed in this chapter that is not bound to a fixed document collection, to a fixed set of topics, or to a fixed set of judgments as all its components change continuously. By adding new topics the inaccuracies of the old topics get amortized over a growing number of newer topics and contributes to a more reliable evaluation [165]. One of the crucial points implied by this definition is that it should be always possible to obtain new runs of all the IRSs participating the evaluation, thus requiring effort both of the participants (maintaining a possibly long term access point to their system or providing the organizers with their system) and of the organizers (maintaining a directory of systems ready to run, developing new topics, etc.) depending on how centralized is the campaign. Microsoft already implemented a system that was able to run all the IRSs participating the entity Recognition and Disambiguation Challenge Workshop held at SIGIR 2014. Continuous access to running IRSs is also adopted by the living-lab approach [16], however this process is highly centralized and not really continuous as it does not allow systems to join the evaluation after the end of the initiative. We think that such an approach would not accommodate the needs of many researchers since it requires a constant effort from the organizers. Other issues regarding the process of integrating new topics, extending the corpus, and dealing with the conflicts of interests that can arise include the possibility of having participants that maliciously gather relevance judgments to favor their systems and penalize other peers.

During our research we decided to take a conservative approach by proposing an evaluation methodology with fixed topics, “static” systems and honest participants in order keep our methodology simple and focused.

7.9 Conclusions

In this chapter we analyzed the situation we encountered in Chapter3, in which a system was not evaluated in a fair way since relevance judgments were missing. To overcome this issue we proposed a continuous evaluation methodology in which new systems participating in the evaluation campaign provide missing relevance judgments, thus continuously updating the underlying testset. Together with the new evaluation methodology we introduced a series of

144 7.9. Conclusions statistics to monitor and compare the IRSs rankings over the course of a continuous evaluation campaign, and to ensure that all systems are treated fairly. We use such statistics to create fairness-aware pooling strategies and to define opportunistic pooling. The methodology can be readily applied both on new and on existing document collections, as long as the system runs and relevance judgments are available. To demonstrate our approach, we created and made available a set of tools that ease the setup and the handling of continuous evaluation campaigns using TREC-like evaluation collections.

As future work, we could better simulate a continuous evaluation by actually obtaining rele- vance judgments continuously by using different crowds in order to better analyze the impact of time and of different people on the relevance judgments. Moreover, the proposed method- ology could be adapted to “more continuous” continuous evaluation campaigns featuring evolving collections whose set of topics can change as well as their corpora and other compo- nents.

145

8 Conclusions

In this thesis, we investigated, designed, and evaluated a number of methods and algorithms to exploit knowledge graphs and to increase the quality of their data. Our work contributed to advancing the state-of-the-art in several tasks related to knowledge graphs, namely, entity retrieval, entity type ranking, and schema adherence. We also studied how other applications can benefit from knowledge graphs by designing and evaluating an entity-centric system for detecting events on Twitter and, finally, we analyzed how crowdsourcing can be applied to continuously evaluate entity retrieval systems.

8.1 Lessons Learned

The lessons we have learned in this context are numerous and are related to the various aspects of research on knowledge graphs we have explored so far. In particular, in Chapter3 where we described our system for Ad-hoc Object Retrieval, we learned that well-known Information Retrieval techniques that are effective for retrieving text documents and webpages are not as effective when used to retrieve entities from knowledge graphs. Nevertheless, combining them with structured repositories allowing us to walk through the knowledge graph is time-efficient and leads to better effectiveness. An important aspect of this approach is related to how to explore the graph to find additional relevant entities. When tackling this problem we realized that only few properties lead to good entities, and that considering properties connecting an entity with many other entities of different types increases the search space and thus should be avoided to keep the execution time low.

Our experience with AOR also taught us that reusable test collections created in the context of IR evaluation initiatives can penalize later systems whose top ranked documents might not have been evaluated by any human annotator. This last remark motivated us to explore new evaluation methodologies that mitigate this issue. We reported our findings in Chapter7, where we also discussed on how to organize a continuous evaluation campaign in which infor- mation retrieval systems can be evaluated at any point in time, provided that they complete the test collection with their missing judgments, and that they share their results with the rest

147 Chapter 8. Conclusions of the community. During that work we understood that exploiting crowd workers to obtain relevance judgments for information needs designed by somebody else is not straightforward: several factors can influence the annotators, including time, number of irrelevant documents they are exposed to, and their tendency to be strict or lenient.

In Chapter4, we showed that entities in knowledge graphs can be associated with many types, and we tackled the task of selecting the best type to show to users given an entity and the textual context in which it is mentioned. Surprisingly, we realized that always returning the most specific type is often not the best choice. We also proved that textual context plays a role in determining which type should be preferred; however, small textual contexts composed by short n-grams appearing before and after the entity mention are good enough to suggest the relevance of very generic types, but cannot generalize to more specific types. Conversely, features extracted from the type hierarchy of the knowledge graph give good evidence of relevance.

The more we worked with knowledge graphs, the more we realized that their data can be noisy and that, despite all the effort put into manually creating coherent data models, knowledge graph data does not always comply to its schemata. This motivated us to focus on improving data quality in knowledge graphs by designing the algorithms we introduced in Chapter5. There, we first show that often object properties do not comply with their formal definition by describing several examples extracted from Freebase and DBpedia, and then reported on how the statistical notion of entropy can help detecting misused properties to improve data quality.

Finally, in Chapter6 we presented ArmaTweet, a system that leverages knowledge graphs to detect events in Twitter. ArmaTweet takes as input semantic queries describing precisely what kind of events the user wants to detect (e.g., “deaths of politicians”), and constantly monitors Twitter to identify relevant posts that are then shown to the user, together with a semantic summary of the event they describe. Our system exploits entity linking techniques to mine entities in tweets and then builds time series starting based on the detected entities, and on verbs spotted in the tweets. The combination of Natural Language Processing methods and techniques for exploiting knowledge graphs yields good results in terms of precision. Experimenting with ArmaTweet taught us that conventional NLP tools are often not suitable for tweets due to their peculiar linguistic features. Moreover, we soon realized that popular techniques for named entity recognition were not enough for our use case: well-known tools such as the Stanford CoreNLP framework or GATE can only extract entities belonging to few categories (typically, people, organizations, and locations), while ArmaTweet users can be also interested in other types of entities such as natural disasters (e.g., earthquakes, tsunamis, floods), digital products, etc.

8.2 Future Work

We believe that in the near future knowledge graphs will play a central role in services offered by Web players. In addition, we think that smaller companies will also start leveraging such a

148 8.2. Future Work

Person

Figure 8.1 – An HTML excerpt in which Microdata and schema.org are used to define a Web Entity (left), and its graphical representation (right). Web Entities often are “locked” in the webpage in which they are defined: since they often do not have a URI they cannot be referenced in other contexts; they cannot “leave” their webpage. technology in order to interlink data coming from different silos to improve their products, or for business intelligence purposes. In the following we present some compelling ideas that could be pursued as an extension of this work and that can help advancing the current state of knowledge graph technologies.

8.2.1 A Knowledge Graph Spanning Over the World Wide Web

According to the original vision of the Semantic Web by Tim Berners-Lee [27], Web publishers have increasingly been using Semantic Web technologies to add machine readable content describing the information they publish. Figure 8.1, for example, shows a small HTML excerpt defining the entity “Emma Watson”, and suggests that such semantic annotations can be viewed as small knowledge graphs embedded into Web documents. We call the entities con- tained in such knowledge graphs Web Entities. In one of our recent publications, we argue that such data composes an underlying unexploited knowledge graph that we call VoldemortKG, and that overlaps and possibly complements other knowledge graphs in the Linked Open Data (LOD) cloud [183]. The size of VoldemortKG can be estimated by considering the work by Bizer et al., in which the authors measured that in 2012 at least 3 billion webpages originated from over 40 million websites contain semantic annotations [28].

Improving VoldemortKG is an interesting future work as we believe it is very challenging. Issues that should be taken into consideration include the fact that Web Entities are often not associated to global URIs and thus their scope is the webpage were they are defined. As a consequence, the same Web Entity can be defined and described by several webpages, possibly using different vocabularies and data formats.

8.2.2 Actionable Knowledge Graphs

A recent trend in academic research is connected to actionable knowledge graphs. In an actionable knowledge graph entities include information on how to perform actions on them;

149 Chapter 8. Conclusions for example, an entity of type “Book” can have properties specifying how to buy it, how to write and share a review about it, or how to read it online. That is, entities are actionable. Ontologies allowing us to describe actionable entities already exist: schema.org, for example, features specific entity types to identify actions and properties to connect them to entities.1

There are several challenging issues connected to actionable knowledge graphs to be tackled. First of all, information on actions that can be performed on specific entities should be collected. This can be not trivial since preliminary studies we conducted show that really few webpages are semantically annotated with action information. Another possible task consists in understanding what actions users want to perform on certain entities, and map such transactional needs to their representations in the knowledge graph. If the interface between the knowledge graph and the user is a search box, the task is connected to Ad-hoc Object Retrieval (cf. Chapter3). This task has been studied by Lin et al., who used probabilistic graphical models to mine latent intents from query logs [113]. Nevertheless, in their approach actions are represented by clusters of keywords which are not related to any knowledge graph. The tasks we described will also be studied for the first time this year at NTCIR, the Japanese counter part of the American Text Retrieval Conference (TREC).2

8.3 Outlook

On a global scale, the future work described previously suggests that the World Wide Web is about to experience a paradigm shift in which transactional needs of users will not be satisfied by ordinary webpages but rather by information extracted from Web documents and organized in knowledge graphs. This was also prophesied by Andrei Broder in his WWW2015 keynote [40] and opens the doors to a wide range of new opportunities which, however, can lead to consequences affecting how information is published and consumed. Opportunities are of course related to the available data: many companies could exploit actionable knowledge graphs containing information available on the Web to provide all kind of services. We can en- vision apps that exploit public governmental data to help people organizing their bureaucratic life by suggesting and automating formal procedures, by helping them planning and actuat- ing financial strategies, and by suggesting and building personal profiles including needed insurances, tax declarations etc. At the same time, search engines could exploit semantic annotations of webpages to directly answer user information needs. For example, they could allow users issuing the query “arena verona nabucco tickets” to directly buy the tickets they are looking for at the lowest price.

However, what if apps and search engines become so effective at satisfying user needs that people do not need to use webpages anymore? On the one hand, Web publishers who integrate machine readable content explaining how to use the services they offer would loose viewers. This would penalize them since they often take advantage of pages connected to the services

1http://schema.org/Action 2http://ntcirakg.github.io/cfp.html

150 8.3. Outlook they offer to display recommendations, advertisement, or general information they want Web users to see. On the other hand, popular search engines or apps might privilege services offered by providers who semantically annotate their content, thus reducing the amount of users who are redirected to service providers that decide to leave their content only human readable. We believe that this situation has the potential to initiate an evolutionary process that will lead to techniques for retrieving information and providing services on the Web that will benefit both producers and consumers of data. The World Wide Web could thus shift from being a document centric resource to being an entity-centric repository of knowledge and services, that is, a knowledge graph.

151

Bibliography

[1] J. Allan, W. B. Croft, A. Moffat, and M. Sanderson. Frontiers, challenges, and opportunities for information retrieval: Report from SWIRL 2012 the second strategic workshop on information retrieval in lorne. SIGIR Forum, 46(1):2–32, 2012.

[2] O. Alonso. Implementing crowdsourcing-based relevance experimentation: an industrial per- spective. Inf. Retr., 16(2):101–120, 2013.

[3] O. Alonso and R. A. Baeza-Yates. Design and implementation of relevance assessments using crowdsourcing. In ECIR, volume 6611 of Lecture Notes in Computer Science, pages 153–164. Springer, 2011.

[4] O. Alonso and S. Mizzaro. Relevance criteria for e-commerce: a crowdsourcing-based experimen- tal analysis. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19-23, 2009, pages 760–761, 2009.

[5] O. Alonso and S. Mizzaro. Using crowdsourcing for TREC relevance assessment. Inf. Process. Manage., 48(6):1053–1066, 2012.

[6] G. Angeli, M. J. J. Premkumar, and C. D. Manning. Leveraging linguistic structure for open domain information extraction. In ACL (1), pages 344–354. The Association for Computer Linguistics, 2015.

[7] J. Aslam and V. Pavlu. A practical sampling strategy for efficient retrieval evaluation. Working Draft, http://www.ccs.neu.edu/home/jaa/papers/drafts/statAP.html, 2007.

[8] J. A. Aslam, V. Pavlu, and E. Yilmaz. A statistical method for system evaluation using incomplete judgments. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6-11, 2006, pages 541–548, 2006.

[9] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. G. Ives. Dbpedia: A nucleus for a web of open data. In The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11-15, 2007., pages 722–735, 2007.

[10] P.Bailey, A. P.de Vries, N. Craswell, and I. Soboroff. Overview of the TREC 2007 enterprise track. In TREC, volume Special Publication 500-274. National Institute of Standards and Technology (NIST), 2007.

[11] K. Balog, L. Azzopardi, and M. de Rijke. A language modeling framework for expert finding. Inf. Process. Manage., 45(1):1–19, 2009.

153 Bibliography

[12] K. Balog, M. Bron, and M. de Rijke. Query modeling for entity search based on terms, categories, and examples. ACM Trans. Inf. Syst., 29(4):22:1–22:31, 2011.

[13] K. Balog, D. Carmel, A. P.de Vries, D. M. Herzig, P.Mika, H. Roitman, R. Schenkel, P.Serdyukov, and D. T. Tran. The first joint international workshop on entity-oriented and semantic search (JIWES). SIGIR Forum, 46(2):87–94, 2012.

[14] K. Balog, A. P. de Vries, P. Serdyukov, and J. Wen. The first international workshop on entity- oriented search (EOS). SIGIR Forum, 45(2):43–50, 2011.

[15] K. Balog, Y. Fang, M. de Rijke, P.Serdyukov, and L. Si. Expertise retrieval. Foundations and Trends in Information Retrieval, 6(2-3):127–256, 2012.

[16] K. Balog, L. Kelly, and A. Schuth. Head first: Living labs for ad-hoc search evaluation. In CIKM, pages 1815–1818. ACM, 2014.

[17] K. Balog and R. Neumayer. Hierarchical target type identification for entity-oriented queries. In CIKM, pages 2391–2394. ACM, 2012.

[18] K. Balog, P.Serdyukov, and A. P.de Vries. Overview of the TREC 2010 entity track. In TREC, volume Special Publication 500-294. National Institute of Standards and Technology (NIST), 2010.

[19] K. Balog, P.Serdyukov, and A. P.de Vries. Overview of the TREC 2011 entity track. In TREC, volume Special Publication 500-296. National Institute of Standards and Technology (NIST), 2011.

[20] K. Balog, I. Soboroff, P.Thomas, N. Craswell, A. P.de Vries, and P.Bailey. Overview of the TREC 2008 enterprise track. In TREC 2008.

[21] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, pages 2670–2676, 2007.

[22] P.Basile, M. Degemmis, A. L. Gentile, P.Lops, and G. Semeraro. The algorithm for word sense disambiguation and semantic indexing of documents. In AI*IA, volume 4733 of Lecture Notes in Computer Science, pages 314–325. Springer, 2007.

[23] H. Becker, F.Chen, D. Iter, M. Naaman, and L. Gravano. Automatic identification and presentation of twitter content for planned events. In ICWSM. The AAAI Press, 2011.

[24] H. Becker, M. Naaman, and L. Gravano. Beyond trending topics: Real-world event identification on twitter. In ICWSM. The AAAI Press, 2011.

[25] E. Benson, A. Haghighi, and R. Barzilay. Event discovery in social media feeds. In ACL, pages 389–398. The Association for Computer Linguistics, 2011.

[26] J. Berant, A. Chou, R. Frostig, and P.Liang. Semantic parsing on freebase from question-answer pairs. In EMNLP, pages 1533–1544. ACL, 2013.

[27] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34–43, May 2001.

[28] C. Bizer, K. Eckert, R. Meusel, H. Mühleisen, M. Schuhmacher, and J. Völker. Deployment of rdfa, microdata, and microformats on the web - A quantitative analysis. In The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part II, pages 17–32, 2013.

[29] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. Dbpedia - A crystallization point for the web of data. J. Web Sem., 7(3):154–165, 2009.

154 Bibliography

[30] R. Blanco, B. B. Cambazoglu, P.Mika, and N. Torzec. Entity recommendations in web search. In The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part II, pages 33–48, 2013.

[31] R. Blanco, H. Halpin, D. M. Herzig, P.Mika, J. Pound, H. S. Thompson, and D. T. Tran. Repeatable and reliable search system evaluation using crowdsourcing. In SIGIR, pages 923–932. ACM, 2011.

[32] R. Blanco, H. Halpin, D. M. Herzig, P.Mika, J. Pound, H. S. Thompson, and T. Tran. Repeatable and reliable semantic search evaluation. J. Web Sem., 21:14–29, 2013.

[33] R. Blanco, P.Mika, and S. Vigna. Effective and efficient entity search in RDF data. In International Semantic Web Conference (1), volume 7031 of Lecture Notes in Computer Science, pages 83–97. Springer, 2011.

[34] C. Böhm, G. de Melo, F.Naumann, and G. Weikum. LINDA: distributed web-of-data-scale entity matching. In 21st ACM International Conference on Information and Knowledge Management, CIKM’12, Maui, HI, USA, October 29 - November 02, 2012, pages 2104–2108, 2012.

[35] K. Bollacker, P.Tufts, T. Pierce, and R. Cook. A platform for scalable, collaborative, structured information integration. In Intl. Workshop on Information Integration on the Web (IIWeb’07), 2007.

[36] K. D. Bollacker, C. Evans, P.Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD Conference, pages 1247–1250. ACM, 2008.

[37] K. Bontcheva, L. Derczynski, A. Funk, M. A. Greenwood, D. Maynard, and N. Aswani. Twitie: An open-source information extraction pipeline for microblog text. In RANLP, pages 83–90. RANLP 2013 Organising Committee / ACL, 2013.

[38] P.Bouquet, H. Stoermer, C. Niederee, and A. Maña. Entity Name System: The back-bone of an open and scalable web of data. In 2008 IEEE International Conference on Semantic Computing, pages 554–561, Aug 2008.

[39] T. Brants, A. C. Popat, P.Xu, F.J. Och, and J. Dean. Large language models in machine translation. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, pages 858–867, 2007.

[40] A. Broder. How good was our crystal ball? a personal perspective and retrospective on favorite web research topics. In Proc. of WWW.

[41] M. Bron, K. Balog, and M. de Rijke. Ranking related entities: components and analyses. In CIKM, pages 1079–1088. ACM, 2010.

[42] M. Bron, K. Balog, and M. de Rijke. Example based entity search in the web of data. In ECIR, volume 7814 of Lecture Notes in Computer Science, pages 392–403. Springer, 2013.

[43] C. Buckley, D. Dimmick, I. Soboroff, and E. M. Voorhees. Bias and the limits of pooling for large collections. Inf. Retr., 10(6):491–508, 2007.

[44] C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In SIGIR, pages 25–32. ACM, 2004.

[45] L. Bühmann and J. Lehmann. Universal OWL axiom enrichment for large knowledge bases. In and Knowledge Management - 18th International Conference, EKAW 2012, Galway City, Ireland, October 8-12, 2012. Proceedings, pages 57–71, 2012.

155 Bibliography

[46] L. Bühmann and J. Lehmann. Pattern based knowledge base enrichment. In The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part I, pages 33–48, 2013.

[47] S. Büttcher, C. L. A. Clarke, P.C. K. Yeung, and I. Soboroff. Reliable information retrieval evalu- ation with incomplete and biased judgements. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007, pages 63–70, 2007.

[48] S. Campinas, D. Ceccarelli, T. E. Perry, R. Delbru, K. Balog, and G. Tummarello. The sindice-2011 dataset for entity-oriented search in the web of data. In In Balog et al, pages 26–32.

[49] D. Carmel, M. Chang, E. Gabrilovich, B. P. Hsu, and K. Wang. Erd’14: entity recognition and disambiguation challenge. In The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’14, Gold Coast , QLD, Australia - July 06 - 11, 2014, page 1292, 2014.

[50] B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6-11, 2006, pages 268–275, 2006.

[51] B. Carterette, E. Kanoulas, V. Pavlu, and H. Fang. Reusable test collections through experimental design. In SIGIR, pages 547–554. ACM, 2010.

[52] M. Catasta, A. Tonon, G. Demartini, J. Ranvier, K. Aberer, and P.Cudré-Mauroux. B-hist: Entity- centric search over personal web browsing history. J. Web Sem., 27:19–25, 2014.

[53] M. Catasta, A. Tonon, D. E. Difallah, G. Demartini, K. Aberer, and P.Cudré-Mauroux. Hippocam- pus: answering memory queries using transactive search. In 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7-11, 2014, Companion Volume, pages 535–540, 2014.

[54] D. Chen and C. D. Manning. A fast and accurate dependency parser using neural networks. In EMNLP, pages 740–750. ACL, 2014.

[55] M. Ciaramita and Y. Altun. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In EMNLP, pages 594–602. ACL, 2006.

[56] C. Cleverdon. Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems. College of Aeronautics, Cranfield, England, 1962.

[57] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string metrics for matching names and records. In KDD WORKSHOP ON DATA CLEANING AND OBJECT CONSOLIDATION, 2003.

[58] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), 2002.

[59] M. Curtiss, I. Becker, T. Bosman, S. Doroshenko, L. Grijincu, T. Jackson, S. Kunnatur, S. B. Lassen, P.Pronin, S. Sankar, G. Shen, G. Woss, C. Yang, and N. Zhang. Unicorn: A system for searching the social graph. PVLDB, 6(11):1150–1161, 2013.

[60] J. Daiber, M. Jakob, C. Hokamp, and P.N. Mendes. Improving efficiency and accuracy in multilin- gual entity extraction. In I-SEMANTICS, pages 121–124. ACM, 2013.

156 Bibliography

[61] J. Dalton and S. Huston. Semantic entity retrieval using web queries over structured rdf data. In Semantic Search Workshop 2010, 2010.

[62] C. d’Amato, N. Fanizzi, and F.Esposito. Inductive learning for the semantic web: What does it buy? Semantic Web, 1(1-2):53–59, 2010.

[63] R. Delbru, N. Toupikov, M. Catasta, and G. Tummarello. A node indexing scheme for web entity retrieval. In ESWC (2), volume 6089 of Lecture Notes in Computer Science, pages 240–256. Springer, 2010.

[64] G. Demartini, D. E. Difallah, and P.Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469–478. ACM, 2012.

[65] G. Demartini, C. S. Firan, T. Iofciu, R. Krestel, and W. Nejdl. Why finding entities in wikipedia is difficult, sometimes. Inf. Retr., 13(5):534–567, 2010.

[66] G. Demartini, T. Iofciu, and A. P.de Vries. Overview of the INEX 2009 entity ranking track. In Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009, Brisbane, Australia, December 7-9, 2009, Revised and Selected Papers, pages 254–264, 2009.

[67] G. Demartini, M. M. S. Missen, R. Blanco, and H. Zaragoza. TAER: time-aware entity retrieval- exploiting the past to find relevant entities in news articles. In CIKM, pages 1517–1520. ACM, 2010.

[68] D. E. Difallah, G. Demartini, and P.Cudré-Mauroux. Pick-a-crowd: tell me what you like, and i’ll tell you what to do. In WWW, pages 367–374. International World Wide Web Conferences Steering Committee / ACM, 2013.

[69] J. Domingue, D. Fensel, and J. A. Hendler, editors. Handbook of Semantic Web Technologies. Springer, 2011.

[70] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, pages 601–610, 2014.

[71] L. Ehrlinger and W. Wöß. Towards a definition of knowledge graphs. In Joint Proceedings of the Posters and Demos Track of the 12th International Conference on Semantic Systems - SEMANTiCS2016 and the 1st International Workshop on Semantic Change & Evolving Semantics (SuCCESS’16) co-located with the 12th International Conference on Semantic Systems (SEMANTiCS 2016), Leipzig, Germany, September 12-15, 2016., 2016.

[72] S. Elbassuoni and R. Blanco. Keyword search over RDF graphs. In CIKM, pages 237–242. ACM, 2011.

[73] F.Erxleben, M. Günther, M. Krötzsch, J. Mendez, and D. Vrandecic. Introducing wikidata to the linked data web. In The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I, pages 50–65, 2014.

[74] Y. Fang, L. Si, Z. Yu, et al. Purdue at trec 2010 entity track: A probabilistic framework for matching types between candidate and target entities. In Proceedings of the Text REtrieval Conference (TREC), 2010.

157 Bibliography

[75] M. Färber, B. Ell, C. Menne, and A. Rettinger. A Comparative Survey of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semantic Web, 1:1–5, 2015.

[76] A. Farzindar and W. Khreich. A survey of techniques for event detection in twitter. Computational Intelligence, 31(1):132–164, 2015.

[77] J. R. Finkel, T. Grenager, and C. D. Manning. Incorporating non-local information into informa- tion extraction systems by gibbs sampling. In ACL, pages 363–370. The Association for Computer Linguistics, 2005.

[78] J. R. Finkel and C. D. Manning. Joint parsing and named entity recognition. In HLT-NAACL, pages 326–334. The Association for Computational Linguistics, 2009.

[79] O. Ganea, M. Ganea, A. Lucchi, C. Eickhoff, and T. Hofmann. Probabilistic bag-of-hyperlinks model for entity linking. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016, pages 927–938, 2016.

[80] A. Gangemi, A. G. Nuzzolese, V. Presutti, F.Draicchio, A. Musetti, and P.Ciancarini. Automatic typing of dbpedia entities. In International Semantic Web Conference (1), volume 7649 of Lecture Notes in Computer Science, pages 65–81. Springer, 2012.

[81] P.Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42, 2006.

[82] G. A. Grimnes, P.Edwards, and A. D. Preece. Learning meta-descriptions of the FOAF network. In International Semantic Web Conference, volume 3298 of Lecture Notes in Computer Science, pages 152–165. Springer, 2004.

[83] H. Gu, X. Xie, Q. Lv, Y. Ruan, and L. Shang. Etree: Effective and efficient event modeling for real-time online social media networks. In Web Intelligence, pages 300–307. IEEE Computer Society, 2011.

[84] J. Guiver, S. Mizzaro, and S. Robertson. A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Trans. Inf. Syst., 27(4):21:1–21:26, 2009.

[85] K. Gunaratna, K. Thirunarayan, and A. P.Sheth. FACES: diversity-aware entity summarization using incremental hierarchical conceptual clustering. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., pages 116–122, 2015.

[86] Y. Guo, Z. Pan, and J. Heflin. An evaluation of knowledge base systems for large OWL datasets. In International Semantic Web Conference, volume 3298 of Lecture Notes in Computer Science, pages 274–288. Springer, 2004.

[87] K. Haas, P. Mika, P. Tarjan, and R. Blanco. Enhanced results for web search. In SIGIR, pages 725–734. ACM, 2011.

[88] M. A. Hall. Correlation-based feature selection for discrete and numeric class machine learning. In ICML, pages 359–366. Morgan Kaufmann, 2000.

[89] H. Halpin, D. M. Herzig, P.Mika, R. Blanco, J. Pound, H. Thompon, and D. T. Tran. Evaluating ad-hoc object retrieval. In Proceedings of the International Workshop on Evaluation of Semantic Technologies (IWEST 2010), Shanghai, China, November 8, 2010, 2010.

[90] D. Harman. Is the cranfield paradigm outdated? In SIGIR, page 1. ACM, 2010.

158 Bibliography

[91] F.Hasibi, K. Balog, and S. E. Bratsberg. Exploiting entity linking in queries for entity retrieval. In Proceedings of the 2016 ACM on International Conference on the Theory of Information Retrieval, ICTIR 2016, Newark, DE, USA, September 12- 6, 2016, pages 209–218, 2016.

[92] S. Heindorf, M. Potthast, H. Bast, B. Buchhold, and E. Haussmann. WSDM cup 2017: Vandalism detection and triple scoring. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017, Cambridge, United Kingdom, February 6-10, 2017, pages 827–828, 2017.

[93] J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. YAGO2: A spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell., 194:28–61, 2013.

[94] G. , M. A. Hall, and E. Frank. Generating rule sets from model trees. In Australian Joint Conference on Artificial Intelligence, volume 1747 of Lecture Notes in Computer Science, pages 1–12. Springer, 1999.

[95] M. Hosseini, I. J. Cox, N. Milic-Frayling, G. Kazai, and V. Vinay. On aggregating labels from multi- ple crowd workers to infer relevance of documents. In Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1-5, 2012. Proceedings, pages 182–194, 2012.

[96] T. Iofciu, G. Demartini, N. Craswell, and A. P.de Vries. Refer: Effective relevance feedback for entity ranking. In ECIR, volume 6611 of Lecture Notes in Computer Science, pages 264–276. Springer, 2011.

[97] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422–446, 2002.

[98] K. Jones and C. Van Rijsbergen. Report on the Need for and Provision of an Ideal Information Retrieval Test Collection. British Library Research and Development reports. 1975.

[99] A. Kalyanpur, J. W. Murdock, J. Fan, and C. A. Welty. Leveraging community-built knowledge for type coercion in question answering. In International Semantic Web Conference (2), volume 7032 of Lecture Notes in Computer Science, pages 144–156. Springer, 2011.

[100] R. Kaptein, P.Serdyukov, A. P.de Vries, and J. Kamps. Entity ranking using wikipedia as a pivot. In CIKM, pages 69–78. ACM, 2010.

[101] G. Kazai. In search of quality in crowdsourcing for search engine evaluation. In ECIR, volume 6611 of Lecture Notes in Computer Science, pages 165–176. Springer, 2011.

[102] G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25-29, 2011, pages 205–214, 2011.

[103] G. Kazai, J. Kamps, and N. Milic-Frayling. Worker types and personality traits in crowdsourcing relevance labels. In CIKM, pages 1941–1944. ACM, 2011.

[104] D. Klein and C. D. Manning. Accurate unlexicalized parsing. In ACL, pages 423–430. ACL, 2003.

[105] M. Knuth and H. Sack. Data cleansing consolidation with patchr. In ESWC (Satellite Events), volume 8798 of Lecture Notes in Computer Science, pages 231–235. Springer, 2014.

[106] L. Kong, N. Schneider, S. Swayamdipta, A. Bhatia, C. Dyer, and N. A. Smith. A dependency parser for tweets. In EMNLP, pages 1001–1012. ACL, 2014.

159 Bibliography

[107] R. Kumar and A. Tomkins. A characterization of online search behavior. IEEE Data Eng. Bull., 32(2):3–11, 2009.

[108] R. Lee and K. Sumiya. Measuring geographical regularities of crowd behaviors for twitter-based geo-social event detection. In GIS-LBSN, pages 1–10. ACM, 2010.

[109] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P.N. Mendes, S. Hellmann, M. Morsey, P.van Kleef, S. Auer, and C. Bizer. Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195, 2015.

[110] M. E. Lesk and G. Salton. Relevance assessments and retrieval system evaluation. Information Storage and Retrieval, 4(4):343–359, 1968.

[111] J. Li, J. Tang, Y. Li, and Q. Luo. Rimom: A dynamic multistrategy ontology alignment framework. IEEE Trans. Knowl. Data Eng., 21(8):1218–1232, 2009.

[112] R. Li, K. H. Lei, R. Khadiwala, and K. C. Chang. TEDAS: A twitter-based event detection and analysis system. In ICDE, pages 1273–1276. IEEE Computer Society, 2012.

[113] T. Lin, P.Pantel, M. Gamon, A. Kannan, and A. Fuxman. Active objects: actions for entity-centric search. In Proceedings of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16-20, 2012, pages 589–598, 2012.

[114] T. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3):225–331, 2009.

[115] R. Long, H. Wang, Y. Chen, O. Jin, and Y. Yu. Towards effective event detection, tracking and summarization on microblog data. In WAIM, volume 6897 of Lecture Notes in Computer Science, pages 652–663. Springer, 2011.

[116] E. Maddalena, M. Basaldella, D. D. Nart, D. Degl’Innocenti, S. Mizzaro, and G. Demartini. Crowd- sourcing relevance assessments: The unexpected benefits of limiting the time to judge. In Proceedings of the 4th AAAI Conference on Human Computation and Crowdsourcing (HCOMP 2016).

[117] F. Mahdisoltani, J. Biega, and F. M. Suchanek. YAGO3: A knowledge base from multilingual . In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings, 2015.

[118] C. D. Manning, P.Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

[119] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky. The stanford corenlp natural language processing toolkit. In ACL (System Demonstrations), pages 55–60. The Association for Computer Linguistics, 2014.

[120] K. Massoudi, M. Tsagkias, M. de Rijke, and W. Weerkamp. Incorporating query expansion and quality indicators in searching microblog posts. In ECIR, volume 6611 of Lecture Notes in Computer Science, pages 362–367. Springer, 2011.

[121] C. Matuszek, J. Cabral, M. J. Witbrock, and J. DeOliveira. An introduction to the syntax and content of . In AAAI Spring Symposium: Formalizing and Compiling Background Knowledge and Its Applications to Knowledge Representation and Question Answering, pages 44–49. AAAI, 2006.

[122] R. McCreadie, C. Macdonald, and I. Ounis. Crowdsourcing Blog Track Top News Judgments at TREC. In Crowdsourcing for Search and Data Mining (CSDM) at WSDM 2011, pages 23–26, 2011.

160 Bibliography

[123] P.N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. Dbpedia spotlight: shedding light on the web of documents. In I-SEMANTICS, ACM International Conference Proceeding Series, pages 1–8. ACM, 2011.

[124] D. Metzler, C. Cai, and E. H. Hovy. Structured event retrieval over microblog archives. In HLT- NAACL, pages 646–655. The Association for Computational Linguistics, 2012.

[125] D. Metzler and W. B. Croft. A markov random field model for term dependencies. In SIGIR, pages 472–479. ACM, 2005.

[126] R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In CIKM, pages 233–242. ACM, 2007.

[127] G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, 1995.

[128] E. Minack, R. Paiu, S. Costache, G. Demartini, J. Gaugaz, E. Ioannou, P.Chirita, and W. Nejdl. Leveraging personal metadata for desktop search: The beagle++ system. J. Web Sem., 8(1):37–54, 2010.

[129] T. M. Mitchell, W. W. Cohen, E. R. H. Jr., P. P. Talukdar, J. Betteridge, A. Carlson, B. D. Mishra, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. A. Platanios, A. Ritter, M. Samadi, B. Settles, R. C. Wang, D. T. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. Never-ending learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., pages 2302–2310, 2015.

[130] S. Mizzaro. Relevance: The whole history. JASIS, 48(9):810–832, 1997.

[131] A. Moffat, W. Webber, and J. Zobel. Strategic system comparisons via targeted relevance judg- ments. In SIGIR, pages 375–382. ACM, 2007.

[132] A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst., 27(1):2:1–2:27, 2008.

[133] Y. Moshfeghi, M. Matthews, R. Blanco, and J. M. Jose. Influence of timeline and named-entity components on user engagement. In ECIR, volume 7814 of Lecture Notes in Computer Science, pages 305–317. Springer, 2013.

[134] H. Mühleisen and C. Bizer. Web data commons - extracting structured data from two large web corpora. In WWW2012 Workshop on Linked Data on the Web, Lyon, France, 16 April, 2012, 2012.

[135] D. Nadeau. Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision. PhD thesis, University of Ottawa, November 2007.

[136] D. Nadeau, P.D. Turney, and S. Matwin. Unsupervised named-entity recognition: Generating gazetteers and resolving ambiguity. In Canadian Conference on AI, volume 4013 of Lecture Notes in Computer Science, pages 266–277. Springer, 2006.

[137] N. Nakashole, T. Tylenda, and G. Weikum. Fine-grained semantic typing of emerging entities. In ACL (1), pages 1488–1497. The Association for Computer Linguistics, 2013.

[138] Y. Nenov, R. Piro, B. Motik, I. Horrocks, Z. Wu, and J. Banerjee. Rdfox: A highly-scalable RDF store. In The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part II, pages 3–20, 2015.

[139] T. Neumann and G. Weikum. The RDF-3X engine for scalable management of RDF data. VLDB J., 19(1):91–113, 2010.

161 Bibliography

[140] R. Neumayer, K. Balog, and K. Nørvåg. On the modeling of entities for ad-hoc entity search in the web of data. In Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1-5, 2012. Proceedings, pages 133–145, 2012.

[141] K. Nguyen, R. Ichise, and B. Le. Interlinking linked data sources using a domain-independent system. In JIST, volume 7774 of Lecture Notes in Computer Science, pages 113–128. Springer, 2012.

[142] F.Niu, C. Zhang, C. Ré, and J. W. Shavlik. Deepdive: Web-scale knowledge-base construction using statistical learning and inference. In Proceedings of the Second International Workshop on Searching and Integrating New Web Data Sources, Istanbul, Turkey, August 31, 2012, pages 25–28, 2012.

[143] O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith. Improved part-of- speech tagging for online conversational text with word clusters. In HLT-NAACL, pages 380–390. The Association for Computational Linguistics, 2013.

[144] H. Paulheim. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web, 8(3):489–508, 2017.

[145] H. Paulheim and C. Bizer. Improving the quality of linked data using statistical distributions. Int. J. Semantic Web Inf. Syst., 10(2):63–86, 2014.

[146] V. Pavlu, S. Rajput, P.B. Golbus, and J. A. Aslam. IR system evaluation using nugget-based test collections. In WSDM, pages 393–402. ACM, 2012.

[147] S. Petrovic, M. Osborne, and V. Lavrenko. Streaming first story detection with application to twitter. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 2-4, 2010, Los Angeles, California, USA, pages 181–189, 2010.

[148] S. Phuvipadawat and T. Murata. Breaking news detection and tracking in twitter. In Web Intelligence/IAT Workshops, pages 120–123. IEEE Computer Society, 2010.

[149] L. Pipino, Y. W. Lee, and R. Y. Wang. Data quality assessment. Commun. ACM, 45(4):211–218, 2002.

[150] A. Popescu and M. Pennacchiotti. Detecting controversial events from twitter. In CIKM, pages 1873–1876. ACM, 2010.

[151] M. F.Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.

[152] J. Pound, P.Mika, and H. Zaragoza. Ad-hoc object retrieval in the web of data. In WWW, pages 771–780. ACM, 2010.

[153] R. Prokofyev, G. Demartini, and P. Cudré-Mauroux. Effective named entity recognition for idiosyncratic web collections. In WWW, pages 397–408. ACM, 2014.

[154] R. Prokofyev, A. Tonon, M. Luggen, L. Vouilloz, D. E. Difallah, and P.Cudré-Mauroux. SANAPHOR: ontology-based coreference resolution. In The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part I, pages 458–473, 2015.

[155] J. R. Quinlan. Learning with continuous classes. In Proceedings of the 5th Australian joint Conference on Artificial Intelligence, volume 92, pages 343–348. Singapore, 1992.

162 Bibliography

[156] Y. Raimond, T. Ferne, M. Smethurst, and G. Adams. The BBC world service archive prototype. J. Web Sem., 27:2–9, 2014.

[157] N. Ramakrishnan, P. Butler, S. Muthiah, N. Self, R. P. Khandpur, P. Saraf, W. Wang, J. Cadena, A. Vullikanti, G. Korkmaz, C. J. Kuhlman, A. Marathe, L. Zhao, T. Hua, F.Chen, C. Lu, B. Huang, A. Srinivasan, K. Trinh, L. Getoor, G. Katz, A. Doyle, C. Ackermann, I. Zavorin, J. Ford, K. M. Summers, Y. Fayed, J. Arredondo, D. Gupta, and D. Mares. ’beating the news’ with EMBERS: forecasting civil unrest using open source indicators. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, pages 1799–1808, 2014.

[158] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: An experimental study. In EMNLP, pages 1524–1534. ACL, 2011.

[159] A. Ritter, Mausam, O. Etzioni, and S. Clark. Open domain event extraction from twitter. In KDD, pages 1104–1112. ACM, 2012.

[160] S. E. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.

[161] M. Rospocher, M. van Erp, P.Vossen, A. Fokkens, I. Aldabe, G. Rigau, A. Soroa, T. Ploeger, and T. Bogaard. Building event-centric knowledge graphs from news. J. Web Sem., 37-38:132–151, 2016.

[162] T. Sakai. Alternatives to bpref. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007, pages 71–78, 2007.

[163] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In WWW, pages 851–860. ACM, 2010.

[164] M. Sanderson. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4):247–375, 2010.

[165] M. Sanderson and J. Zobel. Information retrieval system evaluation: effort, sensitivity, and reliability. In SIGIR, pages 162–169. ACM, 2005.

[166] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling. Twitterstand: news in tweets. In GIS, pages 42–51. ACM, 2009.

[167] A. Saxena, A. Jain, O. Sener, A. Jami, D. K. Misra, and H. S. Koppula. Robobrain: Large-scale knowledge engine for robots. CoRR, abs/1412.0691, 2014.

[168] M. Schmachtenberg, C. Bizer, and H. Paulheim. Adoption of the linked data best practices in different topical domains. In Semantic Web Conference (1), volume 8796 of Lecture Notes in Computer Science, pages 245–260. Springer, 2014.

[169] F.Scholer, D. Kelly, W. Wu, H. S. Lee, and W. Webber. The effect of threshold priming and need for cognition on relevance calibration and assessment. In SIGIR, pages 623–632. ACM, 2013.

[170] W. Shen, J. Wang, P.Luo, and M. Wang. LIEGE: : link entities in web lists with knowledge base. In KDD, pages 1424–1432. ACM, 2012.

[171] A. Signorini, A. Segre, and P.Polgreen. The use of twitter to track levels of disease activity and public concern in the u.s. during the influenza a h1n1 pandemic. PLoS ONE, 6(5):1–10, 2011.

163 Bibliography

[172] M. D. Smucker, G. Kazai, and M. Lease. Overview of the TREC 2012 crowdsourcing track. In TREC, volume Special Publication 500-298. National Institute of Standards and Technology (NIST), 2012.

[173] F. M. Suchanek, S. Abiteboul, and P. Senellart. PARIS: probabilistic alignment of relations, instances, and schema. PVLDB, 5(3):157–168, 2011.

[174] F.M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697–706. ACM, 2007.

[175] T. P.Tanon, D. Vrandecic, S. Schaffert, T. Steiner, and L. Pintscher. From freebase to wikidata: The great migration. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016, pages 1419–1428, 2016.

[176] A. Thalhammer, N. Lasierra, and A. Rettinger. Linksum: Using link analysis to summarize entity data. In Web Engineering - 16th International Conference, ICWE 2016, Lugano, Switzerland, June 6-9, 2016. Proceedings, pages 244–261, 2016.

[177] A. Tonon, M. Catasta, G. Demartini, and P. Cudré-Mauroux. Fixing the domain and range of properties in linked data by context disambiguation. In Proceedings of the Workshop on Linked Data on the Web, LDOW 2015, co-located with the 24th International World Wide Web Conference (WWW 2015), Florence, Italy, May 19th, 2015., 2015.

[178] A. Tonon, M. Catasta, G. Demartini, P.Cudré-Mauroux, and K. Aberer. Trank: Ranking entity types using the web of data. In International Semantic Web Conference (1), volume 8218 of Lecture Notes in Computer Science, pages 640–656. Springer, 2013.

[179] A. Tonon, M. Catasta, R. Prokofyev, G. Demartini, K. Aberer, and P.Cudré-Mauroux. Contextual- ized ranking of entity types based on knowledge graphs. J. Web Sem., 37-38:170–183, 2016.

[180] A. Tonon, P.Cudré-Mauroux, A. Blarer, V. Lenders, and B. Motik. ArmaTweet: Detecting events by semantic tweet analysis. In Proceedings of the Extended Semantic Web Conference (ESWC), 2017.

[181] A. Tonon, G. Demartini, and P. Cudré-Mauroux. Combining inverted indices and structured search for ad-hoc object retrieval. In SIGIR, pages 125–134. ACM, 2012.

[182] A. Tonon, G. Demartini, and P.Cudré-Mauroux. Pooling-based continuous evaluation of infor- mation retrieval systems. Inf. Retr. Journal, 18(5):445–472, 2015.

[183] A. Tonon, V. Felder, D. E. Difallah, and P.Cudré-Mauroux. Voldemortkg: Mapping schema.org and web entities to linked open data. In The Semantic Web - ISWC 2016 - 15th International Semantic Web Conference, Kobe, Japan, October 17-21, 2016, Proceedings, Part II, pages 220–228, 2016.

[184] G. Töpper, M. Knuth, and H. Sack. Dbpedia ontology enrichment for inconsistency detection. In I-SEMANTICS 2012 - 8th International Conference on Semantic Systems, I-SEMANTICS ’12, Graz, Austria, September 5-7, 2012, pages 33–40, 2012.

[185] G. Tummarello, R. Cyganiak, M. Catasta, S. Danielczyk, R. Delbru, and S. Decker. Sig.ma: Live views on the web of data. J. Web Sem., 8(4):355–364, 2010.

[186] T. Tylenda, M. Sozio, and G. Weikum. Einstein: physicist or vegetarian? summarizing semantic type graphs for knowledge discovery. In WWW (Companion Volume), pages 273–276. ACM, 2011.

[187] D. Vallet and H. Zaragoza. Inferring the most important types of a query: a semantic approach. In SIGIR, pages 857–858. ACM, 2008.

164 Bibliography

[188] O. Vallis, J. Hochenbaum, and A. Kejariwal. A novel technique for long-term anomaly detection in the cloud. In HotCloud. USENIX Association, 2014.

[189] S. Verberne, M. van der Heijden, M. Hinne, M. Sappelli, S. Koldijk, E. Hoenkamp, and W. Kraaij. Reliability and validity of query intent assessments. JASIST, 64(11):2224–2237, 2013.

[190] J. Völker and M. Niepert. Statistical schema induction. In The Semantic Web: Research and Applications - 8th Extended Semantic Web Conference, ESWC 2011, Heraklion, Crete, Greece, May 29-June 2, 2011, Proceedings, Part I, pages 124–138, 2011.

[191] E. Voorhees and D. Harman. Overview of the Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242, 1998.

[192] E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In SIGIR, pages 315–323. ACM, 1998.

[193] E. M. Voorhees. The philosophy of information retrieval evaluation. In CLEF, volume 2406 of Lecture Notes in Computer Science, pages 355–370. Springer, 2001.

[194] E. M. Voorhees and D. Harman. Overview of the eighth text retrieval conference (TREC-8). In TREC, volume Special Publication 500-246. National Institute of Standards and Technology (NIST), 1999.

[195] D. Vrandeˇci´cand M. Krötzsch. Wikidata: A free collaborative knowledgebase. Commun. ACM, 57(10):78–85, Sept. 2014.

[196] J. Waitelonis, N. Ludwig, M. Knuth, and H. Sack. Whoknows? evaluating linked data heuristics with a quiz that cleans up dbpedia. Interact. Techn. Smart Edu., 8(4):236–248, 2011.

[197] H. Wang, T. Tran, C. Liu, and L. Fu. Lightweight integration of IR and DB for scalable hybrid search with integrated ranking support. J. Web Sem., 9(4):490–503, 2011.

[198] W. Webber and L. A. F. Park. Score adjustment for correction of pooling bias. In SIGIR, pages 444–451. ACM, 2009.

[199] C. Welty, J. W. Murdock, A. Kalyanpur, and J. Fan. A comparison of hard filters and soft evidence for answer typing in watson. In International Semantic Web Conference (2), volume 7650 of Lecture Notes in Computer Science, pages 243–256. Springer, 2012.

[200] J. Weng and B. Lee. Event detection in twitter. In ICWSM. The AAAI Press, 2011.

[201] S. E. Whang and H. Garcia-Molina. Joint entity resolution on multiple datasets. VLDB J., 22(6):773– 795, 2013.

[202] M. Wylot, J. Pont, M. Wisniewski, and P.Cudré-Mauroux. diplodocus[rdf] - short and long-tail RDF analytics for massive webs of data. In International Semantic Web Conference (1), volume 7031 of Lecture Notes in Computer Science, pages 778–793. Springer, 2011.

[203] L. Yao, S. Riedel, and A. McCallum. Universal schema for entity type prediction. In AKBC@CIKM, pages 79–84. ACM, 2013.

[204] E. Yilmaz and J. A. Aslam. Estimating average precision with incomplete and imperfect judgments. In CIKM, pages 102–111. ACM, 2006.

[205] E. Yilmaz, E. Kanoulas, and J. A. Aslam. A simple and efficient sampling method for estimating AP and NDCG. In SIGIR, pages 603–610. ACM, 2008.

165 Bibliography

[206] H. Zaragoza, H. Rode, P.Mika, J. Atserias, M. Ciaramita, and G. Attardi. Ranking very many typed entities on wikipedia. In CIKM, pages 1015–1018. ACM, 2007.

[207] J. Zhang, J. Tang, and J. Li. Expert finding in a social network. In DASFAA, volume 4443 of Lecture Notes in Computer Science, pages 1066–1069. Springer, 2007.

[208] L. Zhao, F. Chen, J. Dai, T. Hua, C.-T. Lu, and N. Ramakrishnan. Unsupervised spatial event detection in targeted domains with applications to civil unrest modeling. PLoS ONE, 9(10):1–12, 2014.

[209] N. Zhiltsov, A. Kotov, and F.Nikolaev. Fielded sequential dependence model for ad-hoc entity retrieval in the web of data. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 9-13, 2015, pages 253–262, 2015.

[210] X. Zhou and L. Chen. Event detection over twitter social media streams. VLDB J., 23(3):381–400, 2014.

[211] J. Zobel. How reliable are the results of large-scale information retrieval experiments? In SIGIR, pages 307–314. ACM, 1998.

166