Efficiently Charting

Efficiently Charting RDF Oren Kalinsky Oren Mishali Aidan Hogan Technion Technion University of Chile Haifa, Israel Haifa, Israel Santiago, Chile Yoav Etsion Benny Kimelfeld Technion Technion Haifa, Israel Haifa, Israel ABSTRACT Due to their scale and diversity, a major challenge faced We propose a visual query language for interactively explor- when considering a knowledge graph is to understand what ing large-scale knowledge graphs. Starting from an overview, content it contains: what sorts of entities it describes, what the user explores bar charts through three interactions: class sorts of relations are represented, how extensive the cov- expansion, property expansion, and subject/object expansion. erage of particular domains is, etc. Prominent knowledge A major challenge faced is performance: a state-of-the-art graphs, such as DBpedia [17], Freebase [18], Wikidata [75], SPARQL engine may require tens of minutes to compute the contain in the order of tens of millions of nodes and bil- multiway join, grouping and counting required to render lions of edges represented using thousands of classes and a bar chart. A promising alternative is to apply approxi- properties, spanning innumerable different domains. While mation through online aggregation, trading precision for a variety of approaches have been proposed to summarize performance. However, state-of-the-art online aggregation or profile the content of such graphs [31], the general trend algorithms such as Wander Join have two limitations for our is towards either computing statistics and summaries offline, exploration scenario: (1) a high number of rejected paths or relying on off-the-shelf query engines. slows the convergence of the count estimations, and (2) no In this paper, we propose a conceptual approach and tech- unbiased estimator exists for counts under the distinct op- niques for interactive exploration of large-scale knowledge erator. We thus devise a specialized algorithm for online graphs through a visual query language. This query language aggregation that augments Wander Join with exact partial captures user interactions that follow Shneiderman’s princi- computations to reduce the number of rejected paths en- ple for effective data visualization and exploration: “overview countered, as well as a novel estimator that we prove to be first, zoom and filter, then details-on-demand”[69]. The result unbiased in the case of the distinct operator. In an experimen- of a query is a bar chart over a set of focus nodes that are tal study with random interactions exploring two large-scale defined iteratively by the user via three interactions: class ex- knowledge graphs, our algorithm shows a clear reduction in pansion, which focuses on the sub-classes of a selected class error with respect to computation time versus Wander Join. bar, property expansion, which focuses on the properties defined on instances of a class, and subject/object expansion, 1 INTRODUCTION which focuses on entities in the source/target of a given property. At each stage, the focus nodes of the current bar A variety of prominent knowledge graphs have emerged in chart can be filtered by a search condition. Each interaction recent years, including DBpedia [17], Freebase [18], Wiki- constitutes a visual exploration step, with the sequence of data [75], and YAGO [44] covering multiple domains, Linked- arXiv:1811.10955v2 [cs.DB] 26 Jan 2019 interactions captured by the query language. GeoData [13] for geographic data, LinkedMDB [26] for movie Given the size and diversity of prominent knowledge information, and LinkedSDMX [23] for financial and geopo- graphs, the number of (intermediate) results that can be litical data. A number of companies have also announced the generated by queries, and the goal of supporting interactive creation of proprietary knowledge graphs to power a vari- exploration, a major challenge faced is that of performance. ety of end-user applications, including Google,1 Microsoft,2 In preliminary experiments with Virtuoso [32]—a state-of- Amazon3 and eBay4, among others. the-art SPARQL query engine—we found, for example, that 1https://googleblog.blogspot.com/2012/05/introducing-knowledge-graph- computing the distribution of properties over all nodes in things-not.html DBpedia takes over 5 minutes; such runtimes preclude the 2https://blogs.bing.com/search-quality-insights/2017-07/bring-rich- possibility of interactive exploration. knowledge-of-people-places-things-and-local-businesses-to-your-apps To face the critical performance problem, we investigate 3https://blog.aboutamazon.com/innovation/making-search-easier 4https://www.ebayinc.com/stories/news/cracking-the-code-on- two orthogonal approaches. First, we explore the deploy- conversational-commerce/ ment of a query engine from the recent breed of worst-case optimal join algorithms [57], in order to avoid an explosion Exploration Tools of intermediate results when processing multiway joins over A variety of approaches have been proposed in recent years large graphs; for these purposes, we select the Cache Trie for exploring and visualizing graph-structured data [28]. Join algorithm [47] to evaluate queries. Second, with the intu- ition that precise counts are not always required, we explore Faceted Browsing: Among the most popular approaches online aggregation algorithms [42] that trade precision for that have been studied for exploring knowledge graphs is performance, computing approximate counts at a fraction of that of faceted browsing, where users incrementally add re- the cost observed even in the worst-case optimal setting; for strictions – called facets – to restrict the current results [72]. these purposes, we select the Wander-Join algorithm [51]. Early works mainly focused on smaller, domain-specific In essence, Wander Join applies a random walk between graphs, among which we mention the mSpace system [68] database tuples that (jointly) match the join query, and upon in the multimedia domain, BrowseRDF [60] in the crime do- termination, updates an estimator of the aggregate function. main, /facet [43] and Ontogator [52] in the art domain, or Ultimately, inspired by the work of Zhao et al. [78], we more recently, ReVeaLD [48] in the biomedical domain, and conclude that these two approaches are complementary. We Hippalus [71] in the zoology domain. Such works typically offer an algorithm that combines online aggregation with have dealt with smaller-scale and/or homogeneous graphs exact computation. The general idea to apply the random with few classes and properties, focusing on usability and walk of Wander Join, and at each step, consider replacing expressivity rather than issues of scale or performance. the remaining walk with a precise computation of the space However, with the growth of large-scale multi-domain of possible suffixes, this time using Cached Trie Join. This knowledge graphs like DBpedia, Freebase or Wikidata, a consideration is done via an estimate of selectivity. The esti- number of faceted browsers have been proposed that sup- mator needs to be updated accordingly, and we prove that port thousands of classes and properties and upwards of a it remains unbiased. Furthermore, we extend our algorithm hundred million edges, as needed for such datasets. Among to estimate counts in the presence of the distinct operator, these systems, we can mention Neofonie [40], Rhizomer [19], which is crucial to our exploration use case. We call the SemFacet [10], Semplore [77] and Sparklis [33] for explor- resulting algorithm Audit Join, and prove that it provides ing DBpedia; Broccoli [14] and Parallax [46] for exploring unbiased estimators of counts, with and without the distinct Freebase; and GraFa for exploring Wikidata [54]. Of these operator. In experiments that evaluate randomly-generated systems, many do not present runtime performance evalua- exploration queries over two knowledge graphs, we show tion [19, 40, 46, 76], delegate query processing to a general- that our algorithm dramatically reduces error with respect purpose query engine [19, 33, 41, 46], apply a manual se- to the computation time when compared with Wander Join. lection of useful facets or a subset of data [10, 40], and/or otherwise rely on a materialization approach to cache meta- Contributions data (such as counts) [14, 19, 54]. Compared to such works, we focus on scalability and performance; more concretely, we Our contributions are summarized as follows. propose a novel query engine specifically optimized for the types of online aggregation queries needed by such systems. • We propose a formal model of an exploration approach over knowledge graphs. Graph Profiling: While faceted browsing aims to allow • We describe a system implementation of the explo- users to express and answer specific questions in an intuitive ration model. manner, other works have focused on the problem of sum- • We devise Audit Join—a specialized online-aggregation marizing the content of a large knowledge graph: to provide algorithm for the backend of our proposed model, and users insights as to what the graph does or does not contain, prove that it produces unbiased estimations of counts. what are the relationships between entities, what are the • We describe an experimental study of performance most common types, and so forth [31]. over random explorations, showing the benefits of One approach to provide users with an overview of a Audit Join over the state of the art. knowledge graph is to compute a graph summary or quotient graph [24], which groups nodes

Load more