Exploring the Data Wilderness Through Examples
Total Page:16
File Type:pdf, Size:1020Kb
Exploring the Data Wilderness through Examples Davide Mottin Matteo Lissandrini Aarhus University Aalborg University [email protected] [email protected] Yannis Velegrakis Themis Palpanas Utrecht University Paris Descartes University [email protected] [email protected] Machine Machine ABSTRACT Interface learning Exploration is one of the primordial ways to accrue knowl- edge about the world and its nature. As we accumulate, Example-based approaches mostly automatically, data at unprecedented volumes and Middleware speed, our datasets have become complex and hard to un- derstand. In this context exploratory search provides a handy Data structures and hardware tool for progressively gather the necessary knowledge by starting from a tentative query that hopefully leads to an- Relational Graph Text swers at least partially relevant and that can provide cues about the next queries to issue. Recently, we have witnessed a rediscovery of the so-called example-based methods, in which Figure 1: A view of example-based data exploration. the user or the analyst circumvent query languages by using examples as input. This shift in semantics has led to a num- ber of methods receiving as query a set of example members 1 SCOPE OF THE TUTORIAL of the answer set. The search system then infers the entire Data exploration includes methods to eciently extract knowl- answer set based on the given examples and any additional edge from data, even if we do not know what exactly we are information provided by the underlying database. In this tu- looking for, nor how to precisely describe our needs. In such torial, we present an excursus over the main example-based exploration, the user progressively acquires the knowledge methods for exploratory analysis, show techniques tailored by issuing a sequence of generic queries to gather intelli- to dierent data types, and provide a unifying view of the gence about the data. However, the existing body of work problem. We show how dierent data types require dier- in data analysis, data visualization, and predictive models ent techniques, and present algorithms that are specically assumes the user is willing to pose several queries to the designed for relational, textual, and graph data. underlying database in order to progressively gather the required information. This assumption stems from the intu- ACM Reference Format: ition that the user, being accustomed to data analysis, can Davide Mottin, Matteo Lissandrini, Yannis Velegrakis, and Themis Palpanas. 2019. Exploring the Data Wilderness through Examples. more intuitively dig into the data. In 2019 International Conference on Management of Data (SIGMOD Recently, examples became a popular proxy for data explo- ’19), June 30-July 5, 2019, Amsterdam, Netherlands. ACM, New York, ration. One of the earliest attempts to bring examples as a NY, USA, 5 pages. https://doi.org/10.1145/3299869.3314031 query method is query-by-example [51]. The main idea was to help the user in the query formulation, allowing them to Permission to make digital or hard copies of all or part of this work for specify the results in terms of templates for tuples, i.e., exam- personal or classroom use is granted without fee provided that copies ples. Nowadays, examples are not anymore a mere template, are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights but rather the representative of the intended results the user for components of this work owned by others than the author(s) must would like to have. These example-based approaches are fun- be honored. Abstracting with credit is permitted. To copy otherwise, or damentally dierent from the initial query-by-example idea, republish, to post on servers or to redistribute to lists, requires prior specic and have are successfully applied to relational [10, 36, 43], permission and/or a fee. Request permissions from [email protected]. textual [7, 48, 50], and graph [12, 17, 25] data. SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands We note that the exibility of examples does not compro- © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. mise the richness of the results, yet, it overcomes the am- ACM ISBN 978-1-4503-5643-5/19/06...$15.00 biguity of simple keyword searches, which is traditionally https://doi.org/10.1145/3299869.3314031 studied in information retrieval. On the other hand, while data exploration techniques (Idreos et al.: Overview of data in recent vision papers [45, 47]. We will discuss the following exploration techniques, tutorial in SIGMOD 2015) assume major challenges. the user is willing to pose several exploratory queries, the Adaptivity: current data management technologies and • use of examples requires almost no supervision from the user systems do not take into account individual user preferences, perspective, making example-based methods a more palat- and tend to optimize certain kind of queries and respond able choice for novice users, as well as for practitioners. This slowly to others. new functionality can empower existing data exploration Explanation: Data management systems usually include • methods with a complementary tool: whenever a query is little or no explanation for the results of a query. In example- too complex to be expressed with a query language, such as based methods, in which the user query is only implicit, this SQL, examples represent a natural alternative. In this respect requirement is even more prominent. (cf. Figure 1) example-based exploration is a middle ground Interactivity: Current prototypes show the advantage of • between the user interface, and the middleware/hardware, example-based methods with regards to visualization tech- enabling new functionalities for the former and allowing niques. However, in order to achieve the real-time, interactive more natural exploitation of the latter. Moreover, the use of performance needed by visualization tools, the algorithms examples has been demonstrated to be very eective in data should incorporate intelligent and ecient techniques for visualization [24, 39]. navigating through the search space. In this tutorial, we aim at describing the main develop- Finally, we conclude the tutorial with remarks about the ments of examples as an expressive and powerful method current state of aairs and engage the audience in a discus- for exploratory data analysis. sion about their experiences and challenges in this area. The rst part of the tutorial introduces the broad topic of data exploration, highlighting the hardness of query lan- 2 TUTORIAL OUTLINE guages for simple users and advocating the need for dierent In this tutorial, we provide a detailed overview of the new query methods. We will introduce the example-based meth- area of example-based methods for exploratory search, sur- ods as exible delegates for more complex queries that would veying the relevant state-of-the-art techniques. We also present otherwise need to be expressed through a very complex tra- future directions discussing various machine learning tech- ditional query. In this part, we will discuss various cases, niques used to infer user preferences interactively. where queries cannot be expressed in declarative languages Next, we report the summary of the outline. We also provide without requiring complex constructs. We will also present an extended description of example-based approaches in an expressive formulation of example-based approaches as Section 2.1, and machine learning approaches in Section 2.2. seeking a similarity among objects. I. Introduction, motivation, and formulation The second part of the tutorial discusses the current main Why example-based approaches are important techniques for relational, textual, and graph data. In this part, • – Usefulness of exploratory analysis we will present the algorithms, show how they work, and – Main characteristics of exploratory analysis demonstrate their ability to (conceptually) solve complex – Example-based methods for exploratory analysis tasks (e.g., data integration, community detection) from sim- – Use cases of failing keyword and declarative queries ple examples. We will also highlight the dierences among – Applications in current database systems and data data types, focusing on the scalability perspective, present- analysis ing the motivations and drawing parallels among methods Connection to data exploration for dierent data types. • Problem formulation as similarity discovery • The third part of the tutorial focuses on the latest devel- II. Example-based approaches opments of machine learning to progressively discover user Query-by-example: [51] • intention. We will introduce the general area of online learn- Example methods in relational databases: • ing, some early methods based on relevance feedback [16], – Reverse engineering of SQL queries [19, 28, 31, 36, 42, and show some recent applications of multi-armed bandits 43, 46, 49]; theories, that include active search. – Schema mapping [1, 6, 13]; Challenges and open research questions. The last part of – Data cleaning: entity matching [38], data repairing [15]; the tutorial is dedicated to the challenges and open research – Example-based systems [8, 34, 35]. questions. Exploratory search based on examples is rapidly Example methods in textual data: • attracting attention and getting traction, though, the support – Exploring Web documents as examples [7, 50]; for such techniques in modern data management systems is – Example based