Rich Explanations for Query Answers Using Join Graphs

Rich Explanations for Query Answers Using Join Graphs

Puing Things into Context: Rich Explanations for ery Answers using Join Graphs Chenjie Li Zhengjie Miao Qitian Zeng IIT Duke University IIT [email protected] [email protected] [email protected] Boris Glavic Sudeepa Roy Illinois Institute of Technology Duke University [email protected] [email protected] ABSTRACT user. However, real world data often exhibits complex correlations In many data analysis applications, there is a need to explain why a and inter-relationships that connect the provenance of a query surprising or interesting result was produced by a query. Previous with data that has not been accessed by the query. Current ap- approaches to explaining results have directly or indirectly used proaches do not take these crucial inter-relationships into account. data provenance (input tuples contributing to the result(s) of inter- Thus, the explanations they produce may lack important contex- est), which is limited by the fact that relevant information for ex- tual information that can aid the user in developing a deeper under- plaining an answer may not be fully contained in the provenance. standing of the results. We illustrate how to use context to explain We propose a new approach for explaining query results by aug- a user’s question using data extracted from the official website of menting provenance with information from other related tables in the NBA [35]. the database. We develop a suite of optimization techniques, and Example 1. Consider a simplified NBA database with the fol- demonstrate experimentally using real datasets and through a user lowing relations (the keys are underlined, the full schema has 11 re- study that our approach produces meaningful results by efficiently lations). Some tuples from each relation are shown in Figure 1. Each navigating the large search space of possible explanations. team participating in a game can use multiple lineups consisting of ACM Reference Format: five players. Home refers to the home team in a game. Chenjie Li, Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. • Game(year, month, day, home, away, home_pts, away_pts, 2021. Putting Things into Context: Rich Explanations for Query Answers winner, season): participating teams and the winning team. using Join Graphs. In SIGMOD ’21: ACM Symposium on Neural Gaze Detec- • PlayerGameScoring(player, year, month, day, home, pts): tion, June 03–05, 2018, Woodstock, NY. ACM, New York, NY, USA, 32 pages. the points each player scored in each game he played in. https://doi.org/10.1145/1122445.1122456 • LineupPerGameStats(lineupid, year, month, day, home, mp) 1 INTRODUCTION : the minutes played by each lineup. • LineupPlayer(lineupid, player): players for a lineup. Today’s world is dominated by data. Recent advances in complex Suppose we are interested in exploring the winning records of the analytics enable businesses, governments, and scientists to extract team GSW (Golden State Warriors) in every season. The following value from their data. However, results of such operations are of- query & returns this information: ten hard to interpret and debugging such applications is challeng- 1 SELECT winner as team, season, count (*) as win ing, motivating the need to develop approaches that can automati- FROM Game g WHERE winner = 'GSW' GROUP BY winner, season cally interpret and explain results to data analysts in a meaningful way. Data provenance [14, 21], which has been studied for several Figure 1e shows the number of games team (, won for each arXiv:2103.15797v1 [cs.DB] 29 Mar 2021 decades, is an immediate form of explanations that describes how season. (, made history in the NBA in the 2015-16 season to be an answer is derived from input data. However, provenance is of- the team that won the most games in a single season. Observe that ten insufficient for unearthing interesting insights from the data team GSW improved its performance significantly from season 2012- that led to a surprising result, especially for aggregate query an- 13 (C1) to season 2015-16 (C2). Such a drastic increase in a relatively swers. In the last few years, several “explanation” methods have short period of timenaturally raises the question of what changed be- emerged in the data management literature [7, 19, 34, 41, 42, 47, 50, tweenthese 2seasons (denotedas the user question*&1 in Figure 1f). Game 52] that return insightful answers in response to questions from a Note that only the table (shown in Figure 1a) was accessed by &1. This table provides the user with information about each game Permission to make digital or hard copies of all or part of this work for personal or such as the name of the opponent team or whether (, was the classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- home team or not. However, such information is not enough for un- tion on the first page. Copyrights for components of this work owned by others than derstanding why (, won or lost more games than in the other ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- seasons, since in each season a team would play the same number publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. of games and home games, and roughly the same number of times SIGMOD ’21, June 03–05, 2021, Woodstock, NY against each opponent. © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00 In this paper, we present an approach that answer questions like https://doi.org/10.1145/1122445.1122456 *&1 (Figure 1f). Our approach produces insightful explanations SIGMOD ’21, June 03–05, 2021, Woodstock, NY Chenjie Li, Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy year mon day home away home_pts away_pts winner season 61 → 2013 01 02 MIA DAL 119 109 MIA 2012-13 (, won more games in season 2015-16 because Player S. 62 → 2012 12 05 DET GSW 97 104 GSW 2012-13 63 → 2015 10 27 GSW NOP 111 95 GSW 2015-16 Curry scored ≥ 23 points in 58 out of 73 games in 2015-16 64 → 2014 01 05 GSW WAS 96 112 GSW 2013-14 21 65 → 2016 01 22 GSW IND 122 110 GSW 2015-16 compared to out of 47 games in 2012-13. (a) Game Table player year mon day home pts Given this explanation, the user can infer that S. Curry was lineupid player ?1 → S.Curry 2012 12 05 DET 22 one of the key contributors for the improvement of GSW’s win- 58420 K. Thompson ?2 → S.Curry 2015 10 27 GSW 40 58420 D. Green ?3 → S.Curry 2016 01 22 GSW 39 ning record since his points significantly improved in the 2015-16 13507 S. Battier ?4 → K.Thompson 2012 12 05 DET 27 season. Similarly, the explanation in Figure 2c can be interpreted 13507 L. James ?5 → K.Thompson 2016 01 22 DET 18 67949 D. Green ?6 → D.Green 2012 12 05 DET 2 as: (b) LineupPlayer Table (c) PlayerGameScoring table (, won more games in season 2015-16 because Player D. lineupid year mon day home mp team season win Green and Player K. Thompson’s on-court minutes together 13507 2013 11 09 MIA 4.30 C1 → (, 2012-13 47 77727 2012 12 12 MIA 14.70 (, 2013-14 51 were ≥ 19 minutes in 70 out of 73 games in the 2015-16 sea- 58420 2015 11 07 SAC 10.30 (, 2014-15 67 son compared to only 2 out of 47 games in 2012-13 season. 58482 2015 11 07 SAC 11.10 C2 → (, 2015-16 73 58420 2014 12 08 MIN 11.70 (, 2016-17 67 (d) LineupPerGameStats Table (e) Result of &1 This implies that Green and Thompson’s increase of playing time together might have helped improve (, ’s record. We will *&1: Why did (, win 73 games in season 2015-16 (C2) compared to 47 games in 2012-13. discuss other example queries, user questions, and explanations (f) User question *&1 returned by our approach using the NBA and the MIMIC hospital Figure 1: Input/outputtables, and the user question for Example 1. records dataset [26] in Section 6. Our Contributions. In this paper, we develop CaJaDE (Context- Aware Join-Augmented Deep Explanations), the first system that automatically augments provenance data with related contextual information from other tables to produce more informative sum- maries of the difference between the values of two tuples in the ; 4364 (41) = (PT.year=P.year ∧ PT.month=P.month ∧ PT.day=P.day ∧ PT.home=P.home) (b) Legend answer of an aggregate query, or, the high/low value of a single Ω Φ (a) Join graph 1 + pattern 1 for*&1: Star Player outlier tuple. We make the following contributions in this paper. (1) Join-augmented provenance summaries as explanations. We propose the notion of join-augmented provenance and use sum- maries of augmented provenance as explanations. The join-augmented ; 4364 (41) = (PT.year=LS.year ∧ PT.month=LS.month ∧ PT.day=LS.day ∧ provenance is generated based on a join graph that encodes how PT.home=LS.home) the provenance should be joined with tables that provide context. ; 4364 (42) = (LS.lineupid = !1.lineupid) ; 4364 (43) = (!1.lineupid = !2.lineupid) We use patterns, i.e., conjunctions of equality and inequality pred- (c) Join graph Ω2 with pattern Φ2 for *&1: Pair of players icates, to summarize the difference between the join-augmented Figure 2: Explanations for *&1 provenance of two tuples C1,C2 from a query’s output selected by the user’s question.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    32 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us