Finding Pareto Optimal Groups: Group-Based Skyline

Finding Pareto Optimal Groups: Groupbased Skyline Jinfei Liu Li Xiong Jian Pei Emory University Emory University Simon Fraser University [email protected] [email protected] [email protected] Jun Luo Haoyu Zhang Lenovo; CAS Emory University [email protected] [email protected] ABSTRACT i, p[i] < p0[i] (1 ≤ i ≤ d). Given the set of points P , the Skyline computation, aiming at identifying a set of skyline skyline is defined as the set of points that are not dominated points that are not dominated by any other point, is par- by any other point in P . In other words, the skyline rep- ticularly useful for multi-criteria data analysis and decision resents the best points or Pareto optimal solutions from the making. Traditional skyline computation, however, is in- dataset since the points within the skyline cannot dominate adequate to answer queries that need to analyze not only each other. individual points but also groups of points. To address this price gap, we generalize the original skyline definition to the nov- hotel distance price p1 4 400 400 1 el group-based skyline (G-Skyline), which represents Pareto 2 p p 24 380 p2 3 optimal groups that are not dominated by other groups. In p 14 340 p3 4 300 p 36 300 p4 order to compute G-Skyline groups consisting of k points 5 5 p p 26 280 p6 efficiently, we present a novel structure that represents the p6 8 260 200 p7 p7 40 200 points in a directed skyline graph and captures the domi- p8 8 p 20 180 p9 p10 nance relationships among the points based on the first k p9 34 140 100 10 skyline layers. We propose efficient algorithms to compute p 28 120 p11 the first k skyline layers. We then present two heuristic al- p11 16 60 10 20 30 40 gorithms to efficiently compute the G-Skyline groups: the distance to the destination (a) (b) point-wise algorithm and the unit group-wise algorithm, us- Figure 1: A skyline example of hotels. ing various pruning strategies. The experimental results on the real NBA dataset and the synthetic datasets show that Figure 1(a) illustrates a dataset P = fp1; p2; :::; p11g, each G-Skyline is interesting and useful, and our algorithms are representing a hotel with two attributes: the distance to efficient and scalable. the destination and the price. Figure 1(b) shows the corre- sponding points in the two dimensional space where the x 1. INTRODUCTION and y coordinates correspond to the attributes of distance to the destination and price, respectively. We can see that Skyline, also known as Maxima in computational geom- p (14; 340) dominates p (24; 380) as an example of domi- etry or Pareto in business management field, is important 3 2 nance. The skyline of the dataset contains p , p , and p . for many applications involving multi-criteria decision mak- 1 6 11 Suppose the organizers of a conference need to reserve one ing. The skyline of a set of multi-dimensional data points hotel considering both distance to the conference destination consists of the points for which no other point exists that and the price for participants, the skyline offers a set of best is better in at least one dimension and at least as good in options or Pareto optimal solutions with various tradeoffs every other dimension. between distance and price: p is the nearest to the destina- Assume that we have a dataset of n points, referred to 1 tion, p is the cheapest, and p provides a good compromise as P . Each point p of d real-valued attributes can be rep- 11 6 of the two factors. p will not be considered as p is better resented as a d-dimensional point (p[1]; p[2]; :::; p[d]) 2 Rd 8 11 than p in both factors. where p[i] is the i-th attribute of p. Given two points p = 8 (p[1]; p[2]; :::; p[d]) and p0 = (p0[1]; p0[2]; :::; p0[d]) in Rd, p 0 0 Motivation. While the skyline definition has been extend- dominates p if for every i, p[i] ≤ p [i] and for at least one ed with different variants and the skyline computation problem for finding the skyline of a given dataset has been stud- This work is licensed under the Creative Commons Attribution ied extensively in recent years, most existing works focus on NonCommercialNoDerivs 3.0 Unported License. To view a copy of this li skyline consisting of individual points. One important prob- cense, visit http://creativecommons.org/licenses/byncnd/3.0/. Obtain per lem that has been surprisingly neglected to the large extent mission prior to any use beyond those covered by the license. Contact is the need to find groups of points that are not dominated copyright holder by emailing [email protected]. Articles from this volume by others as many real-world applications may require the were invited to present their results at the 42nd International Conference on Very Large Data Bases, September 5th September 9th 2016, New Delhi, selection of a group of points. India. Proceedings of the VLDB Endowment, Vol. 8, No. 13 Hotels Example. Consider our hotel example again, sup- Copyright 2015 VLDB Endowment 21508097/15/09. pose the organizers need to reserve a group of hotels (instead of one) considering both distance to the conference destina- those groups that are not g-dominated by any other group tion and the price for participants. In contrast to the tradi- with same size. Intuitively, if we consider the points in each tional skyline problem which finds Pareto optimal solutions group as a set of dimensions orthogonal to the attributes where each solution is a single point, we are interested in of each point, the definition of G-Skyline groups with the finding Pareto optimal solutions where each solution is a group dominance is in spirit similar to skyline definition, in group of points. One may use the traditional skyline def- that a group is a skyline group if no permutation of any inition, and return all subsets from the skyline points p1, other group exists that is better for at least one point and p6, and p11. If the desired group size is 2, group fp1; p6g, at least as good for every other point. fp1; p11g, and fp6; p11g can be returned. However, we show G-Skyline not only captures groups of points from tradi- that this definition does not capture all the best groups. For tional skyline points but also groups that may contain non- example, fp11; p10g should clearly be considered a Pareto skyline points. Back to our hotel example, fp11; p10g is a optimal group to users who use price as the main criterion, G-Skyline group as we discussed earlier even though p10 is e.g. PhD students with low travel budget, since p11 provides not a skyline point. Group fp6; p3g is also a G-Skyline. On the best price and p10 the second best price. Note that p10 the other hand, group fp1; p3g is not as it is dominated by is only second best to p11 which is also part of the group, fp1, p6g. Group fp3; p8g is also not as it is dominated by hence no other groups are better than this group in terms of fp6; p11g. In summary, the G-Skyline in this example consist price. As another example, fp6; p3g also presents a Pareto of all groups composed of skyline points, fp1; p6g, fp1; p11g, optimal group, as both p6 and p3 provide a good tradeoff fp6; p11g, as well as groups that contain non-skyline points, and no other groups are better than this group considering fp6; p3g, fp11; p8g, and fp11; p10g. both price and distance. On the other hand, group fp3; p8g It's non-trivial to solve G-Skyline problem efficiently. To is not a best group because p11(p6) is better than p8(p3), (find) k-point G-Skyline groups from n points, there can be f g f g n i.e., group p3; p8 is dominated by group p6; p11 . k different possible groups. Unfortunately, the G-Skyline problem is significantly different from the traditional skyline Table 1: Top five players on Attribute PTS. problem, to the extent that algorithms for the latter are( i-) n Player PTS REB AST STL BLK napplicable. A brute force solution is to enumerate all k Michael Jordan 33.4 6.4 5.7 2.1 0.9 possible groups, then for each group, to compare it with all Anthony Davis 30.5 8.5 2 1.5 3 other groups( to) determine whether it cannot be dominated. Kyrie Irving 30 3 2 1 1 n 2 So there are k comparisons. For each comparison, there Allen Iverson 29.7 3.8 6 2.1 0.2 are k! possible permutations of the points, and for each per- Jerry West 29.1 5.6 6.3 0 0 ... ... ... ... ... ... mutation, it requires k comparisons.( ) Therefore, the time n 2 × × complexity is in the order of O( k k! k). NBA Example. Consider another real example with NBA In this paper, we present a novel structure that represents players. Table 1 shows the top five players on attribute PTS the points in a directed skyline graph and captures all the (Points). For other attributes, please see the experimental dominance relationship among the points based on the no- section for detailed explanations.

Load more