Print This Article
Total Page:16
File Type:pdf, Size:1020Kb
Finding Potential Objects in Uncertain Dataset by Using Competition Power 1305 Finding Potential Objects in Uncertain Dataset by Using Competition Power Sheng-Fu Yang, Guanling Lee, Shou-Chih Lo Department of Computer Science and Information Engineering, National Dong Hwa University, Taiwan [email protected], [email protected], [email protected] * Abstract points, {B, E} is the skyline point in this database. By using the skyline concept, we can retrieve excellent In the past studies, it has been proven that skyline data points in the dataset. queries and dominating queries are very useful in applications such as multi-preference analysis and multi- Table 1. Player information criteria decision making. In real applications, such as Player Scoring Rebounding environmental monitoring and market analysis, the data A 20 6 often have uncertain characteristics, and the uncertainty B 18 18 of the data mainly comes from the data randomness, the C 20 3 limitation of the measuring instrument or the delay of D 30 6 updating the data. Therefore, in this paper, by using the E 35 6 dominance concept, we propose an efficient method to F 32 4 help users screen out better data objects in multi- dimensional uncertain dataset. Furthermore, an The concept of the skyline operator was first appropriate probability model is also proposed to introduced in [3], which explores three algorithms: the objectively calculate the scores of uncertain data. To block nested loops (BNL), divide and conquer, and B- show the benefit of the approach, a set of experiment is performed on both synthetic and real datasets. According tree-based schemes. In [4], by presorting the dataset to the experimental results on real dataset, the proposed according to some monotonic scoring function, a method can find the potential data objects efficiently. variant of BNL named the sort-filter skyline algorithm was proposed. As mentioned in [4], with the resulting Keywords: Skyline query, Uncertain data, Competition order of points, the data object is impossible to power, Dominance dominate the data object listed before it, which simplifies the comparisons. Furthermore, based on the 1 Introduction object-based space partitioning concept, an efficient skyline computation method was proposed [5]. Moreover, the problem of how to process skyline query Skyline query and dominance query [1-2] have in a parallel way was discussed in [6-9] and several recently become important in many applications, such efficient parallel skyline processing algorithms were as multi-criteria decision making, data mining, and presented. multi-preference analysis. Generally speaking, an A traditional skyline query can only extract data A B A object is said to dominate another object if is not points which are not dominated by other data points in B A worse than on every dimension and is strictly the full space. Therefore, by extending the problem of B better than on at least one dimension. In skyline full-space skyline computation to subspace skyline query, it returns the points that are not dominated by computation, the concept of subspace skyline was any other points in the dataset. For example, Table 1 introduced in [10]. Moreover, an efficient subspace shows the performance of six players on the scoring approach for calculating the skyline that exploits the A C and rebounding. In Table 1, is a better player than , seed skyline group lattice formed by the full-space A C because and are equally good at scoring, but the skyline points was proposed in [11]. In [12-13], the C number of rebounds is larger than . In this case, we interesting problem of computing the skyline according A C C A can say that dominates , or is dominated by . to a user’s preference was proposed. Furthermore, A B Furthermore, and are data points that are because the probability that a data object dominates the A B incomparable, because is better than in scoring, but other data object decreases as the number of dimension B A is better than in rebounding. In the example, since increases. There will be a large amount of skyline B E the two points and are not dominated by other points in a high dimensional dataset. To respond this *Corresponding Author: Guanling Lee; E-mail: [email protected] DOI: 10.3966/160792642019072004029 1306 Journal of Internet Technology Volume 20 (2019) No.4 problem, in [14-16], the concept of k-dominant skyline we use average value to represent the performance. was discussed and several efficient algorithm were Therefore, in [17], the concept of using the probability proposed to solve the problem. model to compute the probability which an uncertain The shortcoming of skyline query is that the number object belongs to a skyline point was introduced. And of returned data points cannot be determined. In the it retrieves the data points (called p-skyline) whose above example, if we need to find three outstanding probability of becomming a skyline is higher than a players, the skyline operator will only return two data predefined threshold (p). Furthermore, based on the points. To solve this problem, the concept of concept of p-skyline, [18-26] continue to propose more dominance query is introduced. In dominance query, efficient algorithms or explore the issues related to the the goodness of an object A can be naturally measured probabilistic skyline, such as subspace skyline and by the number of other objects dominated by A in a set uncertainty preferences. of objects. Refer to Table 2, There is one player ({C}) However, these models have some shortcomings. dominated by A, and are two players ({A, C}) Refer to Table 1 and Table 2, althought player B dominated by D, the dominant score of A and D are 1 (special functional player) is in skyline, in dominating and 2, respectively. To continue with the above query, player B will be excluded from the result example by using the dominant score as the goodness (dominant score of B is 0). And the same problem will measure, {E, D, F} can be retrieved as the three occur for uncertain data. Therefore, in this paper, we outstanding players. discuss the problem of how to efficiently and effectively help users to select better objects in multi- Table 2. The dominant score of players dimensional uncertain data. In this paper, we propose an appropriate probability model to objectively Player Dominate Players Dominant Score calculate the scores of uncertain data and find objects A C 1 B - 0 with considerable potential. Furthermore, by adapting C - 0 the concept of Minimal Bounding Box and Z-order, an D A, C 2 efficient algorithm for retrieving potential objects in a E A, C, D, F 4 large set of uncertain data is proposed. A set of F A, C 2 experiment is performed on both real and syntatic datasets to show the advantage of our approach. As mentioned above, in skyline query, it returns the The remainder of this paper is organized as follows. points that are not dominated by any other points in the Section 2 provides the main idea and the problem dataset. In real applications, such as environmental definition of the work. In Section 3, the proposed monitoring and market analysis, the data often have algorithm is discussed and presented. Section 4 uncertain characteristics. The uncertainty of the data discusses the experimental results and analysis. mainly comes from the data randomness, the limitation Conclusions are finally drawn in Section 5. of the measuring instrument or the delay of updating the data. For example, the NBA players may have 2 Problem Formulation different performances in different games. If the game- by-game performance data are considered, the player’s Generally speaking, an object p is said to dominate performance becomes uncertain data. another object q if p is not worse than q on every In the early studies, the average of the instances of dimension and p is strictly better than q on at least one each data object was used to represent the uncertain dimension. And we use p ≺ q to denote p dominate q. data itself. However, according to the experimental In skyline query, it returns the points that are not result performed on the database of NBA player dominated by any other points in the dataset. In real performance (scores, assists, rebounds) from 1991 to applications, such as environmental monitoring and 2005 [17], it pointed out the inadequacies of using market analysis, the data often have uncertain average to represent the uncertain data. For example, a characteristics, and the uncertainty of the data mainly player like Hakeem Olajuwon, because his average comes from the data randomness, the limitation of the performance will be dominated by Shaquille O’Neal, measuring instrument or the delay of updating the data. Charles Barkley, Tim Duncan, and Chris Webber, he For example, Table 3 shows the basketball players will not be retrieved in skyline queries if we ignore the performances (score and rebound) in different games. uncertanty and use the mean value to represent the Because each player has different performances in performance. However, taking into account the different games, it forms an uncertain dataset. In the performance of different games, Hakeem Olajuwon following discussion, we use object to denote the data does not perform worse than these players in most of we concerned. In the example, we have 6 objects games. Just because of the variance of his performance (players A~F). Furthermore, (10, 6) is called an is greater than that of these players, he was not instance of A, and objects A and E contain 2 and 3 presented in the skyline query. In addition, stars like instances, respectively.