A Tree-Based Summarization Framework for Differences Between Two Data Sets a Thesis Submitted to Kent State University in Partia

A Tree-based Summarization Framework For Differences Between Two Data Sets A thesis submitted to Kent State University in partial fulfillment of the requirements for the degree of Master of Science by Dong Wang May 2009 Thesis written by Dong Wang B.S., University of Science and Technology of China, 1997 M.S., University of Science and Technology of China, 2000 M.S., Kent State University Approved by Dr. Ruoming Jin, Advisor Dr. Robert Walker, Chair, Department of Computer Science Dr. John R.D. Stalvey, Dean, College of Arts and Sciences ii Table of Contents List of Figures iv List of Tables vii List of Algorithms viii 1 Introduction 1 1.1 Ubiquitous Change . 1 1.2 Hypothesis Testing . 2 1.3 Test Statistics Properties . 2 1.4 Test Statistic Methods . 3 1.5 Utilizing of χ2 Test............................ 4 2 Related Work 6 2.1 Change Detection . 6 2.2 Measuring Differences in Data Set . 6 2.3 Describing Differences Between Multidimensional Data Sets . 7 3 Problem Definition 9 3.1 Histogram-based Analysis . 9 3.2 Multi-dimensional Contingency table . 10 3.3 Decision-tree Like Structure Approach . 11 iii 4 Dynamic Programming Algorithm 13 5 Greedy Recursive Scheme 17 6 Experimental Results 19 6.1 Data Preparing . 19 6.2 Testing Environment . 23 6.3 Running Costs . 23 6.4 Test Results . 24 6.5 Accuracy of the Greedy Algorithm . 24 7 Conclusion and future work 26 Bibliography 27 Appendices 29 A Dynamic Programming Algorithm Results For the Real Data Sets 30 iv List of Figures 3.1 Building of one-dimensional contingency table . 10 3.2 Illustration of a 2-d grid . 11 3.3 Final result of the optimal cuts. 12 4.1 Possible cuts of an 8 × 6 2d hypercube . 13 4.2 Illustration of the decision tree building for 1-d hypercube . 15 4.3 Illustration of the decision tree building for 2-d hypercube . 16 5.1 A simple 1 − d test hypercube . 18 6.1 A simple tree-structure explaination . 20 6.2 Running time of both algorithms vs grid sizes . 24 A.1 Cutting result for the first kind of change of Abalone . 30 A.2 Cutting result for the second kind of change of Abalone . 31 A.3 Cutting result for the first kind of change of Auto MPG . 32 A.4 Cutting result for the second kind of change of Auto MPG . 33 A.5 Cutting result for the first kind of change of Clouds . 34 A.6 Cutting result for the second kind of change of Clouds . 35 A.7 Cutting result for the first kind of change of Cement . 36 A.8 Cutting result for the second kind of change of Cement . 37 A.9 Cutting result for the first kind of change of Credit . 38 v A.10 Cutting result for the second kind of change of Credit . 39 A.11 Cutting result for the first kind of change of Vowel Context . 40 A.12 Cutting result for the second kind of change of Vowel Context . 41 A.13 Cutting result for the first kind of change of Cylinder . 42 A.14 Cutting result for the second kind of change of Cylinder . 43 vi List of Tables 6.1 List of Data Sets. 19 6.2 Comparison of the two algorithms . 25 vii List of Algorithms 1 Dynamic Programming Algorithm To Find The Maximize Indepen- dence . 15 2 Greedy Algorithm To Find The Maximize Independence . 17 3 Third Kind Of Change Generation . 22 4 Hypercube Generation . 23 viii Chapter 1 Introduction 1.1 Ubiquitous Change One of the fundamental problems in data mining is comparing the change between two data sets with the same set of attributes. Detecting and describing the changes have great potentials in many research areas. Listed below are some examples: • Every month a retail store generates a sales report. The manager of the store wants to compare the reports from different periods and find out which factor contributes most to the differences. • In several locations of the tropical ocean, the water temperatures at multiple depths are collected. By comparing the differences between the data sets from different times of the year, the scientists want to know which area is the most significant contributor to the differences between two data sets of two different collecting times. • The log files of two popular web sites might have very different patterns. The administrators want to know what causes the differences. The reason might be a certain point of time or a certain group of visitors. In these examples, the two data sets studied usually have identical sets of attributes. The difference between them is the distribution of the data sets. A summary of these 1 2 differences could help the managers to find the trends in the consumer market and help them to make the right decisions. Unlike the widely used OLAP tools[6], which can drill-down or roll-up to different levels of aggregations and help the user to find the differences, a method to describe and explain these differences is required. 1.2 Hypothesis Testing To describe differences, differences first must be statistically defined. Suppose there are two different data sets, say two transactional data sets D1 and D2 from a chain store in two different locations. For D1, the underlying distribution function is F1, and for D2, the underlying distribution function is F2. The question is if there’s any differences between F1 and F2. So there are two hypotheses, a null hypothesis H0 and an alternative hypothesis H1: H0 : F1 = F2 H1 : F1 6= F2 A test statistic s(F1,F2) is needed, which is a function of F1 and F2 (and therefore a function of D1 and D2); from the value of s it can be decided (with a certain degree of confidence) to reject one of the hypotheses above. 1.3 Test Statistics Properties There are several properties a good test statistic has to have 1. Consistency: When min(|D1|, |D2|) → ∞, we can always reject one of the hypotheses. For example, if D1 is drawn from a Bernoulli distribution with p = 0.5, and D2 is also drawn from a Bernoulli distribution, but with p = 0.49, then if the difference of the two mean values of the two sample sets is used as 3 the test statistic, s = |µ1 − µ2| , the right decision (which is to reject H0) might not be reached when the sample size is small, however when the sample size goes to infinity, even though the difference between D1 and D2 is very small, s will become so significant that H0 can be rejected with high confidence. 2. Distribution-free: In most real datasets, it is difficult to determine functions F1 and F2 in advance and even if they are determined once, distribution functions could change from time to time. So it is necessary to make the test statistic distribution-free. 3. Power: If there are more than one test statistic that meets the requirements above, which one should be used? There are two types of errors in a statistical decision process. Type I error is rejecting a hypothesis that should have been accepted; type II error is accepting a hypothesis that should have been rejected. The power of a test statistic is the probability of NOT committing a type II error. A good test statistic should have a power close to 1, so we can reject the hypothesis when it is not correct. So the goal is to find a test statistic that is consistent, distribution-free and has a power as close to 1 as possible. 1.4 Test Statistic Methods In real-world data sets, many data sets have more than one attribute, so besides the properties listed in section 1.3, the test statistic methods must be multi-variant. Listed below are some of the qualified methods: 1. Friedman’s MST method[8]. As an extension of the univariate Wald-Wolfowitz test[17], Friedman’s MST method can be used to find the number of homo- 4 geneous regions in the mixture of two types of multivariate points. At the beginning, a complete graph is built based on the distances δij between any two points i and j. Then a minimal spanning tree(MST) is built from the complete graph. Next, All the edges across the two different groups are removed. Since it is an MST, each time an edge is removed, the total number of isolated sub- graphs will increase by one. After all the removals, a set of homogeneous and isolated graphs will be formed, and the number of these isolated graphs can be used as the test statistic. 2. The k-nearest neighbors[10]. In multi-dimensional space, once the distance deltaij between any two points i and j is calculated (it could be metric like Euler distance), The k-nearest neighbors for each point can be determined. Each point pi, together with its k nearest neighbors form a sphere centered at pi. At the end, the number of homogeneous spheres can be used as the test statistic. 3. Cross-matching[14]. First, Kullback-Leibler divergence[16] is calculated between any two points from both types, supposing the total number of the two types of points is N, the minimum distance non-bipartite matching can be performed and generate N/2 pairs (if N is odd, a dumb point d with δid = 0, i ∈ {1, 2,...,N}, then the pair containing the dumb point is discarded). The test statistic is the number of homogeneous pairs in these N/2 pairs.

Load more