A -based Summarization Framework

For Differences Between Two Data Sets

A thesis submitted to Kent State University in partial fulfillment of the requirements for the degree of Master of Science

by Dong Wang May 2009 Thesis written by Dong Wang B.S., University of Science and Technology of China, 1997 M.S., University of Science and Technology of China, 2000 M.S., Kent State University

Approved by

Dr. Ruoming Jin, Advisor

Dr. Robert Walker, Chair, Department of

Dr. John R.D. Stalvey, Dean, College of Arts and Sciences

ii Table of Contents

List of Figures iv

List of Tables vii

List of viii

1 Introduction 1 1.1 Ubiquitous Change ...... 1 1.2 Hypothesis Testing ...... 2 1.3 Test Statistics Properties ...... 2 1.4 Test Statistic Methods ...... 3 1.5 Utilizing of χ2 Test...... 4

2 Related Work 6 2.1 Change Detection ...... 6 2.2 Measuring Differences in Data Set ...... 6 2.3 Describing Differences Between Multidimensional Data Sets . . . . . 7

3 Problem Definition 9 3.1 Histogram-based Analysis ...... 9 3.2 Multi-dimensional Contingency table ...... 10 3.3 Decision-tree Like Structure Approach ...... 11

iii 4 Dynamic Programming 13

5 Greedy Recursive Scheme 17

6 Experimental Results 19 6.1 Data Preparing ...... 19 6.2 Testing Environment ...... 23 6.3 Running Costs ...... 23 6.4 Test Results ...... 24

6.5 Accuracy of the Greedy Algorithm ...... 24

7 Conclusion and future work 26

Bibliography 27

Appendices 29

A Dynamic Programming Algorithm Results For the Real Data Sets 30

iv List of Figures

3.1 Building of one-dimensional contingency table ...... 10 3.2 Illustration of a 2-d grid ...... 11

3.3 Final result of the optimal cuts...... 12

4.1 Possible cuts of an 8 × 6 2d hypercube ...... 13 4.2 Illustration of the decision tree building for 1-d hypercube ...... 15 4.3 Illustration of the decision tree building for 2-d hypercube ...... 16

5.1 A simple 1 − d test hypercube ...... 18

6.1 A simple tree-structure explaination ...... 20 6.2 Running time of both algorithms vs grid sizes ...... 24

A.1 Cutting result for the first kind of change of Abalone ...... 30 A.2 Cutting result for the second kind of change of Abalone ...... 31

A.3 Cutting result for the first kind of change of Auto MPG ...... 32 A.4 Cutting result for the second kind of change of Auto MPG ...... 33 A.5 Cutting result for the first kind of change of Clouds ...... 34 A.6 Cutting result for the second kind of change of Clouds ...... 35 A.7 Cutting result for the first kind of change of Cement ...... 36 A.8 Cutting result for the second kind of change of Cement ...... 37 A.9 Cutting result for the first kind of change of Credit ...... 38

v A.10 Cutting result for the second kind of change of Credit ...... 39 A.11 Cutting result for the first kind of change of Vowel Context ...... 40 A.12 Cutting result for the second kind of change of Vowel Context . . . . 41

A.13 Cutting result for the first kind of change of Cylinder ...... 42 A.14 Cutting result for the second kind of change of Cylinder ...... 43

vi List of Tables

6.1 List of Data Sets...... 19 6.2 Comparison of the two algorithms ...... 25

vii List of Algorithms

1 Dynamic Programming Algorithm To Find The Maximize Indepen- dence ...... 15

2 Greedy Algorithm To Find The Maximize Independence ...... 17

3 Third Kind Of Change Generation ...... 22 4 Hypercube Generation ...... 23

viii Chapter 1

Introduction

1.1 Ubiquitous Change

One of the fundamental problems in is comparing the change between two data sets with the same set of attributes. Detecting and describing the changes have great potentials in many research areas. Listed below are some examples:

• Every month a retail store generates a sales report. The manager of the store wants to compare the reports from different periods and find out which factor contributes most to the differences.

• In several locations of the tropical ocean, the water temperatures at multiple depths are collected. By comparing the differences between the data sets from different times of the year, the scientists want to know which area is the most significant contributor to the differences between two data sets of two different collecting times.

• The log files of two popular web sites might have very different patterns. The administrators want to know what causes the differences. The reason might be a certain point of time or a certain group of visitors.

In these examples, the two data sets studied usually have identical sets of attributes. The difference between them is the distribution of the data sets. A summary of these

1 2 differences could help the managers to find the trends in the consumer market and help them to make the right decisions. Unlike the widely used OLAP tools[6], which can drill-down or roll-up to different levels of aggregations and help the user to find the differences, a method to describe and explain these differences is required.

1.2 Hypothesis Testing

To describe differences, differences first must be statistically defined. Suppose there are two different data sets, say two transactional data sets D1 and D2 from a chain store in two different locations. For D1, the underlying distribution function is F1, and for D2, the underlying distribution function is F2. The question is if there’s any differences between F1 and F2. So there are two hypotheses, a null hypothesis H0 and an alternative hypothesis H1:

H0 : F1 = F2

H1 : F1 6= F2

A test statistic s(F1,F2) is needed, which is a function of F1 and F2 (and therefore a function of D1 and D2); from the value of s it can be decided (with a certain degree of confidence) to reject one of the hypotheses above.

1.3 Test Statistics Properties

There are several properties a good test statistic has to have

1. Consistency: When min(|D1|, |D2|) → ∞, we can always reject one of the

hypotheses. For example, if D1 is drawn from a Bernoulli distribution with

p = 0.5, and D2 is also drawn from a Bernoulli distribution, but with p = 0.49, then if the difference of the two mean values of the two sample sets is used as 3

the test statistic, s = |µ1 − µ2| , the right decision (which is to reject H0) might not be reached when the sample size is small, however when the sample size

goes to infinity, even though the difference between D1 and D2 is very small, s

will become so significant that H0 can be rejected with high confidence.

2. Distribution-free: In most real datasets, it is difficult to determine functions F1

and F2 in advance and even if they are determined once, distribution functions could change from time to time. So it is necessary to make the test statistic distribution-free.

3. Power: If there are more than one test statistic that meets the requirements above, which one should be used? There are two types of errors in a statistical decision process. Type I error is rejecting a hypothesis that should have been accepted; type II error is accepting a hypothesis that should have been rejected.

The power of a test statistic is the probability of NOT committing a type II error. A good test statistic should have a power close to 1, so we can reject the hypothesis when it is not correct.

So the goal is to find a test statistic that is consistent, distribution-free and has a power as close to 1 as possible.

1.4 Test Statistic Methods

In real-world data sets, many data sets have more than one attribute, so besides the properties listed in section 1.3, the test statistic methods must be multi-variant. Listed below are some of the qualified methods:

1. Friedman’s MST method[8]. As an extension of the univariate Wald-Wolfowitz test[17], Friedman’s MST method can be used to find the number of homo- 4

geneous regions in the mixture of two types of multivariate points. At the

beginning, a complete graph is built based on the distances δij between any two points i and j. Then a minimal spanning tree(MST) is built from the complete

graph. Next, All the edges across the two different groups are removed. Since it is an MST, each time an edge is removed, the total number of isolated sub- graphs will increase by one. After all the removals, a set of homogeneous and isolated graphs will be formed, and the number of these isolated graphs can be used as the test statistic.

2. The k-nearest neighbors[10]. In multi-dimensional space, once the distance

deltaij between any two points i and j is calculated (it could be metric like Euler distance), The k-nearest neighbors for each point can be determined.

Each point , together with its k nearest neighbors form a sphere centered at

pi. At the end, the number of homogeneous spheres can be used as the test statistic.

3. Cross-matching[14]. First, Kullback-Leibler divergence[16] is calculated be- tween any two points from both types, supposing the total number of the two types of points is N, the minimum distance non-bipartite matching can be performed and generate N/2 pairs (if N is odd, a dumb point d with

δid = 0, i ∈ {1, 2,...,N}, then the pair containing the dumb point is discarded). The test statistic is the number of homogeneous pairs in these N/2 pairs.

1.5 Utilizing of χ2 Test

Among these tests, the χ2 test[7] is of particular interest. A χ2 test is a statistical hypothesis test which has a χ2 distribution when the null hypothesis is true. It is one of the most widely used non-parameter statistical tests for 5 goodness-of-fit test or for determining whether two random variables are independent or not. The χ2 test statistic is calculated as below:

Xn (O − E )2 X2 = i i E i=1 i

2 2 where X is the test statistic that asymptotically approaches a χ distribution. Oi is an observed value while Ei is an expected value assuming the null hypothesis. n is the number of possible outcomes. In our case, we are interested in whether the two testing data sets are generated from the same distribution or not, so χ2 test is a perfect candidate. Chapter 2

Related Work

2.1 Change Detection

Much research has been done on this category. Breunig et al.[5] used the degree of being an outlier, called local outlier factor for each object. The larger this measure- ment is, the more possible this object is an outlier. While Knorr et al.[13] studied distance-based outliers, which also use the number of neareast neighbors, and showed that even for large data sets, the detection can be done at a complexity of O(kN 2). Change detection in data streams is also an interesting topic. Kifer et al.[12] used a novel measurement to find the change point in a multidimensional data stream. Ag- garwal et al.[2] use the concept of velocity density estimation to diagnose the change in evolving data streams. Zhu et al.[18] proposed a data structure called Shifted Wavelet Tree to monitor elastic bursts. Bay et al. [4] utilized the contrast sets min- ing to detect the group differences. All these methods are focussed on the detection of abnormal point(or region), not to quantify the change or describe the change.

2.2 Measuring Differences in Data Set

Ganti et al.[9] developed a framework to measure the deviation between two data sets. This framework mainly studied two classes of data mining models: frequent

6 7 itemset models(called lits-models) and decision tree models(called dt-models). For lits-models, each of the two data sets has a set of frequent itemsets (which can be viewed as a set of interesting regions) and each itemset has a support value. So the difference of these two data sets can be qualified by the structure component(which is the set of frequent itemsets) and the measure component(which is the support value of each frequent itemsets). Similarly, for dt-models, each of the two data sets has a decision tree structure consisting of leaf nodes and inner nodes. Each node can also be viewed as an inter- esting region. Since it is a decision tree, each node is associated with the fractions of the two possible values of the dependent attribute. So by using the tree structure as the structure component and the two fraction values as the measure component, the difference of these two data sets can also be qualified. If the two data sets have identical structure components, the difference between the two data sets is simply the sum of the differences of every region. When the structure components of the two data sets are not the same, both data sets need to extend their structure components so that the two extended structure components become identical, then the deviation of all the regions in the two extended structure componenets are aggregated to get the difference between the two data sets. The limitation of this framework is that when the structure of the data set cannot fit into this two components model, there is no way to calculate the value of difference.

2.3 Describing Differences Between Multidimensional Data Sets

There are several methods for describing the changes between two data sets.

1. Compress a large data set into a compact summary[11]. The Haar+ tree is used

to summarize the data sets with sharp discontinuities or bursts. 8

2. Use the native hierarchical structure of the data set and find the most parsimo- nious explanations of changes[1].

3. Use the so-called discovery-driven exploration[15]. This method use the pre-

computed exceptions to help the user to navigate large OLAP cubes.

All the above methods have one common limitation: they have to use the inherent hierarchical structures of the attributes. When the attributes are continuous without these kinds of hierarchical structures, we need a method to effectively summarize the differences between the two data sets. Inspired by these methods, we propose a novel framework which can deal with the multi-dimensional continuous attributes, can detect the abnormal regions between two data sets, and can give an intuitive hierachichy explanation of the difference between two data sets. Chapter 3

Problem Definition

3.1 Histogram-based Analysis

The first step to analyze the difference between two data sets using the non-parameter and distribution-free test statistical method against the continuous data is discretiza- tion of the continuous data. The easiest way of discretization is using histograms. For a data set D with continuous attribute A, if the size of this data set is |D|, a histogram can be discribed as a set of n (here n ¿ |D|) consecutive bins Bi, where

1 ≤ i ≤ n. For each Bi, an integer value vi will be assigned, which represents the number of entities whose value falls into the bin Bi. To divid D, we can use the equal distance method or the equal frequency method. In the equal distance method, all of the n bins have the same width in the same dimension; while in the equal frequency method, the widths of the bins are adjusted in the way that all of the n bins have the same number of entities. For the sake of simplicity and the nature of the the tests, the equal distance method is used in the algorithms of this thesis, but all the algorithms can be easily modified to use the equal frequency method too.

9 10

3.2 Multi-dimensional Contingency table

The concept of multi-dimensional contingency table is used to represent the data set before dividing. The traditional n × m contingency table only has two variables, and can be presented by columns and rows in the table. In this problem, if using the traditional contingency table, since there are two data sets and each data set will take one row, only one variable can be represented by the columns. Figure 3.1 shows how one-dimensional contingency table is built.

Figure 3.1: Building of one-dimensional contingency table

To utilize the contingency table in multi-variant data sets, the multi-dimensional contingency table can be used. Suppose there are two data sets and both of them have two attributes, a 2-d grid can be created, in which each axis represents one at- tribute and each axis covers all the possible values of both attributes in both datasets. Figure 3.2 show how a 2-d grid looks like 11

Figure 3.2: Illustration of a 2-d grid

3.3 Decision-tree Like Structure Approach

The approach in this thesis to solving the describing problem is to build a decision-tree like structure, in which each node is a subset of the original data set, this structure is called “Subset Tree,” and it has the following characters:

1. The subset tree is a full binary tree where the root is the entire table.

2. There’s no overlap between any two siblings.

3. All the leaf nodes are either an independent subset or a dependent subsets with only two cells.

Definition 1 Given a 2 × m contengency table D, the cut cost of the cut between point i and i + 1 is defined as the result of χ2 test against a 2 × 2 contengency table

Pi Pm V , where Vj,1 = k=1 Dk,Vj,2 = k=i+1 Dk .

This definition can be easily extended to multi-dimensional contingency tables, since a multi-dimensional contingency table can always be linearized into a 1 − D contingency table. 12

The ultimate goal of the algorithms is to find the set of optimal cut which can par- tite the original data set into sub-sets which are statistically independent. Figure 3.3 illustrated the final result of the optimal cuts:

Figure 3.3: Final result of the optimal cuts.

Definition 2 The optimal solution of hierachical cut of a multi-dimensional con- tingency table is the hierachical cut that gives the least sum of cut and dependent leaf subsets.

To get the optimal solution, a dynamic programming algorithm and a greedy algo- rithm are introduced. Chapter 4

Dynamic Programming Algorithm

The optimal-cutting problem we defined in Chapter 3 can be solved using a divide- and-conquer strategy. For a k-dimensional hypercube h, first all the attributes need to be independent to each other. In certain scenarios, some attributes might be de- pendent to each other, but they can be orthogonalized into k0 independent attributes, here k0 ≤ k. In this thesis the orthogonalization methods will not be discussed and the assumption is that the k attributes are independent to each other. Under this assumption, in the case of a k-dimensional hypercube with ci regions at i dimension, the candidta cuts are planes xi = 1, 2, . . . , ci − 1 for each of the i dimensions. For example, for a two dimensional 8×6 hypercube, all the possible 12 cuts are illustrated in Figure 4.1

Figure 4.1: Possible cuts of an 8 × 6 2d hypercube

13 14

The next step is to divide the original problem into sub-problems. This step is trivial since each cut will intuitively generate two non-overlapping and exhaustive sub-hypercubes. In the definition of the optimal cut, the difference between the two data sets came from two origins: the cut cost and the dependency of the leaves. These values are set arbitrarily. In the implementation of this algorithm, the cost of each independent cut is quantified as 0, the cost of each dependent cut as 1. At the same time, an independent leaf will contribute 0 to the total dependency and a dependent leaf will contribute 1 to the total dependency. The test statistic is arbitrarily set too. Theoretically any distribution-free test statistic can be used. In this algorithm, χ2 test is used as the criterion to determine the dependency of a contingency table. Since χ2 test is suitable for nomial attributes, the multi-dimensional hypercube can be linearized into a 1 − d table and the order of the linearization result will not affect the test result. Algorithm 1 is the classic top-down dynamic programming algorithm together with memoization. In this algorithm, a cut is labeled as dependent if the resulting two sub-cubes are dependent to each other, and the total dependency number Σ is increased by 1 for this cut; otherwise it’s labeled as independent and the value of Σ is unchanged. The hypercube will be cut unless

• the number of cell within the hypercube is 2 or less.

• the linearized hypercube is independent according to χ2 test.

The cut process will continue recursively from top down. Memoization is applied so the total cost of this algorithm will be O(n3). To test the effectivity of this algorithm, Several synthetic data sets which have different dependency numbers are used. Figure 4.2 shows the test result for a 1- d hypercube. The two data sets of this hypercube are a baseline data set with the 15

Algorithm 1 Dynamic Programming Algorithm To Find The Maximize Indepen- dence Require: Hypercube Hc filled with two sets of k − D points Ensure: The partition yield the maximal independent sub-tables, along with the total dependency number Σ

1: sub cube h ← Hc,Σ ← 0 2: linearize h and fill the result into a 1-d contingency table T 3: if size of T ≤ 2 then 4: Σ ← χ2(T ) 5: return Σ 6: end if 7: if χ2(T ) = 0 then 8: return 0 9: end if 10: for all candidate cut of h do 11: cut h into h1 and h2 2 12: if χ (h1, h2) > 0 then 13: Σ + + 14: end if 15: Σ(h) = min(Σ(h1) + Σ(h2)) 16: end for uniform distribution f(i) = 0.05, 0 ≤ i ≤ 19 and a changed data set with four regions: region 1 from cell 0 to cell 4, with the uniform distribution f = 0.02; region 2 from cell 5 to cell 9, with the uniform distribution f = 0.08; region 3 from cell 10 to cell 14, with the uniform distribution f = 0.02; and region 4 from cell 15 to cell 19, with the uniform distribution f = 0.08.

Figure 4.2: Illustration of the decision tree building for 1-d hypercube 16

Figure 4.3 shows the test result for a 2-d hypercube. The two data sets of this hypercube are a baseline data set with the uniform distribution across the entire area and a changed data set with four regions: all four of them have uniform distribution, but two regions have a higher probability than the other two regions.

Figure 4.3: Illustration of the decision tree building for 2-d hypercube

From the above example it can be concluded that the dynamic programming algorithm works just as expected. The optimal cuts are found in both examples. Note that this implementation works for the k-d hypercubes since it uses the top- down approach. Chapter 5

Greedy Recursive Scheme

The dynamic programming algorithm proposed in the previous chapter can give the global optimal solution, the implementation of dynamic programming algorithm for a hypercube with size n costs O(n3) time and O(n2) memory, in the cases where the number of cells of the hypercube is small, the running cost will not be a big problem. But when the number of attributes k is large, if each attribute has m divided subsets, the resulting hypercube will have mk cells. Because of the limitation of the hardware, an efficient approximation is required to address the problem. A reasonable approach is to use the greedy algorithm instead of the dynamic programming algorithm. In each level of the decision tree, all the possible candidate cut position is evaluated and the best local cut position is chosen; the cut process continues until the all the subsets are statistically independent.

Algorithm 2 Greedy Algorithm To Find The Maximize Independence

Require: Hypercube Hc filled with two sets of k − D points Ensure: The partition yield the maximal independent sub-tables

1: h ← Hc 2: if χ2(h) = 0 then 3: return 0 4: end if 2 2 5: cut point ← cut point with min(χ (h1) + χ (h2)) 6: cut h into h1 and h2 7: repeat step 5 and 6 till all subcubes are independent or too small to cut

17 18

Figure 5.1: A simple 1 − d test hypercube

Figure 5.1 shows the result of the greedy algorithm against a simple 1 − d hy- percube. The baseline data set in this test hypercube has a uniform distribution f(i) = 0.05, while the changed data set has two regions: region 1 has the uniform distribution with f(i) = 0.02 from cell 0 to cell 9; region 2 has the uniform distri- bution with f(i) = 0.08 from cell 10 to cell 19. The result shows that for a simple hypercube like this, the result of the greedy algorithm is the same as the dynamic programming algorithm. But for complex multi-dimensional hypercubes, the greedy algorithm usually will introduce unnecessary cuts. This issue will be discussed in the next chapter. From the algorithm above we can predict that the cost of the greey algorithm is O(n). We can see in the next chapter that the cost is indeed O(n). Chapter 6

Experimental Results

6.1 Data Preparing

To test the algorithms with the real data sets, 10 data sets are extracted from the UCI Repository[3]. All 10 data sets are multi-variant with discrete or continuous variables. Table 6.1 listed the basic properties of the data sets.

data set name data set size number of attributes Abalone 4177 2 Anneal 798 3 Auto MPG 398 2 Clouds 1024 2 Cement 1030 2 Credit 678 2 Vowel Context 990 2 Cylinder 527 2 Echocardiogram 120 2 Ecoli 336 2

Table 6.1: List of Data Sets.

Next another data set with identical set of attributes are generated. Three kinds of changes are added to build this new data set from the original data set.

1. The first kind of change is to shift part of the data set uniformly. This is done by adding or subtracting a constant to the original data set so that there will

19 20

be a jump between the two parts with different amounts of shift. In this kind of change, the algorithms used in this thesis should be able to find the exact positions where the jump happened. For instance, for a two dimensional data

set located in a rectangle area; a positive constant to every non-empty cell of the upper half of the rectangle is added, the algorithm should find the horizontal line across the middle of the rectangle.

2. The second kind of change is to add one or more abnormal points. The position of the abnormal point(s) is set randomly. In this kind of change, the algorithms should be able to find the exact positions of the abnormal points. For instance, for a two dimensional data set, if there is only one abnormal point and the position of the abnormal point is (x, y), the algorithms should find two cuts: one on the X axis at the position x, the other on the Y axis at the position y.

3. The third kind of change is a user-defined change. In this case the user needs to provide a file containing the desired tree-structure. For example, if the desired tree-structure looks like Figure 6.1, in the user-defined file, each line represents

Figure 6.1: A simple tree-structure explaination 21

a node. Since the tree in this case is always a full binary tree, each node will have either two child nodes or no child node. In each line, if the node is a leaf node, there will be two fields in the line. The first field contains the name of the

node, the second field indicates the dependency of this leaf node; if the node is a root or an internal node, there will be three fields in the line. The first field contains the name of the node, followed by the name of its left child and its right child; the second field indicates the dependency of the cut of this node; the third field contains the cutting point of this node. The tree structure in Figure 6.1 can be represented by the following text file:

A,B,C I 3,7-4,0 B,D,E D 1,7-2,0

C,F,G D 5,7-6,0 DI EI FI GI To generate the changed data set, the following algorithm is applied:

To generate the qualified data sets with the first kind of change using the real data sets, each of the real data sets is sampled with replacement twice. The first sampling is uniform across the entire data set; all the points in the data set have the same probability (1/|D|) to be picked. In the second sampling, the points in the designated areas will have a smaller probability (for example, 1/(3 × |D|)) to be picked, and the other areas have a larger probability to be picked, but within the two different types of areas, the distribution is still uniform.

The two hypercubes Hc1 and Hc2 can then be superimposed and form a k- 22

Algorithm 3 Third Kind Of Change Generation Require: A baseline hypercube and a user-defined tree structure Ensure: A changed hypercube with the user-defined difference compare to the base- line hypercube

1: read the baseline hypercube and determine size Sall and value Vall 2: for all node i of the user-defined tree structure do 3: if i is an inner node then 4: if i is root then 5: S(i) ← Sall, V (i) ← Vall 6: end if 7: S(i.left) ← leftpartofS(i), S(i.right) ← rightpartofS(i) 8: if cut at i is independent then S(i.left) 9: V (i.left) ← V (i) × S(i) S(i.right) 10: V (i.right) ← V (i) × S(i) 11: else S(i.left)×0.5 12: V (i.left) ← V (i) × S(i) 13: V (i.right) ← V (i) − V (i.left) 14: end if 15: else 16: if leaf node i is independent then 17: for all cell k in i do V (k)V (i) 18: V (k) ← baseline Vbaseline(i) 19: end for 20: else V (i.left)×V (i)×0.25 21: V (i.left) ← baseline Vbaseline(i) 22: V (i.right) ← V (i) − V (i.left) 23: end if 24: end if 25: end for 23

Algorithm 4 Hypercube Generation Require: A data set D with k attributes, each attribute has m subsets Require: N sample points which are candidates to be filled into the hypercube k Ensure: Two hypercube Hc1 and Hc2 with m cells filled with two shifted data sets

k 1: initialize the hypercube Hc1 and Hc2 with m empty cells 2: for i = 1 to N do 3: uniformly pick point p1 with replacement from D 4: fill p1 into Hc1 according to its position 5: pick point p2 with replacement from D, the probability can be changed due to the position of p2 6: fill p2 into Hc2 according to its position 7: end for dimensional contingency table. The dynamic programming algorithm and the greedy algorithm are applied on each of the data sets.

The hypercube generation algorithm with the second kind of change is almost the same as the algorithm above, except that instead of a designated area, only one abnormal point is randomly picked with a very high probability while all the other normal points are picked with the same low probability.

6.2 Testing Environment

Tests were run on one of the lab servers (buckeye), which has two dual core AMD Opteron 2.0GHz, with 4GB of main memory. The is Linux (Fedora Core 4) with a 2.6.17 x86-64 kernel. All the algorithms were implemented in C++ and compiled by GNU gcc 4.0.2.

6.3 Running Costs

To test the running costs of both algorithms, we use the same data set with different grid sizes. Figure 6.2 shows the running time of both algorithms. 24

Figure 6.2: Running time of both algorithms vs grid sizes

From Figure 6.2 we can see that the dynamic programming algorithm has a much higher running cost of O(n3), while the greedy algorithm is much faster, with the linear cost.

6.4 Test Results

We tested the accuracy of the algorithms using 7 of the 10 UCI data sets. The results are listed in Appendix A. All of them give the optimal solution as we desired.

6.5 Accuracy of the Greedy Algorithm

The results of the dynamic programming algorithm is always the global optimal so- lution. But it has the huge O(n3) running cost, so this algorithm can only run on the relatively small data grid (for example 20 × 20). Greedy algorithm is much faster and has the preferred linear running cost, but it can only give us the local optimal solution, so the accuracy of the greedy algorithm should to be compared to the dy- 25 namic programming algorithm. Table 6.2 shows the cutting results of the dynamic programming algorithm and the greedy algorithm of both kinds of changes.

First Second DP Greedy DP Greedy Abalone 2 3 21 28 Anneal 1 1 5 15 Auto MPG 1 1 7 15 Clouds 1 1 2 2 Cement 1 1 10 31 Credit 1 1 2 4 Vowel Context 1 1 5 14 Cyliner 3 10 7 21 Echocardiogram 11 25 11 21 Ecoli 5 27 11 21

Table 6.2: Comparison of the two algorithms

From Table 6.2 we can see that for the first kind of change, the greedy algo- rithm performs well on most data sets, but for the second kind of change, the greedy algorithm will usually have many unnecessary cuts. Chapter 7

Conclusion and future work

This thesis proposes a framework to describe the statistical changes between two data sets with identical attributes. The concept of optimal partition is introduced to solve the describing problem. A dynamic programming algorithm is implemented and its effectivity and efficiency of solving the describing problem is tested. A greedy algorithm is implemented too, which is much more cost-efficient than the dynamic programming algorithm, but cannot give the optimal solution in high-dimensional cases. The open problem is for the data sets with categorical attributes. Finding the optimal partition point in one categorical with size n requires O(2n) time, so using the algorithms in this thesis is not practical. But a well-designed heuristic might give a good approximation. One possible way to do this is to utilize the inherent hierarchical structure of the categorical attributes. Using the hierarchical structure can greatly reduce the number of combinations of categorical attributes.

26 Bibliography

[1] Deepak Agarwal, Dhiman Barman, Dimitrios Gunopulos, Neal E. Young, Flip Korn, and Divesh Srivastava. Efficient and effective explanation of change in

hierarchical summaries. In KDD ’07: Proceedings of the 13th ACM SIGKDD

international conference on Knowledge discovery and data mining, pages 6–15, New York, NY, USA, 2007. ACM.

[2] Charu C. Aggarwal. On change diagnosis in evolving data streams. IEEE Trans- actions on Knowledge and Data Engineering, 17(5):587–600, 2005.

[3] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.

[4] Stephen D. Bay and Michael J. Pazzani. Detecting group differences: Mining contrast sets. Data Min. Knowl. Discov., 5(3):213–246, 2001.

[5] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jrg Sander. Lof:

Identifying density-based local outliers. In Weidong Chen, Jeffrey F. Naughton, and Philip A. Bernstein, editors, SIGMOD Conference, pages 93–104. ACM, 2000.

[6] Surajit Chaudhuri and Umeshwar Dayal. An overview of data warehousing and olap technology. SIGMOD Record, 26(1):65–74, 1997.

[7] E.B.Hunt et al. Experiments in Induction. Academic Press, New York, 1966.

27 28

[8] Jerome H. Friedman and Lawrence C. Rafsky. Multivariate generalizations of the wald-wolfowitz and smirnov two-sample tests. The Annals of Statistics, 7(4):697– 717, 1979.

[9] Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan, and Wei-Yin Loh. A framework for measuring differences in data characteristics. J. Comput. Syst. Sci., 64(3):542–578, 2002.

[10] Norbert Henze. A multivariate two-sample test based on the number of nearest neighbor type coincidences. The Annals of Statistics, 16(2):772–783, 1988.

[11] Panagiotis Karras and Nikos Mamoulis. The haar+ tree: A refined synopsis data structure. In ICDE, pages 436–445. IEEE, 2007.

[12] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In VLDB ’04: Proceedings of the Thirtieth international conference on

Very large data bases, pages 180–191. VLDB Endowment, 2004.

[13] Edwin M. Knorr, Raymond T. Ng, and Vladimir Tucakov. Distance-based out- liers: Algorithms and applications. VLDB Journal: Very Large Data Bases, 8(3–4):237–253, 2000.

[14] Paul R. Rosenbaum. An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal Of The Royal Statistical Society Series B, 67(4):515–530, 2005.

[15] Sunita Sarawagi, Rakesh Agrawal, and Nimrod Megiddo. Discovery-driven ex-

ploration of olap data cubes. In Hans-Jrg Schek, Flix Saltor, Isidro Ramos, and Gustavo Alonso, editors, EDBT, volume 1377 of Lecture Notes in Computer Science, pages 168–182. Springer, 1998. 29

[16] S.Kullback and R.A.Leibler. On information and sufficiency. Annals of Mathe- matical Statistics, 22:79–86, 1951.

[17] A. Wald and J. Wolfowitz. On a test whether two samples are from the same

population. The Annals of Mathematical Statistics, 11(2):147–162, 1940.

[18] Yunyue Zhu and Dennis Shasha. Efficient elastic burst detection in data streams. In Lise Getoor, Ted E. Senator, Pedro Domingos, and Christos Faloutsos, editors, KDD, pages 336–345. ACM, 2003. Appendix A

Dynamic Programming Algorithm Results For the Real Data Sets

Figure A.1: Cutting result for the first kind of change of Abalone

30 31

Figure A.2: Cutting result for the second kind of change of Abalone 32

Figure A.3: Cutting result for the first kind of change of Auto MPG 33

Figure A.4: Cutting result for the second kind of change of Auto MPG 34

Figure A.5: Cutting result for the first kind of change of Clouds 35

Figure A.6: Cutting result for the second kind of change of Clouds 36

Figure A.7: Cutting result for the first kind of change of Cement 37

Figure A.8: Cutting result for the second kind of change of Cement 38

Figure A.9: Cutting result for the first kind of change of Credit 39

Figure A.10: Cutting result for the second kind of change of Credit 40

Figure A.11: Cutting result for the first kind of change of Vowel Context 41

Figure A.12: Cutting result for the second kind of change of Vowel Context 42

Figure A.13: Cutting result for the first kind of change of Cylinder 43

Figure A.14: Cutting result for the second kind of change of Cylinder