Nikolaj Tatti, Taneli Mielikäinen, Aristides Gionis, and Heikki Mannila. 2006. What is the dimension of your binary ? In: Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM 2006). Hong Kong, 18•22 December 2006, pages 603•612.

© 2006 IEEE

Reprinted with permission.

This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of Helsinki University of Technology's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs•[email protected].

By choosing to view this document, you agree to all provisions of the copyright laws protecting it. What is the dimension of your binary data?

Nikolaj Tatti Taneli Mielik¨ainen Aristides Gionis Heikki Mannila

HIIT Basic Research Unit Department of Science University of Helsinki and Helsinki University of Technology Helsinki, Finland

Abstract For continuous data there are many ways of defining the dimension of a dataset. One approach is to use decompo- Many 0/1 datasets have a very large number of vari- sition methods such as SVD, PCA, or NMF (nonnegative ables; however, they are sparse and the dependency struc- matrix factorization) [14, 19] and to count how many com- ture of the variables is simpler than the number of vari- ponents are needed to express, say, 90% of the variance in ables would suggest. Defining the effective dimensionality the data. This number of components can be viewed as the of such a dataset is a nontrivial problem. We consider the number of effective dimensions in the data. problem of defining a robust measure of dimension for 0/1 In the aforementioned methods it is assumed that the datasets, and show that the basic idea of fractal dimension dataset is embedded into a higher-dimensional space by can be adapted for binary data. However, as such the frac- some (smooth) mapping. The other main approach is to use tal dimension is difficult to interpret. Hence we introduce a different concept, that of fractal dimensions [3, 9, 15, 23]. the concept of normalized fractal dimension. For a dataset Very roughly, the concept of fractal dimension is based on D, its normalized fractal dimension counts the number of the idea of counting the number of observations in a ball of independent columns needed to achieve the unnormalized radius r and looking what the rate of growth of the number fractal dimension of D. The normalized fractal dimension is as a function of r. If the number grows as rk, then the measures the degree of dependency structure of the data. We dimensionality of the data can be considered to be k.Note study the properties of the normalized fractal dimension and that this approach does not provide any mapping that can be discuss its computation. We give empirical results on the used for the dimension reduction. Such mapping does not normalized fractal dimension, comparing it against PCA. even make sense because the dimension can be non-integral. Applying these approaches to binary data is not straight- forward. Many of the component methods, such as PCA and SVD are strongly based on the assumption that the 1. Introduction data are real-valued. NMF looks for a matrix decomposi- tion with nonnegative entries and hence is somewhat better Many binary datasets occurring in data mining applica- suited for binary data. However, the factor matrices may tions are, on one hand, complex as they have a very large have continuous values, which makes them difficult to in- number of columns. On the other hand, some of those terpret. The component techniques aimed at discrete data datasets could be potentially simple, as they are very sparse (such as multinomial PCA [6] or latent Dirichlet allocation or have lots of structure. In this paper we consider the prob- (LDA) [4]) are possible alternatives, but interpreting the re- lem of defining a notion of effective dimension for a binary sults is hard. dataset. We study ways of defining a concept of dimension In this paper we explore the notion of effective dimen- that would somehow capture the complexity or simplicity sion for binary datasets by using the basic ideas from frac- of the dataset. Such a notion of effective dimension can be tal dimensions. Essentially, we consider the distribution of used as a general score describing the complexity or sim- the pairwise distances between random points in the dataset. plicity of the dataset; some potential applications of the in- Denoting by Z this , we study the ratio of trinsic dimensionality of a dataset include model selection log P (Z

cdA (D; α, 1 − α) ≤ cdA (ind (D);α, 1 − α) , are as close to each other as possible. The problem with this way of normalizing the dimension i.e., the correlation dimension of the original data is not is that it takes as the point of comparison a dataset where larger than the correlation dimension of the data with each all the variables have the same marginal frequency. This is column permuted randomly. very far from being true in real data. Thus we modify the definition slightly. Support for this conjecture is provided by the fact that the We first find a value s such that variance Var [ZD] of the variable ZD canbeshowntobeno K, s α, − α D α, − α , more than the variance Var Zind(D) ; this does not, how- cdA (ind ( ); 1 )=cdA (ind ( ); 1 ) ever, suffice for the proof. The intuition behind the above i.e., a summary of the marginal frequencies of the columns conjecture is similar to what one observes in other types of of D: s is the frequency that variables of an independent definitions of dimension: if we randomly permute each col- dataset should have in order that it has the same correlation umn of a dataset, we expect to see the rank of the matrix dimension as D has when the columns of D have been ran- to grow, and also explain an increase the number of PCA domized. We define the normalized correlation dimension, components needed to explain, say, 90% of the variance. denoted by ncdA (D; α, 1 − α),tobeanintegerH such that In the experimental section we show the empirical evidence for Conjecture 6. |cdA (ind (H, s);α, 1 − α) − cdA (D; α, 1 − α)|

4. Normalized correlation dimension is minimized. Proposition 2 implies the following state- ment.

The definition of correlation dimension (Definition 1) is Proposition 7. Given a dataset D with K columns, the di- based on the definition of correlation dimension for con- mension ncdA (D; α, 1 − α) can be approximated by tinuous data. We have argued that the definition has some 2 simple intuitive properties: for a dataset with independent cdA (D) α, 1 − α ncdA (D; α, 1 − α) ≈ K. variables the dimension is smaller if the variables are sparse, cdA (ind (D)) α, 1 − α and the dimension seems to shrink if we add structure to the data by making variables positively correlated. For examples, see the beginning of the next section. However, the scale of the correlation dimension is not very intuitive: the dimension of a dataset with K indepen- 5. Experimental results dent variables is not K, although this would be the most nat- ural value. The correlation dimension gives much smaller In this section we describe our experimental results. We values and hence we need some kind of normalization. first describe some results on synthetic data, and then dis- We showed Section 3 that under some conditions inde- cuss real datasets and compare the normalized correlation pendent variables maximize the correlation dimension. In- dimension against PCA. formally, we define the normalized correlation dimension Unless otherwise mentioned, the dimension used in our of a dataset D to be the number of variables that a dataset experiments was cdA (D;1/4, 3/4, 50). with independent variables must have in order to have the same correlation dimension as D does. 5.1. Synthetic datasets More formally, let ind (H, p) be a dataset with H inde- pendent variables, each of which is equal to 1 with proba- In this section we provide empirical evidence to support bility p. From Proposition 1 we have an explicit formula for the analysis in Sections 3 and 4. In the first experiment we The theoretical section analyzes only the simplest form 200 of the correlation dimension, that is, the case where N =1. We tested how the dimension behaves for different N.In 150 order to do that, we used generated datasets from the previ-

(D; 1/4, 3/4) A D / , / , A ous experiments and plotted cd ( ;1 4 3 4 50) against 100 D / , / , ncd cdA ( ;1 4 3 4 1). We see from Figure 3 that the corre- lation dimension has little dependency of N. 50

50 100 150 200 Number of independent variables 8 Figure 1. Normalized correlation dimension 6 for data having K independent dimensions K ∈{ , , , } for 50 100 150 200 . 4 (D; 1/4, 3/4, 50) A

cd 50 Dim 2 100 Dim 200 Dim

K 2 4 6 8 generated 100 datasets with independent columns and cd (D; 1/4, 3/4, 1) A random margins pi. For each dataset, the margins pi were randomly picked by first picking p uniformly at random max Figure 3. Correlation dimension from [0, 1]. Then, the probability pi was picked uniformly cdA (D;1/4, 3/4, 50) as a function of from [0,pmax]; this method results in datasets with differ- cdA (D;1/4, 3/4, 1) for data having K in- ent densities. The box plot in Figure 1 shows that the nor- dependent dimensions for K ∈{50, 100, 200}. malized dimension is very close to K, the number of vari- ables in the data. This shows that for independent data the normalized correlation dimension is equal to the number of In the fourth experiment we verified the quality of the variables, and that the sparsity of the data does not influence approximation of Proposition 4. We used the same data in the results. the previous experiment. Figure 4 shows the correlation di- In the second experiment we tested Proposition 2 with mension against µ =E[ZD], the average distance of two synthetic data. We generated 100 datasets having indepen- random points. From the figure we see that Proposition 4 dent columns and random margins, generated as described is partly supported: the correlation dimension behaves as above. Figure 2 shows the correlation dimension as a func- a linear function of µ. However, the slope becomes more 2 tion of µ/σ,whereµ =E[ZD] and σ =Var[ZD].The gentle as the number of columns increases.

figure shows the behavior predicted by Proposition 2: the normalized fractal dimension is a linear function of µ/σ, 25 and the slope is very close to C(1/4) = 0.815. ) 20 µ , µ 15 (D; 0.1

8 R 10 cd 50 Dim 6 5 100 Dim 200 Dim

(D; 1/4, 3/4) 4 20 40 60 A µ cd 50 Dim 2 100 Dim 200 Dim Figure 4. Correlation dimension as a function 2 4 6 8 10 of µ for data with independent columns (see µ/σ Proposition 4). The y-axis is cdA (D;1/4, 3/4) and the x-axis is µ =E[ZD], the average dis- Figure 2. Correlation dimension as a func- tance between two random points. tion of µ/σ for data with independent columns (see Proposition 2). The y-axis is cdA (D;1/4, 3/4) and the x-axis is µ/σ,where Our fifth experiment tested how positive correlation af- 2 µ =E[ZD] and σ =Var[ZD]. The slope of the fects the correlation dimension. Conjecture 6 predicts that line is about C(1/4) = 0.815. positive correlation should decrease the correlation dimen- sion. We tested this conjecture by creating random datasets D i i − Xi such that column depends on column 1.Let be 200 variable number i in the generated dataset. We generated data by a Markov process between the variables: 150 P (Xi =1| Xi−1 =0)=P (Xi =0| Xi−1 =1)=ti (D; 1/4, 3/4) A 100 50 Dim and ncd 100 Dim P (X1 =1)=P (X1 =0)=0.5, 150 Dim 200 Dim 50 X X ,...,X D 20 40 60 where =[ 1 k] is the random element of . µ The reversal probabilities ti were randomly picked as follows: For each dataset we picked uniformly a random t , t Figure 6. Normalized correlation dimension number max from the interval [0 1].Wepicked i uniformly as a function of µ, the average distance be- [0,t ] from the interval max . Note that if the reversal proba- tween two random points. The x-axis is µ = bilities were 0.5, then the dataset would have independent E[ZD] and the y-axis is ncdA (D;1/4, 3/4). columns. Denoting Z = ZD,wehave

P (Zi =1| Zi−1 =0)=P (Zi =0| Zi−1 =1)

=2ti (1 − ti) . 200 120 100

A rough measure of the amount of correlation in the data 150 80 is t = 2ti (1 − ti). Figure 5 shows the correlation di- 60 (D; 1/4, 3/4) (D; 1/4, 3/4) A A mension as a function of the quantity t. We see that the 100 50 Dim 40 ncd 100 Dim ncd 50 Dim datasets with strong correlations tend to have small dimen- 150 Dim 20 100 Dim 200 Dim 200 Dim 50 sions, as the theory predicts. 50 100 150 200 20 40 60 80 100 120 K*cd (D)2/cd (ind(D))2 K*cd (D)2/cd (ind(D))2 A A A A

8 Figure 7. Normalized correlation dimension 2 2 as a function of KcdA (D) /cdA (ind (D)) . 6 The left figure contains datasets with inde- 4 pendent columns and in the right figure adja- (D; 1/4, 3/4) A cent columns of the datasets depend on each cd 2 50 Dim. other. 100 Dim 200 Dim

20 40 60 t 5.2. Real datasets Figure 5. Correlation dimension as a func- tion of t, a rough measure of correlation in a dataset. The y-axis is cdA (D;1/4, 3/4) and the In this section we investigate how our dimensions behave x-axis is the quantity t = 2ti (1 − ti),where with 9 real-world datasets: Accidents, Courses, Kosarak, ti is the reversal probability between columns Paleo, POS, Retail, WebView-1, WebView-2 and 20 News- i and i − 1. groups. The basic information about the datasets is summa- rized in Table 1. The datasets are as follows. 20 Newsgroups1 is a collec- Next, we go back to the first experiment to see whether tion of approximately 20 000 newsgroup documents across the normalized correlation dimension depends on the spar- 20 different newsgroups [18]. Data in Accidents2 were ob- µ sity of data. Note that sparse datasets have small = tained from the Belgian “Analysis Form for Traffic Acci- Z E[ D]. Figure 6 shows the normalized correlation dimen- dents” forms that is filled out by a police officer for each µ sion as a function of for the datasets used in Figure 1. traffic accident that occurs with injured or deadly wounded We see that the normalized dimension does not depend of casualties on a public road in Belgium. In total, 340 183 sparsity, as expected. traffic accident records are included in the dataset [12]. We tested Proposition 7 by plotting the normalized di- 2 2 mension as a function of KcdA (D) /cdA (ind (D)) . We 1http://people.csail.mit.edu/jrennie/ used the generated datasets from the previous experiment 20Newsgroups/ and from our fifth experiment, as well. Figure 7 reveals that 2http://fimi.cs.helsinki.fi/data/accidents.dat. the approximation is good for the used datasets. gz We also computed cdA (ind (D);1/4, 3/4), the corre- Table 1. The basic of the datasets. lation dimension for the datasets with the same column The column K corresponds to the the num- margins but independent columns. Our goal was to use ber of columns and the column N to the num- these numbers to provide empirical evidence for the the- ber of rows. The last column is the density of oretical sections. To calculate the dimensions we need to 1’s in percentages. estimate the probabilities P Zind(D)

2 slope to get the actual dimension, but the time needed for ncdA(D) KcdA(D) Data K ncdA 2 this operation is negligible compared to the time needed for K cdA(ind(D)) estimating the probabilities. Note that in our setup, the size Accidents 469 220 0.47 222.91 of Ds was fixed to 10 000 (except for Paleo and Courses). . . Courses 5 021 304 0 06 344 24 Hence, the running time is proportional to the number of 1’s . . Kosarak 41 271 2 378 0 06 3 684 78 in a dataset. The running times are given in Table 4. Paleo 139 15 0.11 19.90 POS 1 657 181 0.11 242.91 Retail 16 470 1 791 0.11 2 107.52 WebView-1 497 190 0.38 214.33 Table 4. The running times of the correlation WebView-2 3 340 359 0.11 512.97 dimension in seconds for various datasets. Time/# of 1’s: time in milliseconds divided by the number of 1’s in the data. same in D and D. Thus we note, for example, that for the Paleo data the dimensionality is about 15, a fraction of 11% Data # of 1’s Time Time/# of 1’s of the number of columns in the original data. Accidents 11 500 870 973 0.085 The last column in Table 3 is the estimate predicted by Courses 64 743 9 0.141 Proposition 7. Unlike with the synthetic datasets (see Sec- Paleo 3 537 0.10.039 tion 5.1), the estimate is poor in some cases. A probable Kosarak 8 019 015 793 0.099 reason is that the examined datasets are extremely sparse, POS 3 367 020 447 0.133 and hence the techniques used to obtain Proposition 7 are Retail 908 576 103 0.113 no longer accurate. This is supported by the observation WebView-1 149 639 17 0.114 that Accident has the best estimate and the largest density. WebView-2 358 278 40 0.112 We also tested the accuracy of Proposition 7 with 20 Newsgroups dataset9. In Figure 8 we plotted the normalized correlation dimension as a function of the estimate. We see that the approximation overestimates the dimension but the 5.3. Correlation dimension vs. other methods accuracy is better than in Table 3.

100 There are different approaches for measuring the struc- 80 ture of a dataset. In this section we study how the normal- 60 ized dimension compares with PCA. (D; 1/4, 3/4) A 40 We performed PCA to our datasets and computed the ncd percentage of the variance explained by the M first PCA 20 variables, where M =ncdA (D). Additionally, we cal- 20 40 60 80 100 K*cd (D)2/cd (ind(D))2 A A culated how many PCA components are needed to explain 90% of the variance. The results are given in Table 5. Figure 8. Normalized correlation dimension A D 2 2 We observe that ncd ( ) PCA components explain rela- as a function of KcdA (D) /cdA (ind (D)) . tively large portion of the variance for Accidents, POS,and Each point represents one newsgroup in 20 WebView-1, but explains less for Paleo and WebView-2. Newsgroups dataset. The most interesting behavior is observed in the Paleo dataset. We see that whereas PCA dimension says that Pa- leo should have relatively high dimension, the normalized We will compare the normalized correlation dimensions dimension suggests a very small value. We know that Paleo against PCA in the next subsection. has a very strong structure (by looking at the data) so this Next we studied the running times of the computation of suggests that the PCA approach overestimates the intrinsic the correlation dimension. Computing the distance of two dimension for Paleo. This behavior can perhaps be partly binary vectors can be done in O(M) time, where M is the explained also by considering the margins of the datasets. 9The messages were converted into bag-of-words representations and The margins of Paleo are relatively homogeneous whereas 200 most informative variables were kept. the margins of the rest datasets are skewed. on those coordinates results all possible vectors of length Table 5. Normalized correlation dimensions d. Hence, VC-dimension does not make any difference be- d K K versus PCA for various datasets. The sec- tween datasets {0, 1} and {x ∈{0, 1} : i=1 xi ≤ d}, ond column is the percentage of variance ex- although there is a great difference when d<

6. Related work We have given a definition of the effective dimension of a binary dataset. The definition is based on ideas from frac- There has been a significant amount of work in defin- tal dimensions: We studied how the distribution of the dis- ing the concept of dimensionality in datasets. Even though tances between two random data points from the dataset be- most of the methods can be adapted to the case of binary haves, and fit a slope to the log-log set of points. We defined data, they are not specifically tailored for it. For instance, the notion of normalized correlation dimension. It measures many methods assume real-valued numbers and they com- the number of dimensions of the appropriate density that a pute vectors/components that have negative or continuous dataset with independent variables should have to have the values that are difficult to interpret. Such methods in- same correlation dimension as the original dataset. clude, PCA, SVD, and non-negative matrix factorization We studied the behavior of correlation dimension and (NMF) [14, 19]. Other methods such as multinomial PCA normalized correlation dimension, both theoretically and (mPCA) [6], and latent Dirichlet allocation (LDA) [4] as- empirically. Under certain simplifying assumptions, we sume specific probabilistic models of generating the data were able to prove approximations for correlation dimen- and the task is to discover latent components in the data sion, and we verified these results using synthetic data. rather than reasoning about the intrinsic dimensionality of Our empirical results for real data show that different the data. Methods for exact and approximate decomposi- datasets have clearly very different normalized correlation tions of binary matrices in Boolean semiring have also been dimensions. In general, the normalized correlation dimen- proposed [11, 21, 22], but similarly to mPCA and LDA, sion correlates with the number of PCA components that they focus on finding components instead of the intrinsic are needed to explain 90% of the variance in the data, but dimensionality. there are also intriguing differences. The concept of fractal dimension has found many ap- Traditionally, dimension means the degrees of freedom plications in the database and data mining communities, in the dataset. One can consider a dataset embedded into a such as, making nearest neighbor computations more effi- high-dimensional space by some (smooth) embedding map. cient [24], speeding up feature selection methods [29], out- Traditional methods such as PCA try to negate this embed- lier detection [27], and performing clustering tasks based on ding. Fractal dimensions, however, are based on different the local dimensionality of the data points [13]. notion, the behavior of the volume of data as a function of Many different notions of complexity of binary datasets neighborhoods. This means that the methods in this paper have been proposed and used in various contexts, for in- do not provide a mapping to a lower-dimensional space, and stance VC-dimension [2], discrepancy [7], Kolmogorov hence traditional applications, such as feature reduction, are complexity [20] and entropy-based concepts [8, 25]. In not (directly) possible. However, our study shows that frac- some of the above cases, such as Kolmogorov complexity tal dimensions have promising properties and we believe and entropy methods, there is no direct interpretation of the that these dimensions are important as such. measures as a notion of dimensionality of the data as they A fundamental difference between the normalized cor- are measures of compressibility. VC-dimension measures relation dimension and PCA is the following. For a dataset the dimensionality of discrete data, but it is rather conserva- with independent columns PCA has no effect and selects the tive as a binary dataset having VC-dimension d means that columns that have the highest variance until some selected there are d columns such that the projection of the dataset percentage of the variance is explained. Thus, the number of PCA components needed depends on the margins of the [15] B. K´egl. Intrinsic dimension estimation using packing num- columns. On the other hand, the normalized correlation di- bers. In S. T. S. Becker and K. Obermayer, editors, Ad- mension is always equal to the number of variables for data vances in Neural Information Processing Systems 15, pages with independent columns. 681–688, 2003. Obviously, several open problems remain. It would be [16] R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000 organizers’ report: Peeling the onion. interesting to have more general results about the theoretical SIGKDD Explorations, 2(2):86–98, 2000. behavior of the normalized correlation dimension. In the [17] J. B. Kruskal. Multidimensional scaling by optimizing empirical side the study of the correlation dimensions of goodness of t to a nonmetric hypothesis. Psychometrica, the data and its subsets seems to be a promising direction. 29:1–26, 1964. [18] K. Lang. Newsweeder: Learning to filter netnews. In Pro- ceedings of the Twelfth International Conference on Ma- References chine Learning, pages 331–339, 1995. [19] D. D. Lee and H. S. Seung. Algorithms for non-negative [1] D. Achlioptas. Database-friendly random projections: matrix factorization. In T. K. Leen, T. G. Dietterich, and Johnson-lindenstrauss with binary coins. Journal of Com- V. Tresp, editors, Advances in Neural Information Process- puter and System Sciences, 66(4):671–687, 2003. ing Systems 13, pages 556–562, 2001. [2] M. Anthony and N. Biggs. Computational Learning Theory: [20] M. Li and P. Vit´anyi. An Introduction to Kolmogorov Com- An Introduction. Cambridge Tracts in Theoretical Computer plexity and Its Applications. Texts in . Science. Cambridge University Press, 1997. Springer-Verlag, 3rd edition, 1997. [3] M. Barnsley. Fractals Everywhere. Academic Press, 1988. [21] P. Miettinen, T. Mielik¨ainen, A. Gionis, G. Das, and H. Man- [4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet nila. The discrete basis problem. In J. F¨urnkranz, T. Schef- allocation. Journal of Machine Learning Research, 3:993– fer, and M. Spiliopoulou, editors, Knowledge Discovery 1022, 2003. in Databases: PKDD 2006 – 10th European Conference [5] T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using on Principles and Practice of Knowledge Discovery in association rules for product assortment decisions: A case Databases, Berlin, Germany, Croatia, September 18–22, study. In Proceedings of the Fifth ACM SIGKDD Interna- 2006, Proceedings, Lecture Notes in Computer Science. tional Conference on Knowledge Discovery and Data Min- Springer, 2006. ing, August 15-18, 1999, San Diego, CA, USA, pages 254– [22] S. D. Monson, N. J. Pullman, and R. Rees. A survey of (0, 1) 260. ACM, 1999. clique and biclique coverings and factorizations of - [6] W. Buntine and S. Perttu. Is multinomial PCA multi-faceted matrices. Bulletin of the ICA, 14:17–86, 1995. [23] E. Ott. Chaos in Dynamical Systems. Cambridge University clustering or dimensionality reduction? In C. Bishop and Press, 1997. B. Frey, editors, Proceedings of the Ninth International [24] B.-U. Pagel, F. Korn, and C. Faloutsos. Deflating the dimen- Workshop on Artificial Intelligence and Statistics, pages sionality curse using multiple fractal dimensions. In ICDE, 300–307, 2003. pages 589–598. IEEE Computer Society, 2000. [7] B. Chazelle. The Discrepancy Method. Cambridge Univer- [25] P. Palmerini, S. Orlando, and R. Perego. Statistical proper- sity Press, 2000. ties of transactional databases. In H. Haddad, A. Omicini, Elements of Information [8] T. M. Cover and J. A. Thomas. R. L. Wainwright, and L. M. Liebrock, editors, SAC, pages Theory. John Wiley, 1991. 515–519. ACM, 2004. [9] C. Faloutsos and I. Kamel. Beyond uniformity and inde- [26] C. H. Papadimitriou and M. Yannakakis. On limited nonde- pendence: Analysis of R-trees using the concept of fractal terminism and the complexity of the V-C dimension. Jour- dimension. In PODS, pages 4–13. ACM Press, 1994. nal of Computer and System Sciences, 53(2):161–170, 1996. [10] M. Fortelius. Neogene of the old world database of fossil [27] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and mammals (NOW). University of Helsinki, http://www. C. Faloutsos. LOCI: fast outlier detection using the local helsinki.fi/science/now/, 2005. correlation integral. In U. Dayal, K. Ramamritham, and [11] F. Geerts, B. Goethals, and T. Mielik¨ainen. Tiling databases. T. M. Vijayaraman, editors, ICDE, pages 315–326. IEEE In E. Suzuki and S. Arikawa, editors, Discovery Science, Computer Society, 2003. volume 3245 of Lecture Notes in Computer Science, pages [28] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global 278–289. Springer, 2004. geometric framework for nonlinear dimensionality reduc- [12] K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Profiling tion. Science, 290(5500):2319–2323, 2000. high frequency accident locations using association rules. [29] C. Traina Jr., A. J. M. Traina, L. Wu, and C. Faloutsos. In Proceedings of the 82nd Annual Transportation Research Fast feature selection using fractal dimension. In K. Becker, Board, Washington DC. (USA), January 12-16, 2003. A. A. de Souza, D. Y. de Souza Fernandes, and D. C. F. [13] A. Gionis, A. Hinneburg, S. Papadimitriou, and P. Tsaparas. Batista, editors, SBBD, pages 158–171. CEFET-PB, 2000. Dimension induced clustering. In R. Grossman, R. Bayardo, and K. P. Bennett, editors, KDD, pages 51–60. ACM, 2005. [14] I. Jolliffe. Principal Component Analysis. Springer Series in Statistics. Springer, 2nd edition, 2002.