Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework

Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2015, Article ID 759567, 21 pages http://dx.doi.org/10.1155/2015/759567

Review Article Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework

P. Campadelli,1 E. Casiraghi,1 C. Ceruti,1 and A. Rozza2

1 Dipartimento di Informatica, UniversitadegliStudidiMilano,ViaComelico39,20135Milano,Italy` 2Research Group, Hyera Software, Via Mattei 2, Coccaglio, 25030 Brescia, Italy

Correspondence should be addressed to E. Casiraghi; [email protected]

Received 25 February 2015; Accepted 17 May 2015

Academic Editor: Sangmin Lee

Copyright © 2015 P. Campadelli et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

When dealing with datasets comprising high-dimensional points, it is usually advantageous to discover some data structure. A fundamental information needed to this aim is the minimum number of parameters required to describe the data while minimizing the information loss. This number, usually called intrinsic dimension, can be interpreted as the dimension ofthe manifold from which the input data are supposed to be drawn. Due to its usefulness in many theoretical and practical problems, in the last decades the concept of intrinsic dimension has gained considerable attention in the scientific community, motivating the large number of intrinsic dimensionality estimators proposed in the literature. However, the problem is still open since most techniques cannot efficiently deal with datasets drawn from manifolds of high intrinsic dimension and nonlinearly embedded in higher dimensional spaces. This paper surveys some of the most interesting, widespread used, and advanced state-of-the-art methodologies. Unfortunately, since no benchmark database exists in this research field, an objective comparison among different techniques is not possible. Consequently, we suggest a benchmark framework and apply it to comparatively evaluate relevant state- of-the-art estimators.

1. Introduction from an unknown smooth (or locally smooth) manifold structure, eventually embedded in a higher dimensional Since the 1950s, the rapid pace of technological advances space through a nonlinear smooth mapping; in this case, the allows us to measure and record increasing amounts of id to be estimated is the manifold’s topological dimension. data, motivating the urgent need to develop dimensionality Due to the importance of id in several theoretical and reduction systems to be applied on real datasets comprising practical application fields, in the last two decades a great high-dimensional points. deal of research effort has been devoted to the development of To this aim, a fundamental information is provided by effective id estimators. Though several techniques have been the intrinsic dimension (id) defined by Bennett [1]asthe proposedintheliterature,theproblemisstillopenforthe minimum number of parameters needed to generate a data following main reasons. description by maintaining the “intrinsic” structure charac- At first, it must be highlighted that though Lebesgue’s terizing the dataset, so that the information loss is minimized. definition of topological dimension [6](seeSection 3.2)is More recently, a quite intuitive definition employed by quite clear, in practice its estimation is difficult if only a finite several authors in the past has been reported by Bishop in [2], set of points is available. Therefore, id estimation techniques p. 314, where the author writes that “a set in 𝐷 dimensions is proposed in the literature are either founded on different said to have an id equal to 𝑑 if the data lies entirely within a notions of dimension (e.g., fractal dimensions, Section 3.2.1) 𝐷 𝑑-dimensional subspace of R .” approximating the topological one or on various techniques Though more specific and different id definitions have aimed at preserving the characteristics of data-neighborhood been proposed in different research fields3 [ –5], throughout distributions, which reflect the topology of the underlying the pattern recognition literature the presently prevailing id manifold. Besides, the estimated id value markedly changes definition views a point set as a sample set uniformly drawn as the scale used to analyze the input dataset changes [7](see 2 Mathematical Problems in Engineering

(b) d∼2

(a) d∼0

(d) d∼2

Figure 1: At very small scales (a) the dataset seems zero-dimensional; in this example, when the resolution is increased until including all the dataset (b) the id looks larger and seems to equal the embedding space dimension; the same effect happens when it is estimated at noise level (d); the correct id estimate is obtained at an intermediate resolution. an example in Figure 1), and with the number of available Furthermore, when using an autoassociative neural network points being practically limited, several methods underesti- to perform a nonlinear feature extraction, the id value 𝑑 mate id when its value is sufficiently high (namely, id ⩾ 10). can suggest a reasonable value for the number of hidden Other serious problems arise when the dataset is embedded in neurons [14]. Indeed, a network with a single hidden layer of higher dimensional spaces through a nonlinear map. Finally, neurons with linear activation functions has an error function the too high computational complexity of most estimators with a unique global minimum and, at this minimum, the makes them unpractical when the need is to process datasets network performs a projection on the subspace spanned comprising huge amounts of high-dimensional data. by the first 𝑑 principal components [15]estimatedonthe In this work, after recalling the application domains of dataset (see 8.6.2of[2]), with 𝑑 being the number of hidden interest, we survey some of the most interesting, widespread neurons. Besides, according to statistical learning theory [16], used, and advanced id estimators. Unfortunately, since each the capacity and generalization capability of a given classifier method has been evaluated on different datasets, it is difficult may depend on the id. More specifically, in the particular tocomparethembysolelyanalyzingtheresultsreported case of linear classifiers where the data are drawn from a by the authors. This highlights the need of a benchmark manifold embedded through an identical map, the Vapnik- framework, such as the one proposed in this work, to Chervonenkis (VC) dimension of the separation hyperplane objectively assess and compare different techniques in terms is 𝑑+1(see[16], pp. 156–158). Since the generalization of robustness with respect to parameter settings, high- error depends on the VC dimension, it follows that the dimensional datasets, datasets characterized by a high id,and generalization capability may depend on the id value 𝑑. noisy datasets. Moreover, in [17] the authors mark that, in order to balance The paper is organized as follows: in Section 2 the a classifier generalization ability and its empirical error, the usefulness of the id knowledge is motivated and interesting complexity of the classification model should also be related id application domains profitably exploiting it are recalled; to the id of the available dataset. Furthermore, since complex in Section 3 we survey notable state-of-the-art id estimators, objects can be considered as structures composed by multiple grouping them according to the employed methods; in manifolds that must be clustered to be processed separately, Section 4 we summarize mostly used experimental settings, the knowledge of the local ids characterizing the considered we propose a benchmark framework, and we employ it object is fundamental to obtain a proper clustering [18]. to objectively assess and compare relevant id estimators; These observations motivate applications employing in Section 5 conclusions and open research problems are global or local id estimates to discover some structure within reported. thedata.Inthefollowingwesummarizeorsimplyrecallsome interesting examples [19–25]. 2. Application Domains In [19] the authors introduce a fractal dimension estimator, called correlation dimension (CD)estimator(see In this section we motivate the increasing research interest Section 3.2.1), and show that the id estimate it computes is aimed at the development of automatic id estimators, and a reliable approximation of the strange attractor dimension we recall different application contexts where the knowledge in chaotic dynamical systems. of the id of the available input datasets is a profitable Inthefieldofgeneexpressionanalysis,theworkproposed information. in [20] shows that the id estimate computed by the nearest In the field of pattern recognition, the id is one of the first neighbor estimator (described in [26]andSection 3.2.2)isa and fundamental pieces of information required by several lowerboundforthenumberofgenestobeusedinsuper- dimensionality reduction techniques [8–12], which try to vised and unsupervised class separation of cancer and other represent the data in a more compact, but still informative, diseases. This information is important since generally used way to reduce the “curse of dimensionality” effects [13]. datasets contain large number of genes and the classification Mathematical Problems in Engineering 3 results strongly depend on the number of genes employed to In [24], to the aim of analyzing gene expression time learn the separation criteria. series, the authors compute id estimates by comparing the In [21], the authors show that id estimation methods fractal CD estimator and the nearest neighbor (NN)estimator being derived from the basis theory of fractal dimensions [26]. The results on both simulated and real data show that ([7, 19, 27], see Section 3.2.1) can be successfully used to NN seems to be more robust than CD with respect to nonlinear evaluate the model order in signals and time series, which embedding and the underlying time-series model. is the number of past samples required to model the time In the field of geophysical signal processing, hyperspec- series adequately and is crucial to make reliable predictions. tral images, whose pixels represent spectra generated by This comparative work employs fractal dimension estimators, the combination of an unknown set of independent con- since the domain of attraction of nonlinear dynamic systems tributions, called endmembers, often require estimating the has a very complex geometric structure, which could be number of endmembers. To this aim, the proposal in [25]is captured by closely related studies on fractal geometry and to substitute state-of-the-art algorithms specifically designed fractal dimensions. to solve this task with id estimators. After motivating the idea A noteworthy research work in the field of crystallog- id CD by describing the relation between the of a dataset and the raphy [22] employs the fractal estimator [19]followed number of endmembers, the authors choose to experiment by a correction method [28] that, according to the authors, two fractal id estimators [7, 19]andanearestneighbor- “is needed because the CD estimator, to give correct esti- based one [33]. They obtain the most reliable results with mations of the id, requires an unrealistically large number the latest one after opportunely tuning the number of nearest of points.” Anyway, the experimental results show that id is a useful information to be exploited when analyz- neighborstobeconsidered. ing crystal structures. This study not only proves that id Finally, other noteworthy examples of research works id estimates are especially useful when dealing with practical that profitably exploit and estimate it by usually applying tasks concerning real data, but also underlines the need to fractal dimension estimators concern financial time series compute reliable estimates on datasets drawn from manifolds prediction [34], biomedical signal analysis [35–37], analysis characterized by high id andembeddedinspacesofmuch of ecological time series [38], radar clutter identification greater dimensionality. [39], speech analysis [40], data mining and low dimensional The work of Carter et al. [23] is very interesting and representation of (biomedical) time series [41], and plant notable because it is one of the first considering that the input traits representation [42]. data might be drawn from a multimanifold structure, where each submanifold has a (possibly) different id.Toseparate 3. Intrinsic Dimension Estimators the manifolds, the authors compute local id estimates, by applying both a fractal dimension estimator (namely, MLE In this section we survey some of the most notable, recent, [27]; see Section 3.2.1) and a nearest neighbor-based esti- and effective state-of-the-art id estimators, grouping them mator (described in [29, 30]; see Section 3.2.2)onproperly according to the main ideas they are based on. defined data neighborhoods. The authors then show that Specifically, in Section 3.1 we describe projective id esti- the computed local ids might be helpful for the following P ≡{p }𝑁 ⊆ mators, which basically process a dataset 𝑁 𝑖 𝑖=1 id 𝐷 interesting applications: (1) “Debiasing global estimates”: R to identify a somehow appealing lower dimensional the negative bias caused both by the limited number of subspace where to project the data; the space dimension of available sample points and by the curse of dimensionality id id theidentifiedsubspaceisviewedasthe estimate. is reduced by computing global estimates through a id weighted average of the local ones, which assign greater More recent projective estimators exploit the assump- P ≡{p }𝑁 ⊆ R𝐷 importance to the points away from the boundaries. However tion of datasets 𝑁 𝑖 𝑖=1 being uniformly drawn from a smooth (or locally smooth) manifold M ⊆ the authors themselves note that this method is only applica- 𝑑 ble for data with a relatively low id, since in high dimensions R , embedded into a higher 𝐷-dimensional space through the points lie nearby the boundaries [31]. (2) “Statistical a nonlinear map; this is also the basic assumption of all the manifold learning”: the local id estimates are used to reduce othergroupsofmethodsthatwillbereferredtoastopological the dimension of statistical manifolds [32], that is, manifolds id estimators (see Section 3.2)andgraph-based id estimators whose points represent a pdf.Whenthisstepisappliedasthe (see Section 3.3). first step of document classification applications, and analysis We note that the taxonomy we are using to group the of patients’ samples acquired in the field of flow cytometry, reviewed methods is different from the one, commonly used it allows us to obtain lower dimensional points showing by several authors in the past (as an example, see [43]), that a good class separation. (3) “Network anomaly detection”: viewed methods as global,whenid estimation is performed considering that the overall complexity of a router network is by considering a dataset as a whole, or local,whenallthe decreased when few sources account for a disproportionate data neighborhoods are analyzed separately and an estimate amount of traffic, a decrease in the id oftheentirenetworkis is computed by combining all the local results. All the recent searched for. (4) “Clustering”: problems of data clustering and methods have abandoned the global approach since it is image segmentation are dealt with by assuming that different now clear that analyzing a dataset at its biggest scale cannot clusters and image patches belong to manifold structures producereliableresults.Theythusestimatetheglobalid by characterized by different complexity (and ids). somehow combining local ids. This way of proceeding comes 4 Mathematical Problems in Engineering from the assumption that the underlying manifold is locally Among deterministic approaches, we recall the Kernel PCA smooth. (KPCA)[61]andthelocalPCA (LPCA)[62] and its extensions to automatically select the number of PCs[63, 64]. Weobserve 3.1. Projective id Estimators. The first projective id esti- that the work presented in [64] is one of the first works id mators introduced in the literature explicitly compute the that estimates by considering an underlying topological 𝐷 structure and therefore applies LPCA on data neighborhoods mapping that projects input points P𝑁 ∈ R to the subspace 𝑑 represented by an Optimally Topology Preserving Map M ⊆ R minimizing the information loss [27, 43]and (OTPM)builtonclustereddata(givenaninputdatasetP𝑁, therefore view the id as the minimal number of vectors lin- its OTPM is usually computed through Topology Representing early spanning the subspace M.Itmustbenotedthat,since Networks (TRNs); these are unsupervised neural networks these methods were originally designed for exploratory data [65]developedtomapP𝑁 to a set of neurons whose learnt analysis and dimensionality reduction, they often require the connections define proximities in P𝑁. These proximities dimensionality of M (the id to be estimated) to be provided correspond to the optimal topology preserving Voronoi as input parameter. However, their extensions and variants tessellation and the corresponding Delaunay triangulation. include methodologies to automatically estimate id. In other words, TRNs compute connectivity structures that id Most of the projective estimators can be grouped into define and perfectly preserve the topology originally present two main categories: projection techniques based on Multidi- in the data, forming a discrete path-preserving representa- MDS mensional Scaling ( )[44, 45] or its variants, which tend tion of the inner (topological) structure of the topological to preserve as much as possible pairwise distances among the manifold underlying the dataset P𝑁). The authors of this data, and projection techniques based on Principal Compo- method state that their approach is more efficient and less PCA nent Analysis ( )[15, 46]anditsvariantsthatsearchforthe sensitive to noise with respect to the PCA-based approaches. M best projection subspace minimizing the projection error. However, they do not show any experimental comparison Some of the best known examples of MDS algorithms and, besides, their algorithm employs critical thresholds and a are MDSCAL [47–52], Bennett’s algorithm [1, 53], Sammon’s data clustering technique whose result heavily influences the mapping [54], Curvilinear Component Analysis (CCA)[55], precision of the computed estimate [27]. ISOMAP [56], and Local Linear Embedding (LLE)[57]. As The usage of a probabilistic approach has been firstly shown by experiments reported in [56, 57] ISOMAP and introduced by Tipping and Bishop in [66]. Considering that variants of LLE compute the most reliable id estimates. We deterministic methodologies lack an associated probabilistic believe that their better performance is due to the fact that model for the observed data, their Probabilistic PCA (PPCA) both ISOMAP and LLE have been the first projective methods reformulates PCA as the maximum likelihood solution of a basedontheassumptionthattheinputpointsaredrawnfrom specific latent variable model. PPCA and its extensions to an underlying manifold, whose curvature might affect the both mixture and hierarchical mixture models have been precision of data neighborhoods computed by employing the successfully applied to several real problems, but they still Euclidean distance. However, as noted in [58, 59], these algo- provide an id-estimation mechanism depending on critical rithms have shown that they suffer of all the major drawbacks thresholds. This motivates its subsequent variants67 [ ]and affecting MDS-based algorithms, which are too much tied by developments, whose examples are Bayesian PCA (BPCA)[68] the preservation of the pairwise distance values. Besides, as and two Bayesian model order selection methods introduced highlighted in [30], ISOMAP as an id estimator, as well as in [69, 70]. In [71]theasymptoticconsistencyofid estima- other spectral based methods like PCA, relies on a specific tion by (constrained) isotropic version of PPCA is shown with estimated eigenstructure that may not exist in real data. numerical experiments on simulated and real datasets. Regarding LLE, it either requires the id value to be known While the aforementioned methods have been simply in advance or may automatically estimate it by analyzing recalled since their id estimation results have shown to be the eigenvalues of the data neighborhoods [60]; however, unreliable [7, 27],inthefollowingrecentandpromising as outlined in [7, 27], id estimates computed by means of proposals are described with more details. eigenvalue analysis are as unreliable as those computed by The Simple Exponential Family PCA (SePCA)[72]has most PCA-based approaches. Moreover, in [46]itisnotedthat been developed to overcome the assumption of Gaussian- methods such as LLE are based on the solution of a sparse distributed data that makes it difficult to handle all types eigenvalue problem under the unit covariance constraint; of practical observations, for example, integers and binary however, due to this imposed constraint, the global shape of values. SePCA achieves promising results by using exponen- the embedded data cannot reflect the underlying manifold. tial family distributions; however, it is highly influenced by PCA [15, 46] is one of the most cited and well-known critical parameter settings and it is successful only if the data projective id estimators, often used as the first step of several distribution is known, which is often not the case, specially pattern recognition problems, to compute low dimensional when highly nonlinear manifold structures must be treated. representations of the available datasets. When PCA is used for In [73] the authors propose the Sparse Probability PCA id estimation, the estimate is the number of “most relevant” (SPPCA) as a probabilistic version of the Sparse PCA (SPCA) eigenvectors of the sample covariance matrix, also called [74]. Precisely, SPCA selects id by forcing the sparsity of principal components (PCs). Due to the promising dimen- theprojectionmatrixthatisthematrixcontainingthe sionality reduction results, several PCA-based approaches, PCs. However, based on the consideration that the level of both deterministic and probabilistic, have been published. sparsity is not automatically determined by SPCA, SPPCA Mathematical Problems in Engineering 5 employs a Bayesian formulation of SPCA, achieving sparsity topological dimension definition is Lebesgue’s Covering by employing a different prior and automatically learning the Dimension, reported in the following. hyperparameter related to the constraint weight through Evi- dence Approximation ([75] Section 3.5). The authors’ results Definition 1 (cover). Given a topological space X,acoverof and also the results of the comparative evaluation proposed asetY ⊆ X is a countable collection C ={C𝑖} of open sets in [76]showthatthismethodseemstobelessaffectedbythe such that each C𝑖 ⊂ X and ⋃𝑖 C𝑖 ⊇ Y. problems of the aforementioned projective schemes. Definition 2 (refinement of a cover). A refinement of a cover An alternative method (MLSVD)[77]appliesSingular 󸀠 󸀠 SVD PCA C of a set Y is another cover C such that each set in C is Value Decomposition ( ), basically a variant of ,locally C and in a multiscale fashion to estimate the id characterizing containedinsomesetsof . 𝐷-dimensional datasets drawn from nonlinearly embedded Definition 3 (topological dimension (Lebesgue Covering 𝑑-dimensional manifolds M corrupted by Gaussian noise. Dimension)). Given the aforementioned definitions, the Precisely, exploiting the same ideas of the theoretical PCA- topological dimension of the topological space X, also called based id estimator presented in [78], the authors note that the Lebesgue Covering Dimension, is 𝑑 if every finite cover of X best way to avoid the effects of the curvature (induced by the 󸀠 admits a refinement C such that no subset of X has more nonlinearity of the embedding) is to apply SVD locally, that 󸀠 than 𝑑+1 intersecting open sets in C . If no such minimal is, in hyperspheres B(p,𝑟)centered on the data points p and integer value exists, X is said to be of infinite topological having radius 𝑟.However,thechoiceof 𝑟 is constrained by the dimension. following considerations: (1) 𝑟 must be big enough to have at 𝑘≥𝑑 𝑟 least neighbors, (2) must be small enough to ensure To our knowledge, at the state of the art only two M∩B 𝑟 that islinear(oratleastsmooth),and(3) must be big estimators have been explicitly designed to estimate the enough to ensure that the effects of noise are negligible. When topological dimension. 𝑇𝑑 (p,𝑟) these three constraints are met, the tangent space M , One of them, the Tensor Voting Framework (TVF)[84] computed by applying SVD on the 𝑘 neighbors, is a good and its variants [85], relies on the usage of an iterative approximation of the tangent space of M ∩ B and the information diffusion process based on Gestalt principles of number of its relevant eigenvalues corresponds to the (local) perceptual organization [86]. TVF iteratively diffuses local id of M.Tofindapropervaluefor𝑟,theauthorspropose information describing, for each p𝑖 ∈ P𝑁,thetangentspace a multiscale approach that applies SVD on neighborhoods approximating the underlying neighborhood of M.Tothis B(p,𝑟𝑠) whose radius varies in a range 𝑟𝑠 ∈{𝑟𝐿,...,𝑟𝐻}.This aim, the information diffused at each iteration is second- allows us to compute 𝐷 scale-dependent, local singular values order symmetric positive definite tensors whose eigenvectors 𝜆 (p,𝑟𝑠)≥⋅⋅⋅≥𝜆𝐷(p,𝑟𝑠); using a least squares fitting proce- span the local tangent space. Practically, during the initial- 1 T0 dure the SVs can be expressed as functions of 𝑟 whose analysis ization step a ball tensor 𝑖 , which is an identity matrix allows us to identify the range of scales [𝑟 ,...,𝑟 ] not representing the absence of orientation, is used to initialize min max 𝑝 p {𝑝 =(p , T0)}𝑁 influenced by either noise or curvature. Finally, in the range atoken 𝑖 for each point 𝑖 as 𝑖 𝑖 𝑖 𝑖=1.During 𝑟 =[𝑟 ,...,𝑟 ] SV 𝑡 𝑝 {T𝑡 } 𝑠 min max the squared sareanalyzedtogetthe iteration each token 𝑖 “generates” the set of tensors 𝑖,𝑗 𝑗=𝑖̸ id 𝑑̂ Δ(𝑗) = 𝜆 (p,𝑟)− that enact as votes cast to neighboring tokens; precisely, estimate that maximizes the gap 𝑗 𝑠 𝑡 𝜆 (p,𝑟) 𝑟 T is sent to the 𝑗th neighbor, and it encodes information 𝑗+1 𝑠 for the largest range of 𝑠.Theproposedalgorithm 𝑖,𝑗 has been evaluated on unit 𝑑-dimensional hyperspheres and relatedtothelocaltangentspaceestimateinp𝑖 at time 𝑡 cubes embedded in R100 and affected by Gaussian noise. The and decays as the curvature and the distance from the 𝑗th reported results are very good, while other ten well-known neighbor increase. On the other side, at iteration 𝑡 each token 𝑝 𝑝 methods [19, 23, 27, 30, 79–81] show that the idsestimatedon 𝑗 receives votes that are summed to update the 𝑗’s tensor 𝑡+1 𝑡 thesamedatasetsareunreliablealsointheabsenceofnoise. as T𝑗 = ∑𝑖=𝑗̸ T𝑖,𝑗; this essentially refines the estimate of the local tangent space in p𝑗. Based on the definition of 3.2. Topological Approaches. Topological approaches for id 𝑑 topological dimension provided by Brouwer [82], in [87]it estimation consider a manifold M ⊆ R embedded in is noted that TVF can be employed to estimate the local ids 𝐷 a higher dimensional space R through a proper (locally) by identifying the number of most relevant eigenvalues of the 𝐷 smooth map 𝜙:M → R and assume that the given computed second-order tensors. Although interesting, this 𝑁 𝑁 𝐷 dataset is P𝑁 ={p𝑖} ={𝜙(x𝑖)} ⊂ R ,wherex𝑖 are method has a too high computational cost, which makes it 𝑖=1 𝑖=1 𝐷≥ independent identically distributed (i.i.d.) points drawn from unfeasible for spaces of dimension 4. M through a smooth probability density function (pdf) 𝑓: From the definition of Lebesgue Covering Dimension + M → R . itcanbederived[88] that the topological dimension of 𝑑 id any M ⊆ R coincides with the affine dimension 𝑑 of a Under this assumption the to be estimated is the 𝑑 manifold’s topological dimension, defined either through the finite simplicial complex (a simplicial complex in R has firstly proposed Brouwer Large Inductive Dimension [82] affine dimension 𝑑 if it is a collection of affine simplexes 𝑑 or the equivalent Lebesgue Covering Dimension [83]. Since in R , having at most dimension 𝑑 or having at most 𝑑+ Brouwer’s definition has been soon neglected by mathe- 1 vertices) covering M.Thisessentiallymeansthata𝑑- maticiansforitsdifficultproof[83], the commonly adopted dimensional manifold could be approximated by a collection 6 Mathematical Problems in Engineering of 𝑑-dimensional simplexes (each having at most 𝑑+1 by Eckmann and Ruelle in [92] for the correlation dimension vertices); therefore, the topological dimension of M could estimator (CD [19], see below): be practically estimated by analyzing the number of vertices 𝑑< 2 ∗ 𝑁, of the collection of simplexes estimated on P𝑁.Tothisaim, log log (1/𝜌) in [89] a method is proposed to find the number of relevant (1) positive coefficients that are needed to reconstruct each p𝑖 ∈ 𝑟 𝑟 𝑑 𝜌= ≪ , 1𝑁2 ( ) ≫ . P𝑁 from a linear combination of its 𝑘 neighbors, where 𝑘 being 1 1 𝐷 2 𝐷 is a parameter to be manually set in the range 𝑑<𝑘≤ 𝑁 𝐷+1. This algorithm is based on the fact that neighbors Thiswillleadtoalargevalueof ,evenforadatasetwith id with positive reconstruction coefficients are the vertices ofa lower . simplex with dimension equal to the topological dimension Among fractal dimension estimators, one of the most of M. Practically, to ensure that 𝑘>𝑑,itsvalueissetto cited algorithms is presented in [19]andwillbereferredto CD 𝐷, the reconstruction coefficients are calculated by means as in the following. It is an estimator of the correlation of an optimization problem constrained to be nonnegative, dimension (dimCorr), whose formal definition is as follows. and the coefficients bigger than a user-defined threshold are Definition 4 (correlation dimension). Given a finite sample considered as the relevant ones. The id estimate is then set P𝑁,let computed by employing two alternative approaches: the first onesimplycomputesthemodeofthenumberofrelevant 𝑁 󵄩 󵄩 2 󵄩 󵄩 coefficients for each neighborhood and the second one sorts 𝐶𝑁 (𝑟) = ∑ 𝐼(𝑟−󵄩p𝑖 − p𝑗󵄩), (2) 𝑁 (𝑁−1) in descending order the coefficients computed for each neigh- 𝑖=1,𝑖<𝑗 borhood, computes the mean c of the sorted coefficients, and where ‖⋅‖is the Euclidean norm and 𝐼(⋅) is the step function estimates id asthenumberofrelevantvaluesinc.Notethat, used to simulate a closed ball of radius 𝑟 centered on each 𝑘>𝑑 since ,thismethodisstronglyaffectedbythecurvature p𝑖 (𝐼(𝑦) = 0if𝑦<0, and 𝐼(𝑦) = 1otherwise).Then,fora id of the manifold when the is big enough. Indeed, the results countable set, the correlation dimension dimCorr is defined as of the reported experimental evaluation make the authors log 𝐶𝑁 (𝑟) assert that the method works well only on noisy-free data of dim = lim lim . (3) Corr 𝑟→ 𝑁→∞ 𝑟 low id (id ≤ 6), under the assumption that the sampling 0 log process is uniform and the data points are sufficient. CD 𝑑̂ In practice computes an estimate, ,ofdimCorr by Though interesting, both approaches have shown to be computing 𝐶𝑁(𝑟) for different 𝑟𝑖 and applying least squares effective only for manifolds of low curvature as well as low to fit a line through the points (log 𝑟𝑖; log 𝐶𝑁(𝑟𝑖)).Ithasto id values. be noted that, to produce correct id estimates, this estimator In the following we survey other id estimators employing needs a very large number of data points [22], which is never two different estimation approaches, which allow us to cate- available for practical applications; however, the computed gorize them. More precisely, in Section 3.2.1 fractal id esti- unreliable estimations can be corrected by the correction mators are described, which estimate different fractal dimen- method proposed in [28]. sions since they are good approximations of the topological CD one; in Section 3.2.2 nearest neighbors-based (𝑁𝑁) id estima- The relevance of the estimatorisshownbyitsseveral variants and extensions. An example is the work proposed in tors arerecalled,whichareoftenbasedonthestatisticalanal- CD ysis of the distribution of points within small neighborhoods. [91], where the authors propose a normalized estimator for binary data and achieve estimates approximating those computed by CD. 3.2.1. Fractal id Estimators. Since topological dimension Since CD is heavily influenced by the setting of the scale cannot be practically estimated, several authors implicitly parameters, in [93] the authors estimate the id by computing M assume that hasasomehowfractalstructure(see[90]for the expectation value of dimCorr through maximum likeli- an exhaustive description of fractal sets) and estimate id by hood estimate of the distribution of distances among points. ̂ employing fractal dimension estimators, the most relevant of The estimated 𝑑 is computed as which are surveyed in this section. − |𝑄| 1 Very roughly, since the basis concept of all fractal dimen- 1 𝑑̂ =−( ∑𝑟 ) , (4) sionsisthatthevolumeofa𝑑-dimensional ball of radius |𝑄| 𝑘 𝑑 𝑘= 𝑟 scales with its size 𝑠 as 𝑟 [90, 91], all fractal dimension 1 𝑄 𝑄={𝑟 |𝑟 <𝑟} 𝑟 estimators are based on the idea of counting the number of where is the set 𝑘 𝑘 and 𝑘 is the Euclidean 𝑟 observations in a neighborhood of radius 𝑟 to (somehow) distance between two generic data points and is a real value, estimatetherateofgrowthofthisnumber.Iftheestimated called cut-off radius. 𝑑 To develop an estimator more efficient than CD,in[94] growth is 𝑟 , then the estimated fractal dimension of the data the authors choose a different notion of Fd,namely,the is considered to be equal to 𝑑. Information Dimension dim𝐼: Note that all the derived estimators have the fundamental ∑N(𝛿) 𝑝𝑟 ( 𝑝𝑟 ) limitation that, in order to get an accurate estimation, the size 𝑖=1 𝑖 log 𝑖 dim𝐼 =−lim , (5) 𝑁 of the dataset with id 𝑑 has to satisfy the inequality proved 𝛿→0 log 𝛿 Mathematical Problems in Engineering 7 where N(𝛿) is the minimum number of 𝛿-sized hypercubes function 𝐼(⋅) with a suitable kernel function. Precisely, they covering a topological space and 𝑝𝑟𝑖 is the probability of find- define ing a point in the 𝑖th hypercube. Noting that when the scale 󵄩 󵄩 𝑁 󵄩p − p 󵄩 𝛿 in (5) is big enough the different coverings used to estimate 2 1 󵄩 𝑖 𝑗󵄩 𝑈 (𝑁, ℎ,) 𝑑 = ∑ 𝐾ℎ ( ) , (8) dim𝐼 could produce different values for N(𝛿),theauthors 𝑁 (𝑁− ) ℎ𝑑 ℎ2 1 1≤𝑖<𝑗≤𝑁 look for the covering composed by the minimum number N (𝛿) of nonempty sets. Similar to the CD algorithm, the min where 𝐾ℎ is a kernel function with bandwidth ℎ and id istheaverageslopeofthecurveobtainedbyfittingthe 𝑑 N (𝛿) is the assumed dimensionality of the manifold from ( 𝛿; ∑ min 𝑝𝑟 𝑝𝑟 ) points with coordinates log 𝑖=1 𝑖 log 𝑖 . which the points are sampled. Note that, to guarantee the This algorithm is compared with the CD estimator, and the converge of (8),thebandwidthℎ has to fulfill the constraint 𝑑 experimental tests show that both methods compute the same lim𝑁→∞(𝑁ℎ )=∞. For this reason the authors formalize estimates. However, the achieved computation time is much ℎ as a function of 𝑁 and, to achieve scale-independency, lower than that of CD. propose a method that estimates the id by analyzing the con- Considering that CD can severely underestimate the vergence of 𝑈(𝑁, ℎ, 𝑑) when varying the parameters 𝑁 and 𝑑. topological dimension if the data distribution on the man- Precisely, the dataset is subsampled to create sets of different 𝑛 ∈ N = {𝑁, 𝑁/ ,𝑁/ ,𝑁/ ,𝑁/ } ifold is nearly nonuniform, in [7] the author proposes the cardinalities sub sub 2 3 4 5 and the 𝐷 𝑈(𝑛 ,ℎ(𝑛 ), 𝑑), Packing Number (PN), a fractal dimension estimator that curves whose points have coordinates ( sub sub 𝑛 ) are considered. Employing this information the approximates the Capacity Dimension (dimCap). This choice sub following id estimator is proposed: is motivated by the fact that dimCap does not depend on the data distribution on the manifold and, if both dimCap and 󵄨 󵄨 󵄨𝜕𝑈 (𝑛 ,ℎ(𝑛 ),𝑑)󵄨 the topological dimension exist (which is certainly the case Slope (𝑑) = max 󵄨 sub sub 󵄨 , 𝑛 ∈N 󵄨 𝜕𝑛 󵄨 in machine learning applications), the two dimensions agree. sub sub 󵄨 󵄨 To formally define dim ,the𝜖-covering number N(𝜖) of a (9) Cap 𝑑=̂ (𝑑) . set S ⊂ X must be defined; N(𝜖) is the minimum number of arg min Slope B(x ,𝜖)={x ∈ X :‖x −x‖<𝜖} 𝑑∈{1,...,𝐷} open balls 0 0 whose union is a covering of S,where‖⋅‖is a distance metric. The definition S ⊂ X This work is notable since the empirical tests are performed of dimCap of isbasedontheobservationthatthe on synthetic datasets specifically designed to study the covering number N(𝜖) of a 𝑑-dimensional set is proportional −𝑑 influence of high curvature as well as noise on the proposed to 𝜖 : estimator. The usefulness of these datasets is confirmed by thefactthattheyhavebeenalsoemployedtoassessseveral N (𝜖) =− log . subsequent methods [76, 96]. dimCap lim (6) 𝜖→0 log 𝜖 In [97] the authors present a fractal dimension estimator derived by the analysis of a vector quantizer applied to N(𝜖) P ⊆ R𝐷 Y ={y ,..., Since the estimation of is computationally expen- datasets 𝑁 . Considering the codebook 1 N(𝜖) ≤ N (𝜖) ≤ N(𝜖/ ) 𝐷 sive, based on the relation Pack 2 , y𝑘}⊂R containing 𝑘 code-vectors y𝑖,ak-point quantizer 𝜖 N (𝜖) 𝐷 the authors employ the -Packing Number Pack , defined 𝑄 : R → Y 𝜖 is defined by a measurable function 𝑘 ,which in [95] as the maximum cardinality of an -separated set. Y N (𝜖) brings each data point to one of the code-vectors in .This Employing a greedy algorithm to compute Pack ,the 𝑘 S = 𝑑̂ partitions the dataset into so-called quantizer cells 𝑖 estimate, ,ofdimCap is computed as {p ∈ P :𝑄(p )=y } (𝑘) 𝑖 𝑁 𝑘 𝑖 𝑖 , where log2 is called the rate of the quantizer.BeingX a random vector distributed according to N (𝜖 )− N (𝜖 ) ̂ log Pack 1 log Pack 2 ] 𝑒 (𝑄 | 𝑑(𝜖 ,𝜖 ) =− . (7) aprobabilitydistribution ,thequantization error is 𝑟 𝑘 1 2 𝜖 − 𝜖 𝑟 1/𝑟 log 1 log 2 ])=(𝐸][‖X −𝑄𝑘(X)‖ ]) ,where𝑟∈[1,∞)and ‖⋅‖is the 𝐷 EuclideannorminR . Given the set Q𝑘 of all 𝐷-dimensional ̂ To estimate 𝑑 a greedy algorithm is used; however, as noted 𝑘-point quantizers, the performance achieved by an optimal 𝑑̂ 𝑘 X 𝑒∗(𝑄 | ])= (𝑒 (𝑄 | ])) by the author, the dependency of with respect to the order -point quantizer on is 𝑟 𝑘 inf𝑄𝑘∈Q𝑘 𝑟 𝑘 . in which the points are visited by the greedy algorithm intro- When the quantizer rate is high, the quantizer cells can be well duces a high variance. To avoid this problem, the algorithm approximated by 𝐷-dimensional hyperspheres with radius iterates the procedure 𝑀 times on random permutations of equal to 𝜖 andcenteredoneachcode-vectory𝑖 ∈ Y.In the data and considers the average as the final id estimate. this case, the regularity of ] ensures that the probability of /𝑑 The comparative evaluation with the CD estimator makes the such balls is proportional to 𝜖1 ,anditcanbeshown[98] authors assert that PN “seems more reliable if data contains ∗ −1/𝑑 that 𝑒𝑟 (𝑄𝑘 | ])≈𝑘 . This is referred to as the high-rate noise or the distribution on the manifold is not uniform.” approximation and motivates the definition of quantization Unfortunately, also this method is scale-dependent. dimension of order 𝑟: To avoid any scale-dependency in [79]theauthors Hein 𝑘 propose an estimator ( ) based on the asymptotes of 𝑑 (]) =− log . 𝑟 lim ∗ (10) a smoothed version of (2),obtainedbyreplacingthestep 𝑘→∞log 𝑒𝑟 (𝑘|]) 8 Mathematical Problems in Engineering

The theory of high-rate quantization98 [ ] confirms that, for where 𝑘 isthenumberofnearestneighborstox within the aregular] supported on the manifold M, 𝑑𝑟(]) exists for hypersphere B𝑑(x,𝑟)of radius 𝑟 and centered on x and 𝑉(𝑑) 𝑑 each 1 ≤𝑟≤∞and equals the intrinsic dimension of M. is the volume of the (unit 𝑑-dimensional) ball in R . Furthermore, the limit 𝑘→∞allows us to motivate the This means that the proportion of sample points falling in relation between the quantization dimension and the Capac- B𝑑(x,𝑟)is roughly approximated by 𝑝(x) times the volume of 𝑑 ity Dimension. Indeed, according to the theory of high-rate B𝑑(x,𝑟).Sincethisvolumegrowsas𝑟 , assuming the density {𝜖 } quantization [98, 99], there exists a decreasing sequence 𝑘 , 𝑝(𝑥) to be a constant, it follows that the number of samples such that for sufficiently large values of 𝑘 (i.e., in the high-rate 𝑑 ∗ in B𝑑(x,𝑟) is proportional to 𝑟 . From the relationship in regime that is when 𝑘→∞)theratio−(log 𝑘)/(log 𝑒𝑟 (𝑘 | ])) (11), and assuming that the samples are locally uniformly can be approximated increasingly finely, both from below distributed, the authors derive an id estimator for 𝑑: and from above, by quantities converging to the common 𝑟 value dimCap. To practically compute an estimate of the quan- 𝑑̂ = 𝑘 , tization dimension, having fixed the value of 𝑟,theauthors 𝑘(𝑟 − 𝑟 ) (12) 𝑘 ≤𝑘≤𝑘 𝑘+1 𝑘 select a range 1 2 of codebook sizes and design 𝑘 2 𝑟 asetofquantizers{𝑄𝑘}𝑘=𝑘 giving good approximations where 𝑘 is the average of the distances from each sample ∗ 1 𝑘 𝑟(𝑘) 𝑒̂𝑟(𝑘 | ]) of 𝑒 (𝑘 | 𝑝) over the chosen range of 𝑘.Anid point to its th nearest neighbors; defining 𝑖 as the distance 𝑟 x 𝑘 𝑟 estimate is obtained by fitting the points with coordinates between 𝑖 and its th-nearest neighbor, 𝑘 is expressed as ( (𝑘); − 𝑒̂ (𝑘 | ])) 𝑟 =( /𝑁) ∑𝑁 𝑟(𝑘) log log 𝑟 andmeasuringtheaverageslope 𝑘 1 𝑖=1 𝑖 . over the chosen range 𝑘.Thoughtheauthorsmentionthat Since this algorithm is limited by the choice of a suitable their algorithm is less affected by underestimation biases than value for parameter 𝑘,in[63]theauthorsproposeavariant [𝑘 ,𝑘 ] neighborhood-based methods (see Section 3.2.2), in [23]this which considers a range of neighborhood sizes min max . statement is confuted with theoretical arguments. However, in the same work the authors themselves show that this technique generally yields an underestimate of the id 3.2.2. Nearest Neighbors-Based id Estimators. In this section when its value is high. 𝑁𝑁 𝑁 we consider estimators, referred to as estimators in the Taking into account relation (11),in[101]thenumber B𝑑 following, that describe data-neighborhoods’ distributions as of data points in B𝑑(x,𝑟)is described by a polynomial 𝑓(𝑟) = 𝑑 ∑𝑑 𝛽 𝑟𝑠 𝑑 p , p ∈ P functions of .Theyusuallyassumethatclosepointsareuni- 𝑠=0 𝑠 of degree . In practice, considering 𝑖 𝑘 𝑁, formly drawn from small 𝑑-dimensional balls (hyperspheres) calling 𝑟𝑖𝑘 =‖p𝑖 − p𝑘‖ the interpoint distances, and being 𝑟= B (x,𝑟) 𝑟→ ∈ R+ 𝑁 𝑑 having radius (a small radius 0 guar- min𝑖,𝑘= 𝑟𝑖𝑘, 𝑅 a parameter adaptively estimated (to estimate B (x,𝑟) 1 antees that samples included into 𝑑 ,beinglessinflu- 𝑅 by means of P𝑁, the radius value corresponding to the first 𝜙 enced by the curvature induced by the map ,areapproxi- significant peak of the histogram of the 𝑟𝑖𝑗 sisfound),asetof𝑛 mating well enough the intrinsic structure of the underlying r ={𝑟 = 𝑟+𝑗(𝑅−𝑟)/𝑛}𝑛 + radius values 𝑗 𝑗=1 is selected and used to portion of M) 𝑟→0 ∈ R and being centered on x ∈ M. ̂ 𝑛 ̂ 𝑁 calculate 𝑛 pairs {(𝑟𝑗, 𝑓(𝑟𝑗))}𝑗= ,where𝑓(𝑟𝑗)=#[𝑟𝑖𝑘 <𝑟𝑗]𝑖,𝑘= Practically, given an input dataset P𝑁,thevalueof 1 1 𝑟 functions 𝑓(𝑑) is computed by approximating the sampling is the number of interpoint distances strictly lower than 𝑗.To B 𝑘 {𝛽 }𝐷 process related to 𝑑 through the -nearest neighbor algo- estimate the coefficients 𝑗 𝑗=1,thecomputedpairsarefitby rithm (kNN). a least squares fitting procedure that estimates exactly 𝐷+1 Among 𝑁𝑁 id estimators, Trunk’s method [100]isoften coefficients. Since by hypothesis the degree of 𝑓 is 𝑑,thesig- ̂ cited as one of the first methods. It formulates the distribution nificance test described in [17] is used to estimate the degree 𝑑 𝑓(𝑑) ̂ function, , with an ad hoc statistic based on geometric of 𝑓, which is considered as the id estimate. The comparative considerations concerning angles; in practice, having fixed a evaluation of this algorithm with the well-known Maximum 𝛾 𝑘 threshold and a starting value for the parameter ,itapplies Likelihood Estimator (MLE)[27]anditsimprovedversion kNN p ∈ P to find the neighbors of each 𝑖 𝑁 and calculates [102], both described below, has shown that it is more robust ] (𝑘 + ) the angle 𝑖 between the 1 th-nearest neighbor and the than them when dealing with high-dimensional datasets. 𝑘 subspace spanned by the -nearest neighbors. Considering MLE [27], one of the most cited estimators, treats the 𝛾 ( /𝑁) ∑𝑁 ] ≤𝛾 𝑘 a threshold parameter ,if 1 𝑖=1 𝑖 ,then is neighbors of each point p𝑖 ∈ P𝑁 as events in a Poisson process id 𝑘 (𝑗) considered as the estimate; otherwise, is incremented and the distance 𝑟 (p𝑖) between the query point p𝑖 and its by 1 and the process is repeated. The main limitation of this 𝑗th nearest neighbor as the event’s arrival time. Since this 𝛾 method is the difficult choice of a proper value for . process depends on 𝑑, MLE estimates id by maximizing the The work presented by Pettis et al. [26]isnotablesinceit log-likelihood of the observed process. In practice a local id is one of the first works providing a mathematical motivation estimate is computed as for the use of nearest-neighbor distances. −1 P ⊆ R𝐷 𝑘 (𝑘+1) Indeed, for an i.i.d. sample 𝑁 drawn from a 1 𝑟 (p𝑖) 𝑑 𝑑(̂ p ,𝑘)=( ∑ ) . 𝑝(x) R 𝑖 log (𝑗) (13) density distribution in ,thefollowingapproximation 𝑘 𝑗= 𝑟 (p𝑖) holds: 1 𝑑(̂ p ,𝑘) id 𝑑(𝑘)̂ = 𝑘 𝑑 Averaging the 𝑖 s, the global estimate is ≃𝑝(x) 𝑉 (𝑑) 𝑟 , (11) ( /𝑁) ∑𝑁 𝑑(̂ p ,𝑘) 𝑁 1 𝑖=1 𝑖 . Mathematical Problems in Engineering 9

id KL(𝑔 ,𝑔𝑑 ) The theoretical stability of the proposed estimator for Data ,whichisevaluatedbymeansofthedata- 𝐷 Sphere data living in 𝐶1 submanifold of R , 𝑑≤𝐷,andfordata driven technique proposed in [108]. 𝐷 in an affine subspace of R has been proved, respectively, IDEA relies on the authors’ observation that the quantities (𝑗) (𝑘+1) in [103, 104]. Though the authors’ comparative evaluation 1 −𝑟 (p𝑖)/𝑟 (p𝑖) are distributed according to the beta 𝛽 𝑑 shows the superior performance of the proposed estimator distribution 1,𝑑 with parameters 1 and ,respectively. CD E[𝛽 ]=𝑚= /( +𝑑) id with respect to the estimator [19](seeSection 3.2.1)and Therefore, since 1,𝑑 1 1 ,aconsistent NN ̂ the estimator [26], they further improve it by removing its estimator 𝑑≃𝑑equals dependency from the parameter 𝑘; to this end, different values for 𝑘 are adopted and the computed results are averaged ̂ ̂ 𝑚̂ 𝑚 to obtain the final id estimate: 𝑑=(1/𝑡) ∑𝑘∈{𝑘 ,...,𝑘 } 𝑑(𝑘). 𝑑=̂ ≃𝑑= , 1 𝑡 − 𝑚̂ −𝑚 Considering that, in practice, MLE is highly biased for 1 1 both large and small values of 𝑘,avariantofMLE is proposed 𝑁 𝑘 (𝑗) (14) 1 𝑟 (p𝑖) in [102], where the arithmetic mean is substituted with the where 𝑚=̂ ∑∑ ≃𝑚. ̂ 𝑁𝑘 𝑟(𝑘+1) (p ) harmonic average, leading to the following estimator: 𝑑(𝑘) = 𝑖=1𝑗=1 𝑖 (( /𝑁) ∑𝑁 ( /𝑑(̂ p , 𝑘)))−1 1 𝑖=1 1 𝑖 . Though the proposal in [102]seemstoachievemoreaccu- To reduce the effect of the aforementioned bias, IDEA finally rate results, it is based on the assumption that neighbors sur- applies an asymptotic correction step that, inspired by the p rounding each 𝑖 are independent, which is clearly incorrect. correction method presented in [28], models the underesti- To cope with this problem, in [105] an interesting regularized mation error by considering both the base algorithm and the MLE version of applies a regularized maximum likelihood given dataset. techniquetodistancesbetweenneighbors.Thecomparative Motivated by the promising results achieved by MiNDKL, MLE evaluation with the aforementioned methods [27, 102] in [76] the authors propose its extension, namely, DANCo;it makes the authors state that, though the method might be further reduces the underestimation effect by combining an thefirsttoconvergetotheactualestimategivenenoughdata estimator employing normalized nearest-neighbor distances points, its estimation accuracy is comparable to that achieved with one employing mutual angles. More precisely, DANCo by the competing schemes. compares the statistics estimated on P𝑁 with those estimated In [59, 106]afurtherimprovementofMLE is presented; on (uniformly drawn) synthetic datasets of known id.The it achieves a better performance by substituting euclidean comparisons are performed by two Kullback-Leibler diver- distances with geodesic ones. gences applied to the distribution of normalized nearest- 𝑑− Despite the good results achieved by MLE-based neighbor distances 𝑔(𝑟; 𝑘,,having 𝑑) 𝑔(𝑟; 𝑘, 𝑑) =𝑘𝑑𝑟 1(1 − 𝑑 𝑘− approaches, these techniques have shown to be affected by 𝑟 ) 1, and the distribution of pairwise angles 𝑞(x; ^,𝜏), thecurvatureinducedby𝜙 on the manifold neighborhoods 𝑞(x; ^,𝜏)being the von Mises-Fisher distribution (VMF)[109] approximated by kNN. To reduce this effect, various id with parameters ^ and 𝜏. ̂ estimators have been proposed in [76, 107];here,wereview The id estimate 𝑑 is the one minimizing the sum of the those achieving the most promising experimental results. two divergences: In [107] the authors firstly propose a family of id estimators (MiNDML∗), which exploit the pdf 𝑔(𝑟; 𝑘, 𝑑) describing (1) ̂ 𝑑 the distance 𝑟 (x) between the center x of B𝑑(x,𝑟), x ∈ 𝑑=arg min KL (𝑔 ,𝑔 ) + Data Sphere M 𝑟→ 𝑑∈{1,...,𝐷} , 0 and its nearest neighbor. Briefly, formulating (15) 𝑔(𝑟; 𝑘, 𝑑) as a function of the id value 𝑑 (𝑔(𝑟; 𝑘, 𝑑)= 𝑑 𝑑− 𝑑 𝑘− + KL (𝑞 ,𝑞 ). 𝑘𝑑𝑟 1(1 −𝑟) 1), the id estimator is computed by a Data Sphere maximum likelihood approach. After noting that this algorithm is still affected by a bias A fast implementation of DANCo (Fast-DANCo)isalsodevel- causing underestimations when the dataset dimensionality oped. Comparative evaluations show that this algorithm becomes sufficiently high (i.e., 𝑖𝑑 ≥ 10), the authors present achieves promising results (as shown in [76]andSection 4). theoretical considerations which relate the bias to the fact that Another work, which is notable because the authors not id estimators based on nearest-neighbor distances are often only prove the consistency in probability of the presented founded on statistics derived under the assumption that the estimators but also derive upper bounds (see (19) below) on amount of available data is unlimited, which is never the case the probability of the estimation-error for finite, and large in practical applications. Based on these considerations, two enough, values of 𝑁,isproposedin[33]. More precisely, different estimators, named MiNDKL and IDEA, are presented. the authors introduce two estimators by firstly defining a 𝐷 + MiNDKL compares the empirical pdf of the neighborhood function 𝜂:R × R → R slowly varying near the 𝑔 distances computed on the dataset ( Data)withthedistribu- origin (see [33] for a detailed description and motivation tion of the neighborhood distances computed from points of this assumption). The function 𝜂 is then used to express uniformly drawn from hyperspheres of known increasing the logarithm of the probability of a point p of being 𝑔𝑑 id B (p ,𝑟) (P(p ∈ B (p , 𝑟))) = dimensionality ( Sphere). The estimate is the dimen- in the hypersphere 𝐷 𝑖 :log 𝐷 𝑖 𝑑 sionality that minimizes their Kullback-Leibler divergence log(𝜂(p,𝑟))+𝑑log(𝑟),havingP(p ∈ B𝐷(p𝑖, 𝑟)) = 𝜂(p,𝑟)𝑟 . 10 Mathematical Problems in Engineering

(𝑘) Considering that P(p ∈ B(p𝑖,𝑟 (p𝑖))) ≈ 𝑘/𝑛 for 𝑁 big A MST(P𝑁) is the spanning tree minimizing the sum of enough, the authors derive the following system of equations: the edge weights. When the weights approximate Geodesic distances [56], a GMST𝑁(P𝑁) is obtained. 𝑘 ̂ (𝑘) A SIG𝑁(P𝑁) is defined by connecting nodes p𝑖 and p𝑗 iff log ( )≈log (𝜂 (p𝑖,𝑟))+𝑑(p𝑖) log (𝑟 (p𝑖)), 𝑛 ‖p𝑖 − p𝑗‖ ≤ 𝜌(𝑖) + 𝜌(𝑗),where𝜌(𝑖) is the distance between p𝑖 and its nearest neighbor in P𝑁. Essentially, two vertices p𝑖 𝑘 (16) log ( )≈log (𝜂 (p𝑖,𝑟)) and p𝑗 are connected if the corresponding NN hyperspheres (2𝑛) intersect. A generalization of SIG𝑁(P𝑁) is kSIG(P𝑁),where ̂ (⌈𝑘/2⌉) nodes p𝑖 and p𝑗 are connected iff ‖p𝑖 −p𝑗‖≤𝜌𝑘(𝑖)+𝜌𝑘(𝑗), 𝜌𝑘(𝑖) + 𝑑(p𝑖) log (𝑟 (p𝑖)), being the distance between p𝑖 and its kNN in P𝑁.Thismeans ̂ that the kNN hyperspheres centered on p𝑖 and p𝑗 intersect. and solve it for 𝑑(p𝑖) to obtain a local id estimate: In the following we recall interesting id estimators based GMST(P ) kNNG(P ) kSIG(P ) ̂ log (2) on 𝑁 , 𝑁 ,and 𝑁 . 𝑑(p𝑖) = . (𝑟(𝑘) (p )/𝑟 (⌈𝑘/2⌉) (p )) (17) In [29, 30], after defining the length functional log 𝑖 𝑖 𝛾 L(𝐺𝑁(P𝑁)) = ∑ |𝑒𝑖,𝑗| , 𝛾∈(0,𝑑), to build either the The two proposed estimators are then computed either by GMST(P𝑁) or the MST(P𝑁) of kNNG(P𝑁),graphtheoriesare 𝑑̂ 𝑑̂ id averaging ( avg)orbyvoting( vote): exploited to estimate both the of the underlying manifold structure M and its intrinsic Renyi` 𝛼-entropy HM.Tothis 𝑁 aim, the authors derive the linear model: logL(MST(P𝑁)) = 𝑑̂ = 1 ∑𝑑(̂ p ), avg 𝑖 𝑎 log 𝑑+𝑏, 𝑎 = (𝑑−𝛾)/𝑑, 𝑏=log𝑐+HM, 𝑐 being an unknown 𝑁 𝑖= 1 (18) constant, and exploit it to define an estimator of both 𝑑 and 𝐾 󸀠 𝑁 H {𝑛 } 𝑑̂ = [𝑑(̂ p )=𝑑 ] , M. Briefly, a set of cardinalities 𝑘 𝑘=1 is chosen and, for vote arg max # 𝑖 𝑖= 󸀠 + 1 𝑛 MST(P ) P 𝑑 ∈N each 𝑘,the 𝑛𝑘 isconstructedontheset 𝑛𝑘 ,which contains 𝑛𝑘 points randomly sampled from P𝑁,toobtain [ ]𝑁 p 𝐾 ( L(MST(P )), 𝑛 ) where # cond 𝑖=1 denotes the number of points 𝑖 for which asetof pairs log 𝑛𝑘 𝑘 . Fitting them with a ̂ cond is verified. least squares procedure the estimates 𝑎≃𝑎̂ and 𝑏≃𝑏are Under differentiability assumptions on the function 𝜂 𝑎 = (𝑑−𝛾)/𝑑 id M computed. Recalling that ,the is calculated as and regularity assumptions on the authors prove the 𝑑=̂ {𝛾/( − 𝑎)}̂ ≃ 𝑑 consistency in probability of their estimators and provide round 1 . This process is iterated to produce upper bounds (see (19)) on the probability of the estimation- thefinalestimateastheaverageoftheobtainedresults. kNNG error for finite, and large enough, values of 𝑁.However, The aforementioned based algorithm [29, 30] the derived bounds depend on unknown universal constants is exploited in [112], where the authors consider datasets 󸀠 󸀠󸀠 𝑐, 𝑐 ,𝑐 > 0: sampled from a union of disjoint manifolds with possibly different ids. To estimate the local ids, the authors propose 󸀠 ̂ 𝑐 𝑁 a heuristic, which is not described here, to automatically P (𝑑 =𝑑)≰ exp (− ), avg (𝐷𝑐𝑑𝑘)2 determine the local neighborhoods with similar geometric structures without any prior knowledge on the number of (19) id 𝑐󸀠󸀠𝑁 manifolds, their s, and their sampling distributions. P (𝑑̂ =𝑑)≰ (− ). In [96] the authors present three id estimation vote exp 𝑑 2 (𝑐 𝑘) approaches, defined as “graph theoretic methods” since the statistics they compute are functions only of graph prop- 3.3. Graph-Based id Estimators. Asnotedin[96], the work erties (such as vertex degrees and vertex eccentricities) and of [110] has cleared up the fact that theories underlying do not directly depend on the interpoint distances. 1 graphscanbeappliedtosolveavarietyofstatisticalproblems; The first statistic, denoted by 𝑆𝑁(P𝑁)=𝑟𝑗(kNNG(P𝑁)) indeed, also in the field of id estimation various types in the following, is based on the reach (the reach 𝑟𝑗,𝑖(p𝑖,𝐺), of graph structures have been proposed [29, 96, 111, 112] in 𝑗 steps of a node p𝑖 ∈𝐺,isthetotalnumberofvertices andusedforid estimation. Examples are the kNN graph which are connected to p𝑖 by a path composed of 𝑗 arcs kNNG MST ( )[30], the Minimum Spanning Tree ( )[113]and or less in 𝐺)ofverticesinthekNNG(P𝑁). Considering that MST GMST its variation, the geodesic ( )[29]; and the sphere the reach of each vertex p𝑖 ∈ kNNG(P𝑁) grows as the id SIG of influence graph ( )[114] and its generalization, the increases, in [115]theaveragereach𝑟𝑗(kNNG) in 𝑗 steps of 𝑘−sphere of influence graph (kSIG)[96]. 1 vertices in kNNG(P𝑁) is employed: 𝑆 (P𝑁)=𝑟𝑗(kNNG(P𝑁)) = P ={p }𝑁 P 𝑁 Given a sample set 𝑁 𝑖 𝑖=1 agraphbuilton 𝑁, 𝑁 𝑁 (1/𝑁) ∑𝑖= 𝑟𝑗,𝑖(p𝑖, kNNG(P𝑁)). usually denoted by 𝐺(P𝑁)=({p𝑖} ,{𝑒𝑖,𝑗}𝑖,𝑗∈{ ,...,𝑁}), employs 1 𝑖=1 1 𝑆2 (P )= the sample points p𝑖 as nodes (vertices) of the graph and The second statistic, denoted by 𝑁 𝑁 𝑀 (MST(P )) connects them with weighted arcs (edges) {𝑒𝑖,𝑗}𝑖,𝑗∈{ ,...,𝑁}. 𝑁 𝑁 , is computed by considering the degree 1 MST(P ) P A kNNG𝑁(P𝑁) is built by employing a distance function, of vertices in the 𝑁 . Recalling that, for datasets 𝑁 𝑑 which commonly is the Euclidean one, to weight the arcs obtained from a continuous distribution on R ,theratio connecting each p𝑖 to its kNNs. of nodes with a given degree 𝑗 in MST𝑁(P𝑁) converges a.s. Mathematical Problems in Engineering 11 to a limit depending only on 𝑗 and 𝑑 [116] and that the This interesting comparison strengthens the need of average degree in a tree is a constant depending only on defining a benchmark framework to allow an objective and the number of vertices, the authors empirically observe a reproducible comparative evaluation of id estimators. For dependency between the average degree and the id.This this reason, in Section 4 we describe our proposal in this leads to the definition of an id estimator employing the direction. 𝑆2 =𝑀 (MST(P )) = ( /𝑁)𝑁 ∑ ( (p ))2 statistic 𝑁 𝑁 𝑁 1 𝑖=1 degMST(P𝑁) 𝑖 . 3 𝑘 The third statistic, denoted by 𝑆𝑁(P𝑁)=𝑈𝑁(kSIG(P𝑁)), 4. A Benchmark Proposal is motivated by studies in the literature [117] showing that the At the present, an objective comparison of different id expected number of neighbors shared by a given pair of points estimators is not possible due to the lack of a standard- depends on the id of the underlying manifold. Accordingly, ized benchmark framework; therefore, in this section, after calling 𝑁𝑖,𝑗 the number of samples in the intersection of recalling experimental datasets and evaluation procedures the two kNN hyperspheres centered on p𝑖 and p𝑗,intuitions introduced in the literature (see Sections 4.1 and 4.2), we similar to those considered for 𝑟𝑗(kNNG) lead to defining of choose some of them to propose a benchmark framework 𝑆3 (P )=𝑈𝑘 (kSIG(P )) = ( /𝑛) ∑ 𝑁 𝑁 𝑁 𝑁 𝑁 1 𝑖≤𝑗 𝑖,𝑗. (see Section 4.3) that allows for reproducible and comparable Based on their theoretical results and empirical tests on experimental setups. The usefulness of the proposed bench- synthetically generated datasets characterized by id values 𝑑𝑗 markisthenshownbyemployingittocomparerelevantstate- F ⊆𝑁+ F ={𝑑}12 of-the-art id estimators whose code is publicly available (see in a finite range (where 𝑗 𝑑𝑗=2 in the reported experiments), the authors propose an approximate Bayesian Section 4.4). estimator that could indistinctly employ each of the three 1 2 3 ∗ statistics 𝑆𝑁, 𝑆𝑁,and𝑆𝑁, denoted by 𝑆𝑁 in the following. To 4.1. Datasets. The datasets employed in the literature are this aim, they assume that each statistic can be approximated both synthetically generated datasets and real ones. In the 𝑓 (⋅) = N(𝜇(𝑑 ), 𝜎2(𝑑 )) followingsectionswedescribethosewechoosetouseinour by a Gaussian density 𝑑𝑗 𝑗 𝑗 ;toestimate 2 benchmark study. 𝜇(𝑑𝑗) and 𝜎 (𝑑𝑗),foreach𝑑𝑗 ∈ F, 𝐿 datasets of large size are synthetically generated by random sampling from the 4.1.1. Synthetic Datasets. Synthetic datasets are generated by uniform distribution on the unit 𝑑𝑗-cube. These datasets drawing samples from manifolds (M)linearlyornonlinearly arethenusedtoestimatetheparameters𝜇(𝑑̃ 𝑗)≃𝜇(𝑑𝑗) embedded in higher dimensional spaces. 𝜎̃2(𝑑 )≃𝜎2(𝑑 ) 𝑓̃ (⋅) and 𝑗 𝑗 that define the approximation 𝑑𝑗 , The publicly available tool (http://www.mL.uni-saarland computed on a generic sample set with size 𝑁 and id =𝑑𝑗, .de/code/IntDim/IntDim.htm) proposed by Hein and Audib- 𝑓 (⋅) 𝑆∗ ert in [79] allows us to generate 13 kinds of synthetic datasets of the Gaussian density 𝑑𝑗 of 𝑁. by uniformly drawing samples from 13 manifolds of known At this stage, given a new input dataset P𝑁 having id 𝑆∗ (P )=𝑠 id; they are schematically described in Table 1,wherethey unknown ,thestatistic 𝑁 𝑁 P is computed and 𝐻 ̃ 2 are referred to as M∗ . These manifolds are embedded in used to calculate the approximated value 𝑓𝑑 (𝑠P)=N(𝜇̃ (𝑑𝑗), 𝑗 higher dimensional spaces through both linear and nonlinear 2 𝜎̃ (𝑑𝑗)/𝑁) ≃𝑑 𝑓 (𝑠P) 𝑗 . Assuming equal a priori probability for maps and are characterized by different curvatures. We note ∗ 𝐻 all the 𝑑𝑗 ∈ F,theposteriorprobability𝑃[𝑑𝑗 |𝑆𝑁] is given by M that manifold 8 is particularly challenging for its high curvature; indeed, when it is used for testing, most relevant ̃ 𝑓𝑑 (𝑠P) id estimators compute pronounced id overestimates (see 𝑃 [𝑑 𝑆∗ ] = 𝑗 ,𝑑∈ F, 𝑗 𝑁 ̃ 𝑗 (20) also the results reported in [107]). ∑𝑑 ∈F 𝑓𝑑 (𝑠P) 𝑗 𝑗 Another interesting dataset [96] is generated by sampling a 𝑑-dimensional paraboloid, M𝑃𝑑 , nonlinearly embedded and employed to compute an “a posteriori expected value” of in a higher (3(𝑑 + 1)) dimensional space, according to a the id: multivariate Burr distribution with parameter 𝛼=1. Tests on { } this dataset are particularly challenging since the underlying ̂ ∗ 𝑑=round { ∑ 𝑑𝑗𝑃[𝑑𝑗𝑆𝑁]} . (21) manifold is characterized by a nonconstant curvature. 𝑑 ∈F { 𝑗 } To perform tests on datasets generated by employing pdf M a smooth nonuniform ,weproposethedataset beta, The authors evaluate the performance of their methods on obtained as follows: we sample 𝑁 points in [0, 1)10, according synthetic datasets, some of which have been used by similar toabetadistribution𝛽 . , with parameters 0.5and10, studies in the literature [79], while the others (challenging 0 5 10 respectively (high skewness), and store them in a matrix ones) are proposed by the authors to have manifolds with X ∈ R𝑁×10 X X (𝑖, 𝑗) nonconstant curvature. The comparison of the achieved 𝑁 ; multiply each point of 𝑁 ( 𝑁 )by ( ( 𝜋X (𝑖, 𝑗))) D ∈ R𝑁×10 results with those obtained by the estimators proposed in [27, sin cos 2 𝑁 ,thusobtainingamatrix 1 ; X ( ( 𝜋X (𝑖, 𝑗))) 33, 112, 118] has led to the conclusion that none of the methods multiply each point of 𝑁 by cos sin 2 𝑁 ,thus D ∈ R𝑁×10 D D has a good performance on all the tested datasets. However, obtaining another matrix 2 ;append 1 and 2 to D ∈ R2500×20 D graph theoretic approaches would appear to behave better generate a matrix 3 ;append 3 to its duplicate when manifolds of nonconstant curvature are processed. to finally generate a test dataset containing 𝑁 points in R40. 12 Mathematical Problems in Engineering

(a)

(b)

Figure 2: (a) Samples from ISOMAP face database. (b) Samples from digit “0” to digit “9” in MNIST database.

Table 1: The 13 types of synthetic datasets generated with the tool is 18, and its points are drawn from a manifold nonlinearly R72 M 𝑁 proposed in [79]. embedded in . To generate 𝑁2 containing points in R96,weappliedthesameprocedureonvectorssampledin Underlying [0, 1]24. Dataset manifold Description dD name 𝑑-dimensional M𝐻 D − 4.1.2. Real Datasets. Real datasets employed in the literature 1 sphere linearly 1 User-defined embedded generally concern problems in the fields of image analysis, 𝐻 signal processing, time series prediction, and biochemistry. M Affine space 35 2 Among them, the most known and used datasets are ISOMAP Concentrated face database [56], MNIST database [119], Isolet dataset M𝐻 figure, mistakable 46 𝐷 Santa Fe DSVC1 3 with a [120], 2 [121]dataset,and time series [21]. 3-dimensional one Recently, the Crystal Fingerprint space for the chemical compound silicon dioxide dataset has also been proposed [22]. M𝐻 Nonlinear 48 ISOMAP 4 manifold face database consists in 698 gray-level images of size 64 × 64 depicting the face of a sculpture. This dataset has M𝐻 2-dimensional 23 5 helix three degrees of freedom: two for the pose and one for the lighting direction (see Figure 2(a)). M𝐻 Nonlinear 6 636 MNIST database consists in 70000 gray-level images of size Synthetic manifold 𝐻 28 × 28 of hand-written digits (see Figure 2(b)). The real id M Swiss-Roll 23 7 of this database is not actually known, but some works [79, M𝐻 Nonlinear (highly 12 72 122] propose similar estimates for the different digits; as an 8 curved) manifold example, the proposed id values for the digit “1”areinthe M𝐻 D User-defined 9 Affine space range {8,...,11}. 𝑑 M𝐻 -dimensional D − 1 User-defined Isolet dataset has been generated as follows: 150 sub- 10 hypercube jectsspokethenameofeachletterofthealphabettwice,thus M𝐻 Moebius¨ band 23 producing about 52 training examples from each speaker, for 11 10-times twisted a total of 7797 samples. The speakers are grouped into 5 sets Isotropic of 30 speakers each and are referred to as 𝑖𝑠o𝑙𝑒𝑡1, 𝑖𝑠𝑜𝑙𝑒𝑡2, M𝐻 D User-defined 12 multivariate 𝑖𝑠𝑜𝑙𝑒𝑡3, 𝑖𝑠𝑜𝑙𝑒𝑡4, and 𝑖𝑠𝑜𝑙𝑒𝑡5. The real id value characterizing Gaussian this dataset is not actually known, but a study reported in M𝐻 1-dimensional helix [123] shows that the correct estimate could be in the range 13 1 User-defined curve {16,...,22}. The version 𝐷2ofSanta Fe dataset is a time series of 50000 one-dimensional points having nine degrees of free- To further test estimators’ performance on nonlinearly dom (id = 9) and being generated by a simulation of particle id embedded manifolds of high , we propose to generate two motion. In order to estimate the attractor dimension of this M M datasets, referred to as 𝑁1 and 𝑁2 in the following (a time series, it is possible to employ the method of delays 𝑑 tool to generate the datasets sampled from -dimensional described in [124], which generates 𝐷-dimensional vectors M M M paraboloids, the beta dataset, the 𝑁1 dataset, and the 𝑁2 by partitioning the original dataset in blocks containing 𝐷 ∼ dataset, is available at http://security.di.unimi.it/ fox721/ consecutive values; as an example, by choosing 𝐷=50 a M dataset generator.m). Precisely, to generate 𝑁1 we uni- dataset containing 1000 points in R50 is obtained. formly draw 𝑁 points in [0, 1]18, we transform each point DSVC1 𝑖 −𝑖+ is a time series composed by 5000 samples by means of tan(x cos(x18 1)) where 𝑖=1,...,18, we measured from a hardware realization of Chua’s circuit obtain points in R36 by appending each transformed x to [125]. Employing the method of delays with 𝐷=20, a −𝑖+ 𝑖 arctan(x18 1sin(x )),andweduplicatethecoordinatesof dataset containing 250 points in R20 is obtained. The id R72 id M ∼ each point to finally generate points in .The of 𝑁1 characterizing this dataset is 2.26 [21]. Mathematical Problems in Engineering 13

Table2:Syntheticdatasetsandrealdatasetssuggestedbythebenchmark;N is the dataset cardinality, d is the id,andD is the embedding space dimension.

Dataset Dataset name NdD M 1 2500 10 11 M 2 2500 3 5 M 3 2500 4 6 M 4 2500 4 8 M 5 2500 2 3 M 6 2500 6 36 M 7 2500 2 3 M 9 2500 20 20 M 10𝑎 2500 10 11 M 10𝑏 2500 17 18 M Synthetic 10𝑐 2500 24 25 M 10𝑑 2500 70 71 M 11 2500 2 3 M 12 2500 20 20 M 13 2500 1 13 M 𝑁1 2500 18 72 M 𝑁2 2500 24 96 M beta 2500 10 40 M 𝑃3 2500 3 12 M 𝑃6 2500 6 21 M 𝑃9 2500 9 30 M . DSCV1 250 2 26 20 MISOMAP 698 3.00 4096

MSanta Fe 1000 9.00 50 Real M MNIST1 70000 8.00–11.00 784 M . SiO2 4738 12 00 1800 MIsolet 7797 16.00–22.00 617

Crystal Fingerprint spaces, or Crystal Finger spaces, have a further test, originally proposed by Levina and Bickel in been recently proposed in crystallography [22]withtheaim [27]. Precisely, sample sets with different cardinalities are of representing crystalline structures; these spaces are built drawn from the standard Gaussian pdf in R5 and, for starting from the measured distances between atoms in the each set, the estimator is applied varying the values of its crystalline structure. The theoretical id of one Crystal Finger parameters in fixed ranges; this allows us to analyze the space consists in 3𝑁𝑎 + 3 crystal degrees of freedom, where behavior of the id estimate as a function of both the dataset’s 𝑁𝑎 is the number of atoms in the crystalline unitary cell. cardinality and the parameter settings. Note that, since id estimators are usually tested on 4.2. Experimental Procedures and Evaluation Measures. At different datasets to evaluate their reliability when confronted the state of the art, two approaches have been mainly used by different dataset structures and configurations, in [126] to assess id estimators on datasets of known id. an overall evaluation measure is proposed. This indicator, The first one subsamples the test dataset to obtain 𝑇 called Mean Percentage Error (MPE), summarizes all the subsets of fixed cardinality and computes the percentage obtained results in a unique value computed as MPE = ̂ of correct estimations. To analyze estimators’ behavior with (100/#M)∑M(|𝑑M −𝑑M|/𝑑M),where#M is the number of respect to the cardinality of input datasets, this procedure ̂ tested datasets, 𝑑M is the id estimatedonthedatasetM, may be repeated by using different cardinality values29 [ , 30, and 𝑑M is the real id of M.Toapplythistechniquetoreal 79, 122], thus obtaining a distinct performance evaluation id {𝑑 ,...,𝑑 } datasets whose belongs to the range min max ,the measure for each cardinality. same authors propose to calculate the associated MPE’s term The second approach estimates the id on 𝑇 permutations ̂ as min𝑑∈{𝑑 ,...,𝑑 }(|𝑑M −𝑑|/𝑑M),where𝑑M is the mean of the of the same dataset and averages the 𝑇 id estimates to obtain min max range. the final one27 [ , 76, 107, 126]. This value is then compared with the real one to assess the id estimator. To also test the estimator’s robustness with respect to 4.3. Benchmark. In this section we propose an evaluation its parameter settings, in [27, 107, 126] the authors apply approach which can be used as a standard framework to 14 Mathematical Problems in Engineering assess estimators performance, comparing it to relevant id Table 3: Parameter settings for the different estimators: 𝑘 represents estimatorswhosecodeispubliclyavailable.Inthisbenchmark, the number of neighbors, 𝛾 represents the edge weighting factor we suggest to use the following estimators (see Section 3): for kNN, 𝑀 represents the number of Least Square (LS)runs, 𝑁 LS 𝛼 Hein, MLE, kNNG, MLSVD, BPCA, CD, MiNDKL,andDANCo (the represents the number of resampling trials per iteration, 𝜋 source code of the mentioned methods is available at Hein: and represent the parameters (shape and rate) of the Gamma http://www.mL.uni-saarland.de/code.shtml, MLE: http://dept prior distributions, which describe the hyperparameters and the observation noise model of BPCA,and𝜇 contains the mean and .stat.lsa.umich.edu/∼elevina/mledim.m, kNNG: http://web ∼ MLSVD the precision of the Gaussian prior distribution describing the bias .eecs.umich.edu/ hero/IntrinsicDim/, : http://www inserted in the inference of BPCA. .math.duke.edu/∼mauro/code.html#MSVD, BPCA: http:// research.microsoft.com/en-us/um/cambridge/projects/infer- Dataset Method Parameters CD MLE 𝑘 = ,𝑘 = net/blogs/bayesianpca.aspx, : http://cseweb.ucsd.edu/ 1 6 2 20 MiND DANCo lvdmaaten/dr/download.php, KL,and : http://www DANCo 𝑘=10 .mathworks.it/matlabcentral/fileexchange/40112-intrinsic- kNNG 𝑘 = ,𝑘 = ,𝛾= ,𝑀= ,𝑁= 1 1 6 2 20 1 1 10 dimensionality-estimation-techniques). Note that these esti- kNNG 𝑘 = ,𝑘 = ,𝛾= ,𝑀= ,𝑁= mators cover all the groups described in Section 3,thatis, 2 1 6 2 20 1 10 1 Synthetic BPCA iters = 2000, 𝛼=(2.0, 2.0), Projective, Fractal, Nearest-Neighbors-based,andGraph- 𝜋=( . , . ), 𝜇 = ( . , . ) based estimators. 2 0 2 0 0 0 0 01 Hein 𝑁𝑜𝑛𝑒 The benchmark is composed by following steps: CD 𝑁𝑜𝑛𝑒 (1) Test all the considered estimators on both the syn- MLSVD 𝑁𝑜𝑛𝑒 thetic and real datasets described below. We highlight MiNDKL 𝑘=10 that the synthetic datasets whose id is a user-defined MLE 𝑘 = ,𝑘 = parameter should be created with sufficiently high id 1 3 2 8 values (id ≥ 10). DANCo 𝑘=5 kNNG 𝑘 = ,𝑘 = ,𝛾= ,𝑀= ,𝑁= (2) Comparative evaluation steps are as follows: 1 1 3 2 8 1 1 10 kNNG 𝑘 = ,𝑘 = ,𝛾= ,𝑀= ,𝑁= 2 1 3 2 8 1 10 1 MPE = 𝛼=( . , . ), (a) compute the indicator both for synthetic Real BPCA iters 2000, 2 0 2 0 and real datasets, 𝜋=(2.0, 2.0), 𝜇 =0 ( .0, 0.01) (b) compute a ranking test with control methods; Hein 𝑁𝑜𝑛𝑒 to this aim, we suggest the Friedman test with CD 𝑁𝑜𝑛𝑒 Bonferroni-Dunn post hoc analyses [127], MLSVD 𝑁𝑜𝑛𝑒

(c) perform the tests proposed in [27]toevaluate MiNDKL 𝑘=5 the robustness, with respect to different cardinalities and parameter settings.

The 21 synthetic datasets used in the benchmark, referred dioxide SiO2 structure with 3 atoms (this allows us to obtain M R1800 to as M∗ in the following, are listed in Table 2 with their the SiO2 dataset containing 4738 points embedded in id relevant characteristics (𝑁, 𝑑,and𝐷). The first 15 datasets and being characterized by an equal to 12). M M M are generated with the tool proposed in [79]; they include 4 Torunmultipletestsalsoon MNIST1, SiO2,and Isolet, M M M𝐻 for each of them we generated 5 instances by extracting instances, 10∗,ofdataset 10,whicharedrawnfrom 10 𝐷 random subsets containing 2500 points each and we averaged after its embedding in R by setting 𝐷={11, 18, 25, 71}. 𝐻 the achieved results. Note that we did not include the dataset sampled from M 8 Table 3 summarizes the parameter values we employed (see Table 1)sincerelevantandrecentid estimators have for different estimators. Note that, to relax the dependency similarly produced highly overestimated results when tested of the kNNG algorithm from the setting of its parameter 𝑘,we on it [107]. Indeed, dealing with highly curved manifolds is performedmultiplerunswith𝑘 ≤𝑘≤𝑘 and we averaged still a quite challenging problem in the field. 1 2 the achieved results. Furthermore, we tested two versions of The last six synthetic datasets are M𝑁 , M𝑁 , M , 1 2 beta the algorithm (referred to as kNNG1 and kNNG2)obtainedby and 3 instances of dataset M𝑃∗, which are sampled from varying the parameters 𝑀 and 𝑁. paraboloids M𝑃𝑑 whose id is, respectively, 𝑑={3, 6, 9}. To perform multiple tests, 20 instances of each dataset have been generated, and the achieved results have been 4.4. Experimental Results. Table 4 summarizes the results averaged. obtained by the compared estimators on the synthetic Regarding the real datasets we used the DSVC1 time series datasets, while in Table 5 the results obtained on the real [21](MDSVC1, id ∼ 2.26), the ISOMAP face database [56] datasets are reported. (MISOMAP, id = 3), the Santa Fe dataset [121](MSanta Fe, Looking at the number of correct estimations computed id = 9), the MNIST database [119](MMNIST1, id ∈{8,...,13}), byeachalgorithm(highlightedinboldface),wehavethe the Isolet dataset [120](MIsolet, id ∈{16,...,22}), and the following ranking: MLSVD is correct on 13 out of 21 synthetic Crystal Fingerprint space for the chemical compound silicon datasets, DANCo (correct on 10 out of 21 datasets), Hein Mathematical Problems in Engineering 15

Table 4: Results achieved on the synthetic datasets. The bottom row reports the MPE achieved by each algorithm; anyhow, for each test dataset the best approximation results are highlighted in boldface.

Dataset 𝑑 MLE kNNG1 kNNG2 BPCA Hein CD MiNDKL DANCo MLSVD M 10.00 1 10.00 9.10 9.16 9.89 5.45 9.45 9.12 10.30 10.09 M 3.00 3.00 3.00 3.00 3.00 2 3.00 2.88 2.95 3.03 2.88 M 4.00 4.00 4.00 4.00 3 4.00 3.83 3.75 3.82 3.23 2.08 M 4.00 4.00 4 4.00 3.95 4.05 4.76 4.25 3.88 4.15 8.00 M 2.00 2.00 2.00 2.00 2.00 5 2.00 1.97 1.96 2.06 1.98 M 5.95 6 6.00 6.39 6.46 11.24 12.00 5.91 6.50 7.00 12.00 M 2.00 2.00 2.00 7 2.00 1.96 1.97 2.09 1.93 2.07 2.35 M 20.00 9 20.00 14.64 15.25 10.59 13.55 15.50 13.75 19.15 19.71 M 10.00 10𝑎 10.00 8.26 8.62 10.21 5.20 8.90 8.09 9.85 9.86 M 17.00 10𝑏 17.00 12.87 13.69 15.38 9.46 13.85 12.30 16.25 16.62 M 24.00 10𝑐 24.00 16.96 17.67 21.42 13.3 17.95 15.58 22.55 24.28 M 70.00 10𝑑 70.00 36.49 39.67 40.31 71.00 38.69 31.4 64.38 70.52 M 2.00 2.00 11 2.00 2.21 1.95 2.03 1.55 2.00 2.19 1.00 M 20.00 12 20.00 15.82 16.40 24.89 13.7 15.00 11.26 19.35 19.90 M 1.00 1.00 1.00 1.00 1.00 13 1.00 0.97 1.07 5.70 1.14 M 18.00 𝑁1 18.00 12.25 14.26 19.8 36.00 14.10 10.40 17.76 18.76 M 24.00 𝑁2 24.00 14.72 17.62 26.87 48.00 17.76 12.43 23.76 25.76 M 10.00 beta 10.00 6.36 6.45 14.77 19.7 4.00 3.05 7.00 7.00 M 3.00 3.00 𝑃3 3.00 2.89 2.93 3.12 7.00 2.00 2.43 1.00 M 6.00 𝑃6 6.00 4.96 4.98 5.82 7.00 2.66 3.58 5.04 1.00 M 8.00 𝑃9 9.00 6.35 6.89 8.04 10.95 2.85 4.55 7.00 1.00 MPE 17.29 14.50 16.79 62.62 19.92 25.96 5.55 3.70 26.34

Table 5: Results achieved on the real datasets by the compared approaches. The bottom row reports the MPE achieved by each algorithm; anyhow, for each test dataset the best approximation results are highlighted in boldface (when the real id takes values in a range, we highlighted the results that best approximate the mean value of the range).

Dataset id MLE kNNG1 kNNG2 BPCA Hein CD MiNDKL DANCo MLSVD M 2.26 DSCV1 2.26 2.03 1.77 1.86 6.00 3.00 1.92 2.50 1.75 MISOMAP 3.00 4.05 3.60 4.32 4.00 3.00 3.37 3.9 4.00 1.00 MSanta Fe 9.00 7.16 7.28 7.43 18.00 6.00 4.39 7.60 8.19 1.00 M MNIST1 8.00–11.00 10.29 10.37 9.58 11.00 8.00 6.96 11.00 9.98 1.00 M 12.60 SiO2 12.00 39.28 10.24 10.36 3.00 4.80 1.05 17.20 1.00 MIsolet 16.00–22.00 15.78 6.50 8.32 19.00 3.00 3.65 20.00 19.00 1.00 MPE 53.83 27.41 26.76 71.68 34.50 43.34 27.00 15.14 75.17

Table6:Friedmanrankingresultsachievedonallthedatasets.The CD, MLE,andHein obtain good estimates only for low id null hypothesis that the algorithms perform comparably is rejected manifolds, while they produce underestimated values when 𝑝 < . with value 0 00001. processing datasets of high id. Method Ranking By observing the MPE indicator, which accounts for the DANCo 2.40 precision of the achieved estimates, we obtain a different DANCo MiND kNNG kNNG MLE Hein MiNDKL 3.46 ranking: , KL,and 1 and 2, , , CD MLSVD Hein 4.67 ,and . This difference is due to the fact that kNNG kNNG MLE Hein kNNG2 5.11 algorithms, such as 1 and 2, ,and ,most MLSVD 5.17 of the times produce results close to the correct value. kNNG1 5.17 Regarding the real datasets, all the algorithms achieve MLE 5.70 amuchworseMPE indicator, and again DANCo is best CD 6.63 performing method. BPCA 6.68 Furthermore, we compute the Friedman ranking test with the Bonferroni-Dunn post hoc analysis as proposed in Section 4.3 to state the quality of the achieved results on both (correct on 6 out of 21), MiNDKL (6 out of 21), BPCA (4 out of the synthetic and real datasets. Tables 6 and 7 summarize the 21), and MLE (1 out of 21). It can be further noted that kNNG∗, ranking results. 16 Mathematical Problems in Engineering

5.2 5.4 5 5.2 4.8 5 4.6 4.8

id 4.4 id 4.6 4.2 4 4.4 3.8 4.2 3.6 4 01020 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 k k

N = 200 N = 1000 N = 200 N = 1000 N = 500 N = 2000 N = 500 N = 2000

(a) (b) 8 7 7.5 6.5 7 6 6.5 5.5 6 5 id 5.5 id 4.5 5 4.5 4 4 3.5 3.5 3 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 k k

N = 200 N = 1000 N = 200 N = 1000 N = 500 N = 2000 N = 500 N = 2000

4.95

4.9

4.85 id

4.8

4.75

4.7 0 10 20 30 40 50 60 70 80 90 100 k

N = 200 N = 1000 N = 500 N = 2000

(e)

Figure 3: Behavior of (a) MLE,(b)DANCo,(c)kNNG1,(d)kNNG2,and(e)MiNDKL applied to points drawn from a 5-dimensional standard normal distribution; in this test 𝑁∈{200, 500, 1000, 2000} and 𝑘∈{5,...,100}.

Finally, we performed the tests proposed in [27]toeval- 𝑁∈{200, 500, 1000, 2000} varying the parameter 𝑘 in uate the robustness of MiNDKL, MLE, DANCo,andkNNG∗ with the range {5,...,100}. respecttothesettingsoftheir𝑘 parameter. Precisely, these As shown in Figure 3 many of the tested techniques tests employ synthetic datasets subsampled from the standard are strongly influenced by the parameter settings; therefore, Gaussian pdf in R5 (id = 5). As proposed in Section 4.2, studying the variability of the algorithms’ behavior when we repeated the tests for datasets with cardinalities changing their parameter settings is of utmost importance. Mathematical Problems in Engineering 17

Table 7: Hypothesis testing of significance between techniques. Bonferroni-Dunn’s procedure rejects those hypotheses that have a 𝑝 value ≤ 0.0125.

MiNDKL Hein kNNG1 kNNG2 MLE CD MLSVD BPCA DANCo 0.1567 0.0024 0.0003 0.0002 0.0002 0.0000 0.0000 0.0000 MiNDKL ∗∗∗ 0.0801 0.0303 0.0244 0.0055 0.0020 0.0000 0.0000 Hein ∗∗∗ ∗∗∗ 0.7528 0.6366 0.1564 0.1474 0.0034 0.0018 kNNG1 ∗∗∗ ∗∗∗ ∗∗∗ 0.8557 0.3443 0.2301 0.0164 0.0071 kNNG2 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ 0.9314 0.3894 0.1113 0.0282 MLE ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ 0.3428 0.1876 0.0307 CD ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ 0.7337 0.1961

5. Conclusions and Open Problems References This work presents the base theories of state-of-the-art id [1] R. S. Bennett, “The intrinsic dimensionality of signal collec- estimators and surveys the most relevant and recent among tions,” IEEE Transactions on Information Theory,vol.15,no.5, them, highlighting their strengths and their drawbacks. pp. 517–525, 1969. Unfortunately, performing an objective comparative eval- [2] C. M. Bishop, Neural Networks for Pattern Recognition,Oxford uation among the surveyed methods is difficult because, University Press, Oxford, UK, 1995. to our knowledge, no benchmark framework exists in [3] E. Chavez,´ G. Navarro, R. Baeza-Yates, and J. L. Marroqu´ın, this research field; therefore, in Section 4 we propose an “Searching in metric spaces,” ACM Computing Surveys,vol.33, evaluation approach that employs both real and synthetic no.3,pp.273–321,2001. datasets and suggests experiments to evaluate the estimators’ [4] V. Pestov, “An axiomatic approach to intrinsic dimension of a robustness with respect to their parameter settings. Note dataset,” Neural Networks,vol.21,no.2-3,pp.204–213,2008. that, the benchmark is designed to evaluate the performance [5] V. Pestov, “Intrinsic dimensionality,” SIGSPATIAL Special,vol. achieved by id estimators when both low and high id data 2, no. 2, pp. 8–11, 2010. must be processed; this consideration is due to the fact [6] M. Katetov and P. Simon, “Origins of dimension theory,” in that, to our knowledge, only few methods [28, 76, 107, 126] Handbook of the History of General Topology,vol.1,1997. have empirically investigated the problem of datasets drawn [7] B. Kegl,´ “Intrinsic dimension estimation using packing num- from manifolds nonlinearly embedded in higher dimensional bers,” in Proceedings of the Neural Information Processing Sys- spaces and characterized by a sufficiently high id (i.e., id ⩾ tems (NIPS ’02), S. Becker, S. Thrun, and K. Obermayer, Eds., 10). However, due to the continuous technological advances, pp. 681–688, MIT Press, 2002. high id datasets are becoming more and more common, [8] Z. Zhang and H. Zha, “Adaptive manifold learning,”in Advances and the construction of a theoretically well-formed and in Neural Information Processing Systems,vol.17,2005. robust id estimatorabletodealwithhighid data and [9] M. Gashler and T. Martinez, “Tangent space guided intelligent limited amount of points remains one of the open research neighbor finding,” in Proceedings of the International Joint challenges in machine learning. Besides, id estimators should Conference on Neural Network (IJCNN ’11),pp.2617–2624, August 2011. be developed by also considering datasets drawn through nonuniform smooth pdfsfrommanifoldsM characterized [10] M. Gashler and T. Martinez, “Robust manifold learning with CycleCut,” Connection Science,vol.24,no.1,pp.57–69,2012. by a nonconstant curvature; indeed, most of the algorithms are tested by only employing data drawn by means of uniform [11] P. Zhang, H. Qiao, and B. Zhang, “An improved local tangent space alignment method for manifold learning,” Pattern Recog- pdf. nition Letters,vol.32,no.2,pp.181–189,2011. We further note that, though the aforementioned prob- 𝑛 lems still need further investigations, most researches in [12] N. Verma, “Distance preserving embeddings for general - dimensional manifolds,” Journal of Machine Learning Research, this field are presently focusing on tasks that require to vol. 14, pp. 2415–2448, 2013. estimate the id as the first step. Examples are “multimanifold [13] R. E. Bellman, Adaptive Control Processes: A Guided Tour, learning,” whose aim is to process datasets drawn from Princeton University Press, Princeton, NJ, USA, 1961. multiple manifolds, each characterized by different id,to [14] M. Kirby, Geometric Data Analysis: An Empirical Approach to identify the underlying structures (see [128] for an example); Dimensionality Reduction and the Study of Patterns,JohnWiley “nonlinear dimensionality reduction”; or “manifold recon- &Sons,2001. struction,”whoseaimistofindthemappingthatprojects 𝐷 [15]I.T.Jolliffe,Principal Component Analysis, Springer Series in the data (embedded in a higher -dimensional space) on the Statistics, Springer, New York, NY, USA, 1986. 𝑑̂ 𝑑̂ id lower -dimensional subspace, being the estimated on [16] V. N. Vapnik, Statistical Learning Theory,JohnWiley&Sons, the input dataset (as examples, see [9, 11, 129]). 1998. [17] J. H. Friedman, T. Hastie, and R. Tibshirani, The Elements Conflict of Interests of Statistical Learning—Data Mining, Inference and Prediction, Springer, Berlin, Germany, 2009. The authors declare that there is no conflict of interests [18]P.Campadelli,E.Casiraghi,C.Ceruti,G.Lombardi,and regarding the publication of this paper. A. Rozza, “Local intrinsic dimensionality based features for 18 Mathematical Problems in Engineering

clustering,” in Image Analysis and Processing—ICIAP 2013,A. [34] J. A. Scheinkman and B. LeBaron, “Nonlinear dynamics and Petrosino, Ed., vol. 8156 of Lecture Notes in Computer Science, stock returns,” The Journal of Business,vol.62,no.3,pp.311–337, pp. 41–50, Springer, Berlin, Germany, 2013. 1989. [19] P. Grassberger and I. Procaccia, “Measuring the strangeness of [35] D. R. Chialvo, R. F. Gilmour Jr., and J. Jalife, “Low dimensional strange attractors,” Physica D. Nonlinear Phenomena,vol.9,no. chaos in cardiac tissue,” Nature,vol.343,no.6259,pp.653–657, 1-2,pp.189–208,1983. 1990. [20] H. Lahdesm¨ aki,¨ O. Yli-Harja, W. Zhang, and I. Shmulevich, [36] A. Mekler, “Calculation of eeg correlation dimension: large “Intrinsic dimensionality in gene expression analysis,” in Pro- massifs of experimental data,” Computer Methods and Programs ceedings of the International Workshop on Genomic Signal in Biomedicine,vol.92,no.1,pp.154–160,2008. Processing and Statistics (GENSIPS ’05), September 2005. [37] G. N. Derry and P. S. Derry, “Age dependence of the menstrual [21] F. Camastra and M. Filippone, “A comparative evaluation of cycle correlation dimension,” Open Journal of Biophysics,vol.2, nonlinear dynamics methods for time series prediction,” Neural no. 2, pp. 40–45, 2012. Computing and Applications,vol.18,no.8,pp.1021–1029,2009. [38] V. Isham, Statistical Aspects of Chaos: A Review, Chapman and [22] M. Valle and A. R. Oganov, “Crystal fingerprint space—a novel Hall, London, UK, 1993. paradigm for studying crystal-structure sets,” Acta Crystallo- [39] S. Haykin and X. B. Li, “Detection of signals in chaos,” graphica Section A,vol.66,no.5,pp.507–517,2010. Proceedings of the IEEE,vol.83,no.1,pp.95–122,1995. [23]K.M.Carter,R.Raich,andA.O.Hero,“Onlocalintrinsic [40] P. Somervuo, “Speech dimensionality analysis on hypercubical dimension estimation and its applications,” IEEE Transactions self-organizing maps,” Neural Processing Letters,vol.17,no.2, on Signal Processing,vol.58,no.2,pp.650–663,2010. pp.125–136,2003. [24] J. Lapuyade-Lahorgue and A. Mohammad-Djafari, “Nearest [41] B. Hu, T. Rakthanmanon, Y. Hao, S. Evans, S. Lonardi, and neighbors and correlation dimension for dimensionality esti- E. Keogh, “Towards discovering the intrinsic cardinality and mation. Application to factor analysis of real biological time dimensionality of time series using MDL,”in Algorithmic Proba- series data,” in Proceedings of The European Symposium on bility and Friends. Bayesian Prediction and Artificial Intelligence, Artificial Neural Networks (ESANN ’11), pp. 363–368, Bruges, vol. 7070 of Lecture Notes in Computer Science,pp.184–197, Belgium, April 2014. Springer, Berlin, Germany, 2013. [25] R. Heylen and P. Scheunders, “Hyperspectral intrinsic dimen- [42] D. C. Laughlin, “The intrinsic dimensionality of plant traits and sionality estimation with nearest-neighbor distance ratios,” its relevance to community assembly,” Journal of Ecology,vol. IEEEJournalofSelectedTopicsinAppliedEarthObservations 102,no.1,pp.186–193,2014. and Remote Sensing,vol.6,no.2,pp.570–579,2013. [43] F. Camastra, “Data dimensionality estimation methods: a sur- [26] K. W.Pettis, T. A. Bailey, A. K. Jain, and R. C. Dubes, “Anintrin- vey,” Pattern Recognition,vol.36,no.12,pp.2945–2954,2003. sic dimensionality estimator from near-neighbor information,” [44]A.K.Romney,R.N.Shepard,andS.B.Nerlove,Multidimen- IEEE Transactions on Pattern Analysis and Machine Intelligence, sionaling Scaling, Volume I: Theory, Seminar Press, 1972. vol. 1, no. 1, pp. 25–37, 1979. [45]A.K.Romney,R.N.Shepard,andS.B.Nerlove,Multidimen- [27] E. Levina and P. J. Bickel, “Maximum likelihood estimation of sionaling Scaling, Volume II: Applications, Seminar Press, 1972. intrinsic dimension,” in Proceedings of the NIPS, vol. 1, pp. 777– [46] T. Lin and H. Zha, “Riemannian manifold learning,” IEEE 784, 2004. Transactions on Pattern Analysis and Machine Intelligence,vol. [28] F. Camastra and A. Vinciarelli, “Estimating the intrinsic dimen- 30,no.5,pp.796–809,2008. sion of data with a fractal-based method,” IEEE Transactions [47] R. N. Shepard, “The analysis of proximities: multidimensional on Pattern Analysis and Machine Intelligence, vol. 24, no. 10, pp. scaling with an unknown distance function. Part I,” Psychome- 1404–1407, 2002. trika,vol.27,pp.125–140,1962. [29] J. A. Costa and A. O. Hero, “Geodesic entropic graphs for [48] R. N. Shepard, “The analysis of proximities: multidimensional dimension and entropy estimation in manifold learning,” IEEE scaling with an unknown distance function, part II,” Psychome- Transactions on Signal Processing,vol.52,no.8,pp.2210–2221, trika,vol.27,pp.219–246,1962. 2004. [49] J. B. Kruskal, “Multidimensional scaling by optimizing good- [30] J. A. Costa and A. O. Hero, “Learning intrinsic dimension and ness of fit to a nonmetric hypothesis,” Psychometrika,vol.29, entropy of high-dimensional shape spaces,”in Proceedings of the pp. 1–27, 1964. European Signal Processing Conference (EUSIPCO ’04),pp.231– [50] J. B. Kruskal and J. D. Carrol, Geometrical Models and Badness- 252, September 2004. of-Fit Functions, vol. 2, Academic Press, 1969. [31] K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When [51] R. N. Shepard and J. D. Carroll, Parametric Representation of is ‘nearest neighbor’ meaningful?” in Proceedings of the 7th Nonlinear Data Structures, Academic Press, New York, NY, International Conference on Database Theory (ICDT ’99),pp. USA, 1969. 217–235, Springer, London, UK, 1999. [52] J. B. Kruskal, Linear Transformation of Multivariate Data to [32] K. M. Carter, R. Raich, W. G. Finn, and A. O. Hero III, Reveal Clustering, vol. 1, Academic Press, New York, NY, USA, “FINE: fisher information nonparametric embedding,” IEEE 1972. Transactions on Pattern Analysis and Machine Intelligence,vol. [53] C. K. Chen and H. C. Andrews, “Nonlinear intrinsic dimension- 31, no. 11, pp. 2093–2098, 2009. ality computations,” IEEE Transactions on Computers,vol.C-23, [33] A. M. Farahmand, C. Szepesvari,´ and J.-Y. Audibert, “Manifold- no. 2, pp. 178–184, 1974. adaptive dimension estimation,” in Proceedings of the 24th [54] J. W. J. Sammon, “A nonlinear mapping for data structure international conference on Machine learning (ICML ’07),pp. analysis,” IEEE Transactions on Computers,vol.18,pp.401–409, 265–272, June 2007. 1969. Mathematical Problems in Engineering 19

[55] P. Demartines and J. Herault,´ “Curvilinear component analysis: [74] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal compo- A self-organizing neural network for nonlinear mapping of data nent analysis,” Journal of Computational and Graphical Statis- sets,” IEEE Transactions on Neural Networks,vol.8,no.1,pp. tics, vol. 15, no. 2, pp. 265–286, 2006. 148–154, 1997. [75] C. M. Bishop, Pattern Recognition and Machine Learning, [56] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global Springer,NewYork,NY,USA,2ndedition,2006. geometric framework for nonlinear dimensionality reduction,” [76] C. Ceruti, S. Bassis, A. Rozza, G. Lombardi, E. Casiraghi, and Science,vol.290,no.5500,pp.2319–2323,2000. P. Campadelli, “DANCo: an intrinsic dimensionality estimator [57] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduc- exploiting angle and norm concentration,” Pattern Recognition, tion by locally linear embedding,” Science,vol.290,no.5500, vol. 47, no. 8, pp. 2569–2581, 2014. pp. 2323–2326, 2000. [77] A. V.Little, M. Maggioni, and L. Rosasco, “Multiscale geometric [58] J. A. Lee and M. Verleysen, Nonlinear Dimensionality Reduction, methods for data sets I: multiscale SVD, noise and curvature,” Springer,NewYork,NY,USA,2007. MIT-CSAIL-TR 2012-029, 2012. [59] R. Karbauskaite, G. Dzemyda, and E. Mazetis, “Geodesic [78] F. G. Kaslovsky and D. N. Meyer, “Optimal tangent plane recov- distances in the maximum likelihood estimator of intrinsic ery from noisy manifold samples,” http://xxx.tau.ac.il/abs/1111 dimensionality,” Nonlinear Analysis: Modelling and Control,vol. .4601v2. 16, no. 4, pp. 387–402, 2011. [79] M. Hein and J. Y.Audibert, “Intrinsic dimensionality estimation [60] M. Polito and P. Perona, “Grouping and dimensionality reduc- of submanifolds in euclidean space,” in Proceedings of the tion by locally linear embedding,” Advances in Neural Informa- International Conference on Machine Learning (ICML ’05),pp. tion Processing Systems,vol.14,pp.1255–1262,2001. 289–296, 2005. [61] B. Scholkopf,A.Smola,andK.-R.M¨ uller,¨ “Nonlinear compo- [80]G.Haro,G.Randall,andG.Sapiro,“Translatedpoissonmix- nent analysis as a kernel eigenvalue problem,” Neural Computa- ture model for stratification learning,” International Journal of tion, vol. 10, no. 5, pp. 1299–1319, 1998. Computer Vision,vol.80,no.3,pp.358–374,2008. [62] K. Fukunaga and D. R. Olsen, “An algorithm for finding intrin- [81]M.Chen,J.Silva,J.Paisley,C.Wang,D.Dunson,andL. sic dimensionality of data,” IEEE Transactions on Computers, Carin, “Compressive sensing on manifolds using a nonpara- vol.20,no.2,pp.176–183,1971. metric mixture of factor analyzers: algorithm and performance bounds,” IEEE Transactions on Signal Processing,vol.58,no.12, [63] P. J. Verveer and R. P. W. Duin, “An evaluation of intrinsic pp.6140–6155,2010. dimensionality estimators,” IEEE Transactions on Pattern Anal- [82] L. Brouwer, Collected Works, Volume I, Philosophy and Founda- ysis and Machine Intelligence,vol.17,no.1,pp.81–86,1995. tions of Mathematics and II, Geometry, Analysis, Topology and [64] J. Bruske¨ and G. Sommer, “Intrinsic dimensionality estimation Mechanics, North-Holland/American Elsevier, 1976. with optimally topology preserving maps,” IEEE Transactions [83] I. M. James, History of Topology, Mathematics, Elsevier, 1999. on Pattern Analysis and Machine Intelligence,vol.20,no.5,pp. 572–575, 1998. [84] G. Medioni and P.Mordohai, “The tensor voting framework,”in Emerging Topics in Computer Vision, pp. 191–255, Prentice Hall, [65] T. Martinetz and K. Schulten, “Topology representing net- 2004. works,” Neural Networks,vol.7,no.3,pp.507–522,1994. [85] G. Lombardi, E. Casiraghi, and P. Campadelli, “Curvature [66] M. E. Tipping and C. M. Bishop, “Probabilistic principal estimation and curve inference with tensor voting: a new component analysis,” Journal of the Royal Statistical Society. approach,” in Proceedings of the 10th International Conference Series B. Statistical Methodology,vol.61,no.3,pp.611–622,1999. on Advanced Concepts for Intelligent Vision Systems (ACIVS ’08), [67] R. Everson and S. Roberts, “Inferring the eigenvalues of covari- vol. 5259, pp. 613–624, 2008. ance matrices from limited, noisy data,” IEEE Transactions on [86] M. Wertheimer, “Untersuchungen zur Lehre von der Gestalt II,” Signal Processing,vol.48,no.7,pp.2083–2091,2000. in Psycologische Forshung,vol.4,pp.301–350,1923,Translation: [68] C. M. Bishop, “Bayesian PCA,”in Proceedings of the 12th Annual ASourceBookofGestaltPsychology. Conference on Neural Information Processing Systems (NIPS ’98), [87] P. Mordohai and G. Medioni, “Dimensionality estimation, pp. 382–388, December 1998. manifold learning and function approximation using tensor [69] J. J. Rajan and P. J. W. Rayner, “Model order selection for the voting,” Journal of Machine Learning Research, vol. 11, pp. 411– singular value decomposition and the discrete Karhunen-Loeve 450, 2010. transform using a Bayesian approach,” IEE Proceedings—Vision, [88] J. C. Robinson, Dimensions, Embeddings, and Attractors,Cam- Image and Signal Processing,vol.144,no.2,pp.116–123,1997. bridge Tracts in Mathematics, Cambridge University Press, [70] T. P. Minka, “Automatic choice of dimensionality for PCA,” 2010. Tech.Rep.514,MIT,2000. [89] C.-G. Li, J. Guo, and B. Xiao, “Intrinsic dimensionality estima- [71] C. Bouveyron, G. Celeux, and S. Girard, “Intrinsic dimension tion within neighborhood convex hull,” International Journal of estimation by maximum likelihood in isotropic probabilistic Pattern Recognition and Artificial Intelligence,vol.23,no.1,pp. PCA,” Pattern Recognition Letters,vol.32,no.14,pp.1706–1713, 31–44, 2009. 2011. [90] K. Falconer, Fractal Geometry—Mathematical Foundations and [72]J.LiandD.Tao,“SimpleexponentialfamilyPCA,”inProceed- Applications, John Wiley & Sons, 2nd edition, 2003. ings of the 13th International Conference on Artificial Intelligence [91] N. Tatti, T. Mielikainen,¨ A. Gionis, and H. Mannila, “What is and Statistics (AISTATS ’10), pp. 453–460, Sardinia, Italy, May the dimension of your binary data?” in Proceedings of the 6th 2010. International Conference on Data Mining (ICDM’ 06),pp.603– [73] Y. Guan and J. G. Dy, “Sparse probabilistic principal component 612, December 2006. analysis,” Journal of Machine Learning Research,vol.5,pp.185– [92] J.-P. Eckmann and D. Ruelle, “Fundamental limitations for 192, 2009. estimating dimensions and Lyapunov exponents in dynamical 20 Mathematical Problems in Engineering

systems,” Physica D. Nonlinear Phenomena,vol.56,no.2-3,pp. [110] A. J. Quiroz, “Graph-theoretical methods,” in Encyclopedia of 185–187, 1992. Statistical Sciences, vol. 5, Wiley and Sons, New York, NY, USA, [93] F. Takens, “On the numerical determination of the dimension 2006. of an attractor,” in Dynamical Systems and Bifurcations,B.J. [111] A. O. Hero III, B. Ma, O. J. J. Michel, and J. Gorman, “Appli- Braaksma,H.W.Broer,andF.Takens,Eds.,vol.1125ofLecture cations of entropic spanning graphs,” IEEE Signal Processing Notes in Mathematics, pp. 99–106, Springer, Berlin, Germany, Magazine,vol.19,no.5,pp.85–95,2002. 1985. [112] J. A. Costa, A. Girotra, and A. O. Hero III, “Estimating [94] Y. Ashkenazy, “The use of generalized information dimension local intrinsic dimension with k-nearest neighbor graphs,” in in measuring fractal dimension of time series,” Physica A: Proceedings of the IEEE/SP 13th Workshop on Statistical Signal Statistical Mechanics and Its Applications,vol.271,no.3-4,pp. Processing, pp. 417–421, July 2005. 427–447, 1999. [113] J. H. Friedman and L. C. Rafsky, “Graph-theoretic measures [95] C. Tricot Jr., “Two definitions of fractional dimension,” Mathe- of multivariate association and prediction,” Annals of Statistics, matical Proceedings of the Cambridge Philosophical Society,vol. vol. 11, no. 2, pp. 377–391, 1983. 91,no.1,pp.57–74,1982. [114] M. D. Penrose and J. E. Yukich, “Central limit theorems for [96] M. R. Brito, A. J. Quiroz, and J. E. Yukich, “Intrinsic dimension some graphs in computational geometry,” The Annals of Applied identification via graph-theoretic methods,” Journal of Multi- Probability, vol. 11, no. 4, pp. 1005–1041, 2001. variate Analysis,vol.116,pp.263–277,2013. [115] M. R. Brito, A. J. Quiroz, and J. E. Yukich, “Graph-theoretic pro- [97] M. Raginsky and S. Lazebnik, “Estimation of intrinsic dimen- cedures for dimension identification,” Journal of Multivariate sionality using high-rate vector quantization,” in Proceedings of Analysis,vol.81,no.1,pp.67–84,2002. the NIPS, pp. 1105–1112, 2005. [116] J. M. Steele, L. A. Shepp, and W. F. Eddy, “On the number of [98] P. L. Zador, “Asymptotic quantization error of continuous leaves of a Euclidean minimal spanning tree,” Journal of Applied signals and the quantization dimension,” IEEE Transactions on Probability, vol. 24, no. 4, pp. 809–826, 1987. Information Theory,vol.28,no.2,pp.139–149,1982. [117] M. F. Schilling, “Mutual and shared neighbor probabilities: [99] K. Kumaraswamy, V. Megalooikonomou, and C. Faloutsos, finite- and infinite-dimensional results,” Advances in Applied “Fractal dimension and vector quantization,” Information Pro- Probability, vol. 18, no. 2, pp. 388–405, 1986. cessing Letters, vol. 91, no. 3, pp. 107–113, 2004. [118] K. Sricharan, R. Raich, and A. O. Hero III, “Optimized intrin- [100] G. V.Trunk, “Statistical estimation of the intrinsic dimensional- sic dimension estimator using nearest neighbor graphs,” in ity of a noisy signal collection,” IEEE Transactions on Computers, Proceedings of the IEEE International Conference on Acoustics, vol. 25, no. 2, pp. 165–171, 1976. Speech,andSignalProcessing(ICASSP’10),pp.5418–5421, [101] M.Fan,H.Qiao,andB.Zhang,“Intrinsicdimensionestimation Dallas,Tex,USA,March2010. of manifolds by incising balls,” Pattern Recognition,vol.42,no. [119] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based 5, pp. 780–787, 2009. learning applied to document recognition,” Proceedings of the [102] D. MacKay and Z. Ghahramani, “Comments on maximum IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. likelihood estimation of intrinsic dimension by E. Levina and [120]A.FrankandA.Asuncion,UCI Machine Learning Repository, P. B i c ke l ,” 2 0 0 5 , http://www.inference.phy.cam.ac.uk/mackay/ UCI, 2010. dimension/. [121] F. Pineda and J. Sommerer, “Estimating generalized dimensions [103] M. D. Penrose and J. E. Yukich, “Limit theory for point processes and choosing time delays: a fast algorithm,” in Time Series in manifolds,” The Annals of Applied Probability,vol.23,no.6, Prediction: Forecasting the Future and Understanding the Past, pp. 2161–2211, 2013. pp. 367–385, 1994. [104] P.J. Bickel and D. Yan, “Sparsity and the possibility of inference,” [122] J.A.CostaandA.O.Hero,Determining Intrinsic Dimension and Sankhya: The Indian Journal of Statistics,vol.70,no.1,23pages, Entropy of High-Dimensional Shape Spaces,Birkhauser,¨ Boston, 2008. Mass, USA, 2006. [105] M. Das Gupta and T. S. Huang, “Regularized maximum likeli- [123] I. Kivimaki,¨ K. Lagus, I. Nieminen, J. Vayrynen,¨ and T.Honkela, hood for intrinsic dimension estimation,” in Proceedings of the “Using correlation dimension for analysing text data,” in Arti- 26th Conference on Uncertainty in Artificial Intelligence (UAI ficial Neural Networks—ICANN 2010: Proceedings of the 20th ’10),P.GrunwaldandP.Spirtes,Eds.,pp.220–227,AUAIPress,¨ International Conference, Thessaloniki, Greece, September 15–18, CatalinaIsland,Calif,USA,July2010. 2010, Part I,vol.6352ofLecture Notes in Computer Science,pp. [106] R. Karbauskaite and G. Dzemyda, “Investigation of the max- 368–373, Springer, Berlin, Germany, 2010. imum likelihood estimator of intrinsic dimensionality,” in [124] E. Ott, Chaos in Dynamical Systems, Cambridge University Proceedings of the 10th International Conference on Computer Press, Cambridge, UK, 1993. Data Analysis and Modeling, vol. 2, pp. 110–113, 2013. [125] L. O. Chua, M. Komuro, and T. Matsumoto, “The double scroll,” [107] A. Rozza, G. Lombardi, C. Ceruti, E. Casiraghi, and P. IEEE Transactions on Circuits and Systems,vol.32,no.8,pp. Campadelli, “Novel high intrinsic dimensionality estimators,” 797–818, 1985. Machine Learning,vol.89,no.1-2,pp.37–65,2012. [126] G. Lombardi, A. Rozza, C. Ceruti, E. Casiraghi, and P. Cam- [108] Q. Wang, S. R. Kulkarni, and S. Verdu,´ “A nearest-neighbor padelli, “Minimum neighbor distance estimators of intrinsic approach to estimating divergence between continuous random dimension,” in Machine Learning and Knowledge Discovery vectors,” in Proceedings of the IEEE International Symposium on in Databases: Proceedings of the European Conference, ECML Information Theory (ISIT ’06), pp. 242–246, July 2006. PKDD 2011, Athens, Greece, September 5–9, 2011, Part II, [109]K.V.Mardia,Statistics of Directional Data, Academic Press, vol. 6912 of Lecture Notes in Computer Science, pp. 374–389, 1972. Springer, Berlin, Germany, 2011. Mathematical Problems in Engineering 21

[127] J. Jaccard, M. A. Becker, and G. Wood, “Pairwise multiple comparison procedures: a review,” Psychological Bulletin,vol. 96, no. 3, pp. 589–596, 1984. [128] D. Gong, X. Zhao, and G. Medioni, “Robust multiple manifolds structure learning,” in Proceedings of the 29th International Conference on Machine Learning (ICML’ 12), pp. 321–328, July 2012. [129] J. Wei, H. Peng, Y.-S. Lin, Z.-M. Huang, and J.-B. Wang, “Adaptive neighborhood selection for manifold learning,” in Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC ’08), vol. 1, pp. 380–384, IEEE, Kun- ming, China, July 2008. Advances in Advances in Journal of Journal of Operations Research Decision Sciences Applied Mathematics Algebra Probability and Statistics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

The Scientific International Journal of World Journal Differential Equations Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Submit your manuscripts at http://www.hindawi.com

International Journal of Advances in Combinatorics Mathematical Physics Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Journal of Journal of Mathematical Problems Abstract and Discrete Dynamics in Complex Analysis Mathematics in Engineering Applied Analysis Nature and Society Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

International Journal of Journal of Mathematics and Mathematical Discrete Mathematics Sciences

Journal of International Journal of Journal of Function Spaces Stochastic Analysis Optimization Hindawi Publishing Corporation Hindawi Publishing Corporation Volume 2014 Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014