Data Mining Mauro Maggioni
Total Page:16
File Type:pdf, Size:1020Kb
WHATIS... ? Data Mining Mauro Maggioni Data collected from a variety of sources has been Typically, xi is noisy (e.g., noisy pixel values), and accumulating rapidly. Many fields of science have so is f (xi) (e.g., mislabeled samples in the training gone from being data-starved to being data-rich set). and needing to learn how to cope with large data Of course mathematicians have long concerned sets. The rising tide of data also directly affects themselves with high-dimensional problems. One our daily lives, in which computers surrounding us example is studying solutions of PDEs as func- use data-crunching algorithms to help us in tasks tions in infinite-dimensional function spaces and ranging from finding the quickest route to our performing efficient computations by projecting destination considering current traffic conditions the problem onto low-dimensional subspaces (via to automatically tagging our faces in pictures; from discretizations, finite elements, or operator com- updating in near real time the prices of sale items pression) so that the reduced problem may be to suggesting the next movie we might want to numerically solved on a computer. In the case watch. of solutions of a PDE, the model for the data The general aim of data mining is to find useful is specified: a lot of information about the PDE and interpretable patterns in data. The term can is known, and that information is exploited to encompass many diverse methods and therefore predict the properties of the data and to construct means different things to different people. Here we low-dimensional projections. For the digital data discuss some aspects of data mining potentially of discussed above, however, typically we have little interest to a broad audience of mathematicians. information and poor models. We may start with Assume a sample data point xi (e.g., a picture) crude models, measure their fitness to the data and may be cast in the form of a long vector of numbers predictive ability, and, those being not satisfactory, (e.g., the pixel intensities in an image): we represent improve the models. This is one of the key pro- it as a point in RD. Two types of related goals cesses in statistical modeling and data mining. It exist. One is to detect patterns in this set of points, is not unlike what an applied mathematician does and the other is to predict a function on the data: when modeling a complex physical system: he may start with simplifying assumptions to construct a given a training set (xi; f (xi))i, we want to predict f at points outside the training set. In the case of “tractable” model, derive consequences of such a text documents or webpages, we might want to model (e.g., properties of the solutions) analytically automatically label each document as belonging and/or with simulations, and compare the results to an area of research; in the case of pictures, we to the properties exhibited by the real-world phys- might want to recognize faces; when suggesting the ical system. New measurements and real-world next movie to watch given past ratings of movies simulations may be performed, and the fitness by a viewer, f consists of ratings of unseen movies. of the model reassessed and improved as needed for the next round of validation. While physics Mauro Maggioni is assistant professor of mathematics and drives the modeling in applied mathematics, a computer science at Duke University. His email address is new type of intuition, built on experiences in the [email protected]. world of high-dimensional data sets rather than DOI: http://dx.doi.org/10.1090/noti831 in the world of physics, drives the intuition of the 532 Notices of the AMS Volume 59, Number 4 mathematician set to analyze high-dimensional data sets, where “tractable” models are geomet- 3 ric or statistical models with a small number of ψ φ parameters. 2 One of the reasons for focusing on reduc- ing the dimension is to enable computations, φ 1 but a fundamental motivation is the so-called 0.04 curse of dimensionality. One of its manifestations 0.02 arises in the approximation of a 1-Lipschitz func- 0 tion on the unit cube, f : [0; 1]D ! R satisfying jf (x) − f (y)j ≤ jjx − yjj for x; y 2 [0; 1]D. To 0 −0.05 −1 achieve uniform error , given samples (x ; f (x )), i i −0.02 in general one needs at least one sample in each cube of side , for a total of −D samples, which is −2 −0.04 0 too large even for, say, = 10−1 and D = 100 (a −0.03 −0.02 −0.01 rather small dimension in applications). A common 0 0.01 −3 assumption is that either the samples xi lie on 0.02 0.05 D a low-dimensional subset of [0; 1] and/or f is 0.15 not simply Lipschitz but has a smoothness that Anthropology is suitably large, depending on D (see references Astronomy in [3]). Taking the former route, one assumes 0.1 Social Sciences Earth Sciences that the data lies on a low-dimensional subset in Biology the high-dimensional ambient space, such as a Mathematics low-dimensional hyperplane or unions thereof, or 0.05 Medicine Physics low-dimensional manifolds or rougher sets. Re- 6 search problems require ideas from different areas 0 of mathematics, including geometry, geometric measure theory, topology, and graph theory, with their tools for studying manifolds or rougher sets; −0.05 probability and geometric functional analysis for studying random samples and measures in high dimensions; harmonic analysis and approximation −0.1 −0.5 theory, with their ideas of multiscale analysis and 0 function approximation; and numerical analysis, 0.5 −0.15 −0.1 −0.05 0 0.05 because we need efficient algorithms to analyze 5 4 real-world data. As a concrete example, consider the following Figure 1. Top: Diffusion map embedding of the n D construction. Given n points fxigi=1 ⊂ R and > set of configurations of a small biomolecule 2 = − jjxi −xj jj = P (alanine dipeptide) from its 36-dimensional state 0, construct Wij exp( 2 ), Dii j Wij , − 1 − 1 space. The color is one of the dihedral angles = − 2 2 and the Laplacian matrix L I D WD on '; of the molecule, known to be essential to the weighted graph G with vertices fxig and edges the dynamics [4]. This is a physical system weighted by W . When xi is sampled from a man- where (approximate) equations of motion are ifold M and n tends to infinity, L approximates known, but their structure is too complicated (in a suitable sense) the Laplace-Beltrami operator and the state space too high-dimensional to be on M [2], which is a completely intrinsic object. amenable to analysis. Bottom: Diffusion map of a The random walk on G, with transition matrix data set consisting of 1161 Science News P = D−1W , approximates Brownian motion on M. articles, each modeled by a 1153-dimensional Consider, for a time t > 0, the so-called diffusion vector of word frequencies, embedded in a t t distance dt (x, y) := jjP (x, ·) − P (y; ·)jjL2(G) (see low-dimensional space with diffusion maps, as [2]). This distance is particularly useful for cap- described in the text and in [2]. turing clusters/groupings in the data, which are regions of fast diffusion connected by bottlenecks t q t q t that slow diffusion. Let 1 = λ0 ≥ λ1 ≥ · · · be space, where Φd(x) := ( λ1'1(x), : : : ; λd'd(x)), the eigenvalues of P and 'i be the correspond- for some t > 0 [2]. One can show that the Euclidean t t ing eigenvectors ('0, when G is a web graph, is distance between Φd(x) and Φd(y) approximates related to Google’s pagerank). Consider a diffu- dt (x, y), the diffusion distance at time scale t t sion map Φd that embeds the graph in Euclidean between x and y on the graph G. April 2012 Notices of the AMS 533 David In Figure 1 we apply this technique to two completely different data sets. The first one is a set of configurations of a small peptide, obtained by a 12×3 Blackwell molecular dynamics simulation: a point xi 2 R contains the coordinates in R3 of the 12 atoms Memorial Conference in the alanine dipeptide molecule (represented as an inset in Figure 1). The forces between the April 19–20, 2012 atoms in the molecule constrain the trajectories Howard University to lie close to low-dimensional sets in the 36- Washington, DC dimensional state space. In Figure 1 we apply the construction above1 and represent the diffusion Join us on April map embedding of the configurations collected [4]. 19–20, 2012, at The second one is a set of text documents (articles from Science News), each represented as a R1153 Howard University in vector whose kth coordinate is the frequency of the Washington, DC for kth word in a 1153-word dictionary. The diffusion a special conference embedding in low dimensions reveals even lower- honoring David Blackwell dimensional geometric structures, which turn out (1919–2010), former to be useful for understanding the dynamics of President of the Institute the peptide considered in the first data set and for of Mathematical Statistics automatically clustering documents by topic in the and the fi rst African- case of the second data set. Ideas from probability American to be inducted (random samples), harmonic analysis (Laplacian), into the National and geometry (manifolds) come together in these types of constructions.