<<

WHATIS... ? Mining Mauro Maggioni

Data collected from a variety of sources has been Typically, xi is noisy (e.g., noisy pixel values), and accumulating rapidly. Many fields of science have so is f (xi) (e.g., mislabeled samples in the training gone from being data-starved to being data-rich set). and needing to learn how to cope with large data Of course mathematicians have long concerned sets. The rising tide of data also directly affects themselves with high-dimensional problems. One our daily lives, in which surrounding us example is studying solutions of PDEs as func- use data-crunching to help us in tasks tions in infinite-dimensional function spaces and ranging from finding the quickest route to our performing efficient computations by projecting destination considering current traffic conditions the problem onto low-dimensional subspaces (via to automatically tagging our faces in pictures; from discretizations, finite elements, or operator com- updating in near real time the prices of sale items pression) so that the reduced problem may be to suggesting the next movie we might want to numerically solved on a . In the case watch. of solutions of a PDE, the model for the data The general aim of is to find useful is specified: a lot of about the PDE and interpretable patterns in data. The term can is known, and that information is exploited to encompass many diverse methods and therefore predict the properties of the data and to construct means different things to different people. Here we low-dimensional projections. For the discuss some aspects of data mining potentially of discussed above, however, typically we have little interest to a broad audience of mathematicians. information and poor models. We may start with Assume a sample data point xi (e.g., a picture) crude models, their fitness to the data and may be cast in the form of a long vector of predictive ability, and, those being not satisfactory, (e.g., the pixel intensities in an image): we represent improve the models. This is one of the key pro- it as a point in RD. Two types of related goals cesses in statistical modeling and data mining. It exist. One is to detect patterns in this set of points, is not unlike what an applied mathematician does and the other is to predict a function on the data: when modeling a complex physical system: he may start with simplifying assumptions to construct a given a training set (xi, f (xi))i, we want to predict f at points outside the training set. In the case of “tractable” model, derive consequences of such a text documents or webpages, we might want to model (e.g., properties of the solutions) analytically automatically label each document as belonging and/or with simulations, and compare the results to an area of research; in the case of pictures, we to the properties exhibited by the real-world phys- might want to recognize faces; when suggesting the ical system. New measurements and real-world next movie to watch given past ratings of movies simulations may be performed, and the fitness by a viewer, f consists of ratings of unseen movies. of the model reassessed and improved as needed for the next round of validation. While physics Mauro Maggioni is assistant professor of mathematics and drives the modeling in applied mathematics, a at Duke University. His email address is new type of intuition, built on experiences in the [email protected]. world of high-dimensional data sets rather than DOI: http://dx.doi.org/10.1090/noti831 in the world of physics, drives the intuition of the

532 Notices of the AMS Volume 59, Number 4 mathematician set to analyze high-dimensional data sets, where “tractable” models are geomet- 3 ric or statistical models with a small number of ψ φ parameters. 2 One of the reasons for focusing on reduc- ing the dimension is to enable computations, φ 1 but a fundamental motivation is the so-called 0.04 . One of its manifestations 0.02 arises in the approximation of a 1-Lipschitz func- 0 tion on the unit cube, f : [0, 1]D → satisfying |f (x) − f (y)| ≤ ||x − y|| for x, y ∈ [0, 1]D. To 0 −0.05 −1 achieve uniform error , given samples (x , f (x )), i i −0.02 in general one needs at least one sample in each cube of side , for a total of −D samples, which is −2 −0.04 0 too large even for, say,  = 10−1 and D = 100 (a −0.03 −0.02 −0.01 rather small dimension in applications). A common 0 0.01 −3 assumption is that either the samples xi lie on 0.02 0.05 D a low-dimensional subset of [0, 1] and/or f is 0.15 not simply Lipschitz but has a smoothness that Anthropology is suitably large, depending on D (see references Astronomy in [3]). Taking the former route, one assumes 0.1 Social Sciences Earth Sciences that the data lies on a low-dimensional subset in Biology the high-dimensional ambient space, such as a Mathematics low-dimensional hyperplane or unions thereof, or 0.05 Medicine Physics low-dimensional manifolds or rougher sets. Re- 6 search problems require ideas from different areas 0 of mathematics, including geometry, geometric measure theory, topology, and , with their tools for studying manifolds or rougher sets; −0.05 and geometric functional analysis for studying random samples and measures in high dimensions; harmonic analysis and approximation −0.1 −0.5 theory, with their ideas of multiscale analysis and 0 function approximation; and , 0.5 −0.15 −0.1 −0.05 0 0.05 because we need efficient algorithms to analyze 5 4 real-world data. As a concrete example, consider the following Figure 1. Top: Diffusion map embedding of the n D construction. Given n points {xi}i=1 ⊂ R and  > set of configurations of a small biomolecule 2 = − ||xi −xj || = P (alanine dipeptide) from its 36-dimensional 0, construct Wij exp( 2 ), Dii j Wij , − 1 − 1 space. The color is one of the dihedral angles = − 2 2 and the Laplacian matrix L I D WD on ϕ, ψ of the molecule, known to be essential to the weighted graph G with vertices {xi} and edges the dynamics [4]. This is a physical system weighted by W . When xi is sampled from a man- where (approximate) equations of motion are ifold M and n tends to infinity, L approximates known, but their structure is too complicated (in a suitable sense) the Laplace-Beltrami operator and the state space too high-dimensional to be on M [2], which is a completely intrinsic object. amenable to analysis. Bottom: Diffusion map of a The random walk on G, with transition matrix consisting of 1161 Science News P = D−1W , approximates Brownian motion on M. articles, each modeled by a 1153-dimensional Consider, for a time t > 0, the so-called diffusion vector of word frequencies, embedded in a t t distance dt (x, y) := ||P (x, ·) − P (y, ·)||L2(G) (see low-dimensional space with diffusion maps, as [2]). This distance is particularly useful for cap- described in the text and in [2]. turing clusters/groupings in the data, which are regions of fast diffusion connected by bottlenecks t q t q t that slow diffusion. Let 1 = λ0 ≥ λ1 ≥ · · · be space, where Φd(x) := ( λ1ϕ1(x), . . . , λdϕd(x)), the eigenvalues of P and ϕi be the correspond- for some t > 0 [2]. One can show that the Euclidean t t ing eigenvectors (ϕ0, when G is a web graph, is distance between Φd(x) and Φd(y) approximates related to ’s pagerank). Consider a diffu- dt (x, y), the diffusion distance at time scale t t sion map Φd that embeds the graph in Euclidean between x and y on the graph G.

April 2012 Notices of the AMS 533 David In Figure 1 we apply this technique to two completely different data sets. The first one is a set of configurations of a small peptide, obtained by a 12×3 Blackwell molecular dynamics simulation: a point xi ∈ R contains the coordinates in R3 of the 12 atoms Memorial Conference in the alanine dipeptide molecule (represented as an inset in Figure 1). The forces between the April 19–20, 2012 atoms in the molecule constrain the trajectories Howard University to lie close to low-dimensional sets in the 36- Washington, DC dimensional state space. In Figure 1 we apply the construction above1 and represent the diffusion Join us on April map embedding of the configurations collected [4]. 19–20, 2012, at The second one is a set of text documents (articles from Science News), each represented as a R1153 Howard University in vector whose kth coordinate is the frequency of the Washington, DC for kth word in a 1153-word dictionary. The diffusion a special conference embedding in low dimensions reveals even lower- honoring David Blackwell dimensional geometric structures, which turn out (1919–2010), former to be useful for understanding the dynamics of President of the Institute the peptide considered in the first data set and for of Mathematical automatically clustering documents by topic in the and the fi rst African- case of the second data set. Ideas from probability American to be inducted (random samples), harmonic analysis (Laplacian), into the National and geometry (manifolds) come together in these types of constructions.

Photograph by Lotfi A. Zadeh. A. Lotfi by Photograph Academy of Sciences. This is only the beginning of one of many re- Th is conference will bring together a diverse search avenues explored in the last few years. Many group of leading theoretical and applied other exciting opportunities exist, for example the statisticians and mathematicians to discuss study of stochastic dynamic networks, where a advances in mathematics and statistics sample is a network and multiple samples are that are related to, and in many cases have collected in time: quantifying and modeling change requires introducing sensible and robust metrics grown out of, the work of David Blackwell. between graphs. Th ese include developments in dynamic Further reading: [5, 3, 1] and the references programming, , game therein. theory, design of experiments, renewal theory, and other fi elds. Other speakers will References discuss Blackwell’s legacy for the community 1. Science: Special issue: Dealing with data, February 2011, of African-American researchers in the pp. 639–806. mathematical sciences. 2. R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker, Geometric Th e conference is being organized by the diffusions as a tool for harmonic analysis and structure Department of Mathematics at Howard definition of data: Diffusion maps, Proc. Natl. Acad. Sci. USA 102 (2005), no. 21, 7426–7431. University in collaboration with the 3. D. Donoho, High-dimensional : The curses University of California, Berkeley, Carnegie and blessings of dimensionality, “Math Challenges of the Mellon University, and the American 21st Century”, AMS, 2000. Statistical Association. Funding is being 4. M. A. Rohrdanz, W. Zheng, M. Maggioni, and C. Clementi, Determination of reaction coordinates via provided by the National Science Foundation locally scaled diffusion map, J. Chem. Phys. 134 (2011), and the Army Research Offi ce. To learn more 124116. about the program, invited speakers, and 5. J. W. Tukey, The Future of Data Analysis, Ann. Math. Statist. 33 registration, visit: https://sites.google.com/ , Number 1 (1962), 1–67. site/conferenceblackwell.

1We use here a slightly different definition of the weight matrix W , which uses distances between molecular con- figurations up to rigid affine transformations, instead of Euclidean distances.

534 Notices of the AMS Volume 59, Number 4