A Quick Overview Today •! A general intro to –! What is it, and what for? Data Mining and Exploration •! Clustering and classification (a quick and very superficial intro) –! An example from astronomy: star-galaxy separation •! Exploratory statistics –! An example from multivariate statistics: Principal Component Analysis (PCA) and multivariate correlations S. G. Djorgovski •! Second part of the lecture: AyBi199, Feb. 2009 Some examples and demos, by Ciro Donalek

Note: This is just a very modest start! We posted some web links for you to explore, and go from there.

What is Data Mining (DM)? So what is Data Mining (DM)? (or: KDD = Knowledge Discovery in Databases) •! The job of science is Knowledge Discovery; data are incidental •! Many different things, but generally what the name KDD says to this process, representing the empirical foundations, but not the understanding per se –! It includes data discovery, cleaning, and preparation –! A lot of this process is (including discovery of –! is a key component (and can be very problematic) correlations, clustering/classification), discovery of outliers or –! It often involves a search for patterns, correlations, etc.; and anomalies, etc. automated and objective classification •! DM is Knowledge Discovery in Databases (KDD) –! It includes data modeling and testing of the models •! DM is defined as “an information extraction activity whose •! It depends a lot on the type of data, the study domain (science, goal is to discover hidden facts contained in (large) databases” commerce, …), the nature of the problem, etc., etc. •! (ML) is the field of Computer Science •! Generally, DM algorithms are computational embodiments of research that focuses on algorithms that learn from data statistics •! DM is the application of ML algorithms to large databases This is a Huge, HUGE, field! Lots of literature, lectures, –! And these algorithms are generally computational representations of software… And yet, lots of unsolved applied CS research problems some statistical methods A Schematic View of KDD Data Mining Methods and Some Examples Clustering Group together similar items and Classification separate dissimilar items in DB Associations Neural Nets Classify new data items using Decision Trees the known classes & groups Pattern Recognition Find unusual co-occurring associations Correlation/Trend Analysis of attribute values among DB items Principal Component Analysis Independent Component Analysis Predict a numeric attribute value Regression Analysis Organize information in the Outlier/Glitch Identification database based on relationships Visualization among key data descriptors Autonomous Agents Self-Organizing Maps (SOM) Identify linkages between data items Link (Affinity Analysis) based on features shared in common

Some Data Mining Techniques Graphically Represented Here we show selected Kirk Borne’s slides Clustering Self-Organizing Map (SOM) Neural Network from the NVO Summer School 2008,

http://nvo-twiki.stsci.edu/twiki/bin/view/ Main/NVOSS2008Sched

Outlier (Anomaly) Detection Link Analysis Decision Tree Clustering and Classification Classification ~ Mixture Modeling •! Answering the questions like: •! A lot of DM involves automated –! How many statistically distinct kinds of things are there in my classification or mixture data, and which data object belongs to which class? modeling –! Are there anomalies/outliers? (e.g., extremely rare classes) –! How many kinds of data objects –! I know the classes present in the data, but would like to classify are there in my data set? efficiently all of my data objects –! Which object belongs to which •! Clustering can be: class with what probability? 1.! Supervised: a known set of data objects (“ground truth”) can •! Different classes often follow be used to train and test a classifier different correlations –! Examples: Artificial Neural Nets (ANN), Decision Trees (DT) –! Or, correlations may define the 2.! Unsupervised: the class membership (and the number of classes which follow them classes) is not known a priori; the program should find them •! Classes/clusters are defined by –! Examples: Kohonen Nets (SOM), Expectation Maximization their probability density (EM), various Bayesian methods… distributions in a parameter space

There are many good tools out there, but you Exploration of observable parameter spaces need to choose the right ones for your needs and searches for rare or new types of objects A simple, real-life example:!

Inference Joint DE, Bayes Net Structure Learning P(E |E ) Engine Learn 1 2 Inputs

Dec Tree, Sigmoid Perceptron, Sigmoid N.Net, Gauss/ Predict Classifier Joint BC, Gauss Naïve BC, N.Neigh, Bayes Net Based category Inputs BC, Cascade Correlation

Density Prob- Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve DE, Estimator ability Bayes Net Structure Learning, GMMs Inputs

Predict Linear Regression, Polynomial Regression, Perceptron, Regressor real no. Neural Net, N.Neigh, Kernel, LWR, RBFs, Robust Inputs Regression, Cascade Correlation, Regression Trees, 9 GMDH, Multilinear Interp, MARS Now consider ~ 10 data vectors 2 3 (from Moore 2002) in ~ 10 - 10 dimensions …! Original! Gaussian Mixture Modeling An Example •! Data points are distributed in some N-dimensional parameter (from Moore et al.) space, x , j = 1, … N j GMM result! •! There are k clusters, wi , i = 1, …, k, where the

number of clusters, k, µ1 µ2 may be either given by the scientist, or derived

from the data themselves µ3 •! Each cluster can be modeled as an N-variate

Gaussian with mean µi and covariance matrix Si ! •! Each data point has an association probability of belonging to each of the clusters, Pi! Model density distribution !!

A Popular Technique: K-Means In modern data sets: DD >> 1, DS >> 1! Data Complexity ! Multidimensionality ! Discoveries! • ! Start with k random cluster centers! But the bad news is …! • ! Assume a data model (e.g., Gaussian)! – ! In principle, it can be some other . The computational cost of clustering analysis:! type of a distribution! K-means: K " N " I " D • ! Iterate until it converges! Expectation Maximization: K " N " I " D2 – ! There are many techniques; . . 2 2 Expectation Maximization (EM) . Monte Carlo Cross-Validation: M " Kmax " N " I " D is very popular; multi-resolution . N = no. of data vectors, D = no. of data dimensions kd-trees are great (Moore, Nichol, . K = no. of clusters chosen, K = max no. of clusters tried Connolly, et al.)! max I = no. of iterations, M = no. of Monte Carlo trials/partitions • ! Repeat for a different k if needed! • ! Determine the optimal k :! Terascale (Petascale?) computing and/or better algorithms! – ! Monte-Carlo Cross-Validation! – ! Akaike Information Criterion (AIC)! Some dimensionality reduction methods do exist (e.g., PCA, class! – ! Bayesian Information Criterion (BIC)! prototypes, hierarchical methods, etc.), but more work is needed! (Moore et al.)! Some Practical and Theoretical Some Simple Examples of Challenges for Problems in Clustering Analysis Clustering Analysis from “Standard” •! Data heterogeneity, biases, selection effects … Astronomical Galaxy Clustering Analysis •! Non-Gaussianity of clusters (data models) Clustering on a clustered background! Clustering with a nontrivial topology!

Outlier population, or! a non-Gaussian tail?!

•! Missing data, upper and lower limits •! Non-Gaussian (or non-Poissonian) noise •! Non-trivial topology of clustering

•! Useful vs. “useless” parameters … DPOSS Clusters (Gal et al.)! LSS Numerical Simulation (VIRGO)!

Useful vs. “Useless” Parameters: A Relatively Simple Classification Problem: Star-Galaxy Separation Clusters (classes) and correlations may exist/separate! in some parameter subspaces, but not in others! •! Important, since for most astronomical studies you want either xi! xn! stars (~ quasars), or galaxies; the depth to which a reliable classification can be done is the effective limiting depth of your catalog - not the detection depth –! There is generally more to measure for a non-PSF object x x •! You’d like to have an automated and objective process, with j! m! some estimate of the accuracy as a f (mag) –! Generally classification fails at the faint end •! Most methods use some measures of light concentration vs. magnitude (perhaps more than one), and/or some measure of the PSF fit quality (e.g., #2) •! For more advanced approaches, use some machine learning method, e.g., neural nets or decision trees Typical Parameter Space for S/G Classif. More S/G Classification Parameter Spaces: Normalized By The Stellar Locus

Stellar locus! !! (From DPOSS)!

Galaxies!

A set of such parameters can be fed into an automated classifier Then a set of such parameters can be fed into an automated classifier (ANN, DT, …) which can be trained with a “ground truth” sample! (ANN, DT, …) which can be trained with a “ground truth” sample!

Automated Star-Galaxy Classification: Automated Star-Galaxy Classification: Artificial Neural Nets (ANN) Decision Trees (DTs)

Output:! Input:! Star, p(s)! various! image! shape! Galaxy, p(g)! parameters.!

Other, p(o)!

(Weir et al. 1995)! (Odewahn et al. 1992)! Automated Star-Galaxy Classification: Star-Galaxy Classification: Unsupervised Classifiers The Next Generation Multiple imaging data sets! No training data set - the program decides on the number of classes Dataset! present in the data, and partitions the data set accordingly.! dependent! Individually constraints! An example:! derived Star! classifications AutoClass! C , C , … (Cheeseman et al.)! i i ! Uses Bayesian Star+fuzz! Optimal ! approach in Classification! machine learning Optimally combined imagery! (ML).! Classification Gal1 (E?)! $C%! This application! Context! from DPOSS! dependent! Gal2 (Sp?)! (Weir et al. 1995)! constraints!

How to Incorporate the External or A One key Priori (Contextual) Knowledge? external constraint •! Examples: seeing and transparency for a given night; is the direction on the sky, in Galactic coordinates; continuity in the “seeing” star/galaxy fraction along the scan; etc. quality for Good seeing! •! Still an open problem in the machine learning multiple •! In principle, it should lead to an improved classification imaging •! The problem occurs both in a “single pass” classification, and passes in combining of multiple passes •! In machine learning approaches, must somehow convert the (quantifiable external or a priori knowledge into classifier inputs - but the e.g., as the nature of this information is qualitatively different from the PSF FWHM) usual input (individual measurement vectors) Mediocre seeing! Two Approaches Using ANN: Classification Bias and Accuracy Good seeing! 1. Include the external Image P(S)! Parameters { knowledge among the {p , …, p } 1 n Output S ?! input parameters! NN External (stellarity Stars! index) parameters: Object dependent! coordinates, { Bad seeing! seeing, etc. Dataset dependent! Galaxies! Stellarity index S! 2. A two-step classification:! 0 (pure galaxy)! Classification 1 (pure star)! boundary! Image Output S1 Parameters { NN 1 Assuming a classification boundary divider (stars/galaxies) derived from {p , …, p } 1 n good quality data, and applying it to poorer quality data, would lead to a NN 2 Output S 2 purer, but biased sample, as some stars will be misclassified as galaxies.! External parameters { Shifting the boundary (e.g., on the basis of external knowledge) would diminish the bias, but also degrade the purity.!

Combining Multiple Classifications The (Proper) Uses of Statistics Metaclassifier, or a committee of machines with a chairman?! •! Hypothesis testing •! Model fitting Final {p } S i 1 NN1 1 output •! Data exploration: Measured {e } joint attributes and i 1 –! Multivariate analysis (MVA) and correlation search classif. classifications !! –! Clustering analysis and classification !! from individual MC S !! $ % –! Image processing and deconvolutions (independent) passes {p } S •! Data Mining (or KDD): i n NNn n –! Computational/algorithmic implementations of statistical tools {ei}n NB: Statistical Significance ! Scientific Relevance! Note: individual classifiers !! may be optimized or trained Design?! BAD uses of statistics: differently! Weighting algorithm?! Training data set?! –! As a substitute for data (quantity or quality) Validation data set?! –! To justify something a posteriori Multivariate Analysis (MVA) Correlation Searches in Attribute Space Data dimension D = 2! If D < D Multivariate Correlation Search: D DD = 2! S D, Statistical dim. DS = 2! DS = 1! correlations! •! Are there significant, nontrivial correlations present in the data? xi! ! are present!

•! Simple monovariate correlations are rare: multivariate data sets , …)

j

, x can contain more complex correlations i Correlations are clusters

(x •! What is the statistical dimensionality of the data? f with dimensionality reduction! xj! xk! A real-life example:! Clusters vs. Correlations:! “Fundamental Plane” of elliptical galaxies, a set of “Physics” ! Correlations! bivariate scaling relations in Correlations ! reduction of! a parameter space of ~ 10 the statistical dimensions, containing dimensionality! valuable insights into their physics and evolution!

Principal Component Analysis Correlation Vector Diagrams: Solving the eigen-problem of the data Projections of the data and observable axes onto hyperellipsoid in the parameter space of measured the planes defined by the eigenvectors attributes & 2! p 1! p3! & !i = a i1 p 1 + a i2 p 2 + …! p i = observables! &1! p i = b i1 & 1 + b i2 & 2 + …! (i = 1, …Ddata)! &3! '12! & j = eigenvectors, or p 2! principal axes of the data hyperellipsoid! & 1! e = eigenvalues, or &2! j amplitudes of & j! cos '12 = correlation coef. of p 1 and p 2! ( j = 1, …Dstat )! p2! p1! An Example, Using VOStat Pairwise Plots for Independent Observables Here is a data file, with 6 observed and 5 derived quantities (columns) for a few hundred elliptical galaxies (rows, data vectors):!

Their Correlation Matrix: Now Let’s Do the Principal Component Analysis (PCA):

5 independent observables, but only 2 significant dimensions: You can learn a lot just from the inspection of this matrix, the first 2 components account for all of the sample variance!! and comparison with the pairwise (bivariate) plots …! The data sit on a plane in a 5-dim. parameter space: this is the Fundamental Plane of elliptical galaxies. Any one variable can be expressed as a combination of any 2 others, within errors.! PCA Results in More Detail Now Project the Observable Axes Onto the Plane (This from a slightly different data set …)! Defined by the Principal Eigenvectors:

Compare with the correlation matrix:! Cosines of angles between parameter Eigenvectors and projections of parameter axes:! axes give the correlation coefficients.!

Another Approach: Correlated Residuals Bivariate Correlations in Practice mediocre mediocre correlation! correlation! best-fit line! Once the dimensionality has been established from PCA, one can Y! Z! either derive the optimal bivariate combinations of variables from the PCA coefficients, or optimize the mixing ratios for any two } (Z residual! variables vs. a third one (for a 2-dimensional manifold; the generalization to higher dimensional manifolds is obvious).!

X! X! … but the residuals poor correlation!! Y! (Z! correlate with the 3rd variable!! ! The data are on a plane in the XYZ space! Z ! Y! Some Data Mining Software & Projects Examples of Data Mining Packages: Weka General data mining software packages: •! Weka (Java): http://www.cs.waikato.ac.nz/ml/weka/ http://www.cs.waikato.ac.nz/ml/weka/ •! Weka4WS (Grid-enabled): http://grid.deis.unical.it/weka4ws/ •! RapidMiner: http://www.rapidminer.com/ •! A collection of open Astronomy-specific software and/or user clients: source machine learning •! VO-Neural: http://voneural.na.infn.it/ algorithms for data •! AstroWeka: http://astroweka.sourceforge.net/ mining tasks •! OpenSkyQuery: http://www.openskyquery.net/ •! Algorithms can either be •! ALADIN: http://aladin.u-strasbg.fr/ applied directly to a •! MIRAGE: http://cm.bell-labs.com/who/tkh/mirage/ dataset or called from •! AstroBox: http://services.china-vo.org/ your own Java code Astronomical and/or Scientific Data Mining Projects: •! Comes with its own GUI •! GRIST: http://grist.caltech.edu/ •! Contains tools for data •! ClassX: http://heasarc.gsfc.nasa.gov/classx/ pre-processing, •! LCDM: http://dposs.ncsa.uiuc.edu/ classification, regression, •! F-MASS: http://www.itsc.uah.edu/f-mass/ clustering, association

•! NCDM: http://www.ncdm.uic.edu/ rules, and visualization 46

Examples of Data Mining Packages: Mirage Here are some useful books: http://cm.bell-labs.com/who/tkh/mirage/ •! P.-N. Tan, M. Steinbach, & V. Kumar, Introduction to Data Mining, Addison-Wesley, 2005. ISBN: 9780321321367 Java Package for exploratory data •! M. Dunham, Data Mining: Introductory and Advanced Topics, analysis (EDA), Prentice-Hall, 2002. ISBN: 9780130888921 correlation mining, •! R. J. Roiger & M. W. Geatz, Data Mining: A Tutorial-Based and interactive Primer, Addison-Wesley, 2002. ISBN: 9780201741285 pattern discovery.

•! Lots of good links to follow from the class webpage!