Data Mining and Exploration

A Quick Overview Today •! A general intro to data mining –! What is it, and what for? Data Mining and Exploration •! Clustering and classification (a quick and very superficial intro) –! An example from astronomy: star-galaxy separation •! Exploratory statistics –! An example from multivariate statistics: Principal Component Analysis (PCA) and multivariate correlations S. G. Djorgovski •! Second part of the lecture: AyBi199, Feb. 2009 Some examples and demos, by Ciro Donalek Note: This is just a very modest start! We posted some web links for you to explore, and go from there. What is Data Mining (DM)? So what is Data Mining (DM)? (or: KDD = Knowledge Discovery in Databases) •! The job of science is Knowledge Discovery; data are incidental •! Many different things, but generally what the name KDD says to this process, representing the empirical foundations, but not the understanding per se –! It includes data discovery, cleaning, and preparation –! A lot of this process is pattern recognition (including discovery of –! Visualization is a key component (and can be very problematic) correlations, clustering/classification), discovery of outliers or –! It often involves a search for patterns, correlations, etc.; and anomalies, etc. automated and objective classification •! DM is Knowledge Discovery in Databases (KDD) –! It includes data modeling and testing of the models •! DM is defined as “an information extraction activity whose •! It depends a lot on the type of data, the study domain (science, goal is to discover hidden facts contained in (large) databases” commerce, …), the nature of the problem, etc., etc. •! Machine Learning (ML) is the field of Computer Science •! Generally, DM algorithms are computational embodiments of research that focuses on algorithms that learn from data statistics •! DM is the application of ML algorithms to large databases This is a Huge, HUGE, field! Lots of literature, lectures, –! And these algorithms are generally computational representations of software… And yet, lots of unsolved applied CS research problems some statistical methods A Schematic View of KDD Data Mining Methods and Some Examples Clustering Group together similar items and Classification separate dissimilar items in DB Associations Neural Nets Classify new data items using Decision Trees the known classes & groups Pattern Recognition Find unusual co-occurring associations Correlation/Trend Analysis of attribute values among DB items Principal Component Analysis Independent Component Analysis Predict a numeric attribute value Regression Analysis Organize information in the Outlier/Glitch Identification database based on relationships Visualization among key data descriptors Autonomous Agents Self-Organizing Maps (SOM) Identify linkages between data items Link (Affinity Analysis) based on features shared in common Some Data Mining Techniques Graphically Represented Here we show selected Kirk Borne’s slides Clustering Self-Organizing Map (SOM) Neural Network from the NVO Summer School 2008, http://nvo-twiki.stsci.edu/twiki/bin/view/ Main/NVOSS2008Sched Outlier (Anomaly) Detection Link Analysis Decision Tree Clustering and Classification Classification ~ Mixture Modeling •! Answering the questions like: •! A lot of DM involves automated –! How many statistically distinct kinds of things are there in my classification or mixture data, and which data object belongs to which class? modeling –! Are there anomalies/outliers? (e.g., extremely rare classes) –! How many kinds of data objects –! I know the classes present in the data, but would like to classify are there in my data set? efficiently all of my data objects –! Which object belongs to which •! Clustering can be: class with what probability? 1.! Supervised: a known set of data objects (“ground truth”) can •! Different classes often follow be used to train and test a classifier different correlations –! Examples: Artificial Neural Nets (ANN), Decision Trees (DT) –! Or, correlations may define the 2.! Unsupervised: the class membership (and the number of classes which follow them classes) is not known a priori; the program should find them •! Classes/clusters are defined by –! Examples: Kohonen Nets (SOM), Expectation Maximization their probability density (EM), various Bayesian methods… distributions in a parameter space There are many good tools out there, but you Exploration of observable parameter spaces need to choose the right ones for your needs and searches for rare or new types of objects A simple, real-life example:! Inference Joint DE, Bayes Net Structure Learning P(E |E ) Engine Learn 1 2 Inputs Dec Tree, Sigmoid Perceptron, Sigmoid N.Net, Gauss/ Predict Classifier Joint BC, Gauss Naïve BC, N.Neigh, Bayes Net Based category Inputs BC, Cascade Correlation Density Prob- Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve DE, Estimator ability Bayes Net Structure Learning, GMMs Inputs Predict Linear Regression, Polynomial Regression, Perceptron, Regressor real no. Neural Net, N.Neigh, Kernel, LWR, RBFs, Robust Inputs Regression, Cascade Correlation, Regression Trees, 9 GMDH, Multilinear Interp, MARS Now consider ~ 10 data vectors 2 3 (from Moore 2002) in ~ 10 - 10 dimensions …! Original! Gaussian Mixture Modeling An Example •! Data points are distributed in some N-dimensional parameter (from Moore et al.) space, x , j = 1, … N j GMM result! •! There are k clusters, wi , i = 1, …, k, where the number of clusters, k, µ1 µ2 may be either given by the scientist, or derived from the data themselves µ3 •! Each cluster can be modeled as an N-variate Gaussian with mean µi and covariance matrix Si ! •! Each data point has an association probability of belonging to each of the clusters, Pi! Model density distribution !! A Popular Technique: In modern data sets: D >> 1, D >> 1! K-Means D S Data Complexity ! Multidimensionality ! Discoveries! • ! Start with k random cluster centers! But the bad news is …! • ! Assume a data model (e.g., Gaussian)! – ! In principle, it can be some other . The computational cost of clustering analysis:! type of a distribution! K-means: K " N " I " D • ! Iterate until it converges! Expectation Maximization: K " N " I " D2 – ! There are many techniques; . 2 2 Expectation Maximization (EM) . Monte Carlo Cross-Validation: M " Kmax " N " I " D is very popular; multi-resolution . N = no. of data vectors, D = no. of data dimensions kd-trees are great (Moore, Nichol, . K = no. of clusters chosen, K = max no. of clusters tried Connolly, et al.)! max I = no. of iterations, M = no. of Monte Carlo trials/partitions • ! Repeat for a different k if needed! • ! Determine the optimal k :! Terascale (Petascale?) computing and/or better algorithms! – ! Monte-Carlo Cross-Validation! – ! Akaike Information Criterion (AIC)! Some dimensionality reduction methods do exist (e.g., PCA, class! – ! Bayesian Information Criterion (BIC)! prototypes, hierarchical methods, etc.), but more work is needed! (Moore et al.)! Some Practical and Theoretical Some Simple Examples of Challenges for Problems in Clustering Analysis Clustering Analysis from “Standard” •! Data heterogeneity, biases, selection effects … Astronomical Galaxy Clustering Analysis •! Non-Gaussianity of clusters (data models) Clustering on a clustered background! Clustering with a nontrivial topology! Outlier population, or! a non-Gaussian tail?! •! Missing data, upper and lower limits •! Non-Gaussian (or non-Poissonian) noise •! Non-trivial topology of clustering •! Useful vs. “useless” parameters … DPOSS Clusters (Gal et al.)! LSS Numerical Simulation (VIRGO)! Useful vs. “Useless” Parameters: A Relatively Simple Classification Problem: Star-Galaxy Separation Clusters (classes) and correlations may exist/separate! in some parameter subspaces, but not in others! •! Important, since for most astronomical studies you want either xi! xn! stars (~ quasars), or galaxies; the depth to which a reliable classification can be done is the effective limiting depth of your catalog - not the detection depth –! There is generally more to measure for a non-PSF object x x •! You’d like to have an automated and objective process, with j! m! some estimate of the accuracy as a f (mag) –! Generally classification fails at the faint end •! Most methods use some measures of light concentration vs. magnitude (perhaps more than one), and/or some measure of the PSF fit quality (e.g., #2) •! For more advanced approaches, use some machine learning method, e.g., neural nets or decision trees Typical Parameter Space for S/G Classif. More S/G Classification Parameter Spaces: Normalized By The Stellar Locus Stellar locus! !! (From DPOSS)! Galaxies! A set of such parameters can be fed into an automated classifier Then a set of such parameters can be fed into an automated classifier (ANN, DT, …) which can be trained with a “ground truth” sample! (ANN, DT, …) which can be trained with a “ground truth” sample! Automated Star-Galaxy Classification: Automated Star-Galaxy Classification: Artificial Neural Nets (ANN) Decision Trees (DTs) Output:! Input:! Star, p(s)! various! image! shape! Galaxy, p(g)! parameters.! Other, p(o)! (Weir et al. 1995)! (Odewahn et al. 1992)! Automated Star-Galaxy Classification: Star-Galaxy Classification: Unsupervised Classifiers The Next Generation Multiple imaging data sets! No training data set - the program decides on the number of classes Dataset! present in the data, and partitions the data set accordingly.! dependent! Individually constraints! An example:! derived Star! classifications AutoClass! C , C , … (Cheeseman et

Data Mining and Exploration

Models of the Gene Must Inform Data-Mining Strategies in Genomics

A Review of Data Mining Techniques in Medicine

Scientific Data Mining in Astronomy

A Proposed Data Mining Methodology and Its Application to Industrial Engineering

Discovering Knowledge in Data: an Introduction to Data Mining, Introduces the Reader to This Rapidly Growing ﬁeld of Data Mining

Data Mining Techniques in Diagnosis of Chronic Diseases

KDD and Data Mining

DATA MINING Federal Efforts Cover a Wide Range of Uses

Data Mining - Applications & Trends Introduction

Applications and Trends in Data Mining

Trends and Research Frontiers in Data Mining 3 13.1Miningcomplextypesofdata

Downloaded Drosophila Genome 2.0 (Affymetrix) Microarray Raw Gene Expression Data Available from the NCBI Gene Expression Omnibus (Edgar Et Al., 2002)