Data Mining and Materials Informatics: a primer
Krishna Rajan
Department of Materials Science and Engineering NSF Intl. Materials Institute Combinatorial Sciences & Materials Informatics Collaboratory Iowa State University
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 OUTLINE
What is data? primary, secondary , derivative sources of data
What do we mean by mining data? data correlations – dimensional analysis approach data correlations – data mining
What can we learn from data mining?
Classification data mining databases hierarchy of data trends in data Prediction predicting structure-property relationships computational informatics vs. computational materials science data mining for building databases
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS
What is it ? Why ?
• Searching for patterns of • Establish new correlations behavior among multivariate data sets • Identify outliers
• Can pattern recognition lead • Enlarge database / virtual libraries to predictions? • Evaluate databases
• Establish predictions
Krishna Rajan Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 COMPLEXITY OF BIOSYSTEMS
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS STRATEGY
Data + Correlations + Theory = Knowledge-base
Informatics- Based Design
• Atomistic based calculations • Continuum based theories
• Machine learning • Data compression • Pattern recognition Machine learning toolkit: • Classification • Prediction • Simulations • Hybrid tools ULTRA LARGE • Combinatorially derived datasets SCALE • Digital libraries INFORMATION • Spectral and imaging libraries SPACE
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS-BASED DESIGN STRATEGIES
BIOLOGICAL ANALOGUE CRYSTAL STRUCTURE ANALOGUE
Ideker and Lauffenburger:(2003) Nye (1950)
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 Statistical Tools Materials Databases Experimental data Computational derived Principal Component Analysis data sets (PCA) Simulations
Prediction Support Vector Machine (SVM)
Latent Variables / Combinatorial & Partial Least Squares (PLS) Classification Spectral Libraries Input & Output Association Mining (AM)
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 MACHINE LEARNING AND MATERIALS DISCOVERY
Requirements Pitfalls
• Global minima • Local minima • Categorical data • Categorical data difficult
• Missing / variable data • Convergence difficult for large
• Skewed distributions data sets
• Large data sets • Few outliers can lead to poor
• Scalable performance
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 Database administration & management
DATA STORAGE • thermodynamic, crystallographic & property data bases • combinatorial experimental data
Oracle, Unix, Supercomputing, SQL
• taxonomy and ontology of materials science data DATA CURATION • data sharing , networking / cyber infrastructure
JAVA, HTML, Python
• object oriented programming language DATA REPRESENTATION • visualization of high dimensional data
Data mining algorithms
• Clustering analysis KNOWLEDGE • Quantitative Structure-Activity Relationships DISCOVERY (QSAR) for materials design
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 WHY COUPLE COMPUTATIONAL MATERIALS SCIENCE AND INFORMATICS ?
• Accelerated insertion of materials into engineering systems
• Rapid multiscale design and optimization of materials properties
• Establishment of new structure –property correlations among large, heterogeneous and distributed data sets
• Discovery of new chemistries and compounds
• Formulation and / or refinement of new theories for materials behavior
• Rapid identification of critical data and theoretical needs for future problems
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 SOFT MODELING vs. HARD MODELING
Interpretation SoftSoft modeling modeling Data Mining & Visualization Feature Extraction Data Warehousing Knowledge
Patterns Transformed Data Constitutive Equations
Dimensional Analysis and/or Original Theory HardHardmodelingmodeling Data Data bases
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th Krishna2006 Rajan DATA MAP
22x22 (484 cells – 2500 data points): Silicon nitride descriptors
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DIMENSIONALITY REDUCTION
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 EIGENVALUE DECOMPOSITION Sp= λ p
eigenvector eigenvalue SP= PΛ
Eigenvalues on the diagonal of this diagonal matrix
Eigenvectors forming the columns of this matrix If V is nonsingular, this become the eigenvalue decomposition
Eigenvalue Matrix SPP=Λ−1 Loading Matrix=Eigenvector Matrix
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 PREDICTIONS PLS ‘Upgradeable’ linear predictive tool Quadratic and interaction terms :flexibility
T U
Factors Responses
Calculate latent factors from original predictor variables T=XW where T- latent factors, X – predictors, W - weights
Use latent factors to predict response Y=TQ+E where Y - responses, Q – loadings, E- noise
• W maximizes covariance between Y and T
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS AND THE PERIODIC TABLE
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DATA WAREHOUSING
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 BIVARIATE MAPS
1. # of atoms/unit cell 2. Valence electron # 3. Electronegativity 4. 1st Ionization potential 5. Atomic radius 6. Melting point 7. Boiling point 8. Density
6
4
2 20
10
0 4
2
0 10
5
0
4 3 2 1 4 3 2 3
2
1 4000
2000
0 10000
5000
0 40
20 5 8 7 6
0 2 4 6 0 10 20 0 2 4 0 5 10 1 2 3 0 2000 4000 0 5000 10000 0 20 40 1 2 3 4 5 6 7 8
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 SEEKING PATTERNS
6
4
2
0
-2
-4
-6 6
4
PC1 2
0
-2
-4
-6
2
1
0 PC3 PC2 -1
-2
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -2 -1 0 1 2 PC1 PC2 PC3
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 MENDELEEV SEQUENCING
5 54 4 Cs, Ba, La,55 Ce, Pr, Nd, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, 27 Lu, Hf, Ta, W, Re, Os, Ir, Pt, Au, Hg, Tl, Pb, and Bi 3 14 5657 2 28 52 58 53 15 1 3029 45
26 41 4438 31 0 324937 42 51 12 16 25 36 4043 5 13 17 3934 46 3 2 18 33 35 Scores on PC 3 (12.02%) 4847 50 -1 1 11 4 24 22 20 19 10 9 23 -2 Na, Mg, Al 76 21 8 -3 -4 -3 -2 -1 0 1 2 K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu,Scores Zn, and on Ga PC 2 (22.61%) Sr, Y, Zr, Nb, Mo, Tc, Ru, Rh, Pd, Ag, Cd, In, and Sn
Each cluster represent each row in the periodic table Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DEVELOPING PHYSICAL LAWS
• Location of property in loading plot indicates influence of property on PC • Atomic weight has no influence on PC1, high influence on PC2 and 3
Valence • Clustered properties indicate Electron Melting relationships Heat of number Vaporization Point • Electrical and thermal conductivity Heat of Boiling • Melting point and density Fusion Point Pseudopotential • Molar volume and atomic radius Martynov- Radius Batsanov • Melting and boiling points, heats of Covalent electronegativity Specific Heat radius fusion and vaporization, and Molar valence number Metallic Density Volume • Pauling electronegativity and first radius Pauling ionization potential electronegativity Atomic First Weight Ionization Thermal Potential Conductivity
Electrical Conductivity
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DEVELOPING PHYSICAL LAWS
Activation energy for self-diffusion versus melting point
70 Ni Fe Pt Linear regression Co 60
Cu 50 Ag 18R g 40 Au Al 29. Copper (Cu) 30 Bokshtein & Zhukhovitskii Pb
Activation Energy, Q [kcal/mol] 6. Carbon, diamond (C) 100 20 500 1000 1500 2000 2500 T [K] m 32. Germanium (Ge) From Glicksman
10
52. Tellurium (Te) Thermal Conductivity (W/(m*K)) Conductivity Thermal
83. Bismuth (Bi) 1 34. Selenium (Se)
1e-12 1e-10 1e-8 1e-6 1e-4 0.01 RPI Electrical Conductivity (0.00000001*ohm*m)
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 PCA of 34 binary, ternary, quaternary compounds: Score plot
9 descriptors from NIST data - Lattice parameters (a, b, c, α, β, γ) - c/a, b/c - V/abc Orthorhombic 21 22 9
21 8 10 233 Scores Plot Tetragonal 15
(I-42d) 16 Hexagonal 4 26 25 2827 10 Cubic 0 11 2024 305733 18 21 2310 2238291 -10 31343226272825 16 154 121314 3431 19 17 32 -20 12 1312 14
PC( 5.85%) 3 -30 29
-40 50 6 Monoclinic
Tetragonal 0 50 0 (P42/mmc) -50 PCA Score plot (34 binary, ternary, quaternary compounds) PC 2 (29.17%) -50 -100 PC 1 (61.71%)
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 Cluster analysis of 34 binary, ternary, quaternary compounds
α, β, γ = 90° α, β, γ≠90°
Tetragonal (I4/mcm) 5 7 11 20 24 18 30 33 6 17 19 1 2 9 22 25 28 27 26 3 23 10 15 8 4 16 21 12 13 14 31 32 34 29 Hexagonal
Cubic (P63/mmc) Tetragonal Monoclinic (I-42d) (Im-3,Pm-3) Cubic Hexagonal (P-62m,P-6) Orthorhombic (Fm-3m) Cubic Hexagonal (P63/mmc,P6/mmm, P-6m2) (P4132,P-43m,Pm-3, / Trigonal (P-3m1) Pm-3,Fd-3ms) Tetragonal Krishna Rajan TMS / ASM Materials Informatics Workshop (P42/mmc) Cincinatti, OH October 15th 2006 RULE EXTRACTION
Atomic packing of single element
Ex.) Engel’s model using “# of sp electrons/atom”†
BCC < 1.5 HCP 1.7 – 2.1 FCC 2.5-3
Li Al Mg Cu Sc
sp electrons/atom† 1 3 2 11 2
BCC FCC HCP FCC HCP
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 PATTERN DETECTION
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DATA MINING DATABASES
classification prediction
6
4
2
0
PC3(19.47%) -2
-4 covalent ionic 6 -6-6 4 -4 2 -2 0 0 ) -2 % PC 2 3 1(4 -4 .4 9.8 4 2 9 (2 %) 6 -6 2 PC Graduciz, Gaune-Escard Univ.Novi Sad, Serbia CNRS, Marseilles
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 COMPUTATIONAL ISSUES Focus on properties of signal / macroscopic behavior rather than noise/ error. Assume complexity !!! Establish multivariate database: Data can come across length and time scales Seek DIVERSITY in datasets Model relationships in data to seek Utilize data dimensionality reduction techniques heuristic relationships:
Analyze variation and correlation Advanced statistical learning tools in data can deal with: Covariance - skewed data - missing data Establish correlations across - differentiate between local diverse data sets ( ie. length & time scales) and global minima - ultra large scale datasets Identify outliers: explore cause - variable uncertainty
Develop predictive models •Singular value decomposition - Target requirements of missing •Support vector machines data •Association mining - Quantitatively assess data diversity •Fuzzy clustering Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th Krishna2006 Rajan CRYSTAL CHEMISTRY DESIGN
Ching et.al J. Amer. Ceram.Soc. 85 75-80 (2002)
1. Assess influence of latent variables ( i.e. electronic structure parameters) on properties of known data
2. Establish heuristic relationships on database of all input variables instead of phenomenological relationships in bivariate manner
3. Use statistical learning to predict new materials behavior on new multivariate input data
4. Inverse problem approach to formulate quantitative structure- property relationships
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th Krishna2006 Rajan Virtual library: via informatics
Refractory metals
“Real” library: via first principles calculations
Krishna Rajan Suh and Rajan (2005, 2006) TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th Krishna2006 Rajan LINKING DISPARATE LENGTH SCALES Visualizing Associations
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 SYNTHESIZING REMOTE DATA SETS
hcp
fcc
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 STRATEGY FOR MATERIALS DESIGN
• Search and retrieve • Structure databases
• Refining descriptors • Diffraction spectra databases
• Developing predictions • Hyper spectral imaging databases
• Filling in missing data
Mapping / Visualization of databases:
• Correlations • Trend analysis
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 CONCLUSIONS
The Facts of the Matter: Finding, Science was originally empirical, like understanding, and using information about our Leonardo, making wonderful drawings of physical world (2000) nature. Next came the theorists who tried to write down the equations that Information science based design of explained the observed behaviors, like Kepler or Einstein. Then when we got to materials……next stage of the scientific complex enough systems like the clustering discovery process of a million galaxies, there came the computer simulations, the computational branch of science. Now we are getting into the data exploration part of science, which is kind of a little bit of them all”..
Division of Materials Research:
Dr. Alex Szalay: Virtual Observatory Project Cyber infrastructure and cyber discovery in materials science (2006)
May 20th 2003
Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006