<<

Data Mining and Materials : a primer

Krishna Rajan

Department of and Engineering NSF Intl. Materials Institute Combinatorial Sciences & Collaboratory Iowa State University

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 OUTLINE

What is data? primary, secondary , derivative sources of data

What do we mean by mining data? data correlations – dimensional analysis approach data correlations –

What can we learn from data mining?

Classification data mining hierarchy of data trends in data Prediction predicting structure-property relationships computational informatics vs. computational materials science data mining for building databases

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS

What is it ? Why ?

• Searching for patterns of • Establish new correlations behavior among multivariate data sets • Identify outliers

• Can lead • Enlarge / virtual libraries to predictions? • Evaluate databases

• Establish predictions

Krishna Rajan Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 COMPLEXITY OF BIOSYSTEMS

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS STRATEGY

Data + Correlations + Theory = Knowledge-base

Informatics- Based Design

• Atomistic based calculations • Continuum based theories

• Machine • Data compression • Pattern recognition toolkit: • Classification • Prediction • Simulations • Hybrid tools ULTRA LARGE • Combinatorially derived datasets SCALE • Digital libraries • Spectral and imaging libraries SPACE

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS-BASED DESIGN STRATEGIES

BIOLOGICAL ANALOGUE CRYSTAL STRUCTURE ANALOGUE

Ideker and Lauffenburger:(2003) Nye (1950)

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 Statistical Tools Materials Databases ƒ Experimental data ƒ Computational derived Principal Component Analysis data sets (PCA) ƒSimulations

Prediction Support Vector Machine (SVM)

Latent Variables / Combinatorial & Partial Least Squares (PLS) Classification Spectral Libraries Input & Output Association Mining (AM)

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 MACHINE LEARNING AND MATERIALS DISCOVERY

Requirements Pitfalls

• Global minima • Local minima • Categorical data • Categorical data difficult

• Missing / variable data • Convergence difficult for large

• Skewed distributions data sets

• Large data sets • Few outliers can lead to poor

• Scalable performance

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 Database administration & management

DATA STORAGE • thermodynamic, crystallographic & property data bases • combinatorial experimental data

Oracle, Unix, Supercomputing, SQL

and ontology of materials science data DATA CURATION • data sharing , networking / cyber infrastructure

JAVA, HTML, Python

• object oriented DATA REPRESENTATION • visualization of high dimensional data

Data mining

• Clustering analysis KNOWLEDGE • Quantitative Structure-Activity Relationships DISCOVERY (QSAR) for materials design

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 WHY COUPLE COMPUTATIONAL MATERIALS SCIENCE AND INFORMATICS ?

• Accelerated insertion of materials into engineering systems

• Rapid multiscale design and optimization of materials properties

• Establishment of new structure –property correlations among large, heterogeneous and distributed data sets

• Discovery of new chemistries and compounds

• Formulation and / or refinement of new theories for materials behavior

• Rapid identification of critical data and theoretical needs for future problems

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 SOFT MODELING vs. HARD MODELING

Interpretation SoftSoft modeling modeling Data Mining & Visualization Feature Extraction Data Warehousing Knowledge

Patterns Transformed Data Constitutive Equations

Dimensional Analysis and/or Original Theory HardHardmodelingmodeling Data Data bases

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th Krishna2006 Rajan DATA MAP

22x22 (484 cells – 2500 data points): Silicon nitride descriptors

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DIMENSIONALITY REDUCTION

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 EIGENVALUE DECOMPOSITION Sp= λ p

eigenvector eigenvalue SP= PΛ

Eigenvalues on the diagonal of this diagonal matrix

Eigenvectors forming the columns of this matrix If V is nonsingular, this become the eigenvalue decomposition

Eigenvalue Matrix SPP=Λ−1 Loading Matrix=Eigenvector Matrix

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 PREDICTIONS PLS ƒ ‘Upgradeable’ linear predictive tool ƒ Quadratic and interaction terms :flexibility

T U

Factors Responses

Calculate latent factors from original predictor variables T=XW where T- latent factors, X – predictors, W - weights

Use latent factors to predict response Y=TQ+E where Y - responses, Q – loadings, E- noise

• W maximizes covariance between Y and T

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 INFORMATICS AND THE PERIODIC TABLE

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DATA WAREHOUSING

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 BIVARIATE MAPS

1. # of atoms/unit cell 2. Valence electron # 3. Electronegativity 4. 1st Ionization potential 5. Atomic radius 6. Melting point 7. Boiling point 8. Density

6

4

2 20

10

0 4

2

0 10

5

0

4 3 2 1 4 3 2 3

2

1 4000

2000

0 10000

5000

0 40

20 5 8 7 6

0 2 4 6 0 10 20 0 2 4 0 5 10 1 2 3 0 2000 4000 0 5000 10000 0 20 40 1 2 3 4 5 6 7 8

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 SEEKING PATTERNS

6

4

2

0

-2

-4

-6 6

4

PC1 2

0

-2

-4

-6

2

1

0 PC3 PC2 -1

-2

-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -2 -1 0 1 2 PC1 PC2 PC3

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 MENDELEEV SEQUENCING

5 54 4 Cs, Ba, La,55 Ce, Pr, Nd, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, 27 Lu, Hf, Ta, W, Re, Os, Ir, Pt, Au, Hg, Tl, Pb, and Bi 3 14 5657 2 28 52 58 53 15 1 3029 45

26 41 4438 31 0 324937 42 51 12 16 25 36 4043 5 13 17 3934 46 3 2 18 33 35 Scores on PC 3 (12.02%) 4847 50 -1 1 11 4 24 22 20 19 10 9 23 -2 Na, Mg, Al 76 21 8 -3 -4 -3 -2 -1 0 1 2 K, Ca, Sc, Ti, V, Cr, Mn, Fe, Co, Ni, Cu,Scores Zn, and on Ga PC 2 (22.61%) Sr, Y, Zr, Nb, Mo, Tc, Ru, Rh, Pd, Ag, Cd, In, and Sn

Each cluster represent each row in the periodic table Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DEVELOPING PHYSICAL LAWS

• Location of property in loading plot indicates influence of property on PC • Atomic weight has no influence on PC1, high influence on PC2 and 3

Valence • Clustered properties indicate Electron Melting relationships Heat of number Vaporization Point • Electrical and thermal conductivity Heat of Boiling • Melting point and density Fusion Point Pseudopotential • Molar volume and atomic radius Martynov- Radius Batsanov • Melting and boiling points, heats of Covalent electronegativity Specific Heat radius fusion and vaporization, and Molar valence number Metallic Density Volume • Pauling electronegativity and first radius Pauling ionization potential electronegativity Atomic First Weight Ionization Thermal Potential Conductivity

Electrical Conductivity

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DEVELOPING PHYSICAL LAWS

Activation energy for self-diffusion versus melting point

70 Ni Fe Pt Linear regression Co 60

Cu 50 Ag 18R g 40 Au Al 29. Copper (Cu) 30 Bokshtein & Zhukhovitskii Pb

Activation Energy, Q [kcal/mol] 6. Carbon, diamond (C) 100 20 500 1000 1500 2000 2500 T [K] m 32. Germanium (Ge) From Glicksman

10

52. Tellurium (Te) Thermal Conductivity (W/(m*K)) Conductivity Thermal

83. Bismuth (Bi) 1 34. Selenium (Se)

1e-12 1e-10 1e-8 1e-6 1e-4 0.01 RPI Electrical Conductivity (0.00000001*ohm*m)

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 PCA of 34 binary, ternary, quaternary compounds: Score plot

9 descriptors from NIST data - Lattice parameters (a, b, c, α, β, γ) - c/a, b/c - V/abc Orthorhombic 21 22 9

21 8 10 233 Scores Plot Tetragonal 15

(I-42d) 16 Hexagonal 4 26 25 2827 10 Cubic 0 11 2024 305733 18 21 2310 2238291 -10 31343226272825 16 154 121314 3431 19 17 32 -20 12 1312 14

PC( 5.85%) 3 -30 29

-40 50 6 Monoclinic

Tetragonal 0 50 0 (P42/mmc) -50 PCA Score plot (34 binary, ternary, quaternary compounds) PC 2 (29.17%) -50 -100 PC 1 (61.71%)

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 Cluster analysis of 34 binary, ternary, quaternary compounds

α, β, γ = 90° α, β, γ≠90°

Tetragonal (I4/mcm) 5 7 11 20 24 18 30 33 6 17 19 1 2 9 22 25 28 27 26 3 23 10 15 8 4 16 21 12 13 14 31 32 34 29 Hexagonal

Cubic (P63/mmc) Tetragonal Monoclinic (I-42d) (Im-3,Pm-3) Cubic Hexagonal (P-62m,P-6) Orthorhombic (Fm-3m) Cubic Hexagonal (P63/mmc,P6/mmm, P-6m2) (P4132,P-43m,Pm-3, / Trigonal (P-3m1) Pm-3,Fd-3ms) Tetragonal Krishna Rajan TMS / ASM Materials Informatics Workshop (P42/mmc) Cincinatti, OH October 15th 2006 RULE EXTRACTION

Atomic packing of single element

Ex.) Engel’s model using “# of sp electrons/atom”†

BCC < 1.5 HCP 1.7 – 2.1 FCC 2.5-3

Li Al Mg Cu Sc

sp electrons/atom† 1 3 2 11 2

BCC FCC HCP FCC HCP

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 PATTERN DETECTION

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 DATA MINING DATABASES

classification prediction

6

4

2

0

PC3(19.47%) -2

-4 covalent ionic 6 -6-6 4 -4 2 -2 0 0 ) -2 % PC 2 3 1(4 -4 .4 9.8 4 2 9 (2 %) 6 -6 2 PC Graduciz, Gaune-Escard Univ.Novi Sad, Serbia CNRS, Marseilles

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 COMPUTATIONAL ISSUES Focus on properties of signal / macroscopic behavior rather than noise/ error. Assume complexity !!! Establish multivariate database: Data can come across length and time scales Seek DIVERSITY in datasets Model relationships in data to seek Utilize data dimensionality reduction techniques heuristic relationships:

Analyze variation and correlation Advanced statistical learning tools in data can deal with: Covariance - skewed data - missing data Establish correlations across - differentiate between local diverse data sets ( ie. length & time scales) and global minima - ultra large scale datasets Identify outliers: explore cause - variable uncertainty

Develop predictive models •Singular value decomposition - Target requirements of missing •Support vector machines data •Association mining - Quantitatively assess data diversity •Fuzzy clustering Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th Krishna2006 Rajan CRYSTAL DESIGN

Ching et.al J. Amer. Ceram.Soc. 85 75-80 (2002)

1. Assess influence of latent variables ( i.e. electronic structure parameters) on properties of known data

2. Establish heuristic relationships on database of all input variables instead of phenomenological relationships in bivariate manner

3. Use statistical learning to predict new materials behavior on new multivariate input data

4. Inverse problem approach to formulate quantitative structure- property relationships

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th Krishna2006 Rajan Virtual library: via informatics

Refractory metals

“Real” library: via first principles calculations

Krishna Rajan Suh and Rajan (2005, 2006) TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th Krishna2006 Rajan LINKING DISPARATE LENGTH SCALES Visualizing Associations

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 SYNTHESIZING REMOTE DATA SETS

hcp

fcc

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 STRATEGY FOR MATERIALS DESIGN

• Search and retrieve • Structure databases

• Refining descriptors • Diffraction spectra databases

• Developing predictions • Hyper spectral imaging databases

• Filling in missing data

Mapping / Visualization of databases:

• Correlations • Trend analysis

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006 CONCLUSIONS

The Facts of the Matter: Finding, Science was originally empirical, like understanding, and using information about our Leonardo, making wonderful drawings of physical world (2000) nature. Next came the theorists who tried to write down the equations that based design of explained the observed behaviors, like Kepler or Einstein. Then when we got to materials……next stage of the scientific complex enough systems like the clustering discovery process of a million galaxies, there came the computer simulations, the computational branch of science. Now we are getting into the data exploration part of science, which is kind of a little bit of them all”..

Division of Materials Research:

Dr. Alex Szalay: Virtual Observatory Project Cyber infrastructure and cyber discovery in materials science (2006)

May 20th 2003

Krishna Rajan TMS / ASM Materials Informatics Workshop Cincinatti, OH October 15th 2006