Machine Learning in Materials Discovery: Confirmed Predictions

MR50CH03_Meredig ARjats.cls June 16, 2020 11:45 Annual Review of Materials Research Machine Learning in Materials Discovery: Confrmed Predictions and Their Underlying Approaches James E. Saal,1 Anton O. Oliynyk,2 and Bryce Meredig1 1Citrine Informatics, Redwood City, California 94063, USA; email: [email protected] 2Department of Chemistry and Biochemistry, Manhattan College, Riverdale, New York 10471, USA Annu. Rev. Mater. Res. 2020. 50:49–69 Keywords First published as a Review in Advance on machine learning, materials discovery, materials informatics May 18, 2020 Access provided by Manhattan College on 09/01/21. For personal use only. Annu. Rev. Mater. Res. 2020.50:49-69. Downloaded from www.annualreviews.org The Annual Review of Materials Research is online at Abstract matsci.annualreviews.org The rapidly growing interest in machine learning (ML) for materials dis- https://doi.org/10.1146/annurev-matsci-090319- covery has resulted in a large body of published work. However, only a 010954 small fraction of these publications includes confrmation of ML predic- Copyright © 2020 by Annual Reviews. tions, either via experiment or via physics-based simulations. In this review, All rights reserved we frst identify the core components common to materials informatics discovery pipelines, such as training data, choice of ML algorithm, and mea- surement of model performance. Then we discuss some prominent examples of validated ML-driven materials discovery across a wide variety of materials classes, with special attention to methodological considerations and ad- vances. Across these case studies, we identify several common themes, such as the use of domain knowledge to inform ML models. 49 MR50CH03_Meredig ARjats.cls June 16, 2020 11:45 1. INTRODUCTION Over the past decade, machine learning (ML) has emerged as a powerful tool to accelerate materials development (1–5). Academic, government, and commercial entities are broadly deploying ML in service of materials discovery. Publication activity in ML for materials is growing exponentially even as a fraction of all materials research (Figure 1). Despite an increasing body of literature on data-driven materials research, which we refer to as materials informatics, only a fraction of published studies culminate in predictions that are subsequently validated by an experiment, either in the laboratory or as a “virtual” experiment via physics-based simulation. A trained ML model is merely a means to an end, and the utility of materials informatics is fully realized only when ML predictions are confrmed. In this review, we (a) describe the key components of a materials informatics discovery pipeline; (b) highlight recent works that describe validation of materials informatics predictions, as summarized in Table 1;and(c) note some materials discovery–specifc considerations for ML. We begin by describing a typical materials informatics pipeline in more detail. 2. THE MATERIALS INFORMATICS DISCOVERY PIPELINE In this section, we discuss critical components of a materials informatics pipeline common to validated ML studies. These standard steps are summarized in the generalized pipeline of Figure 2. The pipeline begins with establishing a materials data set for training, as well as a set of materials descriptors to extend the data with available physical information. This data set is then used to train an ML model, which is used to make a prediction of novel materials for validation. 2.1. Training Data Data of suffcient quality and quantity are an essential prerequisite for the successful application of ML methods to materials problems. Large companies in the technology industry, such as Google, 0.5 Scopus search results ML publication 0.4 0.3 Access provided by Manhattan College on 09/01/21. For personal use only. Annu. Rev. Mater. Res. 2020.50:49-69. Downloaded from www.annualreviews.org 0.2 0.1 Materials or chemistryMaterials (%) publications 0.0 1990 2000 2010 2020 Year Figure 1 Share of materials and chemistry publications referencing machine learning (ML) as a function of time. The data are normalized to account for the overall exponential growth (6) in scientifc publications over time, illustrating the relative growth of ML-related work. 50 Saal • Oliynyk • Meredig MR50CH03_Meredig ARjats.cls June 16, 2020 11:45 Table 1 Summary of validated machine learning (ML) predictions Number of Predicted Initial training candidates Reference Materials class Application properties data set ML algorithm Design space evaluated Year a 70 Inorganic ternary Stable Formation energy 15,000 density Random forest 1.6 million Nine DFT 2013 solids (AxByCz) composition functional theory enumerated calculations prediction (DFT) calculations compositions 64 Molecules Organic light- Several underlying DFT calculations on Neural network 1.6 million molecules Four 2015 emitting properties 40,000 randomly constructed from experiments diodes contributing to selected candidates fragments external quantum effciency (EQE) 90 Polymers Dielectrics Electronic dielectric Newly generated Kernel ridge ∼156,000 28 DFT 2015 constant, ionic DFT data regression and enumerated four-, calculations dielectric (284 records) genetic six-, and constant, and algorithm eight-block www.annualreviews.org band gap polymers 78 Organically Hydrothermal Reaction success Labeled reactions Support vector 1,680 commercially 34 experiments 2015 templated metal synthesis from internal data machine available diamines oxides set (3,955 records) 91 Inorganic binary Structure Stability Pearson’s Crystal Support vector 2,926 possible binary One 2016 Access provided by Manhattan College on 09/01/21. For personal use only. solids (AB) discovery Data (PCD) machine combinations experiment database (92) Annu. Rev. Mater. Res. 2020.50:49-69. Downloaded from www.annualreviews.org (706 records) • 65 Inorganic ternary Structure Stability PCD (92) Random forest >400,000 AB2C 21 experiments 2016 Machine Learning in Materials Discovery 51 solids (AB2C discovery (1,948 records) compositions Heusler) 68 Metal alloys Shape memory Endothermic peak Newly synthesized Polynomial 1,652,470 One 2016 alloys temperature alloys (53 records) regression compositions experiment 19 Perovskites Ferroelectric Ferroelectric Curie Literature review (167 Support vector 61,506 enumerated 10 experiments 2017 temperature and stability and 117 classifer and compositions perovskite Curie tempera- linear regression stability ture records) 93 Inorganic ternary Structure Polymorphism PCD (92) Support vector 98,769 ABC One 2017 solids (ABC) discovery (1,556 records) machine compositions experiment 7 Bulk metallic glasses Structural Glass formability Landolt–Börnstein Random forest Enumerated grids in Four 2017 applications (79) (6,780 records) 2,024 ternary experiments systems 94 Ternary ionic solids Structure Stability Inorganic Crystal Tucker 7,405,200 27 DFT 2017 (AaBbXx) discovery Structure Database decomposition enumerated calculations (ICSD) (95) recommender compositions system 77 Metal–organic Hydrogen Hydrogen- Grand canonical Least absolute 54,776 MOF One 2018 frameworks storage deliverable Monte Carlo shrinkage and structures from experiment (MOFs) capacity simulations selection Cambridge (1,000 records) operator Structural (LASSO) Database (CSD) (9) (Continued) MR50CH03_Meredig ARjats.cls June 16, 2020 11:45 Table 1 (Continued) Number of Predicted Initial training candidates Reference Materials class Application properties data set ML algorithm Design space evaluated Year a 63 Small molecules Organic light- Max light-absorbing Random sample from Deep neural 40,000 randomly Three 2018 emitting wavelengths; PubChem (12) network generated experiments 52 Saal diodes triplet (T1) (50,000 records) simplifed energy levels molecular-input line-entry system (SMILES) strings 96 Hybrid organic– Photovoltaics Band gap Literature review Gradient boosting 5,158 enumerated Six DFT 2018 • Oliynyk inorganic (212 records) regression compositions calculations perovskites (ABX3) • 31 Inorganic solids Superhard Bulk modulus and Materials Project (28) Support vector 118,287 compounds Tw o 2018 Meredig materials shear modulus (3,248 records) machine from PCD (92) experiments 75 Polymers Cement Slump Commercial data LASSO regression Maximization of One 2018 plasticizer (seven records) analytical experiment expression for slump 20 Ni-rich cathode Batteries Initial capacity, cycle Literature review Extremely 50,000 randomly Five 2018 materials life, and amount (330 records) randomized tree generated experiments (LiNixCo1 − x − y- of residual Li and adaptive candidate Mn1 − x − y − zO2) boosting syntheses Access provided by Manhattan College on 09/01/21. For personal use only. 17 High-entropy alloys Structural Hardness Literature review Canonical 4.6 billion Seven 2018 Annu. Rev. Mater. Res. 2020.50:49-69. Downloaded from www.annualreviews.org applications (82 records) correlation enumerated experiments analysis and compositions genetic algorithm 80 Layered Thermal Figure of merit from Iterative Bayesian Gaussian process >8 billion candidates Three 2018 semiconductor radiator emissivity spectra optimization on regression experiments metamaterials 42,000 groups of structures 97 Polymers Thermoplastics Thermal PoLyInfo (98), QM9 Neural networks Monte Carlo Three 2018 conductivity (λ), (99) (38,310 total generative model experiments several other properties; only 28 (iqspr) properties λ values) 32 Inorganic phosphor Solid-state Debye temperature Materials Project (28) Support vector >300,000 One

Load more