Big-Data Science in Porous Materials: Materials Genomics and Machine Learning

Lawrence Berkeley National Laboratory Recent Work Title Big-Data Science in Porous Materials: Materials Genomics and Machine Learning. Permalink https://escholarship.org/uc/item/3ss713pj Journal Chemical reviews, 120(16) ISSN 0009-2665 Authors Jablonka, Kevin Maik Ongari, Daniele Moosavi, Seyed Mohamad et al. Publication Date 2020-08-01 DOI 10.1021/acs.chemrev.0c00004 Peer reviewed eScholarship.org Powered by the California Digital Library University of California This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes. pubs.acs.org/CR Review Big-Data Science in Porous Materials: Materials Genomics and Machine Learning Kevin Maik Jablonka, Daniele Ongari, Seyed Mohamad Moosavi, and Berend Smit* Cite This: Chem. Rev. 2020, 120, 8066−8129 Read Online ACCESS Metrics & More Article Recommendations ABSTRACT: By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal−organic frameworks (MOFs). The fact that we have so many materials opens many exciting avenues but also create new challenges. We simply have too many materials to be processed using conventional, brute force, methods. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We show how to select appropriate training sets, survey approaches that are used to represent these materials in feature space, and review different learning architectures, as well as evaluation and interpretation strategies. In the second part, we review how the different approaches of machine learning have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. Given the increasing interest of the scientific community in machine learning, we expect this list to rapidly expand in the coming years. CONTENTS 5.1.1. Neural Networks 8087 5.2. Limited Amount of (Structured) Data (Wide 1. Introduction 8067 Data) 8090 2. Machine Learning Landscape 8067 5.2.1. Linear and Logistic Regression 8090 2.1. Machine Learning Pipeline 8068 fl 5.2.2. Kernel Methods 8090 2.1.1. Machine Learning Work ow 8068 5.2.3. Bayesian Learning 8091 2.1.2. Machine Learning Algorithms 8069 5.2.4. Instance-Based Learning 8092 2.2. Theory-Guided Data Science 8070 fi 5.2.5. Ensemble Methods 8092 2.3. Scienti c Method in Machine Learning: 6. How to Learn Well: Regularization, Hyperpara- Strong Inference and Multiple Models 8071 meter Tuning, and Tricks 8093 3. Selecting the Data: Dealing with Little, Imbal- 6.1. Hyperparameter Tuning 8093 anced, and Nonrepresentative Data 8071 6.2. Regularization 8094 3.1. Limitations of Hypothetical Databases 8072 6.2.1. Explicit Regularization: Adding a Term 3.2. Sampling to Improve Predictive Perform- or Layer 8094 ance 8072 6.2.2. Implicit Regularization: More Subtle 3.2.1. Diverse Set Selection 8072 Ways to Stop the Model from Remem- 3.3. Active Learning 8073 bering 8094 3.4. Dealing with Little Data 8074 7. How to Measure Performance and Compare 3.5. Dealing with Imbalanced Data Labels 8075 Models 8095 4. What to Learn from: Translating Structures into 7.1. Holdout Splits and Cross-Validation: Sam- Feature Vectors 8075 pling without Replacement 8095 4.1. Descriptors 8077 7.2. Bootstrap: Sampling with Replacement 8097 4.2. Overview of the Descriptor Landscape 8078 4.2.1. Local Descriptors 8079 4.2.2. Global Descriptors 8080 4.3. Feature Learning 8083 Special Issue: Porous Framework Chemistry 4.3.1. Feature Engineering 8083 4.3.2. Feature Selection 8083 Received: January 3, 2020 4.3.3. Data Transformations 8085 Published: June 10, 2020 5. How to Learn: Choosing a Learning Algorithm 8085 5.1. Lots of (Unstructured) Data (Tall Data) 8087 © 2020 American Chemical Society https://dx.doi.org/10.1021/acs.chemrev.0c00004 8066 Chem. Rev. 2020, 120, 8066−8129 Chemical Reviews pubs.acs.org/CR Review 7.3. Choosing the Appropriate Regression Met- decade, over 10,000 porous2,3 and 80,000 nonporous MOFs ric 8097 have been synthesized.4 In addition, one also has covalent 7.4. Classification 8097 organic frameworks (COFs), porous polymer networks 7.4.1. Probabilities That Can Be Interpreted as (PPNs), zeolites, and related porous materials. Because of Confidence 8097 their potential in many applications, ranging from gas 7.4.2. Choosing the Appropriate Classification separation and storage, sensing, catalysis, etc., these materials Metric 8097 have attracted a lot of attention. From a scientific point of view, 7.5. Estimating Extrapolation Ability 8098 these materials are interesting as their chemical tunability 7.6. Domain of Applicability 8098 allows us to tailor-make materials with exactly the right 7.7. Confidence Intervals and Error Estimates 8099 properties. As one can only synthesize a tiny fraction of all 7.7.1. Ensemble Approach 8099 possible materials, these experimental efforts are often 7.7.2. Distance-Based 8099 combined with computational approaches, often referred to 7.7.3. Conformal Prediction 8099 as materials genomics,5 to generate libraries of predicted or 7.8. Comparing Models 8100 hypothetical MOFs, COFs, and other related porous materials. 7.8.1. Ablation Studies 8100 These libraries are subsequently computationally screened to 7.9. Randomization Tests: Is the Model Learning identify the most promising material for a given application. Something Meaningful? 8100 That we now have of the order of ten thousand synthesized 8. How to Interpret the Results: Avoiding the porous crystals and over a hundred thousand predicted Clever Hans 8100 materials does create new challenges; we simply have too 8.1. Consider Using Explainable Models 8101 many structures and too much data. Issues related to having so 8.2. Post-Hoc Techniques to Shine Light Into many structures can be simple questions on how to manage so Black Boxes 8101 much data but also more profound on how to use the data to 8.3. Auditing Models: What Are Indirect Influen- discover new science. Therefore, a logical next step in materials ces? 8103 genomics is to apply the tools of big-data science and to exploit 9. Applications of Supervised Machine Learning 8104 “the unreasonable effectiveness of data”.6 In this review, we 9.1. Gas Storage and Separation 8104 discuss how machine learning (ML) has been applied to 9.1.1. Starting on Small Data Sets 8104 porous materials and review some aspects of the underlying 9.1.2. Moving to Big Data 8105 techniques in each step. Before discussing the specific 9.1.3. Bridging the Gap between Process applications of ML to porous materials, we give an overview Engineering and Materials Science 8107 over the ML landscape to introduce some terminologies and 9.1.4. Interpreting the Models 8107 also give a short overview over the technical terms we will use 9.2. Stability 8108 throughout this review in Table 1. 9.3. Reactivity and Chemical Properties 8108 In this review, we focus on applications of ML in materials 9.4. Electronic Properties 8108 science and chemistry with a particular focus on porous 9.5. ML for Molecular Simulations 8109 materials. For a more general discussion on ML, we refer the 9.6. Synthesis 8109 reader to some excellent reviews.7,8 9.6.1. Synthesizability 8111 9.7. Generative Models 8111 2. MACHINE LEARNING LANDSCAPE 10. Outlook and Concluding Remarks 8111 Nowadays it is difficult, if not impossible, to avoid ML in 10.1. Automatizing the Machine Learning Work- science. Because of recent developments in technology, we fl ow 8111 now routinely store and analyze large amounts of data. The 10.2. Reproducibility in Machine Learning 8111 underlying idea of big-data science is that if one has large 10.2.1. Comparability and Reporting Stand- amounts of data, one might be able to discover statistically ards 8113 significant patterns that are correlated to some specific fi 10.3. Transfer Learning and Multi delity Opti- properties or events. Arthur Samuel was among the first to mization 8113 use the term “machine learning” for the algorithms he 10.4. Multitask Prediction 8114 developed in 1959 to teach a computer to play the game of 10.5. Future of Big-Data Science in Porous checkers.9 His ML algorithm let the computer look ahead a Materials 8114 few moves. Initially, each possible move had the same weight Author Information 8114 and hence probability of being executed. By collecting more Corresponding Author 8114 and more data from actual games, the computer could learn Authors 8114 which move for a given board configuration would develop a Notes 8114 winning strategy. One of the reasons why Arthur Samuel Biographies 8114 looked at checkers was that in the practical sense the game of Acknowledgments 8115 checkers is not deterministic; there is no known algorithm that Abbreviations 8115 leads to winning the game and the complete evaluation of all References 8116 1040 possible moves is beyond the capacity of any computer. There are some similarities between the game checkers and the science of discovering new materials. Making a new 1. INTRODUCTION material is in practice equally nondeterministic. The number of One of the fascinating aspects of metal−organic frameworks possible ways we can combine atoms is simply too large to (MOFs) is that by combining linkers and metal nodes we can evaluate all possible materials. For a long time, materials synthesize millions of different materials.1 Over the past discovery has been based on empirical knowledge. Significant 8067 https://dx.doi.org/10.1021/acs.chemrev.0c00004 Chem.

Big-Data Science in Porous Materials: Materials Genomics and Machine Learning

Unstructured Data Is a Risky Business

1 Application of Text Mining to Biomedical Knowledge Extraction: Analyzing Clinical Narratives and Medical Literature

Big Data Mining Tools for Unstructured Data: a Review YOGESH S

Extracting Unstructured Data from Template Generated Web Documents

Top Natural Language Processing Applications in Business UNLOCKING VALUE from UNSTRUCTURED DATA for Years, Enterprises Have Been Making Good Use of Their 1

Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis

Solving the Unstructured Data Puzzle with Analytics

Cheminformatics for Genome-Scale Metabolic Reconstructions

Geospatial Semantics Yingjie Hu GSDA Lab, Department of Geography, University of Tennessee, Knoxville, TN 37996, USA

Unstructured Data Analysis in Arcgis

The Role of Text Analytics in Healthcare: a Review of Recent Developments and Applications

Web Mining – Data Mining Im Internet