Acknowledgments

Materials Data Analytics: A Path-Finding Workshop Report was prepared by Nexight Group under the guidance of ASM International staff: Scott Henry, Director, Content & Knowledge-Based Solutions and Larry Berardinis, Technical Projects Manager, CMD Network. The work was sponsored by the National Institute of Standards and Technology (NIST). On behalf of ASM, we would like to express our appreciation to the participants in the Materials Data Analytics Workshop (see Appendix A) for their input and recommendations.

DISCLAIMER This report represents the opinions of the workshop participants and not necessarily that of their home organizations or their affiliated professional societies.

Cover graphic adapted from M.C. Flemings and R.W. Cahn, Organization and Trends in MSE Education in the US and in Europe, Acta Mater., 48, 2000, pp.371-383. Table of Contents

I. Executive Summary ...... 2 II. Background and Workshop Objectives ...... 4 III. Overview of Presentations on MDA...... 5 IV. Challenges to Advancing MDA ...... 7 V. Priority Applications and Opportunities to best leverage MDA Tools ...... 9 VI. Near-Term Pathways for MDA Development ...... 11 VII. Supporting Needs ...... 17 VIII. Concluding Remarks and Next Steps ...... 18 Appendix A. List of Participants ...... 19 Appendix B. Workshop Agenda ...... 20 Appendix C. Summary of Prior Work ...... 21

I. Executive Summary

Materials data analytics (MDA)—an emerging discipline that helps researchers extract knowledge and insights from materials data—will play a critical role in enabling multiple stakeholders to discover, design, develop, and deploy new materials twice as fast at half the cost, per the goals of the Materials Genome Initiative (MGI). While many researchers across disciplines are working independently to develop and use MDA to make their work more efficient, a coordinated approach has yet to be conducted that leverages existing knowledge and outlines a path to drive MDA forward.

To address this issue, ASM International convened Materials Data Analytics: A Path-Finding Workshop on October 8‐9, 2015 at The Ohio State University in Columbus, Ohio. The workshop, sponsored by the National Institute for Standards and Technology (NIST), brought together a select group of more than 30 representatives from academia, industry, and government who are currently using and contributing to the development of MDA approaches (see Appendix A for a complete list of participants).

Through a series of professionally facilitated sessions, workshop participants shared their thoughts and ideas about the challenges, applications, and opportunities for the advancement of MDA. The resulting dialogue, captured in this report, provides a deeper understanding of the current state and impact potential of MDA and identifies critical pathways and actions to accelerate its development and help the MGI community more quickly achieve its goals.

Key High-Level Findings

1. While the materials community actively adapts, develops, and uses MDA algorithms for their R&D activities (see Section III and Appendix C), now is the time to pursue a deeper understanding of the current pitfalls and unexplored realms of MDA for further opportunities and growth.

2. Advancing MDA to achieve the MGI goals requires more than just developing computer algorithms to solve individual materials problems. It is also necessary to establish a collaborative computational environment with shared resources and develop a clear understanding of uncertainty in materials data and information.

3. MDA requires highly coordinated and concerted efforts across academia, industry, society, and government. There is an immediate need for collaboration with communities of different disciplines (e.g., computer science, bioinformatics) to learn from their experiences in data analytics and utilize tools and techniques proven to be effective.

Materials Data Analytics: A Path-Finding Workshop 2

Specifically, the workshop identified the following challenges, priorities, and near-term pathways to provide an actionable path toward the advancement and increased use of MDA in the development and deployment of new materials. Detailed Summary of Workshop Findings

Top Five Challenges to Advancing MDA  Understanding uncertainty in data and models  Lack of data/knowledge sharing  Complexity of multiscale optimization  Limited decision-support resources  Extracting knowledge from literature-based resources

Materials and Technical Applications with High Impact Potential  High-temperature structural alloys (e.g., materials for turbine engines)  High entropy alloys and metallic glasses  Combinatorial materials development  Semi-autonomous experimentation

Opportunities for Advancing MDA Tools Category Highest priority opportunity  Automate mining and curation of legacy data In-service data infrastructure  Develop support resources for decisions (making existing data available to MDA) (maintenance through data fusion)  Integrated design and discovery cycle using Materials discovery and design feedback from materials in use and their application to development  Establish quantitative engineering standards and Engineering design and manufacturing materials certification certification  Integrate quality control with reliability engineering

Pathways for MDA Development (see 4-8 for associated action plans) Pathway Challenge addressed Establish quantitative engineering standards Understanding uncertainty in data and models and materials certification Establish in-service data infrastructure for Lack of data/knowledge sharing MDA Advance combinatorial materials science Complexity of multiscale optimization Develop support resources for decisions Limited decision-support resources (maintenance through data fusion)

Automate mining and curation of legacy data Extracting knowledge from literature-based resources

Materials Data Analytics: A Path-Finding Workshop 3

II. Background and Workshop Objectives

Materials data analytics (MDA) Figure 1. Fleming’s Tetrahedron with MDA Perspective applies principles from materials science and engineering, physics, applied mathematics, and information/computer science to extract knowledge and insights from quantitative process‐ structure‐property‐performance relationships hidden in materials data (see Figure 1). By providing researchers with this information, MDA will play a critical role in accelerating the development and deployment of new materials and predicting how they will function in specific applications.

1 The MGI Strategic Plan released in MDA extracts knowledge and insights from quantitative process- December 2014 clearly defines the structure-property-performance relationships hidden in materials role that MDA can play in achieving data. the goals of the MGI and the need Graphic adapted from M.C. Flemings and R.W. Cahn, Organization and Trends to develop a better understanding in MSE Education in the US and in Europe, Acta Mater., 48, 2000, pp.371-383. of MDA tools and techniques:

Objective 2.4 (The MGI Strategic Plan): Develop Data Analytics to Enhance the Value of Experimental and Computational Data “...The availability of high-quality experimental and computational data … presents an opportunity for data mining and analysis to expand and accelerate discovery of new materials and predictions of materials with new functionalities.”  Milestone 2.4.1: Convene a pathfinding workshop focusing on the status of computational tools for data analytics for applications emerging from materials sciences and engineering.

To address this need, ASM International with support from NIST convened the Materials Data Analytics: A Path-Finding Workshop on October 8–9, 2015 with the following goals: 1. Assess the state-of-the-art in MDA, identifying the current state as well as gaps in the knowledge and tools. 2. Identify opportunities and high-priority directions for research and development as well as the application of MDA tools. 3. Share the outcomes to stimulate targeted future work.

1 https://www.whitehouse.gov/sites/default/files/microsites/ostp/NSTC/mgi_strategic_plan_-_dec_2014.pdf

Materials Data Analytics: A Path-Finding Workshop 4

III. Overview of Presentations on MDA

A select group of experts led off the workshop, giving presentations on the incorporation of data analytics into materials genomic approaches and the benefits of using of MDA to analyze massive amounts of materials data. Table 1 provides a summary of the presentations and offers a glimpse of the broader context and current perspectives in the materials data community on MDA and its potential. All presentations can be viewed at http://www.asminternational.org/web/cmdnetwork/resources. Additional examples of MDA applications are listed in Appendix C.

Table 1. Summary of Plenary Presentations 1. National Materials Data Initiatives with Materials Genome Initiative Overview Charles Ward, Materials & Manufacturing Directorate, Air Force Research Laboratory

Topics  Public (digitally formatted scientific) data access/management plans for MDA discussed  Elements of data management plans  Requirements for a materials data infrastructure for MDA MDA  CHiMaD (Center for Hierarchical Materials Design) data mining, such as image- technology based analysis and supervised learning & efforts  DoD materials project on SEM data (better preservation and re-use of SEM data, improved automatic metadata capture, and enhanced data flow from instruments)  Materials data format for MDA 2. Hierarchical Materials Informatics Surya Kalidindi, Georgia Institute of Technology Topics  MDA for hierarchical materials discussed  Templated workflows for mining materials knowledge  Materials data science and e-collaboration environment such as Materials Informatics and matIN (matin-hub..io/materialsinnovation.github.io/) Examples of  Multivariate Polynomial Regression  Cross-Validation Tools MDA tools  Instance Based K-Nearest Neighbor  Decision Table method (KNN)  KStar (Entropy KNN)  Support Vector Machines  Linear Regression  Robust Regression  Pace Regression (Clustering)  Artificial Neural Networks  M5 Model Tree

Specific  Principal Component Analysis (PCA) to obtain low dimensional microstructure examples of quantification MDA  Ensemble of two-phase microstructure applications  Two-point statistics of microstructure  Feature extraction and representation  Atomic structure classification  Combining approach (e.g., two-point statistics and PCA) for EBSD (electron backscatter diffraction) data  Metamodels for microstructure analysis  Phase field modeling and finite element simulation

Materials Data Analytics: A Path-Finding Workshop 5

3. Informatics for/from Electronic Structure to Microstructure Krishna Rajan, University at Buffalo Topics  Materials data for design discussed  Learning from databases  ICME (Integrated Computational Materials Engineering) framework – limitations to present approach  MDA for “knowledge bases” vs. “data bases”  Data intensive framework for materials functionality Specific  Discovering new data: Prediction of lattice constants of zeolites using Partial Least examples of Squares (PLS) model generated from secondary descriptors MDA  Discovering reference data: Molten salts data discovery using PLS and PCA models applications  Optimizing data volume using artificial neural network

 Ranking similarity of nanoparticle adjuvants to pathogen by MDA tools such as PCA  Discovering systematics in databases: MDA tools to describe elemental-property parameters  Tracking statistical impact of descriptors using MDA  Discovering surface structure-chemistry relationships in catalytic nanoparticles  Discovering chemical pathways for microstructure sensitive properties 4. Data Analytics for High-Throughput Experimentation (HTE) Gilad Kusne, NIST* Topics  HTE Informatics vs. Materials Informatics discussed  MDA challenge: Data pre-processing, sample representation, feature identification/extraction, high-speed metrics Specific  Latent Variable Analysis such as PCA, multi-dimensional data scaling examples of  Symbolic regression for functional property clustering MDA  Bayesian methods for functional property regression applications  Hyperspectral analysis, hierarchical cluster analysis, and/or multi-dimensional data scaling for phase diagram determination  Mean Shift Theory Clustering: search for rare-earth free permanent magnets  Non-negative Matrix Factorization (NMF) for phase identification  Constraint Programming (CP) optimization for phase diagram determination and pure phase ID  Constrained LASSO algorithm and Bayesian method for hyperspectral microscopy

* Data Analytics for High-Throughput Experimentation (HTE) was not delivered at the workshop, but the presentation slides are available at http://www.asminternational.org/web/cmdnetwork/resources.

Materials Data Analytics: A Path-Finding Workshop 6

IV. Challenges to Advancing MDA

To effectively advance MDA, it is first necessary to understand the major challenges to its development and use. Workshop participants identified the most difficult challenges to advancing MDA and prioritized five which, if addressed, will give MDA the greatest potential to achieve the MGI goals. Refer to Table 2 for a complete list of challenges and suggested ways to address them.

Top Five Challenges to Advancing MDA

 Understanding uncertainty in data and models  Lack of data/knowledge sharing  Complexity of multiscale optimization  Limited decision-support resources  Extracting knowledge from literature-based resources

Understanding uncertainty in data and models Accounting for uncertainty is one of the most important but complicated challenges to advancing MDA. Sources of uncertainty must be identified and quantified not only in the data and models, but also in the experimental equipment, processes, and measurements. Moreover, uncertainty impacts MDA results in different ways, complicating the development of error propagation models and stochastic models for uncertainty quantification. Lack of data/knowledge sharing Data and knowledge are not shared in a consistent way in the materials data community. As a result, information generated through materials development efforts is not well leveraged. Developing and using data/knowledge-sharing mechanisms will be key to a successful application of MDA. This includes cultivating a collaborative environment for capturing and sharing data and data provenance. The term “data provenance” generally refers to a description of the origins of data and the processes for its inclusion in a database.2 Complexity of multiscale optimization The ability to integrate process, structure, and property Data Fusion information is critical in enhancing materials A process of associating, correlating, and performance. However, one of the main challenges in combining data and information from merging and optimizing such information arises from the single and multiple sources for effective lack of computational tools that can handle voluminous decision making. and heterogeneous materials data. In this regard, it is essential to develop MDA methodologies to effectively From K. Chiang, Data mining, Data Fusion, and fuse scientific information—particularly complex Libraries, 31st Annual IATUL Conference, June, performance data stemming from characterizations at 2010. different length scales—to gain insight into optimal materials design.

2 P. Buneman, S. Khanna, and W. Tan, Data Provenance: Some Basic Issues, in Proc. of FSTTCS, 2000.

Materials Data Analytics: A Path-Finding Workshop 7

Limited decision-support resources Materials design is a series of decision-making processes from the fundamental study of crystal/micro structures to materials selection for real-world applications. Each decision needs to be supported by physics-based principles and a wide range of existing synthesis recipes. Decision makers can also benefit greatly from real-time feedback from other materials designers. Such resources are currently scarce, calling for the development of centralized experimental and computational innovation hubs.

Extracting knowledge from literature-based resources Although there have been a number of research papers and studies exploring advanced MDA concepts, the information they contain often appears in the form of unstructured non-image or numerical data. It is thus difficult to extract and convert the knowledge for subsequent MDA work.

Table 2. Challenges to Advancing MDA (high-priority challenges in bold and shaded gray) Challenges Potential Ways to Address Challenges Votes  Quantify uncertainty in data (e.g., unstructured Understanding uncertainty in non-image/numerical data) data and models  Quantify uncertainty in models 10  Conduct an impact assessment of uncertainty on MDA results

Lack of data/knowledge sharing  Develop data/knowledge sharing mechanisms 8

Complexity of multiscale  Develop computational tools for property 7 optimization optimization Limited decision-support  Centralize experimental and computational 7 resources innovation hubs for experiment design  Tools to extract knowledge by connecting concepts in literature and by parsing unstructured Extracting knowledge from non-image/numerical data 7 literature-based resources  Tools to represent known science in MDA  Conversion of current knowledge to MDA ready Increases in data size from  Approaches for scaling MDA 6 high-throughput approaches  Models for different physical mechanisms and Lack of information fusion enhanced fidelity 3  Tools to combine models and experiments Detection of outliers  Automated tools for outlier detection 3 Data quality with respect to  Tools for data quality controls 2 provenance and variance Identification of applications to  Identification of specific examples of MDA 2 acquire new knowledge  Tools to visualize materials problems Lack of visualization  Tools to visualize outputs (or structure underlying 1 data) of MDA Lack of (image) data  Tools for scalar, vector, temporal, or multiple 0 processing tools images

Materials Data Analytics: A Path-Finding Workshop 8

V. Priority Applications and Opportunities to best leverage MDA Tools

MDA holds great promise for accelerating the discovery, development, and deployment of new materials and achieving the goals of the MGI. As part of the workshop, participants were asked to identify and prioritize specific materials and technical applications where MDA tools would have the greatest impact if sufficiently developed and made available to the materials community. Participants also identified opportunities to improve and advance MDA tools with high-priority materials and target applications in mind. Priority Materials and Technical Applications for MDA Tools The MGI Strategic Plan identifies national objectives over a range of sectors, including security, human health and welfare, clean energy systems, and infrastructure and consumer products. New, advanced materials and more efficient ways of developing and producing them are critical for achieving these objectives and are greatly aided by the use of MDA. Workshop participants identified and prioritized the materials and technical applications that MDA has the greatest potential to facilitate.

Highest Priority Materials Applications for MDA Tools*

 High-temperature structural alloys (6)  High entropy alloys or metallic glasses (2)  Biomedical additive manufacturing parts (2)  Conducting polymers (1)  Cost-effective solar materials (1)

* Number of votes in parentheses. Miscellaneous materials applications for MDA tools include catalysts for CO2 conversion, water splitting, turbine/engine components, thermoelectrics, windmill blades, and so on.

Highest Priority Technical Applications for MDA Tools*

 Combinatorial materials development (8)  Semi-autonomous experimentation (8)  Outlier/error detection in existing datasets (7)  Hyperspectral microscopy (1)  Consensus analysis for experiments, simulations, and theories (1)

* Number of votes in parentheses. Miscellaneous technical applications for MDA tools include real-time data collections and analyses for synchrotron experiments, advanced microscopy techniques, and so on.

Materials Data Analytics: A Path-Finding Workshop 9

Priority Opportunities for Advancing MDA Tools To fully realize the benefits of MDA, regardless of where and how it is applied, the tools themselves must be further developed along with the infrastructure around them. According to workshop participants, tool improvement efforts should focus on three areas: in-service data infrastructure (making existing data available to MDA), materials discovery and design, and engineering design and manufacturing certification. For each area, participants identified several opportunities for improvement then ranked them according to their impact potential. Table 3 contains a complete list of opportunities, highlighting those with the greatest potential to accelerate materials development and deployment.

Table 3. Opportunities for Advancing MDA Tools (high-priority challenges in bold and shaded)

MDA Opportunities Votes

Automate mining and curation of legacy data 19

In-service data Develop support resources for decisions (maintenance 11 infrastructure through data fusion) (making existing data available to MDA) Construct digital databases 1

Effectively map literature using MDA 0

Integrate the design and discovery cycle using feedback 10 from materials in use and their application to development

Discovery and design of new conceptual ideas 3

Materials discovery and Redundancy creation for accuracy and reliability of materials 3 design design

Understanding physical mechanisms (e.g. defect formation in 2 composites)

Identification of under-explored materials systems 1

Establish quantitative engineering standards and materials 7 certification Engineering design and manufacturing Integrate quality control with reliability engineering 4 certification Critical Analysis of safety factors 1

Accelerated materials identification 0

Materials Data Analytics: A Path-Finding Workshop 10

VI. Near-Term Pathways for MDA Development

After identifying and ranking challenges, opportunities, and applications with high impact potential, workshop participants were asked to come up with a plan to accelerate the development of MDA tools and techniques and help the MGI community achieve its goals. Working in small groups, the participants identified near-term pathways offering the greatest potential for improvement. Each group focused on a specific opportunity, developing a detailed action plan that includes key tasks, time-based milestones, expected outcomes, and required resources. The action plans are outlined on the following pages.

Materials Data Analytics: A Path-Finding Workshop 11

Establish quantitative engineering standards and materials certification Challenge addressed: Understanding uncertainty in data and models Materials standards are currently written based on minimum property values obtained by testing at different length scales. But they could just as easily be based on properties associated with processing inputs and parameters. As properties and performance are generally functions of (micro)structure, it makes more sense going forward to base these standards on structure rather than derived quantities.

Table 4. Action Plan: Establish quantitative engineering standards and materials certification

Establish quantitative engineering standards and materials certification

 Collaborative platforms  Determination of appropriate representation of microstructures  High quality inverse maps between structure and property Key tasks  High quality maps for evolution of structure during processing  Visualization tools that show above complex relationships  In-process MDA and characterization tools for data capture & extraction  Databases of process, structure, and property  Forward/reverse models for process-structure-property linkages Time-based milestones  Merging all aspects to create certifications for target values  Development of regulatory bodies (1 year)  Protocols for certification (1 year)  Identification of most important inputs to a manufacturing process and real-time monitoring  Minimize need for destructive physical testing of manufactured Primary outcomes components  Smaller samples of test sizes  Rapid certification of multiple manufacturing routes  Real-time QA (quality assurance) during manufacturing  Large collaboration of precompetitive data  Industry and regulatory professional’s willingness to completely overhaul the existing system Required resources  Professional societies to develop structure/process based standards  Development of high property/process models  Standardized parts and processes for calibration  Regulatory edicts / Compliance with new process Academia Government  Personnel  Repose and coordinate  NLP, OCR, and machine learning  License exemptions  Input for organization  Open access Roles and  Funding responsibilities Industry  Convening and collaboration  License agreements Professional Society (Non-Profit)  Data contributions  Analytics  License agreements  Convening and collaboration

Materials Data Analytics: A Path-Finding Workshop 12

Establish in-service data infrastructure for MDA Challenge addressed: Lack of data/knowledge sharing In-service data infrastructure would advance MDA by allowing it to rapidly inform materials discovery and design processes with the iterative coupling of in-service materials performance (i.e., collective properties) data with related information, such as process and property data. Table 5. Action Plan: Establish in-service data infrastructure for MDA

Establish in-service data infrastructure for MDA

 Collaboration and buy-in from user community for access to data  Formats and data/metadata standards  Data collection and curation Key tasks  System design enabling in-service data collection  Infrastructure for efficient data analysis  Tools and software model for analysis  Workflow definition and education  ID systems that are ready for MDA (1~2 years) Time-based milestones  Business case analysis justifying investment (3~5 years)  Improvement of existing models  Fill-in information gaps  Develop new models (e.g., predictive modeling) Primary outcomes  Reduced uncertainty in component design  Enhanced performance of components  Reduced in-service cost  Adequately educated workforce  Infrastructure for long-term employment  Tools and software development (targeted as needed) Required resources  Collaborative environments between materials science and computer science  Recognize materials science may need computer science support  Proprietary and legal barriers to sharing will be an impediment Academia Government  Faculty with competence in  Funding both physics and  Facilitate identification of high mathematics impact areas  Development of  Facilitate program development and Roles and interdisciplinary curricula execution  Share knowledge to replicate responsibilities Industry approach and convey lessons  ID where feedback loop will learned have highest payoff  Collaborate on sharing data Professional Society (Non-Profit) and building infrastructure.  Knowledge dissemination  Facilitate sharing resources

Materials Data Analytics: A Path-Finding Workshop 13

Advance combinatorial materials science Challenge addressed: Complexity of multiscale optimization Combinatorial experimentation provides opportunities for knowledge extraction from a vast amount of data generation and high-throughput experimentation designed to automate execution and dynamic control. It is highly desirable to incorporate MDA into combinatorial materials science to systematically improve materials performance through optimization of these large data sources.

Table 6. Action Plan: Advance combinatorial materials science

Advance combinatorial materials science

 Automated capture of data  Automated process monitoring Key tasks  Dynamic design of experiments  Better tools/usage of tools for knowledge extraction from experimental data

Time-based milestones  Milestones were not outlined and require further development  Rapid and efficient exploration of materials and design spaces without Primary outcomes data analysis bottlenecks  Creation of an MDA proficient workforce Required resources  Funding for the creation of user tools

Academia Government  Train students in the skills  Mandate for open data formats required for MDA (e.g. and APIs programming skills, data  Funding explicitly for MDA analytics tools, statistics, etc.) Roles and activities  Development of user tools responsibilities Industry  Creation of open data formats Professional Society (Non-Profit) and APIs (application  MDA “Boot Camps” that would programming interface) for cover the skills also being controlling equipment promoted by academia

Materials Data Analytics: A Path-Finding Workshop 14

Develop support resources for decisions (maintenance through data fusion) Challenge addressed: Limited decision-support resources Rational decision-making at each step of the materials design process will help make the MGI goals a reality by reducing the huge design space and variables. Enhancing intelligent support to offset the lack of decision-making resources can be accomplished through MDA-based information fusion, which requires complete use of experimental and computational resources.

Table 7. Action Plan: Develop support resources for decisions (maintenance through data fusion)

Develop support resources for decisions (maintenance through data fusion)

 Continuous and automated data capture and data storage  Sensors (optional) Key tasks  Metadata capture and storage as data infrastructure  Data mining and fusion  Make use of existing physics-based models  Automotive applications (1 year) Time-based  Aerospace applications (5 years) milestones*  Nuclear power plant applications ( >5 years)  Cost savings on maintenance/down time Primary outcomes  Fewer accidents/failures  Number of adoptions  Collaborative team (IT, domain experts, data scientists, MDA tool Required resources developers, etc.)  Computer systems and support

Academia Government  Development of MDA algorithms  Fund academia and tools  Public safety Roles and  Modeling (degradation responsibilities modeling) Professional Society (Non-Profit) Industry  Provide conceptual services  Business case

* Security issues could delay outcomes.

Materials Data Analytics: A Path-Finding Workshop 15

Automate mining and curation of legacy data Challenge addressed: Extracting knowledge from literature-based resources When developing new materials or advanced devices, it is essential to have access to a wide range of data-rich information, including results from experiments, simulations, and theoretical calculations that may have been conducted. Using automated techniques to mine and curate such legacy data is critical for the advancement of MDA. The most pressing need is the ability to rapidly populate materials data stores through an information retrieval process aided by machine automation.

Table 8. Action Plan: Automate mining and curation of legacy data*

Automate mining and curation of legacy data

 Identify sources of data and scope for collection  Collect, digitize, and make accessible the legacy data Key tasks  Curate: scientific literature and patent literature  Comprehend and organize  Make data discoverable  Access to currently digitized/open materials (1 year)  Identify hard-to-access, high value materials (1 year) Time-based milestones  Digitize all easy and high-value data (5 years)  Tag data for organization and discoverability (10 years)  Tagged, discoverable dataset encompassing all legacy literature  Synchronization with ontologies from other fields Primary outcomes  Materials-specific natural language processing (NLP)  Professional (non-profit) society: 1) license agreements, 2) data, and 3) convening and collaboration  The literature including value assessment  NLP and optical character recognition (OCR) technology Required resources  Infrastructure: computer and people  Open access for literature  Thermodynamic resource center as a possible model Academia Government  Development/testing of models  Curation of data  Development of appropriate  Regulatory authority mathematical descriptions  Financial support for industry Roles and  Development of standards responsibilities Industry  Adapt materials characterization Professional Society (Non-Profit) in support of models  Development of standards  Train employees databases  Support certification  Education and training

* Some data may be rejected, even though it may be correct, if the majority of datasets agree on a wrong result.

Materials Data Analytics: A Path-Finding Workshop 16

VII. Supporting Needs

The action plans outlined in Tables 4-8 require cross-sector collaboration with each stakeholder playing a role in data management, education, and community development. With that in mind, participants identified several supporting activities that will benefit all action plans outlined above.

To help Supporting activities

Materials  Willingness to share data rather than protect it community  Develop open literature repositories to enable NLP and data mining including  Data management (shared storage, data accessible to analysis tools, data industry annotation tools, and discovery tools for federated data)  Shared digital data products from MSE research  Elimination of closed data formats  Metadata standards (micro- and macro data collection and interoperability at reasonable scope and scale)  Standard test sets for comparing algorithm performance  Interpretable digital descriptions of materials data (e.g., vocabularies, formats, metadata standards, and/or ontologies)  Spinning off tools and data from MDA for specific applications  Exploration of business cases (e.g., where MDA was required)  MDA applications on the web pilot projects with untraditional collaborators  Professional societies and congress to put a time limit (e.g., 15 years) on the copyright of MDA-related scientific publications.  Recognize the value of data (incentives: credit, citation, promotion, tenure, awards, or prizes)  A hotline for real-time communication  A cloud collaborative environment for computing, sourcing, and contributions  Create a DOI (digital object identifier) for every prior publication (especially old publications)  Open access catalog of major available MDA tools including their range of applicability and limitations  Open-source discussion boards specific to MDA

Students  Closer interaction between materials science and computer science (e.g., joint courses, joint appointments, cross-lining, joint workshops/tutorials)  Identification of the range of applicability for various MDA techniques  Materials Science and Engineering curricula including MDA/programming (e.g., Introduction to MDA in undergraduate coursework or design of experiments)  Hands-on workshops/hackathons to establish the domain basics  Collaborative multidisciplinary work spaces  New committees combining materials research with computer/data sciences

Current  Education resources/programs on data management workforce  Redesigned workshops/symposia for solving a specific problem  Repurposed workforce (veterans, retirees, or disable) to support data curation  Training courses on MDA by professional society with government funds

Materials Data Analytics: A Path-Finding Workshop 17

VIII. Concluding Remarks and Next Steps

The workshop results provide important guidance that will enable the materials data community to make strides in advancing and implementing MDA. These efforts will involve the development of algorithms that account for a clear understanding of uncertainty in materials data, information to solve individual materials problems, and mechanisms and an environment for sharing these resources across sectors. A coordinated and concerted effort from academia, industry, government, and professional societies to continue defining the state-of-the-art and pursuing promising opportunities will ensure the advancement of MDA and the overall acceleration of new materials development.

Materials Data Analytics: A Path-Finding Workshop 18

Appendix A. List of Participants

Organizers and speakers Participants

Robert Hanisch, National Institute of Standards Laura Bartolo, Northwestern University and Technology (opening remarks on behalf of NIST) Ilias Bilionis, Purdue University Charles Ward, Air Force Research Laboratory Kathryn Dannemann, Southwest Research (speaker) Institute Surya Kalidindi, Georgia Institute of Technology Brian DeCost, Carnegie Mellon University (speaker and organizer) Jeffrey Ellis, Battelle Krishna Rajan, SUNY - University at Buffalo Jason Hattrick-Simpers, University of South (speaker) Carolina Gilad Kusne, National Institute of Standards and Barry Hindin, Battelle Technology (speaker and organizer) Jeremy Knopp, Air Force Research Laboratory David Williams, Ohio State University (speaker and organizer) Ruoqian Liu, Northwestern University Zi-Kui Liu, Penn State University Ankit Agrawal, Northwestern University (organizer) Will Marsden, Granta Design Ltd Rudy Buchheit, Ohio State University Debbie Mies, Granta Design Ltd (organizer) Michael Mills, Ohio State University

Stephen Niezgoda, Ohio State University Richard Otis, Pennsylvania State University

Arindam Paul, Northwestern University Amra Peles, United Technologies Research Center

John Perkins, National Renewable Energy Laboratory Andrew Reid, National Institute of Standards and Technology ASM International Scott Henry Vyacheslav Romanov, National Energy Technology Larry Berardinis Laboratory Dongwon Shin, Oak Ridge National Laboratory Nexight Group Logan Ward, Northwestern University Ross Brindle (workshop facilitator) Changwon Suh J.-C. Zhao, Ohio State University .

Materials Data Analytics: A Path-Finding Workshop 19

Appendix B. Workshop Agenda

Thursday, October 8 12:00-12:45 PM Registration 12:45-1:00 PM Opening Remarks  Welcome and Introductions, Workshop Purpose – Scott Henry, ASM International  Remarks – “Improving Discoverability and Access to Materials Science Data at NIST,” Robert Hanisch, NIST 1:00-1:30 PM  “Translational Data Analytics @ Ohio State University: Accessibility, Integration, and Co-Development,” Philip Payne, Director, Translational Data Analytics, Ohio State 1:30-3:00 PM Materials Data Analytics: Perspectives  Presentation 1 – “Materials Genome Initiative Overview,” Charles Ward, AFRL  Presentation 2 – “Hierarchical Materials Informatics,” Surya Kalidindi, Georgia Tech 3:00–3:15 PM Break 3:15-4:45 PM Materials Data Analytics: Perspectives (cont’d)  Presentation 3 – “Informatics for Electronic Structure to Microstructure,” Krishna Rajan, SUNY-University at Buffalo  Presentation 4 – “Data Analytics for High-Throughput Experimentation,” Gilad Kusne, NIST 4:45 -5:00 PM October 9 Workshop Plan and Process – Ross Brindle, Nexight Group, facilitator 5:45 PM Joint reception w/ Translational Data Analytics (TDA@OSU) Meeting Friday, October 9 8:00-8:30 AM Registration and Continental Breakfast 8:30 – 10:00 AM Group Discussion: MDA and Their State of Development  ASM to review results of pre-workshop analysis of status  Group identifies “state of readiness”  Group prioritizes areas for further development 10:00 – 10:30 AM Break 10:30 – Noon Group Discussion: Applications and Opportunities for MDA Tools  Group identifies applications and opportunities for high priority areas for development  Group begins to identify details around development efforts including potential timing, desired participants, and expected outcomes 12:00 – 1:00 PM Lunch 1:00 – 2:30 PM Group Discussion: Supporting Needs-Data Management, Education, and Community Development  Group brainstorms supporting activities needed to advance MDA 2:30 – 3:00 PM Closing Session  Review of next steps and action items  Brief closing comment from each participant 3:00 PM Adjourn

Materials Data Analytics: A Path-Finding Workshop 20

Appendix C. Summary of Prior Work

Background Materials data analytics (MDA) —an emerging discipline within the fields of materials science and engineering, physics, applied mathematics, and information/computer science—is essential to achieving the goals of the Materials Genome Initiative (MGI). By helping researchers extract knowledge and insights stemming from process-structure-property-performance relationships hidden in materials data3 (Figure C1), MDA will play a critical role in accelerating the development and deployment of new materials and predicting how they will function in specific applications.4

Figure C1. The role of MDA in the materials genome approach5

Performance

Processing

Structure

Property

MDA algorithms As practiced, MDA relies on a variety of information analysis techniques, including data mining (or statistical learning), machine learning, and the application of numerous ad-hoc algorithms (Figure C2). Data mining, in the broadest sense, includes all the activities of data organization, warehousing, and mining to retrieve hidden patterns in data sets. Machine learning, on the other hand, focuses more on

3 S.R. Kalidindi and M. De Graef, Materials Data Science: Current Status and Future Outlook, Annu. Rev. Mater. Res., 45, 2015, p. 171, DOI: 10.1146/annurev-matsci-070214-020844. 4 Materials Genome Initiative Strategic Plan (December 2014), https://www.whitehouse.gov/sites/default/files/ microsites/ostp/NSTC/mgi_strategic_plan_-_dec_2014.pdf. 5Infogineering, “The Differences between Data, Information and Knowledge,” http://www.infogineering.net/data- information-knowledge.htm.

Materials Data Analytics: A Path-Finding Workshop 21 algorithm development, and the algorithms are the primary means by which researchers extract patterns, rules, and associations – i.e., materials information – that would otherwise remain hidden in the underlying process-structure-property-performance (PSPP) relationships.

Figure C2. Common materials data and MDA algorithms used today

Current applications of MDA to materials engineering One of the most powerful attributes of MDA is that it gives materials designers a holistic perspective, allowing them to see, for example, how processing affects mechanical, structural, chemical, optical, magnetic, and other properties. This is achieved through multivariate data analysis, a common form of which, dimensional reduction, is used in feature extraction. The use of multivariate data analysis and visualization of high-dimensional data is indispensable, as most of the data MDA probes include multiple variables, such as process (synthesis) parameters, chemistries, (micro) structures, and various physical (electric, magnetic, optical, or mechanical) properties (see Figure C2). Other ways in which researchers are currently leveraging the strengths of MDA include the use of kernel methods for nonlinear modeling, visual mining for pattern recognition, and association rule mining for finding governing rules in materials systems. Table C1 outlines some review papers describing general MDA applications to materials engineering, while Tables C2, 3, and 4 summarize the current state of MDA applied to specific materials issues.

Materials Data Analytics: A Path-Finding Workshop 22

Table C1. Exemplary review papers on the materials engineering applications of MDA

Exemplary review papers on the materials engineering applications of MDA

Materials Data Science: Current Status and Future Outlook [2] Key concepts: materials database, materials data management, materials data analytics, materials e- collaboration platform, process-structure-property linkage

Machine Learning in Materials Science: Recent Progress and Emerging Applications [4] Key concepts: data analytics of large volumes of materials data, machine learning, materials sciences application

Materials Informatics: The Materials “Gene” and Big Data [5] Key concepts: uncertainty, statistical inference, information theory, fuzzy logic, rough sets

Data Science and Cyberinfrastructure: Critical Enablers for Accelerated Development of Hierarchical Materials [6] Key concepts: materials informatics, microstructure quantification, process-structure-property linkages, data science, Cyberinfrastructure, metamodels, spatial correlations, reduced-order representations

What is High-Throughput Virtual Screening? A Perspective from Organic Materials Discovery [7] Key concepts: computational materials design, big data, materials screening, data filtering

Big Data and Deep Data in Scanning and Electron Microscopies: Deriving Functionality from Multidimensional Data Sets [8] Key concepts: multivariate statistical analysis, visualization, multidimensional structural and functional data, high-performance computing

Quantitative Structure-Property Relationship Modeling of Diverse Materials Properties [9] Key concepts: descriptors, QSPR (Quantitative Structure-Property Relationship) modeling, nanomaterials, catalysts, polymers, ionic liquids, supercritical carbon dioxides, ceramics

Big-deep-smart Data in Imaging for Guiding Materials Design [10] Key concepts: big data in imaging and simulation, deep data, smart data

Box C1. Abbreviations ANN: Artificial Neural Network ICA: Independent Component NLR: Non-Linear Regression ARM: Association Rule Mining Analysis PCA: Principal Component DM: Diffusion Map K-NN: K-Nearest Neighbor Analysis DT: Decision Tree NNMF: Non-Negative Matrix PCR: Principal Component FA: Factor Analysis Factorization Regression GA: Genetic Algorithm MCR: Multivariate Curve PLS: Partial Least Squares GEP: Genetic Programming Resolution OLS: Ordinary Least Squares GP: Gaussian Process MLR: Multiple Linear Regression SOM: Self-Organization Map IT: Information Theory NGT: Network-Graph Theory SVM: Support Vector Machine

Materials Data Analytics: A Path-Finding Workshop 23

Table C2. Examples of MDA for experiments and simulation to specific materials problems

Material Application area MDA method Ref. Functional Analysis of higher-dimensional data sets PCA, ICA, Clustering, [8] materials (3D to 6D) from electron and scanning Bayesian de-mixing, ANN (superconductors probe microscopes and nanocomposites) Hierarchical Analysis and mining of N-point statistics PCA, in-house algorithms [6,11] materials Steels Solid-solid phase transformation ANN [12] mechanisms Earth abundant Analysis of high-throughput data such as In-house algorithm (uses [13,14] permanent diffraction and structural phase boundary mean shift theory), NNMF, magnets identification Clustering Alloys such as Texture analysis in micrographs In-house algorithm (uses [15] Ni-Cr-Al automated alloy segmentation, maximum likelihood estimate) Alloys such as Ti- Segmentation of features from In-house algorithm (uses [16] and Ni-systems microstructure images from high- posterior marginal throughput electron microscopes segmentation to classify image pixels Alloys such as Ti- Analysis of materials image datasets Bayesian segmentation [17] and Ni-systems technique Alloys such as Ti- Analysis of materials image datasets IT (image entropy and [18] and Ni-systems mutual information) Alloys Isosurface rendering for 3D atom probe Rendering algorithm with [19] data OpenGL Alloys Improving isotope discrimination in atom PCA [20] probe tomography data Alloys (Bayesian) uncertainty quantification Bayesian approach, GP [21,22] Functional High-dimensional atomistic potentials ANN [23] materials Hydrogenated Identification of structural features GEP [24] nanocrystalline silicon Functional Statistical analysis of compositional data In-house algorithms for [25] materials data transformation

Materials Data Analytics: A Path-Finding Workshop 24

Material Application area MDA method Ref. Alloys Estimating chemical composition from 3D PCA [26] atom probe data Alloys Statistical assessment of atomic density of K-NN (k=1) [27] precipitates and surrounding matrix regions in reconstructed volumes Alloys Factorization of high volume of MCR [28] tomographic spectral image Functional Correcting adsorption of fluorescent PCA, K-means clustering [29] materials radiation in different phases of an inclusion Alloys Determine statistical distributions of first K-NN (k=1) [30] nearest neighbor distances for random solid solution and solute-enriched clusters Functional Effective image processing Image PCA, in-house [31] materials algorithms for image registration Functional Data fusion for different characterization Data fusion of XPS and [32] materials such as techniques AFM images polymer blends Functional Identification of photovoltaic loss N-way PLS [33] materials (solar mechanisms cell components) Functional Visualization of high-dimensional materials Parallel coordinates and [34] materials such as data radial visualization heterogeneous catalysts Functional Identification of optimal process conditions DM [35] materials (solar cell components) Polymer Identification of key microstructure In-house algorithms using [36] nanocomposites descriptors from vast candidates as image analysis to potential microstructural design variables eliminate redundant microstructure descriptors and identify key descriptors to determine design variables Polymer Identification of a small set of In-house algorithms based [37] nanocomposites microstructure descriptors to represent on statistical sensitivity morphology features quantitatively analysis and multi- objective optimization

Materials Data Analytics: A Path-Finding Workshop 25

Table C3. Examples of MDA for materials discovery to specific materials problems

Material Application area MDA method Ref.

Anode Structure and phase prediction GA [38] materials for batteries

Detector Materials selection Not specified [39] materials for ionizing radiation

Polymers Prediction of materials properties such NLR (Gaussian kernel ridge [40] as atomization energy, zero-point regression) energy, isotrophic polarizability, heat capacity, HOMO-LUMO gap

Ternary Prediction of thermodynamic stability Decision tree (Reduced error [41] compounds of arbitrary compositions pruning tree with the rotation forest ensembling technique)

Molecule- Prediction of structural and physical (Sparse) regression, ONR, MLR [42] based magnet properties of materials (with LASSO algorithm), NGT

Inorganic Prediction of crystal structure In-house algorithms called DMSP: [43,44] materials such data mining structure predictor as ternary (based on mutual information, oxides parameter estimation with maximum likelihood estimate)

All-solid state Prediction of the Li-ion hopping PLS [45] rechargeable energy Li-ion batteries

Inorganic New compound discovery process In-house algorithms (based on [46] functional through ionic substitutions probabilistic model) materials

Photoelectro Identification of new Cluster analysis (the propensity [47] catalysts photoelectrocatalysts in the cubic and dendrogram analysis) perovskites

Dye-sensitized Development of molecular design In-house algorithm for data [48] solar cells rules filtering

Binary Mapping the relative stability of IT (Shannon information [49] intermetallic compounds entropy), classification tree compounds

Materials Data Analytics: A Path-Finding Workshop 26

Table C4. Examples of MDA for PSPP integration to specific materials problems

Material Application area MDA method Ref.

Fullerenes, nanotubes, Modeling of materials OLS, MLR, PLS, PCR, SVM [9] catalysts, polymers, properties supercritical CO2 and ceramics

Superconductors Prediction of critical Similarity and connectivity [50] temperature based on network theory

Heterogeneous catalysis Estimation of evolution of GEP [51] catalytic reaction for unsynthesized catalysts

Light emitting diode Study of phosphor FA, PCA, and OLS [52] phosphors luminescence

Zeolite crystals Classification of zeolites into Random forest algorithm [53] different types of minerals and framework

Organic materials Prediction of atomization In-house non-linear mapping [54] energies of a diverse set of algorithms (based on a organic molecules measure of distance in compound space that accounts for stoichiometry and configurational variation)

Magnetoelastic Fe-Ga alloy Microstructure optimization In-house algorithm for [55] random data generation, feature selection and classification, IT (Information Gain), SVM, Statistics (χ2 and F-score), classification tree

Carbon nanofiber/ Classification of mechanical SVM classifier, SOM [56,57] nanocomposites properties such as yield strength and elastic modulus

Carbon nanofiber/ Prediction of viscoelastic SOM and cluster analysis [58] nanocomposites properties (i.e. fuzzy C-means), PCA

Lithium superionic Prediction of low-temperature SVM regression [59] conductors conductivities

Materials Data Analytics: A Path-Finding Workshop 27

Material Application area MDA method Ref.

Austenitic steel Prediction of rupture life and Classification and regression [60] stress tree (CART)

Wide bandgap AB Mechanical property PCA, decision tree, SVM [61] compounds, Rare earth classification main group RM intermetallics

Polymers Prediction of atomization NLR (kernel ridge regression) [62] energy, lattice parameter, spring constant, electron affinity, bandgap, dielectric constant, etc.

Functional materials Classification/prediction Association analysis, cluster [63,64] modeling analysis, materials visualization

Sialon ceramics Prediction of cold modulus SVM regression [65]

Single and binary Prediction of melting OLS, PLS, SVM regression, [66] compounds temperature GP regression

Metal organic frameworks Classification of CO2 uptake SVM classifier [67] (MOF) for CO2 capture capacity of MOF

Functional materials Development of the function In-house algorithm such as [68] that maps a material structure the multivariate regression to its vale on some property function as a sum of separable functions

Octet AB solids Classification of crystal SVM classifier [69] structure, predict melting temperature

Steels Prediction of Martensite start K-NN [70] temperature

Hard coating materials Materials selection and Hierarchical clustering [71] classification

Steels Correlation of steel properties ANN, OLS, MLR [72] to composition and manufacturing processes

Organic solar cell materials Computation of physically In-house algorithm such as a [73] meaningful morphology graph-based framework descriptors

Materials Data Analytics: A Path-Finding Workshop 28

Current MDA tools

Table C5. Survey of data analytics tools and development environments

Name Description URL Open

Aabel Statistical data analysis and dynamic scientific graphing Gigawiz.com Apache Software framework for distributed processing across Hadoop.apache.org O Hadoop computer clusters using simple programming models Apache Open-source distributed real-time computation system Storm.apache.org O Storm

Eureqa A straightforward interface for genetic programming Creativemachines. cornell.edu/Eureqa

Import io Website data harvester import.io JMP Statistical discovery software from SAS Jmp.com KNIME Data Analytics Knime.org O MATLAB Toolboxes for statistics, optimization, neural networks, Mathworks.com/ commercial signal processing, image processing, and computer vision products/Matlab Open source data visualization and analysis for novice and Orange experts; data mining through visual programming or Orange.biolab.si O Python scripting; components for machine learning Open-source library of programming functions for image OpenCV Opencv.org O recognition and computer vision applications Open-source tool for cleaning and transforming messy OpenRefine Openrefine.org O data, and extending it with web services and links Data integration and analytics platform, offering a suite of Pentaho open-source products for integration, OLAP services, Pentaho.com O reporting, dashboarding, data mining, and ETL capabilities Integrated environment for machine learning, data mining, RapidMiner text mining, predictive analytics, and business analytics; Rapidminer.com supports data visualization, validation, and optimization Provides fast, actionable business insights by analyzing Ryft ONE both historical and streaming data at speeds in excess of Ryft.com O 10 Gigabytes/second Schrödinger Chemical simulations, including structure generation, Materials combinatorial library enumeration, model creation, Schrodinger. com/materials Science Suite property prediction, data analysis, and decision making Scikit-learn Machine learning in Python scikit-learn.org O Software package for scientific graphing, modeling, and SigmaPlot data analysis from basic statistics to advanced Sigmaplot.com mathematical calculations Suite of predictive analytics software from IBM for survey, SPSS IBM.com/SPSS market, and business researchers

Materials Data Analytics: A Path-Finding Workshop 29

Additional MDA-related focus areas

1. Data management MDA can extract the most useful information from well-organized data sets. Materials data sets are heterogeneous due to the many different ways of synthesizing materials and of modeling and measuring properties with different time and length scales. Integrated, well-organized data sets should serve as a guideline for designing effective data querying systems and for the construction of a schema that makes it possible to cross database queries, exchange information, and distribute data in materials design.

2. Education Materials engineering students need to learn about the design principles that can enhance or develop functional materials to meet requirements such as fracture toughness or corrosion resistance. However, materials design is a complex process that requires the determination of optimal combinations of material chemistry, processing routes, and synthetic parameters to robustly meet specific requirements. MDA can significantly improve teaching and learning processes in Science, Technology, Engineering and Math (STEM) education. It affords future materials designers the opportunity to learn materials taxonomy within a computational framework that consists of grouping, classification, association, and correlations for a defined set of engineering data. MDA can provide students with a basic foundation to understand various materials’ behaviors (e.g., delivering the concept of materials design by introducing classifications or associations of materials in terms of electronic properties, such as band gap energy). A pedagogical strategy that leverages MDA would be far different from conventional education in STEM, since it would use scientific engineering databases and materials information to introduce physics-based laws and principles that serve as materials design rules. While general resources for MDA education do exist—including a well-known teaching web page [74] and a materials education symposium [75]—a pedagogical tool that utilizes MDA would be unique in its focus on uncovering the process-structure- property-performance relationships with the aid of materials databases and scientific information.

3. Community development The materials community needs to build expertise in MDA in order to effectively enable data analysis, materials discovery, and process-structure-property-performance integration. Establishing a community that combines experimentalists, theorists, applied materials informaticians, and algorithm developers is critical to cultivating this deeper understanding. An example of such community development are the activities that recently took place in the Columbus, Ohio area, which include:

 Formation of the Columbus Collaboratory, an industry-driven effort to leverage data analytics for business data among local businesses.  Establishment of a Data Analytics Initiative at Ohio State University, which includes the hiring of 50-60 new faculty and the development of an undergraduate major in Data Analytics.

Materials Data Analytics: A Path-Finding Workshop 30

References [1] Materials Genome Initiative Strategic Plan, December 2014, https://www.whitehouse.gov/sites/ default/files/microsites/ostp/NSTC/mgi_strategic_plan_-_dec_2014.pdf. [2] S.R. Kalidindi and M. De Graef, Materials Data Science: Current Status and Future Outlook, Annu. Rev. Mater. Res., 45, 2015, p. 171, DOI: 10.1146/annurev-matsci-070214-020844. [3] Infogineering, “The Differences between Data, Information and Knowledge,” http://www.infogineering.net/data-information-knowledge.htm [4] T. Mueller, A.G. Kusne, and R. Ramprasad, Machine Learning in Materials Science: Recent Progress and Emerging Applications, Rev. Comput. Chem. (Accepted for publication). [5] K. Rajan, Materials Informatics: The Materials “Gene” and Big Data, Rev. Mater. Res., 45, 2015, p. 153, DOI: 10.1146/annurev-matsci-070214-021132. [6] S.R. Kalidindi, Data Science and Cyberinfrastructure: Critical Enablers for Accelerated Development of Hierarchical Materials, Int. Mater. Rev., 60, 2015, p. 150, DOI: 10.1179/1743280414Y.0000000043. [7] E.O. Pyzer-Knapp, C. Suh, R. Gomez-Bombarelli, J. Aguilera-Iparraguirre, and A. Aspuru-Guzik, What is High-Throughput Virtual Screening? A Perspective from Organic Materials Discovery, Annu. Rev. Mater. Res., 45, 2015, p. 195, DOI: 10.1146/annurev-matsci-070214-020823. [8] A. Belianinov et al., Big Data and Deep Data in Scanning and Electron Microscopies: Deriving Functionality from Multidimensional Data Sets, Adv. Struct. Chem. Imaging, 1, 2015, p. 6, DOI: 10.1186/s40679-015-0006-6. [9] T. Le et al., Quantitative Structure-Property Relationship Modeling of Diverse Materials Properties, Chem. Rev., 112, 2012, p. 2889, DOI: 10.1021/cr200066h. [10] S.V. Kalinin, B.G. Sumpter, and R.K. Archibald, Big-deep-smart Data in Imaging for Guiding Materials Design, Nat. Mater., 14, 2015, p. 973, DOI: 10.1038/NMAT4395. [11] S.R. Kalidindi, S. R. Niezgoda, and A.A. Salem, Microstructure Informatics using Higher-order Statistics and Efficient Data-mining Protocols, JOM, 2011, p. 34, DOI: 10.1007/s11837-011-0057-7. [12] C. Capdevil, Neural Networks Modeling of Phase Transformations in Steels, Diffusionless Transformations, High Strength Steels, Modelling and Advanced Analytical Techniques, Vol.2 in Woodhead Publishing Series in Metals and Surface Engineering, 2012, p. 464, DOI: 10.1533/9780857096111.3.464. [13] A.G. Kusne et al., On-the-fly Machine-learning for High-throughput Experiments: Search for Rare- Earth-Free Permanent Magnets, Sci. Rep., 4, 2014, p. 6367, DOI: 10.1038/srep06367. [14] C.J. Long, D. Bunker, X. Li, V.L. Karen, and I. Takeuchi, Rapid Identification of Structural Phases in Combinatorial Thin-film Libraries using X-ray Diffraction and Non-negative Matrix Factorization, Rev. Sci. Instrum., 80, 2009, p. 103902, DOI: 10.1063/1.3216809. [15] L. Huffman, J. Simmons, M. de Graef, and I. Pollak, Shape Priors for MAP Segmentation of Alloy Micrographs using Graph Cuts, Proc. IEEE Stat. Signal. Process. Workshop, 19, 2011, p. 661, DOI: 10.1109/SSP.2011.5967788.

Materials Data Analytics: A Path-Finding Workshop 31

[16] J.P. Simmons, P. Chuang, M. Comer, J.E. Spowart, M.D. Uchic, and M. de Graef, Application and Further Development of Advanced Image Processing Algorithms for Automated Analysis of Serial Section Image Data, Model. Simul. Mat. Sci., 17, 2009, p. 025002, DOI: 10.1088/0965-0393/17/2/025002. [17] M. Comer, C. A. Bouman, M. De Graef, J.P. Simmons, Bayesian Methods for Image Segmentation, JOM, 63, 2011, p. 55, DOI: 10.1007/s11837-011-0113-3. [18] E.B. Gulsoy, J.P. Simmons, M. De Graef, Application of joint histogram and mutual information to registration and data fusion problems in serial sectioning microstructure studies, Scr. Mater., 60, 2009, p. 381, DOI: 10.1016/j.scriptamat.2008.11.004. [19] A. Bryden, S. Broderick, S.K. Suram, and K. Kaluskar, Interactive Visualization of APT Data at Full Fidelity, Ultramicroscopy, 132, 2013, p. 129, DOI: 10.1016/j.ultramic.2012.12.006. [20] S. Broderick, A. Bryden, S.K. Suram, and K. Rajan, Data Mining for Isotope Discrimination in Atom Probe Tomography, Ultramicroscopy, 132, 2013, p. 121, DOI: 10.1016/j.ultramic.2013.02.001. [21] I. Bilionis, N. Zabaras, B.A. Konomi, and G. Lin, Multi-output Separable Gaussian Process: Towards an Efficient, Fully Bayesian Paradigm for Uncertainty Quantification, J. Com. Phys., 241, 2013, p. 212, DOI: 10.1016/j.jcp.2013.01.011. [22] I. Bilionis and N. Zabaras, Multi-output Local Gaussian Process Regression: Applications to Uncertainty Quantification, J. Comp. Phys., 231, 2012, p. 5718, DOI: 10.1016/j.jcp.2012.04.047. [23] J. Behler, Constructing High-dimensional Neural Network Potentials: A Tutorial Review, Int. J. Quan. Chem., 115, 2015, p. 1032, DOI: 10.1002/qua.24890. [24] T. Muller, E. Johlin, and J.C. Grossman, Origins of Hole Traps in Hydrogenated Nanocrystalline and Amorphous Silicon Revealed Through Machine Learning, PRB, 89, 2014, p. 115202, DOI: 10.1103/PhysRevB.89.115202. [25] M.Z. Pesenson, S.K. Suram, and J.M. Gregoire, Statistical Analysis and Interpolation of Compositional Data in Materials Science, ACS Comb. Sci., 17, 2015, p. 130, DOI: 10.1021/co5001458. [26] M.R. Keenan et al., Atomic-Scale Phase Composition through Multivariate Statistical Analysis of Atom Probe Tomography Data, Microsc. Microanal., 17, 2011, p. 418, DOI: 10.1017/S1431927611000353. [27] T. Philippe, M. Gruber, F. Vurpillot, and D. Blavette, Clustering and Local Magnification Effects in Atom Probe Tomography: A Statistical Approach, Microsc. Microanal., 16, 2010, p. 643, DOI: 10.1017/S1431927610000449. [28] P.G. Kotula, M.R. Keenan, and J.R. Michael, Tomographic Spectral Imaging with Multivariate Statistical Analysis: Comprehensive 3D Microanalysis, Microsc. Microanal., 12, 2006, p. 36, DOI: 10.1017/S1431927606060193. [29] B. Vekemans, L. Vincze, F.E. Brenker, and F. Adams, Processing of Three-dimensional Microscopic X- ray Fluorescence Data, J. Anal. At. Spectrom., 19, 2004, p. 1302, DOI: 10.1039/B404300F. [30] T. Philippe et al., Clustering and Nearest Neighbour Distance in Atom-Probe Tomography, Ultramicroscopy, 109, 2009, p. 1304, DOI: 10.1016/j.ultramic.2009.06.007.

Materials Data Analytics: A Path-Finding Workshop 32

[31] K. Artyushkova, J. Fenton, J. Farrar and J.E. Fulghum, Multitechnique Fusion of Imaging Data for Heterogeneous Materials, in Image Fusion and Its Applications, Y. Zheng (Ed.), 2011, DOI: 10.5772/16903. [32] K. Artyushkova, J.O. Farra and J.E. Fulghum, Data Fusion of XPS and AFM Images for Chemical Phase Identification in Polymer Blends, Surf. Interface Anal., 41, 2009, p. 119, DOI: 10.1002/sia.2968. [33] D. Biagioni, R.L. Graham, D.S. Albin, W.B. Jones, and C. Suh, Analysis of Governing Factors for Photovoltaic Loss Mechanism of n-CdS/p-CdTe Heterojunction via Multi-way Data Decomposition, Prog. Photovolt: Res. Appl., 23, 2015, p. 49, DOI: 10.1002/pip.2394. [34] C. Suh et al., Visualization of High-Dimensional Combinatorial Catalysis Data, J. Comb. Chem., 11, 2009, p. 385, DOI: 10.1021/cc800194j. [35] C. Suh et al., Exploring High-Dimensional Data Space: Identifying Optimal Process Conditions in Photovoltaics, the 37th IEEE PVSC paper, 2011, NREL/CP-2C00-50693, DOI: 10.1109/PVSC.2011.6186065. [36] H. Xu, A. Choudhary, R. Liu, W. Chen, A machine learning-based design representation method for designing heterogeneous microstructures, J. Mech. Des., 137, 2015, p. 051403, DOI: 10.1115/1.4029768. [37] H. Xu, X. Liu, C. Brinson, and W. Chen, A Descriptor-based Design Methodology and Materials Informatics for Developing Heterogeneous Microstructural Materials System, J. Mech. Des., 136, 2014, p. 051007, DOI: 10.1115/1.4026649. [38] W.W. Tipton, C. R. Bealing, K. Mathew, and R.G. Hennig, Structures, Phase Stabilities, and Electrical Potentials of Li-Si battery Anode materials, PRB, 87, 2013, p. 184114, DOI: 10.1103/PhysRevB.87.184114. [39] C. Ortiz, O. Eriksson, and M. Klintenberg, Data Mining and Accelerated Electronic Structure Theory as a Tool in the Search for New Functional Materials, Comp. Mat. Sci., 44, 2009, p. 1042, DOI: 10.1016/j.commatsci.2008.07.016. [40] T. D. Huan, A. Mannodi-Kanakkithodi, and R. Ramprasad, Accelerated Materials Property Predictions and Design Using Motif-based Fingerprints, PRB, 92, 2015, p. 014106, DOI: 10.1103/PhysRevB.92.014106. [41] B. Meredig, A. Agrawal, S. Kirklin, J.E. Saal, J.W. Doak, A. Thompson, K. Zhang, A. Choudhary, and C. Wolverton, Combinatorial Screening for New materials in Unconstrained Composition Space with Machine Leaning, PRB, 89, 2014, p. 094104, DOI: 10.1103/PhysRevB.89.094104. [42] H. Dam, T. Pham, T. Ho, A. Nguyen, and V. Nguyen, Data Mining for Materials Design: A Computational Study of Single Molecule Magnet, J. Chem. Phys., 140, 2014, p. 044101, DOI: 10.1063/1.4862156. [43] C.C. Fischer, K.J. Tibbetts, D. Morgan, and G. Ceder, Predicting Crystal Structure by Merging Data Mining with Quantum Mechanics, Nat. Mater., 5, 2006, p. 641, DOI: 10.1038/nmat1691. [44] G. Hautier, C.C. Fischer, A. Jain, T. Mueller, G. Ceder, Finding Nature’s Missing Ternary Oxide Compounds Using Machine Learning and Density Functional Theory, Chem. Mat., 22, 2010, p. 3762, DOI: 10.1021/cm100795d.

Materials Data Analytics: A Path-Finding Workshop 33

[45] R. Jalem, T. Aoyama, M. Nakayama, and M. Nogami, Multivariate Method-Assisted Ab Initio Study 2+ 5+ 3+ 4+ of Olivine-Type LiMXO4 (Main Group M -X and M -X ) Compositions as Potential Solid Electrolytes, Chem. Mater., 24, 2012, p. 1357, DOI: 10.1021/cm3000427. [46] G. Hautier, C. Fischer, V. Ehrlacher, A. Jain and G. Ceder, Data Mined Ionic Substitutions for the Discovery of New Compounds, Inorg. Chem., 50, 2011, p. 656, DOI: 10.1021/ic102031h. [47] I.E. Castelli and K.W. Jacobsen, Designing Rules and Probabilistic Weighting for Fast Materials Discovery in the Perovskite Structure, Modelling Simul. Mater. Sci. Eng., 22, 2014, p. 055007, DOI: 10.1088/0965-0393/22/5/055007. [48] J.M. Cole et al., Data Mining with Molecular Design Rules Identifies New Class of Dyes for Dye- Sensitised Solar Cells, Phys. Chem. Chem. Phys., 16, 2014, p. 26684, DOI: 10.1039/C4CP02645D. [49] C. Kong, P. Villars, S. Iwata, and K. Rajan, Mapping the ‘materials gene’ for Binary Intermetallic Compounds – A Visualization Schema for Crystallographic Databases, Comp. Sci. Disc., 5, 2012, p. 015004, DOI: 10.1088/1749-4699/5/1/015004. [50] O. Isayev, D. Fourches, E. N. Muratov, C. Oses, K. Rasch, A. Tropsha, and S. Curtarolo, Materials Cartography: Representing and Mining Materials Space using Structural and Electronic Fingerprints, Chem. Mater., 27, 2015, p. 735, DOI: 10.1021/cm503507h. [51] L.A. Baumes et al., Using Genetic Programming for an Advanced Performance Assessment of Industrially Relevant Heterogeneous Catalysts, Mat. Manufac. Proc., 24, 2009, p. 282, DOI: 10.1080/10426910802679196. [52] W Park, S. Singh, M. Kim, and K. Sohn, Phosphor Informatics Based on Confirmatory Factor Analysis, ACS Comb. Sci., 17, 2015, p. 317, DOI: 10.1021/acscombsci.5b00017. [53] D. A. Carr, M. Lach-hab, S. Yang, I.I. Vaisman, E. Blaisten-Barojas, Machine Learning Approach for Structure-based Zeolite Classification, Micropor. Mesopor. Mat., 117, 2009, p. 339, DOI: 10.1016/j.micromeso.2008.07.027. [54] M. Rupp, A. Tkatchenko, K. Muller, and O. A. von Lilienfeld, Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning, PRL, 108, 2012, p. 058301, DOI: 10.1103/PhysRevLett.108.058301. [55] R. Liu, A. Jumar, Z. Chen, A. Agrawal, V. Sundararaghavan and A. Choudhary, A Predictive Machine Learning Approach for Microstructure Optimization and Materials Design, Sci. Rep., 5, 2015, p. 11551, DOI: 10.1038/srep11551. [56] O. Abuomar, S. Nouranian, R. King, T.M. Ricks, T.E. Lacy, Comprehensive Mechanical Property Classification of Vapor-Grown Carbon Nanofiber/Vinyl Ester Nanocomposites using Support Vector Machines, Comp. Mat. Sci., 99, 2015, p. 316, DOI: 10.1016/j.commatsci.2014.12.029. [57] O. Abuoar, S. Nouranian, R. King, and T. Lacy, On Materials Informatics and Knowledge Discovery: Mechanical Characterization of Vapor-Grown Carbon Nanofiber/Vinyl Ester Nanocomposites, Shechtman Int. Symp. 2014. [58] O. Abuomar et al., Data Mining and Knowledge Discovery in Materials Science and Engineering: A Polymer Nanocomposites Case Study, Adv. Eng. Infor., 27, 2013, p. 615, DOI: 10.1016/j.aei.2013.08.002.

Materials Data Analytics: A Path-Finding Workshop 34

[59] K. Fujimura et al., Accelerated Design of Lithium Superionic Conductors Based on First-Principles Calculations and Machine Learning Algorithms, Adv. Energy. Mater., 3, 2013, p. 980, DOI: 10.1002/aenm.201300060. [60] Y. Li, Predicting Materials Properties and Behavior Using Classification and Regression Trees, Mat. Sci. Eng. A, 433, 2006, p. 261, DOI: 10.1016/j.msea.2006.06.100. [61] P.V. Balachandran, J. Theiler, J.M. Rondinelli, and T. Lookman, Materials Prediction via Classification Learning, Sci. Rep., 5, 2015, p. 13285, DOI: 10.1038/srep13285. [62] G. Pilania, C. Wang, X. Jiang, S. Rajasekaran, R. Ramprasad, Accelerating Materials Property Predictions Using Machine Learning, Sci. Rep., 3, 2013, p. 2810, DOI: 10.1038/srep02810. [63] Doreswamy, K.S. Hemanth, C. M. Vastrad, and S. Nagaraju, Data Mining Technique for Knowledge Discovery from Engineering Materials Data Sets, in Advances in Computer Science and Information Technology, CCIS, 131, 2011, p. 512, DOI: 10.1007/978-3-642-17857-3_50. [64] Doreswamy and K.S. Hemanth, Mining Knowledge from Engineering Materials Database for Data Analysis, SocPros, the series Advances in Intelligent Systems and Computing, 236, 2012, p. 1217, DOI: 10.1007/978-81-322-1602-5_127. [65] L. Xu, L. Wencong, J. Shenli, L. Yawei, C. Nianyi, Support Vector Regression Applied to Materials Optimization of Sialon Ceramics, Chem. Int. Lab. Sys., 82, 2006, p. 84, DOI: 10.1016/j.chemolab.2005.08.011. [66] A. Seko, T. Maekawa, K. Tsuda, and I. Tanaka, Machine Learning with Systematic Density-functional Theory Calculations: Application to Melting Temperatures of Single- and Binary-component Solids, PRB, 89, 2014, p. 054303, DOI: 10.1103/PhysRevB.89.054303. [67] M. Fernandez, P.G. Boyd, T.D. Daff, M.Z. Aghaji, and T.K. Woo, Rapid and Accurate Machine learning

Recognition of High Performing Metal Organic Frameworks for CO2 Capture, J. Phys. Chem. Lett., 5, 2014, p. 3056, DOI: 10.1021/jz501331m. [68] M. D’avezal, R. Botts, M.J. Mohlenkamp, and A. Zunger, Learning to Predict Physical Properties Using Sums of Separable Functions, SIAMA J. Sci. Comput, 33, 2011, p. 3381, DOI: 10.1137/100805959. [69] G. Pilania, J.E. Gubernatis, and T. Lookman, Structure Classification and Melting Temperature Prediction in Octet AB Solids via Machine Learning, PRB, 91, 2015, p. 214302, DOI: 10.1103/PhysRevB.91.214302. [70] E.B’elisle, Z. Huang, and A. Gheribi, Scalable Gaussian Process Regression for Prediction of Materials Properties, H. Wang and M.A. Sharaf (Eds.), ADC 2014, LNCS 8506, 2014, p. 38, DOI: 10.1007/978-3-319-08608-8_4. [71] A. Chauhan and R. Vaish, Hard Coating Materials Selection Using Multi-Criteria Decision Making, Mater. Design, 44, 2013, p. 240, DOI: 10.1016/j.matdes.2012.08.003. [72] P.D Deshpande, B. P. Gautham, A. Cecen, S. Kalidindi, A. Agrawal, A. Choudhary, Application of Statistical and Machine Learning Techniques for Correlating Properties to Composition and Manufacturing Processes of Steels, 2nd World Congress on Integrated Computational Materials Engineering Proceeding, 2013, DOI: 10.1002/9781118767061.ch25.

Materials Data Analytics: A Path-Finding Workshop 35

[73] O. Wodo, S. Tirthapura, S. Chaudhary, B. Ganapathysubramanian, A Graph-based Formulation for Computational Characterization of Bulk Heterojunction Morphology, Org. Electron., 13, 2012, p. 1105, DOI: 10.1016/j.orgel.2012.03.007. [74] http://www.matter.org.uk/ [75] http://www.materials-education.com

Materials Data Analytics: A Path-Finding Workshop 36