F

O

NH

O

O

NH NH O NH CHEMICAL SCIENCEO SYMPOSIUM 2020 O How can machine learning and autonomy accelerate chemistry?

29 – 30 SeptemberF 2020 Online event O

O

NH NH

O O

NH

O

O

NH

O Fundamental questions Elemental answersNH F

O Meeting Information

Meeting Information

Chemical Science Symposium 2020: How can machine learning and autonomy accelerate chemistry? is organised and hosted online by the Royal Society of Chemistry.

This e-book contains abstracts of the posters presented at the Chemical Science Symposium 2020. All abstracts are produced directly from typescripts supplied by authors. Copyright reserved.

All sessions, including the posters, are available to access via the virtual lobby. Further information on how to join the meeting and best practice for an online event is detailed in the joining instructions.

Networking sessions

There will be regular breaks throughout the meeting for socialising, networking and continuing discussions started during the scientific sessions. During the networking sessions you will be able to join existing networking rooms or initiate one-to-one chats. Existing networking rooms will be visible from the virtual lobby.

To create a one-to-one chat, simply click on the name of the person you would like to speak to and select if you would like to have a private or public conversation. For a public conversation, other delegates can join your chat room.

On the web version, you can only be in one session at a time (this includes networking rooms).

Posters

Posters have been numbered consecutively.

The posters will be available to view throughout the discussion by clicking on the link in the virtual lobby. The dedicated poster session will take place on

29 September 16:00 – 17:30 30 September 13:20 – 14:05

During these times, the presenters will be available for live chat and outside of these times, a direct message can be sent to the authors and they can respond when available.

Networking at the Chemical Science Symposium

One of the primary aims of the Chemical Science symposia series is to act as a platform to bring researchers together with the intention of facilitating the sharing of knowledge and experience and to provide opportunities for delegates to both strengthen existing connections and forge new ones. We strongly encourage you to connect throughout the symposium. In addition to general networking opportunities, we have also created some optional discussion sessions and breakout rooms where you can meet and interact with our speakers, editors and members of RSC staff.

Meet our speakers – a chance for general networking with our speakers

Meet our Chemical Science Editors – a chance for networking with some of the Chemical Science Editorial Board and team in the Editorial Office. This would be a great opportunity to talk with our team about what they look for in a paper, the peer review process or to discuss publishing in general.

RSC Digital Futures – In September 2019, we held our first Strategic Advisory Forum, inviting 14 experts from different scientific fields and sectors to discuss the long-term promise of and concerns about the use of data and digital technologies for scientific discovery. The output of this Forum was published earlier this year as our Digital Futures Report.

We are looking to continue this discussion with you at the Chemical Science symposium. A breakout room will be open throughout the meeting where we are asking you to share your thoughts on the tools and techniques that can be used within the chemical sciences to solve new problems and what the future challenges may be. Your comments will be summarised during the closing remarks and will feature in a blog post following the meeting.

We will also hold a chat-based discussion session, hosted by our Co-Chairs Andrew Cooper and Alan Aspuru- Guzik where we invite you to share your experiences.

RSC Technology: Author Experience Project

Get to know more about our exclusive products and services. Discover more about how our manuscript tracker and submission system can make the publication process smoother for you. A breakout room will be open throughout the meeting where you can watch a short presentation on these author tools. We also ask delegates to give feedback on these tools via some interactive polls.

Some of the team behind the development of these products and services – our User Experience Researcher Sara Braganca User Experience Designer Ian Parry - will also host a discussion session during the symposium. Discover more about how the RSC develops the solutions that are available to you and let us know about any ideas of feedback you may have for us. We really want to hear your thoughts on your experiences with the publishing process.

Finally, help shape new and exclusive RSC products and services. Join our one-to-one sessions and speak to Sara and Ian about our new ideas. You will be able to be one of the first ones to see new features and comment on them. To sign-up to a one-to-one session, please email them in advance of the meeting at userexperience@ rsc.org or connect with Sara Braganca or Ian Parry during the meeting in the InEvent conferencing software to arrange a suitable time. For reference, the schedule of these networking sessions is as follows:

Networking sessions, 29th September

13:00-17:30 RSC Digital Futures: Breakout Room & Discussion Session 13:00-17:30 RSC Technology: Author Experience Breakout Room 13:00-17:30 Poster Session 1 16:00-16:30 Meet the Speakers: G. Day, K. Thurow & J. Schrier 16:00-16:30 Meet the Industry Speakers: J. Becker & M. J. Nieves Remacha 16:30-17:00 Meet our Chemical Science Editors

Networking Sessions, 30th September

12:00-14:05 RSC Digital Futures: Breakout Room 12:00-14:05 RSC Technology: Author Experience Breakout Room & Discussion Session 13:35-14:05 Meet the Speakers: L. Cronin, J. Cole & Y. Jung 12:00-15:45 Poster Session 2 Introduction

Dear Colleagues,

A warm welcome to our Chemical Science Symposium for 2020, the second in our symposia series. Under the banner of our flagship journal, Chemical Science, we are shining a spotlight on cutting-edge chemistry research and giving it the sort of attention that drives scientific progress.

While the ongoing COVID-19 pandemic has moved this symposium from our historical home in London, Burlington House, to a virtual setting, our connection with the Chemical Science community remains as strong as always. As with the previous iteration of this symposium, we encourage strong participation from early career scientists and future leaders, who will be the future of this discipline. So we are dedicating significant time in the programme to our poster and networking breaks.

The Chemical Science Symposium provides a way for our wider community to regularly stay in touch with the journal editors and fellow researchers across a broad range of topics in the chemical sciences. It is also a great way of staying up to date with what’s happening in the journal, from technical developments to the most exciting research coming your way in the future, and we’ll providing more details through breakout sessions in the networking breaks.

The focus of this Symposium is the acceleration of chemistry through automation and machine learning. This topic was the focus of the Royal Society of Chemistry’s first Strategic Advisory Forum, held in September 2019, the results of which were recently published as our Digital Futures report (http://rsc.li/digital-futures). Advances in artificial intelligence, robotics and automation technologies, machine learning and modelling and simulation are revolutionising scientific discoveries, allowing scientists to advance their research at an unprecedented rate. Using these techniques and methods, researchers are able to gather results quicker and analyse complex data more efficiently.

Each of the speakers in this symposium is an international expert in their area of research, with the topics covered demonstrating the applicability of this topic to a diverse range of problems in the chemical sciences. The speakers will describe the application of machine learning and automation to materials design and the discovery of new materials and drugs, the role of self-driving laboratories for molecular synthesis and the utilisation of these techniques within industry.

We hope that the lectures and poster presentations will stimulate the exchange of ideas and experiences between all participants, setting a strong platform for discussion. As such we strongly encourage delegates to raise questions to our speakers and poster presenters during the discussion sessions and throughout our dedicated poster and networking breaks as well. We’d like to thank each of the speakers, poster presenters and participants for all their contributions.

Again welcome to what promises to be an exciting symposium. We hope that this event will act as a springboard for future activities and that it will help in fostering new research collaborations.

Professor Alán Aspuru-Guzik Professor Andrew Cooper Dr May Copsey University of Toronto University of Liverpool Royal Society of Chemistry Associate Editor, Editor-in-Chief, Executive Editor, Chemical Science Chemical Science Chemical Science Committee Invited Speakers

Alán Aspuru-Guzik (Co-Chair), Jill Becker University of Toronto, Canada Kebotix, United States

Andrew Cooper (Co-Chair), Jacqueline Cole University of Liverpool, United Kingdom University of Cambridge, United Kingdom

Luis M Campos, Lee Cronin Columbia University, United States University of Glasgow, United Kingdom

Kim E Jelfs, Graeme Day Imperial College London, United Kingdom University of Southampton, United Kingdom

Andrei Yudin, Yousung Jung University of Toronto, Canada KAIST, South Korea

María José Nieves Remacha Eli Lilly and Company, Spain

Joshua Schrier Fordham University, United States

Kerstin Thurow University of Rostock, Germany O

NH

O

O

NH

O Programme CHEMICAL SCIENCE SYMPOSIUM 2020 How can machine learning and autonomy 29 – 30 September 2020 accelerate chemistry? Online event

Day 1 (Times given for UK BST)

13:00 Welcome and introduction to day 1

Session 1: Chair - Alán Aspuru-Guzik University of Toronto, Canada

Efficient exploration of solid state chemical space using machine 13:10 learning Inv. 1 Yousung Jung KAIST, South Korea Building a computational engine to guide the autonomous 13:30 discovery of molecular materials Inv. 2 Graeme Day University of Southampton, UK

13:50 Session 1 discussion 14:20 Break

NH Session 2: Chair - Luis Campos Columbia University, USA

Autonomous materials discovery: promise, pitfalls, and progress 14:35 Inv. 3 Joshua Schrier Fordham University, USA Accelerating materials innovation: discovery of electrochromic 14:55 materials for smart windows Inv. 4 Jill Becker Kebotix, USA

15:15 Session 2 discussion 15:45 Break

Poster Session & Networking Rooms

Networking 16:00 Sessions 1

Networking 16:30 Poster session 1 General networking Sessions 2

17:00

Continues overleaf. 17:30 Close O

NH

O

Day 2 (Times given for UK BST) O

12:00 Welcome and introduction to dayNH 2

O Session 3: Chair - Kim Jelfs Imperial College London, UK F Accelerating materials discovery with data mining and machine O 12:10 learning O Inv. 5

Jacqueline Cole University of Cambridge, UK NH

NH The chemical oracle O 12:30 Inv. 6 Lee Cronin University of Glasgow, UK

12:50 Session 3 discussion

13:20 Break F

Poster Session & Networking Rooms O

13:20 NH

O

NH

O 13:35 Poster session 2 General networking Networking Sessions 3

O 14:05 NH

Session 4: Chair - Andrei Yudin University of Toronto, Canada

Suitable automation systems for accelerating chemical research 14:20 Inv. 7 Kerstin Thurow University of Rostock, Germany Autonomous chemical synthesis in flow for drug discovery 14:40 Inv. 8 María José Nieves Remacha Eli Lilly and Company, Spain

15:00 Session 4 discussion 15:30 Closing remarks and poster awards

15:45 Close of symposium O

NH

O O

NH

O

Fundamental questions 29 – 30 September 2020 Elemental answers Online event F Registered charity number: 207890

O Invited speakers

Graeme Day University of Southampton, United Kingdom

Graeme Day is Professor of Chemical Modelling at the University of Southampton. His research concerns the development of computational methods for modelling the organic molecular solid state. A key focus of this work is the prediction of crystal structures from first principles; his research group applies these methods in a range of applications, including pharmaceutical solid form screening, NMR crystallography and computer-guided discovery of functional materials.

After a PhD in computational chemistry at University College London, he spent 10 years at the University of Cambridge, where he held a Royal Society University Research Fellowship working mainly on modelling pharmaceutical materials and computational interpretation of terahertz spectroscopy. He moved to the University of Southampton in 2012, at which time he was awarded a European Research Council Starting Grant for the 'Accelerated design and discovery of novel molecular materials via global lattice energy minimisation' (ANGLE). This grant shifted the focus of his research to functional materials, including porous crystals and organic electronics. In 2020, he was awarded an ERC Synergy grant 'Autonomous Discovery of Advanced Materials' (ADAM) with Andrew Cooper (Liverpool) and Kerstin Thurow (Rostock) to integrate computational predictions, chemical space exploration with automation in the materials discovery lab.

Graeme has served on the editorial boards of CrystEngComm, and on the advisory board of Molecular Systems Design & Engineering (MSDE).

Lee Cronin University of Glasgow, United Kingdom

Leroy (Lee) Cronin FRSE is the Regius Professor of Chemistry in Glasgow. Prizes include 2019 Japan Society of Coordination Chemistry International Prize, 2018 ACS Inorganic Lectureship, 2018 RSC Interdisciplinary Prize, 2015 RSC Tilden Prize, 2013 BP/RSE Hutton Prize, 2012 RSC Corday Morgan, 2011, Election to the Royal Society of Edinburgh in 2009. His research has four main aims 1) the construction of an artificial life form / work out how inorganic chemistry transitioned to biology / searching for new life forms; 2) the digitization of chemistry; and 3) the use of artificial intelligence in chemistry including the construction of ‘wet’ chemical computers; 4) The exploration of complexity and information in chemistry. He runs a team of around 60 people funded by grants from the UK EPSRC, US DARPA, Templeton, Google, BAe, JM. Kerstin Thurow University of Rostock, Germany

Kerstin Thurow (Member, IEEE) received the Habilitation degree in automation and control from the University of Rostock, Rostock, Germany, in 1999. Since 2003, she has been the CEO of the Center for Life Science Automation, University of Rostock, where she has been holding the Chair of the Automation Technologies/Life Science Automation since 2004. She has authored more than 190 papers in journals and conferences. Her major research interests include life science automation, medical automation, mobile robotics, and automated analytical measurement.

Yousung Jung KAIST, South Korea

Yousung Jung is a Professor of Chemical and Biomolecular Engineering at KAIST. He received the Ph.D. in Theoretical Chemistry from University of California, Berkeley with Martin Head-Gordon. After a postdoctoral work at Caltech with Rudy Marcus, he joined the faculty at KAIST in 2009. His research interests involve electronic structure theory, statistical modeling, and machine learning to develop efficient methods for fast and accurate simulations of complex molecular and materials systems, and their applications towards the understanding and inverse design problems in chemistry and materials science. He is the recipient of Pole Medal (2018, Asia-Pacific Association of Theoretical and Computational Chemists), Korean Young Physical Chemist Award (2017), Chemical Society of Japan Distinguished Lectureship Award (2015), and KCS- Wiley Young Chemist Award (2013).

Joshua Schrier Fordham University, United States

Joshua Schrier is the Kim. B. and Stephen E. Bepler Professor of Chemistry at Fordham University in New York City. The central theme of his research is the use of computers to accelerate the discovery of new materials, using a combination of physics-based simulations, cheminformatics, machine learning, and automated experimentation; current projects focus on halide perovskites and amine-templated metal oxides as example materials. He is also deeply committed to undergraduate chemistry education and is the author of the textbook, "Introduction to Computational Physical Chemistry" (2017).

Prof. Schrier received his doctoral degree in theoretical physical chemistry from the University of California, Berkeley (with K. Birgitta Whaley), and was the Alvarez Computational Sciences Postdoctoral Fellow at Lawrence Berkeley National Laboratory (with Lin-Wang Wang). Prior to joining Fordham in 2018, he was on the faculty at Haverford College, where he served as Chemistry Department Chair and coordinator of the Scientific Computing program. He has received awards including the Dreyfus Teacher-Scholar award (2014) and U.S. Department of Energy Visiting Faculty Award (2017). Jacqueline Cole University of Cambridge, United Kingdom

Professor Jacqueline Cole holds the Royal Academy of Engineering Research Professorship in Materials Physics at the University of Cambridge, where she is Head of Molecular Engineering. She concurrently holds the BASF / Royal Academy of Engineering Research Chair in Data-driven Molecular Engineering of Functional Materials. This is partly funded by the ISIS neutron and Muon Source, STFC Rutherford Appleton Laboratory, Oxfordshire, UK, with whom she holds a joint appointment. At Cambridge, she carries a joint appointment between the Physics Department (Cavendish Laboratory) and the Department of Chemical Engineering and Biotechnology at Cambridge.

Her research combines artificial intelligence with data science, computational methods and experimental research to afford a 'design-to-device' pipeline for data-driven materials discovery.

Her research has been recognised by the Royal Society Clifford Paterson Medal and Lecture 2020; the BASF / Royal Academy of Engineering Research Chair and Senior Research Fellowship in Data-driven Molecular Engineering of Functional Materials (2018- 2023); the 1851 Royal Commission 2014 Fellowship in Design (2015-8), a Fulbright Award (all disciplines Scholar, 2013-4), an ICAM Senior Scientist Fellowship (2013-4); The Vice-Chancellor's Research Chair, University of New Brunswick, Canada (2008-2013), a Royal Society University Research Fellowship (2001-11), a Senior Research Fellowship (2002-2009) and Junior Research Fellowship (1999-2002) from St Catharine’s College, Cambridge, UK; the Royal Society of Chemistry SAC Silver Medal and Lecture (2009); the Brian Mercer Feasibility Award (2007); the 18th Franco-British Science prize (2006); the first British Crystallographic Association Chemical Crystallography Prize (2000).

María José Nieves Remacha Eli Lilly and Company, Spain

María José Nieves Remacha received her PhD (2014) and MS (2009) from Massachusetts Institute of Technology, where she worked with Klavs F. Jensen in scaling up multiphase continuous flow chemistries from micro to milli scales, from both an experimental and computational fluid dynamics perspective. After that she worked for The Dow Chemical Company in Core R&D (Freeport, TX), providing technical expertise in reaction engineering and revealing non-obvious process insights through modeling, optimization, and statistics.

In 2016, she joined Eli Lilly and Company (Spain) to work in the Discovery Flow Chemistry Group and later in the Innovation and Technology Group implementing novel technologies, smart experimentation, and data analysis strategies to optimize processes in chemical synthesis. Her research interests are at the interface of computer science, engineering and chemistry, including: accelerating drug discovery through artificial intelligence and building laboratory automated systems with increasing level of autonomy. Jill Becker Kebotix, United States Building a computational engine to guide the autonomous discovery of molecular materials

Graeme M. Day University of Southampton, United Kingdom

The talk will describe the vision for the development of a computational framework to guide the discovery of functional crystalline molecular materials. Many properties of interest, such as porosity or charge transport, are strongly influenced by the arrangement of molecules in the solid state. However, the structures of molecular crystals are often determined by a delicate balance of weak, competing interactions. Therefore, small changes in molecular structure can lead to large changes in crystal packing and empirical rules for predicting the arrangement of molecules often fail. Thus, our approach to the computational discovery of materials is built around the core technology of crystal structure prediction (CSP), which has developed into a reliable tool for exploring the likely crystal structures associated to a given molecule. When targeting a given function, we assess the relevant properties for this landscape of potential crystal structures, providing what we call an energy- structure-function (ESF) map [1].

After discussing recent examples of how CSP and ESF maps have been used to guide experimental programmes for materials discovery, I will describe how we have started to integrate CSP with methods for chemical space exploration, principally using population-based evolutionary approaches [2]. This presents several challenges before we can fully automate the exploration of the joint chemical-crystal structure space, including the need to accelerate the CSP methods and to automate the interpretation of ESF maps for identifying the best target molecules for synthesis and characterisation. These are areas where we have made use of supervised and unsupervised machine learning [3,4], which will be described.

References 1. Functional materials discovery using energy–structure–function maps, A. Pulido et al, Nature 2017, 543, 657. 2. Evolutionary chemical space exploration for functional materials: computational organic semiconductor discovery, C. Y. Cheng, J. E. Campbell and G. M. Day, Chem. Sci. 2020, 11, 4922-4933. 3. Machine learning for the structure–energy–property landscapes of molecular crystals, F. Musil, S. De, J. Yang, J. E. Campbell, G. M. Day and M Ceriotti, Chem. Sci. 2018, 9, 1289-1300. 4. [Machine-Learned Fragment-Based Energies for Crystal Structure Prediction, D. McDonagh, C.-K. Skylaris and G. M. Day, J. Chem. Theory Comput. 2019, 15, 2743–2758; Multi-fidelity Statistical Machine Learning for Molecular Crystal Structure Prediction, O. Egorova, R. Hafizi, D. C. Woods and G. M. Day, ChemRxiv preprint 2020, https://doi.org/10.26434/ chemrxiv.12407831.v1. The chemical oracle

Lee Cronin University of Glasgow, United Kingdom

We outline robotic systems, driven by chemical intelligence algorithms, designed to search for new reactivity, reactions, and molecules. A programmable robotic discovery system has been built that can run the reactions and analysis. To program the robot, we will exploit the world’s first domain-specific chemical programming language (XDL) to produce the code and simultaneously. To ensure wide usability, the XDL will be automatically generated for this architecture using a natural language recognition system that converts the design of experiments from batch chemical protocols. This system will handle the programming of robots so expert chemists can focus on generating discovery experiments, allowing the exploration of chemical space. ChemTrek will achieve autonomy by combining the discovery language, modular hardware and sensors including portable NMR, mass spec. & UV/ vis, IR and Raman spectroscopy. These sensors will give real-time feedback from the reaction outputs to select new inputs including new process conditions. Correlations between the input chemicals & process conditions will be developed using statistical methods, allowing bias-free exploration, aiming for new discoveries using a minimum set of 'complexity-reagents' - that is a universal reaction system aiming to cover as much chemical space as is practical while minimizing the need for a large number of inputs. New findings will be translated into code and validated before conversion into reaction rules so the new discoveries can generate novel methodologies. In addition, ChemTrek will aim to set a new standard for chemical-information-interchange using the newly established XDL language. This will allow chemists to version their discoveries and share verified reaction protocols, as well as failed experiments. We aim to explore organic, inorganic, and supramolecular chemistry focussing on novel transformations, new properties and insights. Suitable automation systems for accelerating chemical research

Kerstin Thurow University of Rostock, Germany

The development of new chemical materials or chemical reactions is often a very time-consuming process.

Machine learning methods that enable either a prediction of material properties or an adaptation of the reactions to be carried out depending on different parameters are innovative instruments to accelerate and intensify the development processes.

In order to be able to optimally exploit the resulting potential, suitable automation strategies are required. The automation of chemical reactions is still not state of the art. Semi-automated synthesis systems and proprietary automation systems are currently dominating.

The lecture describes difficulties and challenges in the automation of chemical syntheses as well as requirements for suitable automation concepts. In addition to the implementation of the material flow within the automation system, the connection between the automation system and machine learning software is also discussed. Efficient exploration of solid state chemical space using machine learning

Yousung Jung Department of Chemical and Biomolecular Engineering, KAIST, South Korea

Discovery of a new material with desired properties is the ultimate goal of materials research. To date, a generally successful strategy has been to use chemical intuition and empirical rules to design new materials, but these conventional approaches require a significant amount of time and cost due to almost unlimited combinatorial possibilities of inorganic materials in chemical space. A promising way to significantly accelerate the latter process is to incorporate all available knowledge and data to plan the synthesis of the next material. In this talk, I will present several directions to use informatics to efficiently explore materials chemical space. I will first describe methods of machine learning for fast and reliable predictions of materials properties that can replace density functional calculations, an essential component needed for large scale materials design. With these tools in place for property evaluation, I will next present a few initial frameworks that we have recently developed to allow generative inverse design of inorganic crystals with optimal target properties, either in the compositional space or structural space. I will finally discuss several challenges and opportunities that lie ahead for further developments of accelerated materials platform, including synthesizability of inorganic crystals. Autonomous materials discovery: promise, pitfalls, and progress

Joshua Schrier Fordham University, United States

Automated materials synthesis and characterization, in which algorithms use robots as “hands” and “eyes” in the laboratory, may sound like a science fiction story. We’ve been working to turn this dream into reality for organic- inorganic hybrid materials synthesis. In this talk, I will describe our Robotic-Accelerated Perovskite Investigation and Discovery (RAPID) system. The first generation of RAPID uses inverse temperature crystallization (ITC) to grow halide perovskite single crystals for x-ray structure determination and bulk characterization using commercial liquid handling robots. The second iteration of RAPID uses antisolvent vapor diffusion, expanding the types of chemical processes we can study. Experiment plans for the syntheses are contributed remotely, by both human scientists and algorithms trained on the reaction data, facilitated by our ESCALATE (Experiment Specification, Capture and Laboratory Automation Technology) data management software which captures and processes the comprehensive data and metadata collected during experiments into a form suitable for machine learning. Incoming data collected by ESCALATE is used to automatically train machine learning models, evaluate model performance and feature influence, and quantify reproducibility. A live web dashboard communicates these insights to the scientist and management in visual form, improving transparency and completeness of reported results. I will describe several case studies about how we have used this system to extract new scientific insights, enhance the replicability of our scientific work, and reduce the time to discovery, as well as describe some of the obstacles we have encountered. Accelerating materials discovery with data mining and machine learning

Jacqueline Cole University of Cambridge, United Kingdom

Large-scale data-mining workflows are increasingly able to predict successfully new chemicals that possess a targeted functionality. The success of such materials discovery approaches is nonetheless contingent upon having the right data source to mine, adequate supercomputing facilities and machine-learning workflows to calculate or sample a large range of materials, and algorithms that suitably encode structure-function relationships as data- mining workflows which progressively short list data toward the prediction of a lead material for experimental validation.

This talk shows how to meet these data-science requirements via 'chemistry-aware' natural language processing, image recognition and machine learning developments using case study to showcase their successful application to data-driven materials discovery. Autonomous Chemical Synthesis in Flow for Drug Discovery

María José Nieves Remacha Eli Lilly and Company, Spain

Chemical synthesis for drug discovery has traditionally been carried out by chemists in batch reactors. In latest years, flow chemistry has been adopted by the pharmaceutical industry as an efficient and safe technology enabling forbidden chemistries and improving reaction outcomes. Most recently, flow chemistry platforms developed at research institutes and universities have been progressively incorporating higher levels of automation and autonomy, with feedback loops and self-optimization algorithms with the aim of reducing the level of human intervention enabling chemists to focus on more valuable activities.

In this presentation we introduce the journey of Eli Lilly at Discovery Chemistry on flow chemistry automated platforms, discussing elements and challenges involved in building an autonomous chemical synthesis platform in flow. In addition, we present examples where using design-of-experiments to generate data and multivariate regression to produce multidimensional maps helped maximize product yields. We also present the use of dynamic (transient) multivariate experiments as an alternative to the traditional steady-state approach to increase data throughput per experiment and illustrate the use of reinforcement learning to guide the optimization of reaction conditions. Poster presentations

P60 Mechanistic insights into the metal catalyst deactivation from biogenic P01 Olympus: a benchmarking framework for noisy optimisation and impurities in integrated bio and chemo catalytic processes experiment planning Haseena KV Matteo Aldeghi Indian Institute of Technology Delhi, India Vector Institute for Artificial Intelligence, Canada

P02 Predicting structure zone diagrams for thin film synthesis by generative machine learning Lars Banko Ruhr-University Bochum, Germany

P03 An active-learning strategy to search for the optimum dopant distribution and spin multiplicity in nanoparticles – the case of Ce(13-x)NixO26† Lizandra Barrios Herrera University of Calgary, Canada

P04 Machine learning and high-throughput methodology for robust design of P3HT-CNT composites for high electrical conductivity Daniil Bash National University of Singapore, Singapore

P05 Machine learning for establishing the bridge between formulation design and performance in glioblastoma: do outcomes differ? João Basso University of Coimbra, Portugal

P06 Predicting the synthetic accessibility of organic materials Steven Bennett Imperial College London, UK

P07 Efficient and automated construction of system-focused atomistic models Christoph Brunken ETH Zurich, Switzerland

P08 A mobile robotic researcher Benjamin Burger University of Liverpool, UK

P09 An interactive web application for sharing and visualizing chemical datasets Yu Che Universiy of Liverpool, UK

P10 Standard error evaluations on the formation energy and energy bandgap of ABX3 perovskite structures using machine learning approach Ericsson Chenebuah University of Ottawa, Canada

P11 Evolutionary chemical space exploration for functional materials Chi Yang Cheng University of Southampton, UK P12 Scaling relations and machine learning-driven design for molecular water splitting catalysts Michael Craig Trinity College Dublin, Ireland

P13 Mining predicted crystal structure landscapes with high throughput crystallisation: old molecules, new insights Peng Cui University of Liverpool, UK

P14 Autonomous optimization of nonaqueous battery electrolytes Adrsh Dave Carnegie Mellon University, USA

P15 Structure solution of triformylbenzene using the pycrystga genetic algorithm Lewis Farrar University of Liverpool, UK

P16 Summit: benchmarking machine learning for reaction optimisation Kobi Felton University of Cambridge, UK

P17 Machine-learning assisted modelling of the jacobsen epoxidation process: can random forests help optimize catalyst efficiency? José Ferraz-Caetano REQUIMTE-LAQV, University of Porto, Portugal

P18 Predicting molecular similarity and toxicity of mycotoxins using machine learning as a playground Cláudia Filipa Ferreira University of Coimbra, Portugal

P19 Evaluate organic corrosion inhibitors through machine learning Tiago Galvão CICECO, University of Aveiro, Portugal

P20 Can we synthesize molecules proposed by generative models? Wenhao Gao Massachusetts Institute of Technology, USA

P21 Improving vae molecular representations by tailoring them to predict docking poses and scores Miguel Garcia Ortegon University of Cambridge, UK

P22 Combining phonon accuracy with high transferability in machine-learned interatomic potentials Janine George Université catholique de Louvain, Belgium P23 Accelerating chemical discoveries by automated reaction space exploration Stephanie A. Grimmel ETH Zurich, Switzerland

P24 Machine learned transition pathways between molecular crystal structures Roohollah Hafizi University of Southampton, UK

P25 Developing rapid powder diffraction analysis for efficient characterisation of new materials Sophie Hodgkiss University of Liverpool, UK

P26 Towards the application of machine learning for MoS2 global optimization problem Jiri Hostas University of Calgary, Canada

P27 Using collective knowledge assign oxidation states Kevin Maik Jablonka EPFL, Switzerland

P28 A machine learning regression model for natural product repurposing as potential SARS-CoV-2 protease inhibitors Jose Isagani Janairo De La Salle University, Philippines

P29 Uniform quantitative predictive modelling for route design Kjell Jorner AstraZeneca Macclesfield, UK

P30 Driving spectroscopic discovery with deep learning Kelvin Lee Massachusetts Institute of Technology, USA

P31 Machine learning applied to a large library of organic molecules: identifying molecular photocatalysts for hydrogen evolution Xiaobo Li Liverpool University, UK

P32 Benchmarking the performance of bayesian optimization across multiple experimental domains Qiaohao Liang MIT, USA

P33 Metabolite translator: a transformer-based tool for predicting drug metabolites Eleni Litsa Rice University, USA P34 Accelerating development of natural porous materials assisted by statistical machine learning. Giulia Lo Dico Fundation IMDEA Materials, Spain

P35 Evaluations of sequential pre-processing strategy on spectral data for forensic ink analysis purpose Lee Loong Chuen Universiti Kebangsaan Malaysia, Malaysia

P36 Bayesian optimization for structural elucidation of atomic clusters Maicon Lourenço Universidade Federal do Espírito Santo, Brazil

P37 The effect of descriptor choice in machine learning models for ionic liquid melting point prediction Kaycee Low Monash University, Australia

P38 The automation of powder diffraction Amy Lunt University of Liverpool, UK

P39 Using quantum atomics and machine learning to advance picotechnology Preston MacDougall MTSU, USA

P40 Crystallography companion agent for high-throughput materials discovery Phillip Maffettone Brookhaven National Laboratory, USA

P41 Accurate many-body repulsive potentials for density-functional tight binding from deep tensor neural networks Leonardo Medrano Sandonas University of Luxembourg, Luxembourg

P42 Molecular design using graphinvent Rocío Mercado AstraZeneca, Sweden

P43 On the importance of structural diversity in metal-organic framework databases Seyed Mohamad Moosavi EPFL, Switzerland

P44 3D printed fluidics for reaction monitoring Adam Price Loughborough University, UK

P45 When does remeasuring accelerate catalyst discovery? Fuzhan Rahmanian Helmholtz Institute Ulm at the Karlsruhe Institute of Technology, Germany P46 Prediction of chemical reaction yields using reaction transformers Philippe Schwaller IBM Research – Europe/University of Bern, Switzerland

P47 A new approach to expanding the chemical space accessible by self- driving labs Martin Seifrid University of Toronto, Canada

P48 Navigating the design space of perovskite alloys using physics-informed sequential learning Shijing Sun Massachusetts Institute of Technology, USA

P49 Evolutionary discovery of large porous organic icosahedra Filip Szczypiński Imperial College London, UK

P50 Prediction of the properties of organic molecules using Wigner-Ville distribution and deep convolutional neural networks Alain Tchagang National Research Council, Canada

P51 Tandem random forests - Monte Carlo optimization of nitro-arene catalytic reduction. Filipe Teixeira LAQV-REQUIMTE University of Porto, Portugal

P52 Molecular-level virtual screening of nutraceuticals against Covid -19 using artificial intelligence and machine learning Jalala V. K University of Calicut, India

P53 Machine learning based design with small data sets: when the average model knows best Danny Vanpoucke Maastricht University, Netherlands

P54 Autonomous cloud-based platform for the AI-driven synthesis of molecules Alain Vaucher IBM Research Europe, Switzerland

P55 Looking for appropiate descriptors of heterogeneous catalysts Aline Villarreal SEDEMA, Mexico

P56 Machine learning for polymer swelling in liquids Qisong Xu National University of Singapore, Singapore P57 Development of a global lattice energy landscape sampling method Shiyue Yang University of Southampton, UK

P58 Automated calculation of reaction energy profiles Tom Young University of Oxford, UK

P59 Imputation of missing data for polymer membrane gas separation with machine learning Qi Yuan Imperial College London, UK

P60 Identifying degradation patterns of li-ion batteries from impedance spectroscopy using machine learning Yunwei Zhang University of Cambridge, UK

Olympus: a benchmarking framework for noisy optimisation and experiment planning

Matteo Aldeghi* 1,2,3, Florian Häse* 1,2,3,4, Riley J. Hickman 1,2, Loïc M. Roch 1,2,3, Melodie Christensen 5,6, Elena Liles 6, Jason E. Hein 6,and Alán Aspuru-Guzik 1,2,3,4 1Department of Chemistry, University of Toronto, Canada, 2Department of Computer Science, University of Toronto, Canada, 3Vector Institute for Artificial Intelligence, Toronto, Canada, 4Department of Chemistry and Chemical Biology, Harvard University, Cambridge, USA, 5Process Research and Development, Merck & Co., Inc., Rahway, USA, 6Department of Chemistry, University of British Columbia, Vancouver, Canada * These authors contributed equally to this work

Substantial advances in automation, robotics, and machine learning, have sparked the rise of autonomous experimentation as a next-generation approach to scientific discovery [1-6]. When the discovery process can be approached as an optimisation problem, off-the-shelf algorithms can be directly applied as part of the closed- loop procedure to achieve the desired optimisation goals in as few experimental iterations as possible. However, the most suitable optimisation strategy is a priori unknown and rigorous comparison of different strategies is highly time and resource demanding. As new algorithms are typically benchmarked on low-dimensional synthetic functions, it is unclear how their performance would translate to noisy, higher-dimensional experimental problems. To obviate these issues, we introduce Olympus, a software package that provides a consistent and easy-to-use framework for benchmarking optimisation algorithms against experiments emulated via probabilistic deep-learning models. Olympusincludes a collection of benchmark sets from chemistry and materials science and a suite of experiment planning strategies that can be easily accessed via a user-friendly python interface. Furthermore, Olympus allows the facile integration, testing, and sharing of custom algorithms and user-defined datasets. In brief, Olympus mitigates the barriers associated to benchmarking optimisation algorithms on realistic experimental scenarios, promoting data sharing and the creation of a standard framework for evaluating the performance of experiment planning strategies.

References 1. Häse, F., Roch, L. M. & Aspuru-Guzik, A. Next-Generation Experimentation with Self-Driving Laboratories. Trends Chem. 1, 282–291 (2019). 2. Gromski, P. S., Henson, A. B., Granda, J. M. & Cronin, L. How to explore chemical space using algorithms and automation. Nat. Rev. Chem. 3, 119–128 (2019). 3. Burger, B. et al. A mobile robotic chemist. Nature 583, 237–241 (2020). 4. MacLeod, B. P. et al. Self-driving laboratory for accelerated discovery of thin-film materials. Sci. Adv.6 , eaaz8867 (2020). 5. Langner, S. et al. Beyond Ternary OPV: High-Throughput Experimentation and Self-Driving Laboratories Optimize Multicomponent Systems. Adv. Mater. 32, 1907801 (2020). 6. Grizou, J., Points, L. J., Sharma, A. & Cronin, L. A curious formulation robot enables the discovery of a novel protocell behavior. Sci. Adv. 6, (2020).

P01 © The Author(s), 2020 Predicting structure zone diagrams for thin film synthesis by generative machine learning

Lars Banko1, Yury Lysogorskiy2, Ralf Drautz2,3, Alfred Ludwig1,3 1Chair for Materials Discovery and Interfaces, Institute for Materials, Ruhr-Universität, Bochum, Germany, 2 Interdisciplinary Centre for Advanced MaterialsSimulation (ICAMS), Ruhr-Universität, Bochum, Germany, 3 Materials Research Department, Ruhr-Universität, Bochum, Germany

Microstructure optimization is of crucial importance for the development of hard, protective thin film coatings. Structure zone diagrams (SZD) are the state of the art for estimating the microstructure of PVD grown thin films based on a few synthesis parameters. The predictive power of classical SZD is limited due to the complex synthesis-microstructure relationship of thin films. Furthermore, the detailed interplay of synthesis parameters and compositional complexity hinders the generalisation to other conditions and materials. Several refined SZD were proposed. These classical SZD have in common that they are based on a small number of observations. The underlying trends were extracted based on the scientists‘ expertise and abstracted into a diagram representation of microstructural features. Emerging developments in combinatorial thin film synthesis and high-throughput characterization allow for a fast, high-quality acquisition of microstructure data.

Inspired by the recent success of generative machine learning models, a new approach to microstructure prediction comes within reach. Here, we propose to reduce the cost of microstructure design by joining combinatorial experimentation with generative deep learning models to extract synthesis-composition- microstructure relations 1.

First, we train a variational autoencoder (VAE) on a dataset of augmented SEM surface image data. We make use of the VAEs recognition model to cluster a complex dataset of 120 thin film SEM surface images of the Cr-Al- O-N material system. The observed latent space representation is used to visually identify synthesis-composition- microstructure relationships across a 6 parameter synthesis-composition space. Despite this promising approach, the generative capabilities of VAE stayed below our expectation.

In a second approach a conditional generative adversarial network (cGAN) is applied to generate microstructure images on the input of process-parameters and chemical composition. We validate the predictions on particle size distributions and find that the generated microstructure images are in good coherence with images from the experimental dataset. Next we generate structure zone diagrams by sampling from the process-parameter- compositions space. We observe that the cGAN captures the fundamental trends in the dataset and that the predictions generalize reasonably to unobserved conditions. We demonstrate the ability to generate structure zone diagrams (gSZD) on demand by varying 2 variables of the synthesis-composition space and choosing 4 constant variables. The cGAN and a classification model are joined to complement the gSZD with a probabilistic structure zone map that can be used to identify regions of interest for a targeted synthesis.

References 1. Banko, L. et al. Predicting structure zone diagrams for thin film synthesis by generative machine learning. Commun. Mater. 1, 15 (2020). https://doi.org/10.1038/s43246-020-0017-2

P02 © The Author(s), 2020 An active-learning strategy to search for the optimum dopant distribution and spin multiplicity in nanoparticles – the case of Ce(13-x)NixO26†

Lizandra Barrios Herrera1, Dennis R. Salahub1, Jiri Hostas1, Alain Tchagang2, Cleyton de Souza Oliveira3, *Maicon P. Lourenço3 1Department of Chemistry, CMS Centre for Molecular Simulation, IQST Institute for Quantum Science and Technology, Quantum Alberta, University of Calgary, Canada.2Digital Technologies Research Centre, National Research Council of Canada, Canada. 3Departamento de Química e Física, Centro de Ciências Exatas, Naturais e da Saúde (CCENS), Universidade Federal do Espírito Santo, Brazil. * Address correspondence to: [email protected] (MPL)

Recent studies have shown the stability and high activity of Ni-CeO2 catalysts for ethanol steam reforming and the water gas shift (WGS) reactions [1, 2]. Knowing that the catalytic activity of ceria may increase by a factor of two by decreasing the size of its crystallites to the nanometer scale, we are interested in exploring Ni-CeO2 nanoparticles (NPs) as catalysts for water splitting, and WGS reactions. Therefore, elucidating how different properties of the Ni-

CeO2 NPs, such as size, composition, electronic structure, and catalytic sites, may affect the reactivity is of crucial importance to the eventual design of improved nanocatalysts.

This work focuses on the determination of the most promising catalytic sites resulting from the distribution of the Ni dopants in a ceria NP and how it affects the spin multiplicity. Modelling this kind of problem from first-principles calculations represents a significant challenge since there is a combinatorial number of possibilities of substituting the Ce by Ni. On top of that, we found that it is particularly important to investigate the possible spin multiplicities of the molecular system along with the Ni distributions in the cluster. As a result, the search space increases dramatically.

In order to approach this problem, we have developed an Active-Learning method (AL) to search for the optimum

Ni distribution and its spin multiplicity in Ce13O26 from DFT calculations. To accomplish that, we have proposed a descriptor that contemplates both Ni distributions in Ce10Ni3O26 and the spin multiplicity and implemented it in the QMLMaterial software [3]. The progress of the on-the-fly AL method implemented in the QMLMaterial and its interface with the deMon2k software for this problem will be presented.

References

1. Pastor-Pérez, L.; Buitrago-Sierra, R.; and Seplúveda-Escribano, A; CeO2-promoted Ni/activated carbon catalysts for the water-gas shift (WGS) reaction. Inter. J. Hydrogen Energy, 39, 31, 17589 (2014).

2. Lustemberg, P. G.; Feriac and, L.; Ganduglia-Pirovano, M. V.; Single Ni sites supported on CeO2(111) reveal cooperative effects in the water-gas shift reaction. J. Phys. Chem. C, 123, 13, 7749 (2019). 3. Lourenço, M. P.; Anastácio, A. S.; Rosa. L. Andreia; Frauenheim, T.; da Silva, M. C; An Adaptive Design approach for defects distribution modeling in materials from first-principle calculations, J. Mol Mod., 26, 187 (2020).

†Work supported by: the (1) National Research Council of Canada, Artificial Intelligence for Design program and by the Natural Sciences and Engineering Research Council of Canada, Discovery Grant ( RGPIN-2019-03976) and (2) The support of the Brazilian agencies: Fundação de Amparo à Pesquisa do Espírito Santo (FAPES)— project CNPq/FAPES PPP 22/2018.

P03 © The Author(s), 2020 Machine learning and high-throughput methodology for robust design of P3HT-CNT composites for high electrical conductivity

Bash Daniil*1,2, Cai Yongqiang*1, Chellappan Vijila*2,Wong Swee Liang2, Yang Xu2, Kumar Pawan2, Tan Jin Da1,2, Abutaha Anas2, Cheng Jayce2, Lim Yee Fun2, Ren Danny Zekun1,4, Tian Siyu1,4, Kumar Jatin2, Khan Saif1, Li Qianxiao1,3#, Buonassisi Tonio4#, Hippalgaonkar Kedar2,5# 1 National University of Singapore, Singapore, 2 Institute of Materials Research and Engineering, A-STAR, Singapore, 3 Institute of High Performance Computing, A-STAR, Singapore, 4 Singapore- MIT Alliance for Research and Technology, Singapore, 5 Nanyang Technological University, Singapore * These authors contributed equally # Corresponding authors

Combination of high-throughput experimentation with machine learning enables rapid exploration of parameter space, and accelerates optimization of experiments, aimed towards achieving target properties. In addition to such optimization, we demonstrate in this work that machine learning, deployed on curated experimental datasets, can also be used for hypothesis testing. We introduce an automated flow system with high-throughput drop-casting for thin film preparation, followed by fast characterization of optical and electrical properties. eW combine regio-regular poly-3-hexylthiophene with various carbon nanotubes to achieve electrical conductivities as high as 1200 S/cm, which we subsequently explain with high fidelity optical characterization. With this approach we are able to complete one cycle of fabrication, characterization, labeling and training of new model for ~160 samples in a single day. Dataset resampling strategies, graph-based regressions, and artificial ground-truth models that account for acquisition cost of experimental data, allow validation of the hypothesis, linking charge delocalization, measured via hyperspectral imaging, to electrical conductivity. We therefore present a robust machine-learning driven high-throughput experimental scheme, suitable for optimization for targets, as well as underlying understanding of properties of hybrid organic-inorganic composite materials.

P04 © The Author(s), 2020 Machine learning for establishing the bridge between formulation design and performance in glioblastoma: do outcomes differ?

João Basso1,2,3*, Maria Mendes1,2,3*, Jessica Silva1,3*, Tânia Cova2, Alberto Pais2, Carla Vitorino1,2,3# 1Faculty of Pharmacy, University of Coimbra, Portugal, 2Coimbra Chemistry Centre, Department of Chemistry, University of Coimbra, Portugal, 3Centre for Neurosciences and Cell Biology (CNC), University of Coimbra, Portugal

Cationic compounds have been described to readily penetrate cell membranes. Assigning positive charge to nanosystems, e.g. lipid nanoparticles, has been identified as a key feature to promote an electrostatic binding and to design ligand-based constructs for tumour targeting. However, the intrinsic high cytotoxicity has hampered their biomedical application. In here, we seek to establish which cationic compounds and which properties are compelling for interface modulation, in order to improve the design of tumour targeted nanoparticles against glioblastoma. How can intrinsic features (e.g. nature, structure, conformation) shape efficacy outcomes?

In the quest for safer alternative cationic compounds, we evaluate the effects of two novel glycerol-based lipids, GLY1 and GLY2, on the architecture and performance of nanostructured lipid carriers (NLCs). These two molecules, composed of two alkylated chains and a glycerol backbone, differ only in their polar head and proved to be efficient in reversing the zeta potential of the nanosystems to positive values.

In combination with the experimental development of NLCs, unsupervised and supervised machine learning methods are used to further characterize the key factors governing the effectiveness of the formulations. By deconstructing the formulations to their individual components, the use of clustering and principal component analyses, and partial least squares regression uncover the hidden patterns that govern the performance of these drug delivery systems. Furthermore, neural networks modelling particle size, polydispersity index, zeta potential and cytotoxicity at 24 h were introduced in order to combine classical interpretative methods with machine learning, thus conveying information from various methods with different interpretative power. Rather than expressing the same information, these methods provide a complementary outlook concerning the nanosystems.

The knowledge gap between formulation composition and performance is herein bridged, with unsupervised and supervised machine learning techniques unravelling the structural characteristics of these lipids: in spite of their similarity, GLY1 showed a better potential in increasing zeta potential and cytotoxicity, while decreasing particle size. Furthermore, NLCs containing GLY1 showed a favourable hemocompatible profile, as well as improved uptake by tumour cells. Summing-up, GLY1 circumvents the intrinsic cytotoxicity of a common surfactant, CTAB, is effective at increasing glioblastoma uptake, and exhibits encouraging anticancer activity. Overall, the use of neural networks provides a holistic overview and, due to the enhanced fit, a more robust and trustworthy observation of the behaviour of the data is retrieved when compared to traditional methods. The use of machine learning is then strongly encouraged for formulation design and optimization.

P05 © The Author(s), 2020 Predicting the synthetic accessibility of organic materials

Steven Bennett, Filip T. Szczypiński, Kim E. Jelfs Department of Chemistry, Imperial College London, UK

Porous organic cages (POCs) have been discovered as a possible alternative material for molecular separations, catalysis and sensing applications.1 However, due to the diverse number of potential precursors and reactions that can be used to form a novel POC, it can be time-consuming and computationally expensive to explore a large number of possible candidate molecules.2 Despite being able to predict materials with exceptional properties, it is often difficult to predict whether it is possible to synthetically realise a potential candidate compound. Additionally, molecules that are more difficult to synthesise could result in extremely useful properties. In the field of drug discovery, machine learning techniques have been able to readily distinguish between synthesisable and unsynthesisable molecules, accelerating the drug discovery process.3

Using data-driven synthetic accessibility scoring techniques, we aimed to incorporate a filter for synthetically viable molecules into our existing high-throughput POC screening workflow. We used the previously reported materials design software, stk,4to explore the synthesisable chemical space of possible candidate compounds. Precursors from a diverse chemical database were combined in pre-defined topological configurations, and porosity of the lowest energy conformations were analysed. Cages with favourable properties were identified, including those exhibiting highly symmetrical windows and with a large permanent internal cavity.

We found that incorporating a synthetic accessibility scoring function into the precursor selection process favoured less complex, synthetically accessible precursors, bridging the gap between computational screening and experimental synthesis of POCs. Furthermore, we tried to reproduce the chemical intuition of an expert organic chemist, attempting to model the decision process used when selecting potential precursor molecules. By redefining synthetic accessibility as a classification problem, we were able to create a random forest model that classified molecules as synthetically accessible or inaccessible.This screening workflow is also not limited to POCs, as it can be extended to other materials designed using a similar bottom-up, modular approach.

References 1. T. Hasell and A. I. Cooper, Nat. Rev. Mater., 2016, 1, 16053 2. Gómez-Bombarelli, R.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Duvenaud, D.; Maclaurin, D.; Blood-Forsythe, M. A.; Chae, H. S.; Einzinger, M.; Ha, D. G.; Wu, T.; et al. Design of Efficient Molecular Organic Light-Emitting Diodes by a High-Throughput Virtual Screening and Experimental Approach. Nat. Mater., 2016, 15 (10), 1120–1127 3. C. W. Coley, L. Rogers, W. H. Green and K. F. Jensen, SCScore: Synthetic Complexity Learned from a Reaction Corpus, J. Chem. Inf. Model., 2018, 58, 252–261 4. L. Turcani, E. Berardo and K. E. Jelfs, stk: A python toolkit for supramolecular assembly, J. Comput. Chem., 2018, 39, 1931–1942

P06 © The Author(s), 2020 Efficient and automated construction of system-focused atomistic models

Christoph Brunken and Markus Reiher ETH Zurich, Switzerland

Computational studies of chemical reactions in complex environments such as proteins or metal-organic frameworks require accurate and at the same time efficient atomistic models applicable to the nanometer scale. As an accurate parametrization of the atomistic entities will not be available for arbitrary system classes, it remains a tedious task to generate suchs models manually or to extend existing parametrizations. We accelerate this process by introducing a fully automated system-focused parametrization procedure [1], which is quickly applicable, reliable, flexible, and reproducible. Therefore, we combine an automatically parametrizable quantum chemically derived molecular mechanics model with machine-learned corrections under uncertainty quantification. Our approach first generates an accurate, physically motivated model from a minimum energy structure and its corresponding Hessian matrix by a partial Hessian fitting procedure [2] of the force constants. This model can be applied to generate a large number of configurations (e.g., via molecular dynamics) for which additional reference data can be calculated on the fly. A Δ-machine learning model is trained on these data to provide a correction to energies and forces including uncertainty estimates. The parametrization of large systems is enabled by an autonomous fragmentation approach, which is demonstrated at the example of the copper-containing protein plastocyanin. Our approach can also be employed for the generation of system-focused electrostatic molecular mechanics embedding environments in a quantum-mechanical/molecular-mechanical hybrid model for arbitrary atomistic structures at the . In this context, we developed a novel approach for the selection of the composition of the quantum-mechanical region in a systematic and fully automated manner based on first- principles.

References 1. Brunken, C.; Reiher, M., J. Chem. Theory Comput. 2020, 16, 1646-1665. 2. Wang, R.; Ozhgibesov, M.; Hirao, H., J. Comput. Chem. 2016, 37, 2349-2359.

P07 © The Author(s), 2020 A mobile robotic researcher

Benjamin Burger, Phillip M. Maffettone, Vladimir V. Gusev, Catherine M. Aitchison, Yang Bai, Xiaoyan Wang, Xiaobo Li, Ben M. Alston, Buyi Li, Rob Clowes*, Nicola Rankin, Brandon Harris, Reiner Sebastian Sprick and Andrew I. Cooper Leverhulme Centre for Functional Materials Design, Materials Innovation Factory and Department of Chemistry, University of Liverpool, UK

This work describes the development of an autonomous system aimed at photocatalysis research. Recent advancements in enabling technologies (e.g. collaborative robots) allow for novel autonomous approaches in materials science. The introduction of mobile robots in a laboratory environment [1], along with recent advancements in AI driven searches and a drift towards automation in various analytical equipment have led to various autonomous research systems [2-5]. Here, we aim to combine automated experiments [6, 7] with machine- learning [8] to automate the researcher completely during the experiment. By introducing this concept into the field of photocatalysis, we have reduced human-error in the measurements and reduced the labor required for performing these experiments. To this end, we have developed modular stations, each performing one atomic operation of the experiment, such as solid dispensing, capping, or analysis, operated using a mobile robot. The mobile robot was programmed to handle vials, cartridges filled with solids, and racks. With the modular workflow, machine-learning was use to generate new candidates based on previous experiments in an active learning paradigm. We show that the KUKA Mobile Robot (KMR) can operate each step of the workflow using modular stations. Finally, we formulated five chemical hypotheses to improve a hydrogen-evolving catalyst formulation. Each hypothesis selects one or two compounds, and thereby collectively defined a chemical formulation space of 11 components. We show how Bayesian Optimization can evaluate the hypotheses within this search space to ultimately improve the catalytic performance by a factor of six. References 1. Abdulla, A.A., et al., Multiple Mobile Robot Management System for Transportation Tasks in Automated laboratories Environment, 2018. 2. MacLeod, B.P., et al., Self-driving laboratory for accelerated discovery of thin-film materials. arXiv preprint arXiv:1906.05398, 2019. 3. Langner, S., et al., Beyond Ternary OPV: High-Throughput Experimentation and Self-Driving Laboratories Optimize Multi- Component Systems. arXiv preprint arXiv:1909.03511, 2019. 4. Coley, C.W., et al., A robotic platform for flow synthesis of organic compounds informed by AI planning. Science, 2019, 365, 557. 5. Nikolaev, P., et al., Autonomy in materials research: a case study in carbon nanotube growth. npj Computational Materials, 2016, 2, 16031;. 6. Ren, F., et al., Accelerated discovery of metallic glasses through iteration of machine learning and high-throughput experiments. Science Advances, 2018, 4(4), eaaq1566. 7. Fleischer, H. and K. Thurow, Automation Solutions for Analytical Measurements: Concepts and Applications, John Wiley & Sons, 2017. 8. Roch, L.c.M., et al., ChemOS: An Orchestration Software to Democratize Autonomous Discovery. ChemRxiv preprint: https:// doi.org/10.26434/chemrxiv.5953606.v1, 2018.

P08 © The Author(s), 2020 An interactive web application for sharing and visualizing chemical datasets

Yu Che, Linjiang Chen, Chengxi Zhao and Andrew I. Cooper Leverhulme Research Centre for Functional Materials Design, Materials Innovation Factory and Department of Chemistry, University of Liverpool, UK

A major challenge concerning the application of experimental or computational high throughput (HT) methods is to find an efficient way to visualize and share generated, large datasets.To improve the interpretability and accessibility of HT results, a web-based, interactive visualization application was developed. Figure 1a is a snapshot of one web-based application developed for exploring energy–structure–function maps1, incorporating a 2D scatter plot and a 3D crystal structure viewer. In the 2D scatter plot, variables for the X-axis, Y-axis and colour- code of the point on the graph are selected from dropdown lists. The 2D plot can also be switched to other maps (Figure 1b), in which crystal structures are clustered and mapped onto a 2D space by unsupervised machine learning algorithms. Upon selecting points in the scatter plot, their corresponding 3D crystal structures are displayed in structure viewers, where the structure may be rotated and zoomed in or out for visualization; multiple structures can be viewed at the same time in separate structure viewers. The visualization applications developed here make use of the Pandas, Plotly, Flask and Dash python packages and are rendered for web browsing and hosted by the Heroku servers, with a statistic domain facilitating public access on the Internet (https://www. interactive-esf-maps.app/).

Figure 1 Snapshots of an interactive web application: (a) energy–structure–function maps, (b) 2D embeddings of the energy– structure–function maps.

References 1. Pulido, A.; Chen, L.; Kaczorowski, T.; Holden, D.; Little, M. A.; Chong, S. Y.; Slater, B. J.; McMahon, D. P.; Bonillo, B.; Stackhouse, C. J.; Stephenson, A.; Kane, C. M.; Clowes, R.; Hasell, T.; Cooper, A. I.; Day, G. M., Functional materials discovery using energy–structure–function maps. Nature 2017, 543, 657.

P09 © The Author(s), 2020 Standard error evaluations on the formation energy and energy bandgap of ABX3 perovskite structures using machine learning approach

*Ericsson T. Chenebuah1, Alain B. Tchagang2, Michel Nganbe1 1Department of Mechanical Engineering, University of Ottawa, Canada, 2Digital Technologies Research Centre, National Research Council of Canada, Canada * Address correspondence to: [email protected]

The high cost of high-throughput screening and ab-initio Density Functional Theory (DFT) calculations has resorted in material scientists and engineers developing new material discovery methods via Machine Learning (ML) principles. This results in designing and applying appropriate ML models in predicting target-driven properties on well-prepared datasets and subsequently evaluating the standard errors within acceptance range. On this frontier, error evaluations are standardized references for ML designs and are therefore understudied.

In this research, ten different ML models are used to predict the formation energy and energy bandgap of 837 uniquely proven ABX3 ternary Perovskites, from 30 robust input features. The 30 features were carefully selected on the basis of their chemical, physical and stability properties associated with the perovskite crystal structure. The objective was to verify which input features contributed immensely or negligibly on the models in order to achieve optimized error metrics. Feature importance and Permutation importance engineering were essential in eliminating redundant input descriptors and identifying which features pulled heavier weight on the training exercise.

Preliminary results reveals that both the B-site and X-site ionic properties co-jointly contribute the most in deciding the outcome of the formation energy, whereas the B-site chemical and stability properties were vital in deciding the outcome for the energy bandgap. Furthermore, standard metrics such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and R-square values were used to define the accuracy of the prediction evaluated on all ML models. Overall, the Support Vector Regressor (SVR) model performed best in predicting the Formation energy with RMSE and MAE scores on test set evaluated at 0.24 eV and 0.16 eV respectively while the Gradient Boosting Regressor (GBR) model out-performed the other 9 models in standardizing the energy bandgap with RMSE and MAE scores on test set at 0.85 eV and 0.52 eV respectively. This study advances the discussion on potential predictive improvements in delivering accurate target properties especially in other forms of perovskites such as hybrid, double-ionic-sites and organic structures.

References 1. J. Bartel, et al., New tolerance factor to predict the stability of perovskite oxides and halides, Sci. Adv. 5(2), (2019), eaav0693. 2. Lu, Q. Zhou, Y. Ouyang, Y. Guo, Q. Li, J. Wang, Accelerated discovery of stable lead-free hybrid organic-inorganic perovskites via machine learning, Nat. Com., 9(1), (2018), 3405. 3. Sharma, P. Kumar, P. Dev, G. Pilania, Machine learning substitutional defect formation energies in ABO3 perovskites, J. Appl. Phys. 128, 034902 (2020); doi: 10.1063/5.0015538 4. Li, et al., A progressive learning method for predicting the band gap of ABO3 perovskites using an instrumental variable, J. Mater. Chem. C, (2020), 8, 3127.

P10 © The Author(s), 2020 Evolutionary chemical space exploration for functional materials

Chi Y. Cheng, Josh E. Campbell and Graeme M. Day Computational Systems Chemistry, School of Chemistry, University of Southampton, UK

With a near infinite number of molecules that could be theoretically proposed any computational screening procedure will need to reduce the chemical space to a manageable size whilst ensuring that certain molecules with favourable properties are included. In a recent paper [1] we suggested a new method to do this using a fragment-based evolutionary algorithm which allows us to define and then search the chemical space that we are interested in. As a proof of concept, we applied this to the computational screening of high-performance organic semiconductors. Using this program, we were able to restrict the search to a chemical space ofaza-substituted PAHs which contain molecules with favourable properties, as for example the aza-substitutions allow for: an increase electron affinity of the material which can reduce the barrier for the injection of an electron; hydrogen bonding to occur between molecules which can promote more pi-stacked structures which increase electron mobilities. From the defined chemical space, the evolutionary algorithm then suggests molecules for the target application. For organic semiconductors we then use this information to feed into a crystal structure prediction [2] process reevaluate the suggestions. Together the evolutionary algorithm and crystal structure prediction therefore describes a complete computational screening process from molecular fragments to a solid-state material.

References 1. “Evolutionary chemical space exploration for functional materials: computational organic semiconductor discovery” Chemical Science, 11, 4922-4933 (2020). 2. “Convergence properties of crystal structure prediction by quasi-random sampling” Journal of Chemical Theory and Computation, 12, 910-924 (2016).

P11 © The Author(s), 2020

Scaling relations and machine learning-driven design for molecular water splitting catalysts

Michael John Craig, Max García-Melchor Trinity College Dublin, Ireland

The oxygen evolution reaction (OER) is a reaction of fundamental importance to future renewable energy applications.[1] It is currently the bottleneck in the carbon-free production of hydrogen since commercial polymer electrolyte electrocatalysts are oxides based on precious metals iridium and ruthenium.[2]A promising avenue towards earth-abundant OER is to design stable homogeneous catalysts, which have shown orders of magnitude higher turnover frequencies with respect to heterogeneous catalysts.[3-5]In this poster, insights into the required mechanism to achieve such high activity are shown[6] and the high-throughput evaluation of this mechanism via density functional theory will be described.

The set of molecules to evaluate is generated by a combinatorial creation of hypothetical catalysts from individual mono-, bi-, tri- and quadra-dentate ligands. Then, the water nucleophilic attack mechanism is simulated. With this data, the method-independent universal scaling relationship for molecular water oxidation catalysts is demonstrated in more depth, and the ascendency of Ru catalysts in this domain is justified. Our high-throughput approach identifies novel molecular catalysts with low predicted overpotential based only on abundant elements Cr and Fe as promising leads for experimental realisation. With the computed Gibbs energy of all the OH intermediates, we developed a Gaussian process regression model which allows for the prediction of OH binding energies with a mean absolute error of 0.08 eV. Two distinct binding energies determine the catalyst efficiency in a cursory manner in either base or acidic media. Initial Bayesian optimisation experiments using these binding energies have shown promising results, which will also be discussed. Altogether, this poster provides stimulating details and directions for OER while outlining the first steps towards enabling inverse design for molecular OER catalysis.

References 1. C. Acar and I. Dincer, Int. J. Hydrogen Energy, 2014, 39, 1–12. 2. C. C. L. McCrory, S. Jung, I. M. Ferrer, S. M. Chatman, J. C. Peters and T. F. Jaramillo, J. Am. Chem. Soc., 2015, 137, 4347–4357. 3. L. Duan, F. Bozoglian, S. Mandal, B. Stewart, T. Privalov, A. Llobet and L. Sun, Nat. Chem., 2012, 4, 418–423. 4. L. Duan, C. M. Araujo, M. S. G. Ahlquist and L. Sun, Proc. Natl. Acad. Sci. USA, 2012, 109, 15584–15588. 5. R. Matheu, M. Z. Ertem, J. Benet-Buchholz, E. Coronado, V. S. Batista, X. Sala and A. Llobet, J. Am. Chem. Soc., 2015, 137, 10786–10795. 6. M. J. Craig, G. Coulter, E. Dolan, J. Soriano-López, E. Mates-Torres, W. Schmitt and M. García-Melchor, Nat. Commun., 2019, 10, 4993.

P12 © The Author(s), 2020 Mining predicted crystal Structure landscapes with high throughput crystallisation: old molecules, new insights

Peng Cui,1 David P. McMahon,2 Peter R. Spackman,2,3 Ben M. Alston,1,3 Marc A. Little,1 Graeme M. Day2 and Andrew I. Cooper 1,3 1Department of Chemistry and Materials Innovation Factory, University of Liverpool, UK, 2Computational Systems Chemistry, School of Chemistry, University of Southampton, UK, 3Leverhulme Research Centre for Functional Materials Design, Department of Chemistry and Materials Innovation Factory, University of Liverpool, UK

Organic molecules tend to close pack to formdense structures when they are crystallised from organic solvents. Porous molecular crystals defy this rule: they contain open space, which is typically stabilised by inclusion of solvent in the interconnected pores during crystallisation. The design and discovery of such structures is often challenging and time consuming, in part because it is difficult to predict solvent effects on crystal form stability. Here, we combine crystal structure prediction (CSP) with a robotic crystallisation screen to accelerate the discovery of stable hydrogen-bonded frameworks. We exemplify this strategy by finding new phases of two well-studied molecules in a computationally targeted way. Specifically, we find a new ‘hidden’ porous polymorph of trimesic acid, δ-TMA, that has a guest-free hexagonal pore structure, as well as three new solvent-stabilized diamondoid frameworks of adamantane-1,3,5,7-tetracarboxylic acid (ADTA). Beyond porous solids, this hybrid computational-experimental approach could be applied to a wide range of materials problems, such as organic electronics and drug formulation.

References 1. Nature, 2017, 543, 657-664. 2. Chem. Sci., 2019, 10, 9988-9997.

P13 © The Author(s), 2020 Autonomous optimization of nonaqueous battery electrolytes

Adarsh Dave [1], Jared Mitchell [2], Sven Burke [2], Kirthevasan Kandasamy [3], Biswajit Paria [4], Barnabas Poczos [4], Jay Whitacre [2], Venkat Viswanathan [1] [1] Carnegie Mellon University, Department of Mechanical Engineering, [2] Carnegie Mellon University, Department of Materials Science, [3] University of California - Berkeley, Department of Electrical Engineering and Computer Science, [4] Carnegie Mellon University, Department of Machine Learning

Electrolyte composition is crucial to a range of battery performance targets, including cycle-life and fast-charging. Designing an electrolyte entails choosing a precise combination of dozens of commonly used solvents and salts [1-5]. The choice is constrained by a host of unknown physical properties, including liquid window, viscosity, wetting behavior, vapor pressure... as well as cost.

Previously, our team has successfully automated the measurement of transport and electrochemical properties of aqueous electrolytes, integrated with an autonomous machine-learning loop [6-8]. Current work focuses on automation inside a glovebox for nonaqueous chemistries.

We have designed an automated test-stand that mixes a nonaqueous electrolyte and characterizes a range of properties important to battery electrolyte performance in a repeatable, unsupervised manner. We capture conductivity, viscosity at various temperatures, density, and dielectric constant. Future work will focus as well on thermal properties, surface tension, and vapor pressure.

An exhaustive measurement all types of electrolyte mixtures would be cost-prohibitive and time-prohibitive. We utilize methods of machine-learning to autonomously run the test-stand, given simple objective functions like "optimize for conductivity" and constrained design spaces.

References 1. Petibon, R. et al. Electrolyte System for High Voltage Li-Ion Cells. J. Electrochem. Soc. 163, A2571 (2016). 2. Logan, E. R. et al. A Study of the Physical Properties of Li-Ion Battery Electrolytes Containing Esters. J. Electrochem. Soc. 165, A21 (2018). 3. Xiong, D. J. et al. Some Physical Properties of Ethylene Carbonate-Free Electrolytes. J. Electrochem. Soc. 165, A126 (2018). 4. Logan, E. R. et al. Ester-Based Electrolytes for Fast Charging of Energy Dense Lithium-Ion Batteries. J. Phys. Chem. C 124, 12269–12280 (2020). 5. Louli, A. J. et al. Diagnosing and correcting anode-free cell failure via electrolyte and morphological analysis. Nature Energy 1–10 (2020) doi:10.1038/s41560-020-0668-8. 6. Dave, A., Gering, K. L., Mitchell, J. M., Whitacre, J. & Viswanathan, V. Benchmarking Conductivity Predictions of the Advanced Electrolyte Model (AEM) for Aqueous Systems. J. Electrochem. Soc. 167, 013514 (2019). 7. Dave, A. et al. Autonomous discovery of battery electrolytes with robotic experimentation and machine-learning. arXiv:2001.09938 [physics] (2019). 8. Whitacre, J. F. et al. An Autonomous Electrochemical Test Stand for Machine Learning Informed Electrolyte Optimization. J. Electrochem. Soc. 166, A4181 (2019).

P14 © The Author(s), 2020 Structure solution of triformylbenzene using the PyCrystGA genetic algorithm

L. Farrara,b, L. Chena,b and S. Y. Chongb* a Leverhulme Research Centre for Functional Materials Design, 51 Oxford St, Liverpool, L7 3NY. b Department of Chemistry and Materials Innovation Factory, University of Liverpool, University of Liverpool, 51 Oxford St, Liverpool, L7 3NY.

Powder X-ray diffraction (PXRD) is a powerful analytical technique used to elucidate structural features of crystalline materials and atomic packing arrangements, allowing for structure property relations to be rationalised. Solving crystal structures from PXRD can be achieved through direct-space (DSM) or reciprocal-space methods. We utilise DSM for the investigation of molecular organic crystals, as they circumvent the issue of peak overlap, and allow for the utilisation of chemical information during the structure solution process. DSM involve the optimisation of structural model using an appropriate algorithm like simulated annealing (SA),1 and evolutionary computation.2

Previously, we have reported the structure solution of several flexible thiazolide molecular crystals using SA, and the subsequent structure determination using both TOPAS-6,3 and EXPO2014.4 Due to the complex nature of these crystal systems this process was time-consuming. Thus, the development of more efficient methods for structure solution is sought out, to aid the discovery of new functional materials.

Herein, we report the ongoing development of a Genetic Algorithm (PyCrystGA) for the structure solution of molecular crystals from PXRD, written using Python programming language and integrated with TOPAS-6.3 PyCrystGA initialises a population of crystal structures, evaluates their fitness, carries out crossover and mutation, and selects the best solutions to take forward. It tracks detailed fitness statistics associated with each generation. Moreover, the genetic information corresponding to individual crystal structures within the population can easily be extracted, such as the variable torsion angles, molecular position, and molecular orientation. Herein, we report the proof-of-concept application of PyCrystGA to solve the crystal structure of triformlybenzene - a simple, yet conformationally flexible organic molecule. PyCrystGA should afford tangible advantages versus SA when applied to complex optimisation problems, as PyCrystGA employs a population-based optimisation approach backed by reliable Rietveld code, versus the single solution approach found in SA.

References 1. S. Kirkpatrick et al., Science., 1983, 220, 671–680. 2. H. J. Bremermann et al., in Self-Organizing Systems, 1962, Spartan Books, Washington D. C., 1962, pp. 93–106. 3. A. Coelho, J. Appl. Crystallogr., 2018, 51, 210–218. 4. A. Altomare et al., J. Appl. Crystallogr., 2013, 46, 1231–1235.

P15 © The Author(s), 2020 Summit: benchmarking machine learning for reaction optimisation

Kobi C. Felton1*, Jan G. Rittig2*, Alexei Lapkin1† 1Department of Chemical Engineering, University of Cambridge, Cambridge, UK 2RWTH Aachen University, Process Systems Engineering (AVT.SVT), Aachen 52074, Germany * Contributed equally to this work. † Corresponding author. Email: [email protected]

The development of novel and efficient chemical processes is essential to meeting grand challenges in healthcare, energy and sustainability. Reaction screening and optimisation is one key step to developing these chemical processes, but finding optimal conditions can be extremely time- and labor-intensive, especially when intuition is used. Promisingly, recent work has shown that machine learning (ML) can reduce the number of experiments required to discover optimal reaction conditions.1–5 These ML strategies iteratively suggest new experiments based on past data, converging on optimal reaction conditions often within tens of experiments. However, there is a lack of standardisation in this emerging field. ML strategies have been applied on problems of varying difficulty, so it is challenging to compare the strategies. Additionally, there is not a unified software framework for setting up ML reaction optimisation.

In this poster, we present two chemically-motivated virtual benchmarks for reaction optimisation and compare six ML strategies on these benchmarks. The benchmarks are either kinetic models of reactions with added noise or predictive models trained on experimental data, so they represent the difficulty of optimisation problems faced in the chemistry laboratory. Furthermore, the benchmarks and ML strategies are encompassed in a new open source framework written in python named Summit. This framework can be used to create new benchmarks, compare new strategies, and apply the strategies to real reaction optimisation case studies.

Figure 1. Comparison of six ML strategies on a benchmark representing the optimisation of a nucleophilic substitution reaction in a flow reactor. The objectives of this benchmark are to maximise space-time yield (STY) and minimise E-factor (the ratio of the mass of waste to the mass product). (a) The performance of each strategy is measured using the hypervolume, which always improves as more optimal trade-offs between STY and E-factor are determined.6 The best performance is seen with Bayesian optimisation strategies, particularly TSEMO.5 However, this comes a at greater computational cost. (b) GRYFFIN3 quickly improves its hypervolume trajectory in less than ten iterations, but TSEMO has the best terminal performance. The mean is plotted with the 95% confidence interval.

Using Summit, we are able to conduct over 50,000 virtual experiments across different ML strategies. Our results show that Bayesian optimisation strategies perform very well on both benchmarks (e.g., Figure 1), while many strategies previously used in reaction optimisation fail to find optimal solutions. eW also compare various methods for optimising across categorical variables (e.g., choosing catalysts, bases, etc.) and multiobjective optimisation problems. Overall, our work is a first step towards standardising ML for reaction optimisation and offering a toolkit for researchers to use in new problems.

References 1. Cortés-Borda et al., Org. Process Res. Dev., 2016, 20, 1979–1987. 2. Z. Zhou et al., ACS Cent. Sci., 2017, 3, 1337–1344. 3. W. Huyer et al., ACM Trans. Math. Softw., 2008, 35, 1–25. 4. F. Häse et al., 2020, arXiv:2003.1212732. 5. A. M. Schweidtmann et al., Chem. Eng. J., 2018, 352, 277–282. 6. E. Zitzler et al., in Evolutionary Multi-Criterion Optimization, Springer-Verlag, 2007, vol. 4403, pp. 862–876.

P16 © The Author(s), 2020 Machine-Learning assisted modelling of the Jacobsen epoxidation process: can Random Forests help optimize catalyst efficiency?

José Ferraz-Caetano REQUIMTE-LAQV - Faculty of Sciences, University of Porto, Portugal

Supervised and unsupervised Machine Learning (ML) tools have been used recently in assisting in the development reaction models for several chemical processes [1,2]. ML correlates a chemical property to molecular features by means of an algorithm. This leads to the creation of models able to predict activities for molecules, possessing similar descriptor values for the specific chemical space, thus generating new chemical entities. Random Forests (RF) as one of the most widespread machine learning algorithms, have been widely applied in chemistry research [3]. As they can work with very large datasets, it is quite popular among chemists when considering using ML models to handle data from complex reaction systems. However, as of today, their applications for catalysis were not well-studied in the past few years.

The asymmetric epoxidation process is one of the most applied methods to the synthesis of enantiomeric stable organic compounds. Jacobsen and co-workers have developed an established method for the use of chiral Mn(salen) catalysts for the epoxidation of unfunctionalized olefins [4,5]. This reaction is predominately used today in the production of intermediaries for various chemical compounds, such as fine pharmaceuticals and agrochemicals. However, this methodology still has some limitations on several substrates used, prioritizing its development in industrial chemistry research.

In this communication, we present a methodology for the implementation of Random Forest algorithms for catalysis research in the Jacobsen epoxidation process. We highlight how this data science technique can help to address pressing issues and accelerate the progress of catalytic epoxidation. This work combines experimental and theoretical data to assess the efficiency of RF models in modeling catalytic efficiency, thus improving catalyst design. Preliminary models based on this methodology were successfully trained. The model was further tested using data from other known catalysts/substrate combinations and was shown to accurately predict reaction yields and enantiomeric excesses. These results are encouraging as to the applicability of this methodology to another catalytic environments, in combination with molecular modeling techniques.

Acknowledgements This work was supported by Fundação para a Ciência e a Tecnologia (FCT/MEC) through national funds and co- financed by FEDER, under the partnership agreement PT2020 (Projects UID/QUI/50006/2013 and POCI/01/0145/ FEDER/007265). Further founding was also received from FCT/MEC under project REALM – Reactive Learning Machines (PTDC/QUI-QIN/30649/2017).

References 1. Butler, K. T. ,Davies, D. W., Cartwright, H., Isayev, O., Walsh, A. “Machine Learning for Molecular and Materials Science” Nature. 2018, 559, 547–555. 2. Philomena, S. L., Winther, K., Torres, J. A. G., Streibel, V., Zhao, M., Bajdich, M., Abild-Pedersen, F., Bligaard, T. “Machine Learning for Computational Heterogeneous Catalysis” ChemCatChem. 2019, 11, 3581–3601. 3. Yang, W., Fidelis, T., Sun, W. “Machine Learning in Catalysis, From Proposal to Practicing” ACS Omega 2020, 5 (1), 83–88. 4. Jacobsen, E. in “Catalytic Asymmetric Synthesis”, Ed. Ojima, I. VCH Publishers Inc., NewYork, 1993, chap. 4.2. 5. Johnson, R., Sharpless, K. in “Catalytic Asymmetric Synthesis”, Ed. Ojima, I. VCH Publishers Inc., New York, 1993, chap. 4.1.

P17 © The Author(s), 2020 Predicting molecular similarity and toxicity of mycotoxins using machine learning as a playground

Cláudia F. Ferreira*, Tânia F. G. G. Cova, Alberto A. C. C. Pais Coimbra Chemistry Center, Department of Chemistry, Faculty of Sciences and Technology, University of Coimbra, Rua Larga, 3004-535 Coimbra, Portugal *[email protected]

AIM. This study aims at developing predictive models for molecular similarity and toxicity of mycotoxins, using big data technologies and machine learning approaches. An efficient chemical data mining over 30 selected molecules described by 287 molecular fingerprints is proposed for virtual mycotoxin screening and toxicity prediction.

NEED. Mycotoxins are highly toxic secondary metabolites produced by fungi and their contamination can occur in a large number of agro-food products. Due to their toxic effects, mycotoxin contamination in food represents considerable health risks for humans and animals, compromising food security, and sustainability worldwide. [1] Being a result of climate changes, it is estimated that mycotoxins’ contamination may spark off serious socio- economic problems from a global perspective. [2] Water contamination by mycotoxins is an issue previously ignored, but now of increasing concern, for which there is no definite solution.

The design of cost-effective strategies for simultaneously identifying and eliminating these fungal metabolites requires knowledge about relevant molecular fingerprints, interaction patterns, co-occurrence, synergistic effects, and fungal sources as can it be a gateway to an effective answer to these societal obstacles.

METHODS. Aflatoxin B1 is selected as a lead compound. Hierarchical cluster analysis (HCA) and k-means are used for chemical structure analysis, and for understanding the molecular similarity between mycotoxins using their molecular fingerprints so that relevant properties can be predicted through the clustering map. [3] Principal Component Analysis (PCA) is employed for correlating mycotoxins’ distribution in the clusters with the molecular descriptors and building a model for classifying new structures and identifying target properties. [3] The power of artificial neural networks in predicting mycotoxins' toxicity and thereby classifying into different types and degrees of toxicity is also investigated. A Deep Neural Network is combined with the molecular fingerprints to improve knowledge over the selected mycotoxins and to develop models for predicting the respective toxicity profiles.

RESULTS. Hierarchical cluster analysis and k-means clustering reveal five natural clusters consistent with the known mycotoxins’ families. The PCA results show that discrimination between mycotoxins is governed by MDEC 33, ATSc3, C3SP2, BCUTp.1l, khs.dsCH, and ATSc2 descriptors. The optimal network structure comprises a two- layer model with ten and five nodes in each layer, respectively. All models show a great fitting of the data, both for their construction (training) but also when applied to unused data (validation).

CONCLUSIONS. This study proved successful in identifying mycotoxins that enable a prediction of their behavior and toxicity, as well as validating the clustering, and characterizing those molecules not so well described in the literature. This creates a gateway for subsequent classification, identification, and rapid and efficient characterization of possible new and unknown mycotoxins. Establishing the bridge between multivariate chemical data and the ability of models to predict and deal with relevant mycotoxin-related phenomena, such as co- occurrence and molecular recognition, and also the development of improved classification and remediation procedures is still a challenge, often limited by the experimental information available.

References 1. Anater, A.,et al.(2016). Aquaculture, 451, 1-10. 2. Medina, A., et al. (2017). Fungal Biology Reviews, 31(3), 143-154. 3. Cova, T. F., & Pais, A. A. (2019).Frontiers in Chemistry, 7, 809.

P18 © The Author(s), 2020 Evaluate organic corrosion inhibitors through machine learning

Tiago L. P. Galvão1, Gerard Novell-Leruth1,2, Alena Kuznetsova1, João Tedim1, José R. B. Gomes2 1 CICECO-Aveiro Institute of Materials, Department of Materials and Ceramic Engineering, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal 2 CICECO-Aveiro Institute of Materials, Department of Chemistry, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal

Organic corrosion inhibitors are playing a crucial role to substitute traditional protective technologies, which have acute toxicity problems associated. However, why some organic compounds inhibit corrosion and others do not, is still not well understood. Therefore, different machine learning [1] and interactive exploratory data approaches, using the R programming framework, are being used to distinguish efficient corrosion inhibitors for aluminium alloys commonly used in aeronautical applications. Herein, we will show the first steps on how these approaches can greatly contribute to automate the search for new and more efficient protective solutions in the future, having the potential to significantly accelerate research in the field by serving as a tool to perform an initial virtual screen of the molecules before more detailed experimental tests are performed.

Acknowledgements This work was developed within the scope of the project CICECO-Aveiro Institute of Materials, UIDB/50011/2020 & UIDP/50011/2020, financed by national funds through the FCT/MEC and when appropriate co-financed by FEDER under the PT2020 Partnership Agreement, as well as project DataCor (refs. POCI-01-0145- FEDER-030256 and PTDC/QUI-QFI/30256/2017, datacorproject.wixsite.com/datacor).

References 1. Elucidating Structure–Property Relationships in Aluminium Alloy Corrosion Inhibitors by Machine Learning, T.L.P. Galvão, G. Novell-Leruth, A. Kuznetsova, J. Tedim, J.R.B. Gomes, J. Phys. Chem. C 124 (2020) 5624–5635 (https://doi.org/10.1021/ acs.jpcc.9b09538).

P19 © The Author(s), 2020 Can we synthesize molecules proposed by generative models?

Wenhao Gao and Connor Coley Massachusetts Institute of Technology, USA

The discovery of functional molecules is an expensive and time-consuming process, exemplified by the rising costs of small molecule therapeutic discovery. One class of techniques of growing interest for early stage drug discovery is de novo molecular generation and optimization, catalyzed by the development of new deep learning approaches. These techniques can suggest novel molecular structures intended to maximize a multiobjective function, e.g., suitability as a therapeutic against a particular target, without relying on brute-force exploration of a chemical space. However, the utility of these approaches is stymied by ignorance of synthesizability. To highlight the severity of this issue, we use a data-driven computer-aided synthesis planning program to quantify how often molecules proposed by state-of-the-art generative models cannot be readily synthesized. Our analysis demonstrates that there are several tasks for which these models generate unrealistic molecular structures despite performing well on popular quantitative benchmarks. Synthetic complexity heuristics can successfully bias generation toward synthetically tractable chemical space, although doing so necessarily detracts from the primary objective. This analysis suggests that to improve the utility of these models in real discovery workflows, new algorithm development is warranted.

References 1. Gao, Wenhao, and Connor W. Coley. "The synthesizability of molecules proposed by generative models." Journal of Chemical Information and Modeling (2020).

P20 © The Author(s), 2020 Improving VAE molecular representations by tailoring them to predict docking poses and scores

Miguel García-Ortegón1,2,3, Andreas Bender2, Carl E. Rasmussen3, Hiroshi Kajino4, Sergio Bacallado1 1. Department of Pure Mathematics and Mathematical Statistics, Center for the Mathematical Sciences, University of Cambridge, UK, 2. Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, UK, 3. Department of Engineering, University of Cambridge, UK, 4. MIT-IBM Watson AI Lab; IBM Research, Tokyo, Japan

In this work, we aim to improve latent vector representations from variational autoencoders (VAE) for the purpose of predicting docking scores against a target protein of interest. We computed docking scores using Autodock Vina and chose the neural architecture of the molecular-hypergraph-grammar VAE. As a baseline, we tried to predict docking scores from the latent vectors of a VAE that had been trained in an entirely unsupervised manner. Then, we trained another model using a "train with predict" strategy, such that the encoder and decoder components of the VAE were tuned together with prediction modules and a prediction error term was included in the loss. Then, we predicted docking scores from these customized latent vectors and compared the performance with the previous baseline. We experimented with two prediction modules: one for the docking score and one for the docking pose, which was represented through voxelization. We evaluated our method against the dopamine receptor D2, using a combination of unlabelled molecules from ZINC and labelled molecules from ExCAPE as our training set.

References 1. Diederik P Kingma, & Max Welling. (2013). Auto-Encoding Variational Bayes. 2. Trott, O., & Olson, A. J. (2010). AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2), 455–461. https://doi.org/10.1002/ jcc.21334 3. Hiroshi Kajino (2019). Molecular Hypergraph Grammar with Its Application to Molecular Optimization. In Proceedings of the 36th International Conference on Machine Learning (pp. 3183–3191). PMLR. 4. Wang, B. (2018). Structure of the D2 dopamine receptor bound to the atypical antipsychotic drug risperidone. Nature, 555(7695), 269-273. 5. Sterling, J. (2015). ZINC 15 – Ligand Discovery for Everyone. Journal of Chemical Information and Modeling, 55(11), 2324- 2337. 6. Sun, H. (2017). ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomicsJournal of Cheminformatics, 9(1), 17.

P21 © The Author(s), 2020 Combining phonon accuracy with high transferability in machine- learned interatomic potentials

J. George,1 G. Hautier,1 A. P. Bartók,2 G. Csányi,3 V. L. Deringer4 1Université catholique de Louvain, Belgium, 2University of Warwick, United Kingdom, 3University of Cambridge, United Kingdom, 4University of Oxford, United Kingdom

Vibrational properties are of high importance for inorganic materials and their applications. For example, thermoelectric materials require low thermal conductivities. [1] Thermal conductivity and other phonon properties can be computed with density functional theory (DFT) but this is typically very slow. DFT-based computations of thermal conductivity are, therefore, not suitable for high-throughput searches. These approaches are typically circumvented. [2] In this poster, we will explore how machine-learned interatomic potentials can be used to compute phonons properties in an accelerated way. [3] The resulting phonon band structures are in excellent agreement with DFT reference data (agreement of phonon frequencies within 0.1 – 0.2 THz). Moreover, the new machine-learned interatomic potential also arrives at excellent predictions for amorphous materials.

References 1. G. J. Snyder, E. S. Toberer, Nat. Mater. 2008, 7, 105. 2. C. Toher, J. J. Plata, O. Levy, M. de Jong, M. Asta, M. B. Nardelli, S. Curtarolo, Phys. Rev. B, 2014, 90, 174107. A. Seko, A. Togo, H. Hayashi, K. Tsuda, L. Chaput, I. Tanaka, Phys. Rev. Lett. 2015, 115, 205901. 3. J. George, G. Hautier, A. P. Bartók, G. Csányi, V. L. Deringer, J. Chem. Phys., 2020, 153, 044104.

P22 © The Author(s), 2020

Accelerating chemical discoveries by automated reaction space exploration

Stephanie A. Grimmel and Markus Reiher ETH Zurich, Switzerland

Imagine anyone could provide a computer program with a list of chemical substances and reaction conditions and, without any further user intervention or prior knowledge required, predict the reaction mechanism and products based on the laws of quantum physics. Novel chemical phenomena could be discovered after a very limited investment of human time and, consequently, at an unprecedented rate. In this endeavor, we develop the software Chemoton1 that allows for the automated exploration of chemical reaction networks.2,3

The decisive feature of any algorithm for chemical space exploration is how the considered elementary steps are selected. Programs that apply graph-based chemical rules profit from large savings in computational time but suffer from being limited to existing mechanistic knowledge. On the contrary, brute-force explorations might promise detailed insights but are hampered by the immense computational effort resulting from the complexity of chemical space.

In Chemoton we aim to get the best of both worlds by applying first-principles heuristics:4 The exploitation of physical properties of the involved molecules enables us to establish algorithms that are agnostic regarding the limitations of preexisting chemical knowledge while still reducing the computational cost. As an example, we present the possibility of analyzing the molecular electrostatic potential for predicting protonation sites that we applied to the exploration of the mechanism of proton reduction catalyzed by a hydrogenase model complex.5

References 1. Simm, G. N.; Reiher, M. J. Chem. Theory Comput. 2017, 13, 6108-6119. 2. Simm, G. N.; Vaucher, A. C.; Reiher, M. J. Phys. Chem. A 2019, 123, 385-399. 3. Unsleber, J. P.; Reiher, M. Annu. Rev. Phys. Chem. 2020, 71, 121-142. 4. Bergeler, M.; Simm, G. N.; Proppe, J.; Reiher, M. J. Chem. Theory Comput. 2015, 11, 5712-5722. 5. Grimmel, S. A.; Reiher, M. Faraday Discuss. 2019, 220, 443-463.

P23 © The Author(s), 2020 Machine learned transition pathways between molecular crystal structures

Roohollah Hafizi and Graeme Day University of Southampton, United Kingdom

Finding transition pathways that connect the minima of a chemical system isessentialfor understanding kinetics of polymorphic transformations, calculating associated activation energies, and thereforetransition rates. However, locating the transition states can be challenging mainly because 1) identifying the reaction coordinate is not straightforward, 2) energy function in the reaction coordinate is often not explicitly known, and 3) even when the energy surface is available, the configurational space is prohibitively high dimensional for a general search. As a successful strategy to obtain minimum energy pathways (MEP), many chain-of-states methods have been developed in the past two decades. Following the idea of using a line integral representation of a discretized path for optimization, nudged elastic band (NEB) methods were developed by connecting the atoms of structures (replicas) in the chain by springs and relaxing the chain by including only the perpendicular component of the true force and the parallel component of the spring force. Despite being a successful methodology, the convergence of chain-of-states methods can be quite slow, mainly because of the high number of degrees of freedom of a "chain", which typically requires hundreds of geometry optimization steps to acquire an optimized path. Therefore, generating reference reaction paths that are properly approximated to a valid MEPwill increase the efficiency of pathfinding by increasing the convergence rate.

Among various, mostly statistical, sampling methods such as Umbrella,Metadynamics, Transition Path Sampling (TPS), Dominant Reaction Path (DRP), etc. linear interpolation methods are arguably the most widely applied methods for generating reference reaction paths. In this method, lattice vectors and atoms of the initial state are mapped into the equivalent lattice vectors and atoms in the final state, and chain-of-states is constructed by linear interpolation of lattice vectors and atomic coordinates. A proper choice of atoms mapping, which is essential in this procedure, is usually not obvious. When dealing with a molecular crystal, the situation is even more complicated; linear interpolation of atomic positions will not work because there are intra-molecular constraints for the relative position of atoms, so, instead of atomic positions, molecular positions and orientations should be interpolated. Except for straightforward cases, this type of interpolation often results in the clashing of molecules in the chain-of-states, which makes NEB much harder, sometimes impossible, to converge.

In this work, we target approximating valid transition pathways between energy landscape minima of molecular crystals, not statistically but by uniform sampling of all structures connecting two endpoints. We use a spherical harmonic transform of the Hirshfeld surface of atoms as the structural descriptor to measure the similarity of sampled points. Based on the similarities, a connected neighborhood graph of the structures is constructed, by which the connection between different pairs of the sampled point, including start and endpoints, can be found. While no energy evaluation is involved in predicting the pathways, by testing our methodology on finding the pathways between experimental minima of urea and oxalic acid, we show how physical the pathways can be.

P24 © The Author(s), 2020 Developing rapid powder diffraction analysis for efficient characterisation of new materials

S. Hodgkissa, b, S. Y. Chongb* a Leverhulme Research Centre for Functional Materials Design, UK, b Department of Chemistry and Materials Innovation Factory, University of Liverpool, UK *[email protected], [email protected]

High-throughput (HT) synthesis is an exciting approach that can accelerate the discovery of new materials. One factor that currently frustrates the use of such methods is the need to analyse large volumes of data for structural characterisation of multiple compounds. Powder X-ray diffraction (PXRD) allows identification of the crystal structure of a material which directly influences their functional properties. PXRD is compatible with a variety of high-throughput synthetic techniques, and many samples can be measured within short time frames. For example, PXRD has previously been incorporated alongside HT synthesis ‘on-the-fly’ due to the efficiency and ease of data collection [1].

The discovery of new organic materials is key in many areas from pharmaceuticals to molecular separations [2]. Organic crystals however, tend to generate PXRD patterns which are more challenging to characterise than higher symmetry inorganic solids, due to lower intensity scattering and a higher degree of overlap of the diffraction peaks. Computational packages have previously been developed to address the restriction on the rapid analysis of vast quantities of datasets, but these tend not to be designed to tackle the challenges posed by organic PXRD patterns. [1,3] Therefore, by incorporating machine learning techniques designed to tackle these challenges alongside HT synthesis, the efficiency of the discovery of new materials could significantly increase. Work within the group aims to develop methods implemented as a collection of Python scripts to perform rapid analysis of PXRD patterns. To enhance HT methods, highly crystalline patterns can be identified, analysed and clustered together based on their structural information, in order to guide future HT screens towards new functional materials.

References 1. A. G. Kusne, T. Gao, A. Mehta, L. Ke, M. C. Nguyen, K.-M. Ho, V. Antropov, C.-Z. Wang, M. J. Kramer, C. Long and I. Takeuchi, Sci. Rep., 2014, 4, 6367. 2. L. Chen, P. S. Reiss, S. Y. Chong, D. Holden, K. E. Jelfs, T. Hasell, M. A. Little, A. Kewley, M. E. Briggs, A. Stephenson, K. M. Thomas, J. A. Armstrong, J. Bell, J. Busto, R. Noel, J. Liu, D. M. Strachan, P. K. Thallapally and A. I. Cooper, Nat. Mater., 2014, 13, 954. 3. D. Lau, D. G Hay, M. R Hill, B. W Muir, S. Furman and D. Kennedy, Comb. Chem. High Throughput Screen., 2010, 14, 28–35.

P25 © The Author(s), 2020 Towards the application of machine learning for MoS2 global optimization problem

Jiri Hostas(1)*, Alain B. Tchagang(2), Maicon P. Lourenço(3), Dennis R. Salahub(1) (1) Department of Chemistry, CMS - Centre for Molecular Simulation, IQST - Institute for Quantum Science and Technology, Quantum Alberta, University of Calgary, 2500 University Drive NW, Calgary, AB, T2N 1N4, Canada, (2) Digital Technologies Research Centre, National Research Council of Canada, 1200 Montreal Road, Ottawa Ontario K1A 0R6, Canada, (3) Departamento de Química e Física, Centro de Ciências Exatas, Naturais e da Saúde (CCENS), Universidade Federal do Espírito Santo, 29500-000, Alegre, Espírito Santo, Brazil.

The study of materials in the nanoscale regime has important applications for catalytic reactions, the energy industry and medicine. Molybdenum disulfide (MoS2), a relatively unreactive material, has been shown to possess a rich variety of catalytic properties ranging from hydrodesulfurization, to the water gas reaction to hydrogenation. These applications often require the catalyst to be prepared in situ in the form of nano-particles (NPs). However, the atomistic resolution of the catalyst still remains an unresolved problem in such complex reaction mixtures in most cases.

We have performed exploratory density functional theory (DFT) calculations for Mo8S16 and Mo16S32 NPs using two global optimization methods, Minima Hopping and Born-Oppenheimer Molecular dynamics in the deMon2k software.[1] We found NPs with lower-lying energies than those of locally optimized crystal geometries, which is in agreement with the suspected phase transition from a tetragonal to a rhombohedral/hexagonal lattice arrangement during the formation of these small MoS2 NPs. The final goal here would be to describe 1-5nm NPs with several hundreds of atoms and perform analysis of their catalytic properties such as reaction activation barriers for the main competing reactions. However, the computational cost of these calculations drastically increases with the system size.

Machine learning (ML) interatomic potentials open the doors to speed up the calculations of large-sized systems with the accuracy of DFT calculations. We have successfully parametrized several initial models based on Behler and Parrinello atom centered Gaussian functions combined with either neural networks (NN) or kernel ridge regression using the AMP python package.[2,3] Our most accurate potential is routinely used for total energy predictions as well as local optimization of small MoS2 NPs. Furthermore, a Bayesian optimizer was used for the inference of the optimum Gaussian cutoff and width in the development of ML interatomic NN potentials. We will present our initial results and discuss the advantages and practical aspects of bootstrap technique which brings in the uncertainty intervals of the predictions. Depending on the progress of the project, we will also present the prediction results using other descriptors such as the Coulomb matrix and the transferability of the parametrized model from smaller to larger system sizes in the global optimization regime.

References 1. A.M. Koster et al., deMon2k, Version 5, The deMon developers, Cinvestav, Mexico City (2018). 2. J. Behler and M. Parrinello: Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces, Phys. Rev. Lett. 2007, 98, 146401 3. Khorshidi A. and Peterson A.A.: Amp: A modular approach to machine learning in atomistic simulations, Comput. Phys. Commun. 2016, 207, 310-324.

Work supported by the National Research Council of Canada, Artificial Intelligence for Design program and by the Natural Sciences and Engineering Research Council of Canada, Discovery Grant (RGPIN-2019-03976).

P26 © The Author(s), 2020 Using collective knowledge assign oxidation states

Kevin Maik Jablonka, Daniele Ongari, Seyed Mohamad Moosavi and Berend Smit Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingenierie Chimiques (ISIC), Ecole Polytechnique Fédérale de Lausanne (EPFL), Sion, VS, Switzerland

Knowledge of the oxidation state of a metal center in a material is essential to understand its properties and a key ingredient of chemical reasoning. Chemists have developed several theories to predict the oxidation state on the basis of the chemical formula or on the electron donation count of the ligands. These methods are quite successful for simple compounds but often fail to describe the oxidation states of more complex systems, such as metal-organic frameworks. But still, experienced chemists often have a good intuition for the oxidation state of a metal center.

In this presentation, we introduce a data-driven approach that harvests this intuition to automatically assign oxidation states.[1] To do so, we trained a classifier on the assignments by chemists encoded in the chemical names in the Cambridge Crystallographic Database (CSD). [2, 3] We found our algorithm to be robust to most of the experimental uncertainties in the structures (like incorrect protonation or unbound solvents). It has excellent accuracy (>; 98%), and also gives a measure of confidence in its predictions.

By investigating the cases in which the high-confidence predictions of our model disagree with the assignment in the CSD, we could identify a large number of incorrect assignments, many of which were already confirmed by the authors.

Notably, the predictions of our model follow chemical intuition and are often interpretable in terms of the local coordination geometry.

This work nicely illustrates how powerful the collective knowledge of chemists actually is. Machine learning can harvest this knowledge and convert it into a useful tool for chemists.

References 1. The tool is available at go.epfl.ch/oximachine. 2. Groom, C. R.; Bruno, I. J.; Lightfoot, M. P.; Ward, S. C. Acta Crystallogr B Struct Sci Cryst Eng Mater 2016, 72 (2), 171–179. 3. Moghadam, P. Z.; Li, A.; Wiggin, S. B.; Tao, A.; Maloney, A. G. P.; Wood, P. A.; Ward, S. C.; Fairen-Jimenez, D. Chem. Mater. 2017, 29 (7), 2618–2625.

P27 © The Author(s), 2020

A machine learning regression model for natural product repurposing as potential SARS-CoV-2 protease inhibitors

Jose Isagani Janairo De La Salle University, Philippines

The corona virus disease of 2019 (COVID-19) caused by the 2019 – novel corona virus (SARS-CoV-2) that originated in Wuhan, China, has already reached pandemic levels. This global health concern has led to intensified efforts for searching effective therapeutic strategies to manage the disease. A viable strategy for the development of SARS-CoV-2 antivirals is searching for natural products that can inhibit key processes in the viral life cycle. Natural products are ideal sources to be considered since they are readily available, and looking into their activities against the virus can help identify promising leads much faster. Moreover, promising natural products may be derivatized into more potent antiviral agents, thereby shortening the design phase in the drug discovery process1. In an effort to accelerate the screening and drug discovery workflow for potential SARS-CoV-2 protease inhibitors, a multiple linear regression model that can predict the binding free energies of compounds to the SARS-CoV-2 main protease is presented. The regression model, which was trained and 2 2 tested on 83 compounds demonstrates reliable prediction performance (r test = 0.91, RMSEtest = 0.501), while only requiring three topological descriptors (second order weighted path, topological surface area efficiency, vertex adjacency information). The formulated regression model was externally validated using compounds from 3,4 2 two independent studies , wherein the resulting predictive performance remained satisfactory (r validation = 0.81,

RMSEvalidation = 0.822), highlighting the robustness of the model. A previously reported QSAR model for SARS- CoV-2 protease inhibitors showed that the topological surface area, molecular weight, XLogP, hydrogen bond 2 5 donors, hydrogen bond acceptors descriptors were needed to create the model that exhibited r test = 0.753 . Thus, the presented regression model introduces new variables that can lead to the better prediction of the binding interaction of compounds with the viral enzyme. The parsimony and reliability of the formulated regression model can accelerate the discovery and development of natural products against SARS-CoV-2. It can help conserve resources by limiting biological assays to those that yielded favorable outcomes from the model. Moreover, since the outcome of the model is an estimate of the binding free energy, systematic molecular modifications can be carried out in order to increase the affinity of the candidate compounds to the target protein.The emergence of highly infectious diseases will always be a threat to human health and development, which is why the development of computational tools for rapid response is important.

References 1. G. Rastelli, F. Pellati, L. Pinzi and M. C. Gamberini, Molecules, 2020, 25, 1154. 2. Y.-M. Yan, X. Shen, Y.-K. Cao, J.-J. Zhang, Y. Wang and Y.-X. Cheng, Preprints, 2020, 2020020254. 3. S. Khaerunnisa, H. Kurniawan, R. Awaluddin, S. Suhartati and S. Soetjipto, Preprints, 2020, DOI:10.20944/ preprints202003.0226.v1. 4. S. Farabi, R. Saha and A. Khan, ChemRxiv, 2020, DOI:https://doi.org/10.26434/chemrxiv.12440024.v1. 5. R. Islam, R. Parves, A. S. Paul, N. Uddin, M. S. Rahman, A. Al Mamun, M. N. Hossain, M. A. Ali and M. A. Halim, J. Biomol. Struct. Dyn., 2020, 1–20.

P28 © The Author(s), 2020 Uniform quantitative predictive modelling for route design

Kjell Jorner1, Tore Brinck2, Per-Ola Norrby3 and David Buttar1 1Early Chemical Development, Pharmaceutical Sciences, R&D, AstraZeneca Macclesfield, UK, 2Department of Chemistry, KTH Royal Institute of Technology, Stockholm, Sweden, 3Data Science and Modelling, Pharmaceutical Sciences, R&D, AstraZeneca Gothenburg, Sweden

Synthesizing drug molecules is a challenge tackled by specialists in synthetic organic chemistry. To decide which route to take, and which synthetic approaches to try, synthetic chemists rely on qualitative reactivity principles and knowledge of the chemical literature. Predictive computer tools are now emerging to aid in this task, driven by recent advances in machine learning and increased access to chemical data. Retrosynthesis tools trained on patent data, commercial reaction databases and internal company data can quickly suggest routes to complex drug molecules. Reaction prediction tools can then indicate how likely it is that suggested transformations will work. The vision is that predictive tools will be incorporated in automated artificially intelligent reaction discovery platforms, where drug candidates are designed in the computer and then synthesized and tested on the fly.

We develop reaction prediction tools for use in drug development, where accuracy with respect to targets like reaction yield and selectivity are paramount. These tools are based on a combination of quantum-chemical simulations of the reaction mechanism and machine learning based on high-quality experimental data. We are primarily targeting the nucleophilic aromatic substitution reaction (SNAr), which comprises ca 9% of all reactions carried out in the pharmaceutical industry. In the SNAr reaction, a group on an aromatic ring (the leaving group) is substituted for another group (the nucleophile). We show that machine learning models with chemical accuracy can be built using ca 150-200 rate constants from experiment. These models use descriptors derived from quantum-mechanical calculations both for the reactants and the transition states. We achieve a mean absolute error lower than 1 kcal/mol for predicting absolute activation energies. These models are sufficiently accurate to guide experimental work and will be deployed internally in AstraZeneca route design and development projects.

P29 © The Author(s), 2020

Driving spectroscopic discovery with deep learning

Kin Long Kelvin Leea,b, Alexander Macleodc, P. Brandon Carrollb, Kyle N. Crabtreed, Brett McGuirea, Michael M. McCarthyb aMassachusetts Institute of Technology, Cambridge MA, USA, bCenter for Astrophysics | Harvard & Smithsonian, Cambridge MA, USA, cUniversity of Massachusetts, Lowell, Lowell MA, USA, dUniversity of California, Davis, Davis CA, USA

In recent times, rotational spectroscopy has received significant technological improvements through the widespread availability of high speed, high bandwidth electronics; signal generators, amplifiers, and receivers. Consequently, we see a significant increase in both the quality, resolution, and of course volume of rotational spectra: modern chirped-pulse Fourier transform spectrometers obtain over an octave (>;10 GHz) of spectra (1,2) overnight, and in the case of complex discharge mixtures, correspond to hundreds to thousands of spectral features—many of which correspond to completely unknown molecules. As an example, we've recently demonstrated the resolving and discovery power of broadband microwave spectroscopy on discharge mixtures, where we were able to untangle over 200 unique species, with 50 previously unreported in the literature (3).

The rise in the quality and volume of spectroscopic data, of course, comes at the cost of analysis. Generally speaking, analysing high resolution spectra is a manual and arduous field requiring a large amount of expertise. In the case of rotational spectroscopy, there are two main aspects: first assigning lines to molecular constituents, and second identifying unknown molecules based on their spectroscopic parameters. Given the substantial increase in data volume, deep learning—and more broadly machine learning—are highly felicitous fields to help automate a significant amount of spectroscopic analysis. In this poster, I will be detailing the efforts we have undertaken over the last year in developing probabilistic deep learning models in assisting rotational spectroscopy analysis, both in identifying unknown molecules based on spectroscopic constants (4), and sequence-models combined with deep reinforcement learning for spectroscopic complex mixture analysis.

References 1. Brown, G. G.; Dian, B. C.; Douglass, K. O.; Geyer, S. M.; Shipman, S. T.; Pate, B. H. A Broadband Fourier Transform Microwave Spectrometer Based on Chirped Pulse Excitation. Review of Scientific Instruments 2008, 79 (5), 053103. https:// doi.org/10.1063/1.2919120. 2. Park, G. B.; Field, R. W. Perspective: The First Ten Years of Broadband Chirped Pulse Fourier Transform Microwave Spectroscopy. J. Chem. Phys. 2016, 144 (20), 200901. https://doi.org/10.1063/1.4952762. 3. McCarthy, M. C.; Lee, K. L. K.; Carroll, P. B.; Porterfield, J. P.; Changala, P. B.; Thorpe, J. H.; Stanton, J. F. Exhaustive Product Analysis of Three Benzene Discharges by Microwave Spectroscopy. J. Phys. Chem. A 2020, 124 (25), 5170–5181. https://doi.org/10.1021/acs.jpca.0c02919. 4. McCarthy, M.; Lee, K. L. K. Molecule Identification with Rotational Spectroscopy and Probabilistic Deep Learning. J. Phys. Chem. A 2020, 124 (15), 3002–3017. https://doi.org/10.1021/acs.jpca.0c01376.

P30 © The Author(s), 2020

Evaluations of sequential pre-processing strategy on spectral data for forensic ink analysis purpose

Loong Chuen Lee Universiti Kebangsaan Malaysia, Malaysia

Data pre-processing is an essential aspect in the spectral data modelling pipeline. Data collected via instrumental techniques, especially spectroscopy techniques like Attenuated Total Reflectance-Fourier Transform Infrared (ATR-FTIR) spectroscopy is seldom be readily interpreted accurately without being pre-processed beforehand. Various data pre-processing (DP) methods are available for diverse kinds of purposes, e.g. scattering correction, and eliminating noisy signals. The aim of this work is to evaluate the merit and pitfall of sequential DP procedures in statistical modelling of an ATR-FTIR spectral data of white copy paper. The practical purpose is to construct a prediction model to determine source (i.e. paper manufacturer) of white copy paper collecting from a crime scene. The 150 spectral data can be classified into three different classes (i.e. paper manufacturers). The raw data was pre-processed with three different DP methods, i.e. normalization to sum (NS), vector normalization, asymmetric least squares (AsLS) respectively in an individual and sequential settings. For assessing the impact of spectral window, the pre-processing procedures also performed on a sub-region deriving from the global region. Eventually, a total of 14 sub-dataset, including the raw ones, has been prepared and modeled via principal component analysis-linear discriminant analysis. Each sub-dataset was repeatedly validated via 999 pairs of training and test sets to derive the respective internal and external error rates; as well as the corresponding standard errors. Results showed the best DP strategy for global and sub-region of the spectral data respectively is AsLS-NS and NS. In fact, the sub-region preprocessed via NS is the only one achieving zero error rates in both internal and external validations. In conclusion, sequential DP strategy is much promising than single DP approach but the advantage could be diminished when the right sub-region is used for the modelling.

References 1. A. Rinnan, F. van den Berg, S.B. Engelsen, 2009. Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends in Analytical Chemistry 28: 1201-1222. 2. K. Banas, A. Banas, M. Gajda, B. Pawlicki, W.M. Kwiatek, M.B.H. Breese. 2015. Pre-processing of Fourier transform infrared spectra by means of multivariate analysis implemented in the R environment. 140: 2810-2814. 3. L.C. Lee, C.Y. Liong, A.A. Jemain. 2017. A contemporary review on Data Preprocessing (DP) practice strategy in ATR-FTIR spectrum. Chemometrics and Intelligent Laboratory Systems 163: 64-75. 4. K.L Andrew Chan, S.G. Kazarian, 2016, Attentuated total reflection Fourier transform infrared (ATR-FTIR) imaging of tissues and live cells. 45: 1850-1864. 5. N. Legner, C. Meinen, R. Rauber, 2018. Root differentiation of agricultural plant cultivars and proveniences using FTIR spectroscopy. Frontiers in Plant Science 9: 748. 6. L.C. Lee. A.A. Jemain, 2019. Predictive modelling of colossal ATR-FTIR spectral data using PLS-DA: Empirical differences between PLS1-DA and PLS2-DA algorithms. Analyst 144: 2670-2678. 7. L.C. Lee, C.Y. Liong, K. Osman, A.A. Jemain, 2016, Comparison of several variants of principal component analysis (PCA) on forensic analsyis of paper based on IR spectrum. Advances in Industrial and Applied Mathematics: Proceedings of 23rd Malaysian National Symposium of Matheamtical Sciencs, SKSM 2015. American Institute of Physics Inc., 1750: 060012. 8. K.H. Liland, T. Almoy, B.-H. Mevik, 2010, Optimal choice of baseline correction for multivariate calibration of spectra, Applied Spectroscopy 64:1007-1016.

P31 © The Author(s), 2020

Machine learning applied to a large library of organic molecules: identifying molecular photocatalysts for hydrogen evolution

Xiaobo Li,a Phillip M. Maffettone,a,b Yu Che,a,c Tao Liu,a Linjiang Chen,a,c and Andrew I. Coopera,c a Department of Chemistry and Materials Innovation Factory, University of Liverpool, Liverpool, UK, b National Synchrotron Light Source II, Brookhaven National Laboratory, USA, c Leverhulme Research Centre for Functional Materials Design, Materials Innovation Factory and Department of Chemistry, University of Liverpool, Liverpool, UK E-mail: [email protected]

Whilst molecular catalysts are a promising avenue for photocatalytic water splitting, approaches pertaining to serendipity have been restricting most searches to narrow areas of the enormous chemical space of organic molecules. Integration of experiment, computation and machine learning is an emerging approach to addressing this challenge. Here, a library of 572 organic molecules, with diverse compositions and structures, was experimentally established for photocatalytic hydrogen evolution. An interactive web application was developed for exploring our library (https://www.molecular-photocatalysts-library.app). Then, we applied quantum chemical calculations and machine learning to visualize, interpret and predict the photocatalytic activities of the molecules. By applying unsupervised learning to the molecular structures, we identified structural features that are common in high-activity molecules. Further analysis using calculated molecular descriptors in a suite of supervised classification algorithms revealed that exciton electron affinity, exciton binding energy, and singlet–triplet energy gap had major effects on the performance of the molecular photocatalysts. Moreover, the learned predictive models are robust against producing false positives (more than 95% ‘low’ performers, ≤ 1.0 µmol H2/h, are correctly labelled by all the models) and hence are readily useful to identify molecular catalysts for photocatalytic water splitting in the future. This work brings the realization of “predictive design” of organic photocatalysts forward.

P32 © The Author(s), 2020

Benchmarking the performance of Bayesian Optimization across multiple experimental domains

Harry Qiaohao Liang1, Aldair E. Gongora2, Zekun Ren3, Zhe Liu1, Armi Tiihonen1, Shijing Sun1, Flore C.L. Mekki-Berrada4, Saif A. Khan4, Daniil Bash5, Kedar Hippalgaonkar5, Keith A. Brown2, John Fisher1, Tonio Buonassisi1 1Massachusetts Institute of Technology, Cambridge, MA 02139, USA, 2Boston University, Boston, MA 02215, USA, 3Singapore-MIT Alliance for Research and Technology, Singapore 138602, Singapore, 4National University of Singapore, Singapore 119077, Singapore, 5Agency for Science, Technology and Research (A*STAR), Singapore 138632, Singapore

High-throughput autonomous research has recently emerged as the new frontier of accelerated materials discovery. This success has sparked efforts to improve not only lab automation infrastructure but also optimization algorithms. In materials optimization campaigns, sequential learning approaches such as Bayesian Optimization (BO) have gained great popularity due to their ability to improve parameters of merit with fewer number of experiment cycles in comparison to brute-force approaches. Previous attempts to quantitatively evaluate the acceleration of research through sequential learning have been limited to specific materials systems [1]. Therefore, two open questions that remain are: (a) how does performance of BO in sequential learning approaches vary across multiple experimental domains and manifold complexities. (b) how we could improve upon current optimization algorithms and attend to the needs of a broader materials science community.

In this work, we benchmark the performance of sequential learning approaches within the BO framework across five experimental domains. Each focusing on a different materials system, five experimental datasets with varying sizes, dimensionalities and dataset manifold complexities are used to build representations of ground truths, where we draw from these to simulate optimization campaigns. We compare the performance of BO algorithms employing one of two surrogate models with multiple myopic acquisition functions. The performance of each surrogate model and acquisition function pair is benchmarked against random acquisition, and further quantified via learning rate metrics designed to measure different research objectives.

It is observed that BO consistently reduces the number of experiment cycles necessary to reach superior materials response by factors of 5 to 15 across all experimental systems. Random forest surrogate model paired with a lower confidence bound acquisition function of balanced mean and standard deviation shows higher- ranking performance for research tasks such as finding any or all of the candidates with top 5% response. Interestingly, in the case of surrogate models, random forests outclass gaussian processes in optimization tasks when using the same acquisition functions across all datasets, likely due to their nature of making fewer structural assumptions about manifold landscapes.

Complementary to our findings, our benchmarking effort has yielded many useful insights on how experimentalists interested in novel materials systems could select the most suitable BO methods prior to embarking on an optimization campaign. These can be utilized by a broader materials science community in further progressing the field of autonomous research.

References 1. Rohr, B., Stein, H.S., Guevarra, D., Wang, Y., Haber, J.A., Aykol, M., Suram, S.K. and Gregoire, J.M., 2020. Benchmarking the acceleration of materials discovery by sequential learning. Chemical Science, 11(10), pp.2696-2706.

P33 © The Author(s), 2020

Metabolite translator: a transformer-based tool for predicting drug metabolites

Eleni E. Litsa1, Payel Das2,3, Lydia E. Kavraki1 1Department of Computer Science, Rice University, Houston, TX, USA, 2IBM Research AI, IBM Thomas J. Watson Research Center, NY, USA, 3Applied Physics and Applied Mathematics, Columbia University, NY, USA

The structure of a drug may be altered in the human body through metabolic reactions. Potential toxicity of the formed metabolites is investigated during the drug development process, while complications due to active metabolites may be identified even after a drug has been released in the market. Computational approaches have been developed for the prediction of possible drug metabolites in an effort to assist the resource-demanding experimental route [1]. Current methodologies are based upon metabolic transformation rules which encode the action of certain enzymes. Most approaches cover metabolism through the major cytochrome P450 enzyme family of phase I metabolism while a few have been extended to cover phase II metabolism. The transformation rules are often manually derived raising scalability issues. Additionally, they restrict generalisation for a variety of substrates, as a rule is applied only when there is an exact match between the substrate and the reaction rule pattern.

We present a rule-free, end-to-end learning-based approach for predicting human metabolites of small molecules including drugs. We approach the metabolite prediction problem as a sequence translation problem representing molecules as sequences using the SMILES notation. Due to the limited amount of available human metabolism data, we applied transfer learning on a Transformer model pre-trained on general chemical reactions for predicting the reaction outcome. We fine-tuned the Transformer model on a dataset of human metabolic reactions that we gathered from freely accessible metabolic databases including metabolism of xenobiotics and endogenous compounds. We further built an ensemble model to account for multiple and diverse metabolites. We assessed the ability of the model to predict human metabolites of drugs using as reference three existing tools [2,3,4].

Our analysis showed that the large diversity of the training dataset allows the model to predict metabolites through any enzyme without compromising performance on the major enzyme families in drug metabolism. When compared to existing methods, Metabolite Translator retrieved a comparable number of metabolites with methodsthat have been specifically developed for phase I and phase II drug metabolism [2,3], It additionally surpassed a model, which similar to Metabolite Translator,covers the broader human metabolism [4]. The proposed methodology addresses the problems of limited scalability and lack of generalisation of existing rule-based approaches. As more data on drug metabolism become available, the performance of this approach can be further improved encouraging the adoption of such tools in drug discovery for accelerating and enhancing safety studies.

References 1. Kazmi S.R., Jun R., Yu M.-S., Jung C., Na D., In silico approaches and tools for the prediction of drug metabolism and fate: A review, Comput. Biol. Med., 106: 54-64, 2019. 2. de Bruyn Kops C., Šícho M. , Mazzolari A. , Kirchmair J. , GLORYx: Prediction of the Metabolites Resulting from Phase 1 and Phase 2 Biotransformations of Xenobiotics, Chem. Res. Toxicol., Aug 2020. 3. Ridder, L., Wagener, M., SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of Metabolites. ChemMedChem, 3: 821-832, 2008. 4. Djoumbou-Feunang, Y., Fiamoncini, J., Gil-de-la-Fuente, A. et al. BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification. J. Cheminform. 11, 2019.

P34 © The Author(s), 2020

Accelerating development of natural porous materials assisted by statistical machine learning

Giulia Lo Dico1,2,3, Verónica Carcelén3, Maciej Haranczyk1 1 IMDEA Materials Institute, C/Eric Kandel 2, 28906 Getafe, Madrid, Spain, 2 Department of Materials Science and Engineering and Chemical Engineering, Universidad Carlos III de Madrid, Getafe, Spain, 3 Tolsa Group, Carretera de Madrid a Rivas Jarama, 35, Madrid, Spain

Machine Learning provides predictive tools for optimization accelerating the discovery of novel materials1. Parallelly, statistical algorithms are suitable for knowledge extraction and representation. The data quality and the identification of efficient vector space are the keys for successful implementation of machine learning models2,3. The powerfulness of tree-based ensembled algorithms has been exploited recognizing correlation paths in vector space for the design of hierarchical porous materials3. Natural clay-based porous materials are green and low-cost adsorbents and catalysts. The key factors determining their performance in these applications are the pore morphology and surface activity4. The latter can be modified and tuned to specific applications through further processing and/or chemical treatment. Typically, the characterization of the material, raw or processed, is assessed experimentally, being costly especially in the context of tuning properties towards specific requirement, involving numerous experiments. In this work, we present an application of statistical machine learning algorithms being trained on experimental datasets to accelerate the characterization and design optimization. The accuracy of the resulting models (R2 from 0.78 to 0.99) allows reliable prediction of outcomes of experimental characterization of processed materials, and their high throughput enables exploration processing parameters- properties correlations. The knowledges extracted by feature importance analysis and the proposed design functions help to pinpoint optimal thresholds of processing conditions leading to the desired properties. The multi- objective optimization have been employed to extract information about the most favorable clay minerals and their process to design promising micro- or nano-adsorbent, and acid or basic nano-catalyzer.

References 1. Yao, Z.; Sanchez-lengeling, B.; Bobbitt, N. S.; Bucior, B. J.; Govind, S.; Kumar, H.; Collins, S. P.; Burns, T.; Woo, T. K.; Farha, O.; Snurr, R. Q.; Aspuru-guzik, A. Inverse Design of Nanoporous Crystalline Reticular Materials with Deep Generative Models. ChemRxiv 2020. https://doi.org/10.26434/chemrxiv.12186681.v1. 2. Raccuglia, P.; Elbert, K. C.; Adler, P. D. F.; Falk, C.; Wenny, M. B.; Mollo, A.; Zeller, M.; Friedler, S. A.; Schrier, J.; Norquist, A. J. Machine-Learning-Assisted Materials Discovery Using Failed Experiments. Nature 2016, 533 (7601), 73–76. https://doi. org/10.1038/nature17439. 3. Jablonka, K. M.; Ongari, D.; Moosavi, S. M.; Smit, B. Big-Data Science in Porous Materials: Materials Genomics and Machine Learning. 2020, 1–173. 4. Ongari, D.; Boyd, P. G.; Barthel, S.; Witman, M.; Haranczyk, M.; Smit, B. Accurate Characterization of the Pore Volume in Microporous Crystalline Materials. Langmuir 2017, 33 (51), 14529–14538. https://doi.org/10.1021/acs.langmuir.7b01682.

P35 © The Author(s), 2020

Bayesian Optimization for structural elucidation of atomic clusters

*Maicon P. Lourenço(1), Cleyton de Souza Oliveira(1), Breno Galvão(2), Lizandra Barrios Herrera(3), Jiri Hostas(3), Alain Tchagang(4), Dennis R. Salahub(3). (1)Departamento de Química e Física, Centro de Ciências Exatas, Naturais e da Saúde (CCENS), Universidade Federal do Espírito Santo, 29500-000, Alegre, Espírito Santo, Brazil. (2) Centro Federal de Educação Tecnológica de Minas Gerais, CEFET-MG, Av. Amazonas 5253, 30421-169, Belo Horizonte, MG, Brazil.(3) Department of Chemistry, CMS Centre for Molecular Simulation, IQST Institute for Quantum Science and Technology, Quantum Alberta, University of Calgary, 2500 University Drive NW, Calgary, AB, T2N 1N4, Canada.(4) Digital Technologies Research Centre, National Research Council of Canada, 1200 Montréal Road, Ottawa, ON, K1A 0R6 Canada. * Address correspondence to: [email protected]

Bayesian Optimization (BO) has been successfully applied in chemistry [1, 2] and is known as an effective method when the search space is discrete [3]. However, many important chemistry problems are represented in continuous space, as in global optimization (GO) of atomic clusters. In this case, the challenge lies in sampling properly the virtual space (non-computed) [3] in order to apply the exploitation-exploration tradeoff [2].

As the search space to apply the Bayesian inference is continuous, it is expected to have a huge dimension. Therefore, it is necessary to use an efficient sampling strategy to create the virtual space in order to maximize the chance of finding the Global Minimum with few computations. This can be accomplished from a sample of previously computed clusters by applying the Plane-Cut-Splice (PCS) operator [4] to create the structures of the virtual clusters. A similar strategy is the Markov Chain Monte Carlo method.

BO applied for clusters and PCS sampling was implemented in the QMLMaterial software [2]. It provides an automatic framework to employ Gaussian Process regression (the uncertainties are directly estimated) or Neural Network (the uncertainty is obtained from K-Fold cross-validation). The utility function that guides the choice of new clusters in the virtual space to be computed is the Expected Improvement [2]. The clusters are represented by the MBTR descriptor [5]. The progress of the BO in QMLMaterial will be presented and applied to the GO of the

Al4Si7 and Na20 using DFTB and, depending on the progress, for Mo4S8 using DFT.

References 1. Lookman, T.; Balachandran, P. V.; Xue, X.; Yuan, R.; Active Learning in Materials Science with Emphasis on Adaptive Sampling Using Uncertainties for Targeted Design, npj Com. Mat., 5, 21 (2019). 2. Lourenço, M. P.; Anastácio, A. S.; Rosa. L. Andreia; Frauenheim, T.; da Silva, M. C; An Adaptive Design approach for defects distribution modeling in materials from first-principle calculations, J. Mol Mod., 26, 187 (2020). 3. Nomura, M.; Abe K.; A Simple Heuristic for Bayesian Optimization with a Low Budget. arXiv: 1911.07790. 4. Deaven, D. M.; Ho, K. M. Molecular Geometry Optimization with a Genetic Algorithm, Phys. Rev. Lett., 75, 288 (1995). 5. Himanen, L.; et al.; DScrible: Library of Descriptors for Machine Learning in Materials Science, Comp. Phys. Com., 247, 106949 (2020).

†Work supported by: the (1) National Research Council of Canada, Artificial Intelligence for Design program and by the Natural Sciences and Engineering Research Council of Canada, Discovery Grant ( RGPIN-2019-03976). (2) The support of the Brazilian agencies: Fundação de Amparo à Pesquisa do Espírito Santo (FAPES)—project CNPq/FAPES PPP 22/2018.

P36 © The Author(s), 2020

The effect of descriptor choice in machine learning models for ionic liquid melting point prediction

Kaycee Low1, Rika Kobayashi2, and Ekaterina I. Izgorodina1 1Monash University, Australia

The characterisation of an ionic liquid's properties based on structural information is a longstanding goal of computational chemistry which has received much focus from ab initioand molecular dynamics simulations.1,2One of the most promising uses of ionic liquids is as high-conductivity, low-volatility electrolytes, which requires properties such as low viscosity and low melting temperature for practical application.3 Machine learning models hold great promise in their ability to predict the melting point of an ionic liquid without the need for synthesis, allowing for rapid screening of combinatorial databases. Thus far, a multitude of descriptor types have been used in the literature; mostly based on semi-empirical or group-contribution type methods.4,5

In this work, we build kernel ridge regression models based on an experimental dataset of 2,212 ionic liquid melting point values with various ion types. Structural descriptors, which have been shown to predict quantum mechanical properties of small neutral molecules within chemical accuracy,6 are here shown to benefit from the addition of high-level first-principles data related to the target property (molecular orbital energy, charge density profile, and interaction energy based on the geometry of a single ion pair) when predicting the melting point of ionic liquids. Out of the two chosen structural descriptors, ECFP4 circular fingerprints and the Coulomb matrix, the addition of molecular orbital energies and charge profiles to each descriptor respectively increases the accuracy of surrogate models for melting point prediction when compared to use of structural descriptors alone. The best model, based on ECFP4 and molecular orbital energies, predicts ionic liquid melting points with a MAE of 29 K, and unlike group contribution methods which have achieved similar results,is applicable to any type of ionic liquid.

References 1. E. I. Izgorodina, Z. L. Seeger, D. L. Scarborough, and S. Y. Tan, “Quantum chemical methods for the prediction of energetic, physical, and spectroscopic properties of ionic liquids,” Chemical reviews 117, 6696–6754 (2017). 2. B. Kirchner, O. Hollóczki, J. N. Canongia Lopes, and A. A. Pádua, “Multiresolution calculation of ionic liquids,” Wiley Interdisciplinary Reviews: Computational Molecular Science 5, 202–214 (2015). 3. M.Watanabe, M. L. Thomas, S. Zhang, K. Ueno, T. Yasuda, and K. Dokko, “Application of ionic liquids to energy storage and conversion materials and devices,” Chemical reviews 117, 7190–7239 (2017). 4. F. Gharagheizi, P. Ilani-Kashkouli, and A. H. Mohammadi, “Computation of normal melting temperature of ionic liquids using a group contribution method,” Fluid phase equilibria 329, 1–7 (2012). 5. V. Venkatraman, S. Evjen, H. K. Knuutila, A. Fiksdahl, and B. K. Alsberg, “Predicting ionic liquid melting points using machine learning,” Journal of Molecular Liquids 264, 318–326 (2018) 6. M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. Von Lilienfeld, “Fast and accurate modeling of molecular atomization energies with machine learning,” Physical review letters 108, 058301 (2012).

P37 © The Author(s), 2020

The automation of powder diffraction

A. Lunta, b, S. Y. Chongb, A. I. Coopera,b* a Leverhulme Research Centre for Functional Materials Design, UK, b Department of Chemistry and Materials Innovation Factory, University of Liverpool, UK *[email protected], s.chong@ liverpool.ac.uk, [email protected]

Accelerating materials research in academia and industry is of huge importance for our ever-developing society to meet needs of emerging and future technologies.[1] Fundamental to this is understanding structure-function relationships, which can be elucidated from the crystal structure of a material. One method to determine such information about crystal structure is powder X-ray diffraction (PXRD).[2]

Of the thousands of crystal structures discovered, less than 1% have been determined using PXRD[3] - largely due to the complex nature of the data produced. The group is currently working to develop methods to perform rapid analysis of PXRD patterns computationally, for example, using Genetic Algorithms and techniques such as similarity analysis. With the analysis of PXRD data being difficult (usually requiring computational assistance) and time consuming,[4] one can see the desire of integrating automation to carry out the more routine aspects of crystallisation and data acquisition. The motivation behind the automation of materials experiments includes lowering per-experiment costs, eliminating human error and to enable learning-driven experiments that, not only explore the materials parameter space, but also identify promising regions to search for new materials within it.

Here, we report our progress in creating a robotic workflow that carries out the automated preparation of candidate functional materials and their characterisation using PXRD analysis. We have previously demonstrated an integrated workflow for the synthesis and characterisation of new photocatalysts using a KUKA mobile robot[5]; this work will develop a new workflow which carries out crystallisation experiments under variable conditions, such as solvent screening and cocrystallisation screening. The samples will then be transferred to an X-ray diffractometer using the KUKA robot and the hardware controlled by an interface.

Herein, we report our current progress in developing a crystallisation workflow that uses generic laboratory glassware, which can be used in conjunction with a variety of automated liquid handlers. The workflow incorporates a grinding protocol for PXRD sample preparation and solid transfer step to a sample holder compatible with PXRD data collection in transmission geometry.

This progress is significant in developing a unique, autonomous materials workflow that also integrates solid state structural characterisation. The intention is not only to automate the analysis of crystalline powders with PXRD, but also to be able to screen and highlight suitable candidates in new materials discovery.

References 1. E. A. Kabova, C. D. Blundell and K. Shankland, J. Pharm. Sci., 2018, 107, 2042-2047. 2. W. I. F. David and K. Shankland, Acta Crystallogr. Sect. A Found. Crystallogr, 2008, 64, 52-64. 3. K. Shankland, M. J. Spillman, E. A. Kabova, D. S. Edgeley and N. Shankland, Acta Crystallogr. Sect. C Cryst. Struct. Commun., 2013, 69, 1251-1259. 4. A. A. Coelho, J. Appl. Crystallogr, 2018, 51, 210-218. 5. Automation: Chemistry shoots for the Moon, Nature, April 2019

P38 © The Author(s), 2020

Using quantum atomics and machine learning to advance picotechnology

Preston MacDougall and Kiran Donthula MTSU, USA

With an eye toward the future, when databases of transferable atomic fragments1 are routinely available from quantum crystallography,2 the ability of machine learning to predict the spectroscopic and chemical properties of a carbonyl group based on a few key topological properties of a single atom is explored for 225 small molecules, and a cluster model of an enzyme’s active site.

References 1. Bader, R.F.W. "Atoms in Molecules: A Quantum Theory", Clarendon Press, Oxford, 1990. 2. "Quantum Crystallography: Current Developments and Future Perspectives", Alessandro Genoni, Lukas Bučinský, Nicolas Claiser, Julia Contreras-García, Birger Dittrich, Paulina M. Dominiak, Enrique Espinosa, Carlo Gatti, Paolo Giannozzi, Jean- Michel Gillet, Dylan Jayatilaka, Piero Macchi, Anders Ø. Madsen, Lou Massa, Chérif F. Matta, Kenneth M. Merz Jr., Philip N. H. Nakashima, Holger Ott, Ulf Ryde, Karlheinz Schwarz, Marek Sierka, Simon Grabowsky, Angew. Chem. Int. Ed. 2018.

P39 © The Author(s), 2020

Crystallography companion agent for high-throughput materials discovery

Phillip M. Maffettone1,2, Lars Banko3, Peng Cui2, Yury Lysogorskiy4, Marc A. Little2, Daniel Olds1, Alfred Ludwig3 & Andrew I. Cooper2 1National Synchrotron Light Source II, Brookhaven National Laboratory, Upton, New York 11973, USA 2Department of Chemistry and Materials Innovation Factory, University of Liverpool, Crown Street, Liverpool L69 7ZD, U.K.3Institute for Materials, Faculty of Mechanical Engineering, Ruhr University Bochum, 44801 Bochum, Germany 4Interdisciplinary Centre for Advanced Materials Simulation (ICAMS), Ruhr University, 44801 Bochum, Germany

The discovery of new structural and functional materials is driven by phase identification, often using X-ray diffraction (XRD). Automation has accelerated the rate of XRD measurements, greatly outpacing XRD analysis techniques that remain manual, time consuming, error prone, and impossible to scale. With the advent of autonomous robotic scientists or self-driving labs, contemporary techniques prohibit the integration of XRD. Here, we describe a computer program for the autonomous characterization of XRD data, driven by artificial intelligence (AI), for the discovery of new materials [1]. Starting from structural databases, we train an ensemble model using a physically accurate synthetic dataset, which output probabilistic classifications – rather than absolutes – to overcome the overconfidence in traditional neural networks. This AI agent behaves as a companion to the researcher, improving accuracy and offering unprecedented time savings, and is demonstrated on a diverse set of organic and inorganic materials challenges. This innovation is directly applicable to inverse design approaches, robotic discovery systems, and can be immediately considered for other forms of characterization such as spectroscopy and the pair distribution function.

References 1. Maffettone, P. M. XCA.https://github.com/maffettone/xca

P40 © The Author(s), 2020

Accurate many-body repulsive potentials for density-functional tight binding from deep tensor neural networks

Leonardo Medrano Sandonas, Martin Stöhr, and Alexandre Tkatchenko Department of Physics and Materials Science, University of Luxembourg, L-1511 Luxembourg, Luxembourg.

Machine learning (ML) has been proven to be an extremely valuable tool for simulations with ab initio accuracy at the computational cost between classical interatomic potentials and density-functional approximations. Similar efficiency can only be achieved by semi-empirical methods, such as density-functional tight-binding (DFTB). One of the limiting factors in terms of the accuracy and transferability of DFTB parametrizations is the so-called repulsive potential, which plays a considerable role for the prediction of energetic, structural, and dynamical properties. Few attempts of using ML-techniques to address this issue have been proposed recently [1] but, up to now, evidence of transferability and scalability is still scarce.

Hence, we combine density-functional tight-binding (DFTB) with deep tensor neural networks (DTNN) to maximize the strengths of both approaches in predicting structural, energetic, and vibrational molecular properties. The DTNN is used to construct a non-linear model for the localized many-body interatomic repulsive energy, which so far has been treated in an atom-pairwise manner in DFTB. Substantially improving upon standard DFTB and

DTNN, the resulting DFTB-NNrep model yields accurate predictions of atomization and isomerization energies, equilibrium geometries, vibrational frequencies and dihedral rotation profiles for a large variety of organic molecules compared to the hybrid DFT-PBE0 functional. Our results highlight the high potential of combining semi-empirical electronic-structure methods with physically-motivated machine learning approaches for predicting localized many-body interactions. We conclude by discussing future advancements of the DFTB-NNrep approach that could enable chemically accurate electronic-structure calculations for systems with tens of thousands of atoms.

References 1. J. J. Kranz, M. Kubillus, R. Ramakrishnan, O. A. von Lilienfeld, and M. Elstner, J. Chem. Theory Comput., 14, 2341-2352, (2018). 2. M.Stöhr, L. Medrano Sandonas, A.Tkatchenko, J. Phys. Chem. Lett., 11, 6835-6843, (2020).

P41 © The Author(s), 2020

Molecular design using GraphINVENT

Rocío Mercado,1 Tobias Rastemo,1,2 Edvard Lindelöf,1,2 Günter Klambauer,3 Ola Engkvist,1 Hongming Chen,4 Esben J. Bjerrum.1 1Hit Discovery, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden, 2Chalmers University of Technology, Gothenburg, Sweden, 3Institute of Bioinformatics, Johannes Kepler University, Linz, Austria, 4Centre of Chemistry and Chemical Biology, Guangzhou Regenerative Medicine and Health – Guangdong Laboratory, Guangzhou, China

Graphs are widespread mathematical structures that can be used to describe an assortment of relational information, and are natural choices for describing molecular structures. Recently, there has been an increase in the use of graph neural networks (GNNs) for modeling patterns in graph-structured data, including graph-based molecular generation for pharmaceutical drug discovery.[1-3] The guiding principle behind graph-based molecular design can be boiled down to generating graphs which meet all the criteria of desirable drug-like molecules.

Here, we apply GNNs to the task of molecular generation and introduce GraphINVENT, a Python program written for graph-based molecular design using message passing neural networks (MPNNs) and a tiered feed-forward network structure to probabilistically generate new molecules one atom/bond at a time. The models can quickly learn the underlying distribution of properties in training set molecules without any explicit writing of chemical rules. The proposed models perform well for molecular generative tasks when benchmarked using MOSES. Our work illustrates how deep learning methods can enhance drug design and shows that graph-based generative models merit further exploration for molecular graph generation.

References 1. Y. Li, L. Zhang, and Z. Liu, “Multi-objective de novo drug design with conditional graph generative model,” J. Cheminform., vol. 10, no. 1, pp. 1–24, 2018. 2. Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia, “Learning Deep Generative Models of Graphs,” ICLR, pp. 1–16, 2018. 3. W. Jin, R. Barzilay, and T. Jaakkola, “Hierarchical Generation of Molecular Graphs using Structural Motifs,” arXiv:2002.03230v2,2020.

P42 © The Author(s), 2020

On the importance of structural diversity in metal-organic framework databases

Seyed Mohamad Moosavi,1,2 Aditya Nandy,2 Kevin M. Jablonka,1 Daniele Ongari, Jon Paul Janet,2 Peter G.Boyd,1 Yongjin Lee,3 Heather J Kulik,2 and Berend Smit1 1École Polytechnique Fédérale de Lausanne (EPFL), Sion, Switzerland, 2Massachusetts Institute of Technology (MIT), Cambridge, United States, 3ShanghaiTech University, Shanghai, China

By combining metal nodes and organic linkers, one can make millions of distinct metal-organic frameworks (MOFs). At present over 90'000 MOFs have been synthesized and there are databases with over 500'000 predicted structures. This raised the question whether a new experimental or predicted structure adds new information or is only a small variation of the existing structures. We all have a notion that chemical diversity in the material databases is important; for example, one would like to avoid experimentally or computationally screen a large number of chemically similar structures. For MOF chemists, the chemical design space is a combination of pore geometry, metal nodes, organic linkers, and functional groups, but we did not have a formalism to quantify optimal coverage of this chemical design space.

In this presentation, we introduce a machine learning-based method to quantify this diversity. We will show that by projecting a MOF structure on a set of relevant descriptors characterizing these four domains of MOF chemistry, we can quantify different aspects that are related to the chemical diversity of the different databases. We quantify the diversity of a database in terms of known diversity metrics, which provides us with a simple, powerful, and practical guideline to evaluate whether a set of new structures will have the potential for new insights or constitute relatively small variation of existing structures. We conclude our presentation with showing that our diversity analysis identifies outstanding biases in the current MOF databases that have led to incorrect conclusions in the past as well as not transferable machine learning models.

References 1. Moosavi, S. M.; Nandy, A.; Jablonka, K. M.; Ongari, D.; Janet, J. P.; Boyd, P. G.; Lee, Y.; Smit, B.; Kulik, H.Understanding the Diversity of the Metal-Organic. Nature Communications, in press. Available onChemRxiv preprint 2020, DOI: 10.26434/ chemrxiv.12251186.v1

P43 © The Author(s), 2020

3D printed fluidics for reaction monitoring

Adam J.N Price, R. Lee, Steven D.R Christie Department of Chemistry, Loughborough University, LE11 3TU, United Kingdom

Chemical synthesis and material discovery consist of many time-consuming, labor-intensive procedures such as lengthy additions, work-ups, and purification. In industry, these processes are often performed by automated machines on a much larger scale. Continuous flow processing in the laboratory offers a small-scale method of automating chemical synthesis and therefore, freeing up much of the researchers’ time.

Reaction optimization is an area of chemistry that can particularly benefit from process automation. It is achieved by repeating multitudes of reaction steps whilst varying parameters and measuring the response output. Reaction ‘self-optimization’ relies heavily on the real time feedback of reaction data. Numerous algorithms have been written that are able to process this data to determine how progression can be made towards achieving a set of reaction conditions that produce the optimum target response (e.g. highest yield)1.

One of the main drawbacks of self-optimization in flow is the cost of the equipment used to monitor reaction progression. 3D printing offers a method of designing and producing much cheaper alternatives to commercial reaction monitoring equipment to make automated synthesis much more accessible to many research groups2.

Here we present the design for a 3D printable flow cell that allows for easy integration of a standardATR-IR spectrometer with continuous flow process configurations. The flow cell, and accompanying MATLAB data collection script, allow for near real-time monitoring (<; 3 sec) of continuous flow processes.This feedback of data may be used to quantify reaction products at steady state to autonomously drive reaction optimization algorithms towards an optimum set of reaction parameters.

References 1. A. D. Clayton, J. Manson, C. J. Taylor, T. W. Chamberlain, B. Taylor, G. Clemens and R. Bourne, React. Chem. Eng., , DOI:10.1039/c9re00209j. 2. A. J. Capel, A. Wright, M. J. Harding, G. W. Weaver, Y. Li, R. A. Harris, S. Edmondson, R. D. Goodridge and S. D. R. Christie, Beilstein J. Org. Chem., 2017, 13, 111–119.

P44 © The Author(s), 2020

When does remeasuring accelerate catalyst discovery?

Fuzhan Rahmanian1,2, Helge Sören Stein1,2,* 1: Applied Electrochemistry,Helmholtz Institute Ulm,Helmholtz 11, 89081 Ulm , Germany 2: Applied Electrochemistry, Institute for Physical Chemistry, Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany *: Corresponding author [email protected]

Heteroscedastic error is ubiquitous in experimental research but often neglected, resulting in unaccounted error propagation and amplification in sequential learning. Herein we study the effect of actively considering measurement and model error (i.e. aleatoric and epistemic uncertainty) through incorporating measurement error in a dynamic remeasuring decision process for sequential learning. Benchmarking in uncertain measurement conditions was performed by artificially noising an OER catalysts dataset.This study discusses acceleration and deceleration when all uncertainty sources are considered and motivates dynamic allocation of robotic resources to sequential learning campaigns.

P45 © The Author(s), 2020

Prediction of chemical reaction yields using reaction transformers

Philippe Schwaller1,2, Alain C. Vaucher1 , Teodoro Laino1, Jean-Louis Reymond2 1IBM Research -- Europe, 2University of Bern, Switzerland

Artificial intelligence is driving one of the most important revolutions in organic chemistry.

Multiple platforms, including tools for reaction prediction and synthesis planning based on machine learning, successfully became part of the organic chemists' daily laboratory, assisting in domain-specific synthetic problems [1]. Unlike reaction prediction and retrosynthetic models, reaction yields models have been less investigated, despite the enormous potential of accurately predicting them. Reaction yields models, describing the percentage of the reactants that is converted to the desired products, could guide chemists and help them select high-yielding reactions and score synthesis routes, reducing the number of attempts [2]. So far, yield predictions have been predominantly performed for high-throughput experiments using a categorical (one-hot) encoding of reactants, concatenated molecular fingerprints [3], or computed chemical descriptors [4]. Here, we extend the application of natural language processing architectures to predict reaction properties given a text-based representation of the reaction, using an encoder transformer model [5,6] combined with a regression layer. We demonstrate outstanding prediction performance on two high-throughput experiment reactions sets [4,7]. An analysis of the yields reported in the open-source USPTO data set [8] shows that their distribution differs depending on the mass scale, limiting the dataset applicability in reaction yields predictions.

References 1. IBM RXN for Chemistry, https://rxn.res.ibm.com 2. Schwaller, P. et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem. Sci. 11, 3316–3325 (2020). 3. Sandfort, F., Strieth-Kalthoff, F., Kühnemund, M., Beecks, C. & Glorius, F. A structure-based platform for predicting chemical reactivity. Chem (2020). 4. Ahneman, D. T., Estrada, J. G., Lin, S., Dreher, S. D. & Doyle, A. G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 360, 186–190 (2018). 5. Vaswani, A. et al. Attention is all you need. In Advances in neural information processing systems, 5998–6008 (2017). 6. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1). 2019. 7. Perera, D. et al. A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science 359, 429–434 (2018). 8. Lowe, D. M. Extraction of chemical structures and reactions from the literature. Ph.D. thesis, University of Cambridge (2012). Preprint:10.26434/chemrxiv.12758474

P46 © The Author(s), 2020

A new approach to expanding the chemical space accessible by self-driving labs

Martin Seifrid1, Théophile Gaudin2, 3, Alán Aspuru-Guzik1, 2, 4, 5 1Department of Chemistry, University of Toronto, Toronto, ON M5S 3H6, Canada, 2Department of Computer Science, University of Toronto, Toronto, ON M5S 3H6, Canada, 3IBM Research Zürich, 8803 Rüschlikon, Zürich, Switzerland, 4Vector Institute for Artificial Intelligence, Toronto, ON M5S 1M1, Canada, 5Canadian Institute for Advanced Research (CIFAR) Senior Fellow, Toronto, ON M5S 1M1, Canada

Discovering and designing of new molecules is a significant challenge in many areas of research because it requires exploring vast chemical spaces.1 Self-driving labs, have the potential to make faster, more efficient progress by “closing” the chemical discovery loop: integrating property prediction, synthesis, analysis, characterization and experiment planning.2 Automated synthesis platforms (ASPs) are a crucial part of self-driving labs.3-5 However, the chemical reactions that are executable by ASPs are very limited compared to the scope of reactions available to human chemists. At this time, the capabilities of different ASPs are also very diverse, with limited overlap. Understanding the areas of chemical space accessible to ASPs is important for planning molecular discovery campaigns, and for understanding the limitations of self-driving labs.

Here, we present a method to evaluate the merits of this combined approach, which can be likened to a public transit system. Manual reactions are used to effect smaller, but more challenging changes in molecular structure, similar to walking to or from a subway station. ASPs are used to rapidly traverse larger areas of chemical space, which can be compared to riding the subway. In this way, self-driving labs can reach previously inaccessible complex molecules from simple, cheap building blocks, and requiring minimal human intervention.

Quantifying synthetic accessibility is an ongoing challenge in many areas of research including synthetic chemistry and cheminformatics. Currently, these metrics do not address how to combine manual and automated synthesis. Our approach evaluates a synthetic route in terms of “distance traveled through chemical space” as a function of “cost in time, money and materials,” with different acceleration factors to account for differences in the relative “speed” of manual and automated chemistry.

Such a combined approach may allow self-driving labs to access a larger portion of chemical space. In the future, the “subway” synthesis evaluation could be integrated into a ML-based experiment orchestration platform, and used alongside the predicted properties of a molecule to determine if the additional complexity of a target molecule requiring manual synthetic steps is worth the anticipated benefits in performance or properties.

References 1. Reymond, J.-L. The Chemical Space Project. Acc.Chem. Res. 48, 722–730 (2015). 2. Tabor, D. P. et al. Accelerating the discovery of materials for clean energy in the era of smart automation. Nat.Rev. Mater. 3, 5–20 (2018) 3. Li, J. et al. Synthesis of many different types of organic small molecules using one automated process. Science 347, 1221–1226 (2015) 4. W. Coley, et al. A robotic platform for flow synthesis of organic compounds informed by AI planning. Science. 365 (2019) 5. Steiner, et al. Organic synthesis in a modular robotic system driven by a chemical programming language. Science. 363 (2019)

P47 © The Author(s), 2020

Navigating the design space of perovskite alloys using physics- informed sequential learning

Shijing Sun1, Armi Tiihonen1, Felipe Oviedo1, Zhe Liu1, Janak Thapa1, Yicheng Zhao2, 2a, Noor Titan P. Hartono1, Anuj Goyal3, Thomas Heumueller2, 2a, Clio Batali1, Alex Encinas1, Jason J. Yoo1, Ruipeng Li4, Zekun Ren5, I. Marius Peters2, Christoph J. Brabec2, 2a, Moungi G. Bawendi1, Vladan Stevanovic3, John Fisher III1, Tonio Buonassisi1,5 1Massachusetts Institute of Technology, Cambridge, MA 02139, USA, 2Helmholtz-Institute Erlangen-Nürnberg (HI-ERN), Erlangen, 91058, Germany , 2aInstitute of Materials for Electronics and Energy Technology (i-MEET), FAU, Erlangen, 91058, Germany, 3Colorado School of Mines, Golden, CO 80401, USA, 4Brookhaven National Laboratory, Upton, NY 11970, USA, 5Singapore- MIT Alliance for Research and Technology, 138602, Singapore

An outstanding challenge in creating environmentally stable organic-inorganic hybrid perovskite solar cells is resource-efficient optimisation in vast compositional spaces.[1] Here, we fuse data from high-throughput degradation tests and density functional theory modelling of phase thermodynamics into an end-to-end Bayesian optimisation framework using probabilistic constraints. By sampling just 1.8% of the discretized CsxMAyFA1-x- yPbI3(MA = methylammonium, FA = formamidinium) compositional space, we identify compositions centred at [2] Cs0.17MA0.03FA0.80PbI3with minimal optical change under elevated temperature, moisture, and illumination. This results in thin-films with 17x improved stability than MAPbI3 end-point, translating into enhanced solar cell stability compared to state-of-the-art more complex Cs0.05(MA0.17FA0.83)0.95Pb(I0.83Br0.17)3without compromising conversion efficiency. The improved environmental stability is achieved using fewer elements and 8% or less MA, realising co-modulation of chemical decomposition and minority phase formation. Our findings demonstrate the benefits of a data-fusion approach to efficiently navigate multinary systems.

References 1. J-P. Correa-Baena et al.,Science, 2019, 363, 627-631 2. S. Sunet al., ChemRxiv preprints,https://doi.org/10.26434/chemrxiv.12601997.v1

P48 © The Author(s), 2020

Evolutionary discovery of large porous organic icosahedra

Filip Szczypiski and Kim Jelfs Imperial College London, United Kingdom

Porous materials play an important role in catalysis, sensing, and encapsulation of dangerous or valuable compounds which are otherwise unstable or difficult to separate.1,2 Chemists strive to design more complex and beautiful architectures, which encode specific functions and give rise to applications. Porous organic cages (POCs) are organic molecules with cavities that can be manipulated in solution but also pack together via weak non-bondin­­g molecular interactions to form permanently porous solids. Large and shape-persistent structures are of particular interest due to their ability to encapsulate multiple guests thus acting as molecular flasks and leading to the discovery of new properties and reactions, in a manner akin to cellular compartmentalisation and preorganisation of enzymatic pockets common in natural systems.3,4

Synthetic discovery of POCs is extremely time- and resource-consuming, given a vast library of building blocks available, which all can form various topologies based on their structure, stoichiometry, and experimental conditions. To narrow the enormous number of possible targets and prevent wasted synthetic efforts, our group has developed computational methods that aid discovery of new porous materials.5,6 We have compiled a library of building blocks comprising over 10,000 multifunctional aldehydes and amines, which can be used for the synthesis of millions of imine-based icosahedral POCs comprising many hundreds of atoms each. Brute-force computational screening of such a vast library of large molecules using existing techniques would take years and hence efficient algorithms need to be used to efficiently explore the resulting chemical space.

Herein, we present our efforts towards the use of evolutionary algorithms to discover new molecular flasks that can be targeted experimentally. We believe that the tools and newly identified synthetic targets can be of guidance to the researchers in the wider field of supramolecular chemistry and will accelerate the discovery of new functional molecules.

References 1. T. Hasell and A. I. Cooper, Nat. Rev. Mater., 2016, 1, 16053. 2. A. G. Slater and A. I. Cooper, Science, 2015, 348, aaa8075–aaa8075. 3. M. Yoshizawa, J. K. Klosterman and M. Fujita, Angew. Chemie Int. Ed., 2009, 48, 3418–3438. 4. M. Otte, ACS Catal., 2016, 6, 6491–6510. 5. E. Berardo, L. Turcani, M. Miklitz and K. E. Jelfs, Chem. Sci., 2018, 1–39. 6. M. Miklitz and K. E. Jelfs, J. Chem. Inf. Model., 2018, 58, 2387–2391.

P49 © The Author(s), 2020

Prediction of the properties of organic molecules using Wigner-Ville distribution and deep convolutional neural networks

Alain Tchagang and Julio Valdés National Research Council, Canada Address correspondence to: [email protected]

Accumulation of molecular data obtained from quantum mechanics (QM) theories such as density functional theory (DFT) make it possible for machine learning (ML) to accelerate the discovery of new molecules, drugs, and materials. Models that combine QM with ML (QM↔ML) have been very effective in delivering the precision of QM at the high speed of ML.

In this study, we show that by integrating well-known signal processing techniques such as. Wigner- Ville distribution (WVD) in the QM↔ML pipeline, we obtain a powerful methodology that can be used for representation, visualization and characterization of the properties of molecules. More precisely, in this study, we show that the WVD representation of molecules encodes their structural, geometric, energetic, electronic and thermodynamic properties. This is demonstrated by using the new representation in the forward design loop as input to a deep convolutional neural networks trained on DFT calculations, which outputs the properties of the molecules.

Tested on the QM9 dataset (composed of 133,855 organic molecules and 19 properties), the new model is able to predict the properties of molecules with a mean absolute error (MAE) below acceptable chemical accuracy (i.e. MAE <; 1 Kcal/mol for total energies and MAE <; 0.1 ev for orbital energies). Furthermore, the new approach performs similarly or better compared to other ML state-of-the-art techniques described in the literature. More precisely the new proposed model outperforms several of the state-of-the-art ML techniques described in the literature on the prediction of 14 properties and was able to predict 16 out of 19 properties of the QM9 dataset with MAEs below chemical accuracy.

This study, shows that the new model represents a powerful technique for organic molecules forward design and will be presented. Preliminary results of the inverse design of molecules using WVD and generative models will also be discussed.

References 1. Rupp, “Machine learning for quantum mechanics in a nutshell,”International Journal of Quantum Chemistry, vol. 115, no. 16, pp. 1058–1073, Apr. 2015. 2. C. Blum and J.-L. Reymond, “970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13,”Journal of the American Chemical Society, vol. 131, no. 25, pp. 8732–8733, 2009. 3. A. B. Tchagang, A. H. Tewfik and J. J. Valdés, “Molecular Design Using Signal Processing and Machine Learning: Time- Frequency-like Representation and Forward Design”,arXiv preprint arXiv:2004.10091, 2020.This work is supported by the National Research Council of Canada, Artificial Intelligence for Design program.

P50 © The Author(s), 2020 Tandem Random Forests - Monte Carlo optimization of Nitro-arene catalytic reduction

Filipe Teixeira, Joaquín M. Santelices, M. N. D. S. Cordeiro LAQV-REQUIMTE - Faculty of Sciencies of the University of Porto, Portugal

Nitro-arene (NA) reduction has attracted considerable attraction in recent years as a viable route to transform highly pollutant industrial effluents into valuable fine chemicals[1]. This interest has prompt the development of new catalysts for NA reduction, usually by trial-and-error, which is a long and expensive process, often requiring the use of expensive chemicals and characterization methods.

In this work, we employ a Random Forest (RF) model[2] to model the catalytic efficiency of a diverse set of recently published catalysts. In order to achieve this model, the information regarding each catalysts had to be codified, together with the available data concerning the reaction conditions used in the catalytic essays carried out by different groups. The hyper-parameters of the Random Forest algorithm were optimized using a standard 5-fold Cross Validation routine, and the final model was trained using a 60:40 split between training and testing data. The resulting regression model, achieved good predictive performance, with the outlier cases being apparently related to large deviations in the protocol detailed in the experimental reports (i.e. experimental protocols that did not follow the same overall procedure observed in most).

An assessment of the variable importance and partial dependency plots for the RF model highlighted some interesting tendencies regarding the experimental data and the diverse protocols used to generate it. Moreover, it allowed some qualitative insights on the desirable characteristics when designing a new catalyst for NA reduction, as well as on the conditions allowing for its optimum performance. In order to further explore such desirable features, a Monte Carlo (MC) optimization protocol was employed to optimize the deployment conditions of a number of catalysts. The results from such procedure strongly suggest that the experimental conditions used in the catalytic assessments are generally sub-optimal, although usually close to the predicted points of maximum efficiency.

Acknowledgements This work was supported by Fundação para a Ciência e a Tecnologia (FCT/MEC) through national funds and co- financed by FEDER, under the partnership agreement PT2020 (Projects UID/QUI/50006/2013 and POCI/01/0145/ FEDER/007265). Further founding was also received from FCT/MEC under project REALM – Reactive Learning Machines (PTDC/QUI-QIN/30649/2017).

References 1. B. Jarrais, A. Guedes, C. Freire, Heteroatom-doped carbon nanomaterials as metal-free catalysts for the reduction of 4-nitrophenol, ChemCatChem, 2017, 9, 1737-1748 2. Yang, W., Fidelis, T., Sun, W. “Machine Learning in Catalysis, From Proposal to Practicing” ACS Omega 2020, 5 (1), 83-88

P51 © The Author(s), 2020

Molecular-level virtual screening of nutraceuticals against COVID -19 using artificial intelligence and machine learning

Jalala V. K and K Muraleedharan University of Calicut, India

The SARS-CoV-2 disease is a new strain of coronavirus (2019-nCoV) which has spread all over the globe in a very short time becoming pandemic. In the last 10 years, nutraceuticals have grown in interest to researchers, industry, and consumers and are now familiar in the collective imagination as a tool for preventing the onset of a disease. In the present study, we will be performing the in-silico study of SARS-CoV-2 structure with different herbal compounds of medicinal importance. We selected viral key protein of SARS-CoV-2 structure i.e, Main Protease (Mpro) (PDB ID: 6LU7). In this proposed work, the idea finds out potential lead candidates for the development of low-cost nutraceuticals, which can be used against the SARS-CoV-2 virus and it can be achieved by using the application molecular chemistry approach with machine learning and AI.

References 1. Nutraceuticals Inspiring the Current Therapy for Lifestyle Diseases ,Silpi Chanda ,Raj Kumar Tiwari, Arun Kumar ,and Kuldeep Singh 2. From pharmaceuticals to nutraceuticals: bridging disease prevention and management ,Patricia Daliu, Antonello Santini & Ettore Novellino 3. Ayurvedic treatment of COVID-19/SARS-CoV-2: A case report, P.L.T. Girija, Nithya Sivan 4. Potential Inhibitor of COVID-19 Main Protease (Mpro) from Several Medicinal Plant Compounds by Molecular Docking Study,Siti Khaerunnisa , Hendra Kurniawan , Rizki Awaluddin, Suhartati Suhartati5, Soetjipto Soetjipto

P52 © The Author(s), 2020

Machine Learning based design with small data sets: when the average model knows best

Danny E. P. Vanpoucke,1 Onno S. J. Van Knippenberg,2 Ko Hermans,2Siamak Mehrkanoon3, and Katrien V. Bernaerts1 1Maastricht University, Aachen-Maastricht Institute for Biobased Materials (AMIBM), The Netherlands, 2CCL Olympic B.V., The Netherlands, 3Maastricht University, Department of Data Science and Knowledge Engineering, The Netherlands

Machine Learning plays an ever more important role in modern materials design and discovery. New discoveries and applications of machine learning are presented at a steadily increasing pace. Unfortunately, these achievements are generally rooted in access to a suitable large data set. Although such big data sets are becoming more common place, they are generally not representative for the day-to-day work performed by materials researchers, where large numbers of samples are often unfeasible due to production-cost or-time, or availability of raw materials.

Due to the success of Machine Learning within the context of large data sets, there is a natural interest to apply these methods in the context of small data sets and also reap their rewards here. The use of artificial intelligence and Machine Learning is these cases is generally aimed at improved design of experiments for materials optimization, often in combination with robotic automation.

Figure 1:Modelling small data sets. (a) schematic representation of the problem. (b) and (c) heatmaps of ensembles of 1000 model instances for a linear and non-linear data set of 20 data points. (from [1])

In this work, we present a critical investigation of the role of small (<; 25 data points) data sets in Machine Learning based regression analysis. We highlight the strong dependence of the model quality on the considered data points as an important limitation of Machine Learning in this context. Using both synthetic and experimental data sets we show that the model instances of an ensemble are distributed around the model average (cf., Fig. 1).[1]

Figure 2: Estimated MAE values for the Best, Worst, and Average model instances for two experimental data sets. (from [1])

This result appears to be independent of the underlying model. More interestingly, we find that this ensemble average presents a model-quality on par with that of the best available model instance in the ensemble for the data set (cf., Fig. 2). We therefore propose to construct a model instance that is equivalent to the ensemble average, but presents a much lower computational cost for evaluation and storage. This mitigates the observed limitation of Machine Learning for small data sets, and makes it also accessible in the context of day-to-day small scale materials projects.

References 1. “Small Data Materials Design with Machine Learning: When the Average Model Knows Best”, Danny E. P. Vanpoucke et al., Journal of Applied Physics (2020), DOI: https://doi.org/10.1063/5.0012285

P53 © The Author(s), 2020

Autonomous cloud-based platform for the AI-driven synthesis of molecules

Alain C. Vaucher, Antonio Cardinale, Joppe Geluykens, Matteo Manica, Philippe Schwaller, Aleksandros Sobczyk, Alessandra Toniato, Federico Zipoli, Teodoro Laino IBM Research Europe, Switzerland

The process of synthesizing novel compounds involves multiple steps including 1) suggesting an adequate synthetic route, 2) specifying reaction conditions and parameters for each reaction step, and 3) executing the synthesis in the laboratory. Typically, these steps are guided by the knowledge and experience matured by chemists in decades of practice. Still, they remain complex and time-consuming.

Recent advances in automation and artificial intelligence (AI) offer new possibilities to accelerate this process. Over the last few years, our group has followed the goal to integrate all steps for the synthesis of novel compounds in a unified platform combining the advantages of AI and automation. The platform can be steered remotely and can be compared to a cloud-based laboratory that supports chemists in the design of a chemical synthesis and then executes it autonomously, possibly thousands of kilometres away.

To achieve this, our platform combines different algorithms recently published by our group. They are applied at different stages in the setup of a remote synthesis. First, a target molecule drawn by a chemist in the web browser undergoes a retrosynthetic analysis, after which possible synthetic routes are suggested [1] and can be reviewed by the chemist. The result is a full retrosynthetic tree going back to commercially available compounds. Second, for each reaction step a sequence of synthesis actions is generated [2,3]. The synthesis actions thus obtained represent common operations executed in wet laboratories and are hardware-independent. In a subsequent step, the actions undergo a conversion to instructions specific to the installed chemical robot and can be submitted for execution. As long as the required chemicals are available, the robot can execute the synthesis autonomously and without human input. In this manner, we were able to demonstrate the successful execution of multiple chemical syntheses.

References 1. Schwaller, P.; Petraglia, R.; Zullo, V.; Nair, V. H.; Haeuselmann, R. A.; Pisoni, R.; Bekas, C.; Iuliano, A.; Laino, T., Chem. Sci. 2020, 11, 3316–3325. 2. Vaucher, A. C.; Zipoli, F.; Geluykens, J.; Nair, V. H.; Schwaller, P.; Laino, T., Nat. Commun. 2020, 11, 3601. 3. Vaucher, A. C. et al., in preparation.

P54 © The Author(s), 2020

Looking for appropiate descriptors of heterogeneous catalysts

Aline Villarreal SEDEMA, Mexico

Heterogeneous catalysts are of fundamental importance in industry, to accelerate the use of renewable energy, and to transit towards more sustainable processes1. However, the discovery of better catalysts is a time and resource-consuming endeavor2,3. The use of machine learning algorithms promises to decrease in the number of experimental data needed to find patterns on catalyst performance and predict the composition of better materials4,5. Nevertheless, the accuracy of the algorithms depends on the quality and pertinency of the input features (composition, support, surface termination, particle size, particle morphology, and atomic coordination environment) and the appropriate choice of the adjustment parameters of the chosen model2,6.

The availability and quality of input features are complicated because experimental catalysis data is often hard to obtain. In some of those systems, its morphology changes under reaction conditions, and these data are even more scarce7,8. Quantum mechanics (QM) simulations are often useful to fill in the gaps of information and are less expensive, though the results cannot be extrapolated entirely to real catalysts2.

To make this work easier, less expensive (experimentally and computationally) properties are used, these descriptors can be useful as inputs of outputs and must easily be related to catalyst activity. The most common descriptors in heterogeneous catalysis are the energy of the d-band center (with respect to the Fermi level) because it is connected to the interaction of the adsorbate valence orbitals and the d-orbitals of the adsorbents if they are transition metals6,9,10. Another chemically meaningful descriptor is the atomic coordination number11–13.

In this work, we surveyed the literature looking for the use of descriptors and their accuracy and generalizability when used in machine learning models. We found that composition and, in particular, stepwise variations in the composition of promoters are used together with supervised-learning algorithms and high throughput experimentation to extract information from these experimental datasets4,14.

Other descriptors used together with calculations (DFT) are the electronegativity of the surface, the number of d-valence electrons, and the adjusted work function. In general, the examined works prefer to use chemically meaningful concepts, and none of them used aggregate descriptors found autonomously by classification algorithms9,12,13.

In conclusion, experimental approaches prefer to use descriptors related only to synthesis parameters, such as precursor compositions, coupled with high throughput experimentations, maybe due to the difficulty of obtaining rapid characterizations that yield information about the structure of the catalyst. On the other hand, theoretical approaches use a variety of descriptors based on chemical knowledge.

References 1. Corma, Angew. Chem. Int. Ed., 2016, 55, 6112–6113. 2. R. Goldsmith, et. al., AIChE J., 2018, 64, 2311–2323. 3. Williams, K., et. al., Chem. Mater., 2020, 32, 157–165. 4. Moliner, et. al., Acc. Chem. Res., 2019, 52, 2971–2980. 5. F. de Almeida, et. al., Nat. Rev. Chem., 2019, 3, 589–604. 6. M. Ghiringhelli, et. al., Phys. Rev. Lett., 2015, 114, 105503. 7. Topsøe, J. Catal., 2003, 216, 155–164. 8. Zhao, Y., et. al., ChemCatChem, 2015, 7, 3683–3691. 9. J. H. Jacobsen, et. al., J. Am. Chem. Soc., 2001, 123, 8404–8405. 10. J. Medford, et. al., J. Catal., 2015, 328, 36–42. 11. F. Calle-Vallejo, et. al., Science, 2015,350, 185-189. 12. Ma and H. Xin, Phys. Rev. Lett., 2017, 118, 036101. 13. Zhang and C. Ling, Npj Comput. Mater., 2018, 4, 25. 14. A. Corma, et. al.,ChemPhysChem, 2002, 3, 939–945.

P55 © The Author(s), 2020 Machine learning for polymer swelling in liquids

Qisong Xu and Jianwen Jiang National University of Singapore, Singapore * Corresponding Author: [email protected]

Polymer swelling in liquids remains of utmost importance in widespread technological applications. Owing to the complex nature of this phenomenon, its fundamental studies has been immensely investigated from theory, experiment and simulation. In this study, we develop a machine learning (ML) methodology to quantify and characterize polymer swelling in liquids. As illustrations, the methodology is applied to the swelling of organic solvent nanofiltration (OSN) membranes and polydimethylsiloxane (PDMS), respectively. Specifically, chemically intuitive descriptors like solubility parameters and solvent properties are selected, and a molecular representation via sum-of-fragments approach is proposed for modelling. Using kernel ridge regression, the model based on solubility parameters of polymer and solvent offered the best quantitive prediction of swelling degree and revealed multimodal modal swelling behaviour in OSN membranes. For PDMS swelling, solvent solubility parameter and geometry were found to be key properties. The proposed molecular representation via sum-of-fragments also demonstrated remarkable predictive power. Through appropriate data augmentation, excellent out-of-sample predictions for PEI swelling in nine solvents and PDMS swelling in substituted aromatic solvents were achieved. Principal component analysis was applied to the proposed sum-of-fragments to explore its suitability as molecular representation and the chemical space of polymer swelling. Relationships between molecular fragments and swelling were also determined from Pearson correlation. This ML study demonstrates the development and utilization of chemically intuitive descriptors to construct models capable of superior prediction and unravelling chemical insights into polymer swelling. Such methodology can also be extended to other polymer properties in liquids, thereby expanding its scope of potential applications.

P56 © The Author(s), 2020

Development of a global lattice energy landscape sampling method

Shiyue Yang, Graeme M. Day University of Southampton, United Kingdom

Organic crystal structure prediction (CSP) methods are aimed at predicting experimentally viable crystal structures of organic molecules and are important for computer-guided discovery of functional materials. CSP is made difficult by the high dimensional configurational space and the need to locate all possible low energy structures. A quasi-random searching method has been previously developed in the group to identify candidate crystal structures. To improve its efficiency in locating important low energy structures while maintaining effective sampling of higher energy crystal structures, the Monte Carlo Basin Hopping (MCBH) algorithm was implemented, initiating from structures generated by the quasi-random searching method. Two different trial-trial communication strategies were introduced: independent trials and on-the-fly clustering, which truncates a trial if the sampled minimum was already located by other trials. The hybrid quasi-random basin hopping (QRBH) sampling method was applied to both single component crystals and co-crystal systems in various space groups, and different parameters were explored to maximize the sampling efficiency, such as temperature, perturbation step size and computational allocation. We find that the new algorithm has advantanges in locating low-energy structures, especially on complex energy landscape like co-crystals, maintaining the ability to sample high- energy experimental structures at the same time. The on-the-fly clustering QRBH method had a better sampling efficiency with a large perturbation steps size but the optimal temperature differs among space groups.

References 1. M.J. Frisch. et al. (2009). Gaussian 09, revision d. 01, 2009. 2. Coombes, D. S. et al. (1996). J. Phys. Chem.,100(18), 7352 3. Anthony, L. S. (2009). Acta Crystallogr. D, 65(2), 148 4. Eamonn, K. & Chotirat, A. R. (2005). Knowl. Inf. Syst., 7(3), 358 5. Case, D. H. et al. (2016). J. Chem. Theory Comp.,12(2), 910 6. Wales, D. J., & Doye, J. P. (1997). J. Phys. Chem. A,101(28), 5111

P57 © The Author(s), 2020

Automated calculation of reaction energy profiles

Tom A. Young‡, Joseph Silcock‡, Alistair J. Sterling‡ and Fernanda Duarte° ‡Physical & Theoretical Chemistry Laboratory, University of Oxford, OX1 3QZ °Chemistry Research Laboratory, University of Oxford, OX1 3TA

Mechanistic hypotheses in organic and organometallic chemistry are now routinely accompanied by computed reaction profiles. Construction of these profiles is, however, a highly time-consuming and non-systematic process yet algorithmic in nature.1 We have therefore developed autodE,2 a Python package that automates transition state (TS) location and conformational sampling of minima and TSs. From SMILES input autodE generates reaction profiles that would take days of human effort for even experienced computational chemists. We present industrially and synthetically relevant examples, including metal-catalysed hydroformylation and an Ireland- Claisen cascade reaction.3, 4

References 1. G. N. Simm, A. C. Vaucher, and M. Reiher. J. Phys. Chem. A, 123(2):385– 399, 2018. 2. T. Young and J. Silcock. autode v. 1.0.0a. https://github.com/duartegroup/autodE, 2020. 3. T. K ́egl. RSC Adv., 5(6):4304– 4327, 2015. 4. G. P. Petrova A. Patel K. Morokuma K. N. Houk C. Whan Lee, B. L. H. Taylor and B. M. Stoltz. J. Am. Chem. Soc., 141(17):6995–7004, 2019.

P58 © The Author(s), 2020

Imputation of missing data for polymer membrane gas separation with machine learning

Qi Yuan1, Mariagiulia Longo2, Aaron Thornton3, Neil B. McKeown4, John Jansen2, and Kim E. Jelfs1* 1Department of Chemistry, Molecular Sciences Research Hub, White City Campus, Imperial College London, Wood Lane, London, UK, 2Institute on Membrane Technology, National Research Council (CNR-ITM), Via Pietro Bucci 17/C, 87036 Rende (CS), Italy, 3Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia, 4 EaStCHEM, School of Chemistry, University of Edinburgh, Joseph Black Building, David Brewster Road, Edinburgh, Scotland, UK

Polymer membranes with gas selectivity can be used for energy efficient industrial gas separations, and an open source database of such polymers would benefit the discovery of gas selective polymers.The Membrane Society of Australasia (https://membrane-australasia.org/) hosts the database for gas permeability of polymers collected from publications from 1950 to 2018. However, missing values exist in the database, making it difficult to generalize quantitative relationships among the permeability of different gases. If missing values in the database can be filled accurately, one can not only retrieve candidates with good gas selectivity that were not measured at the time of publication, but also get a more complete database for future experimental and theoretical study.

In this study, missing values in the database were filled using machine learning (ML). The ML model was validated with published gas permeability data that are not recorded in the database. The rooted mean squared error (RMSE) between the ML prediction and experimental reports of the logarithm gas permeability were between 0.06 and 0.13 for polymers of intrinsic microporosity (PIM) and polyimides. In addition, the ML model could predict if the gas selectivity of polymer in the test set were above the Robeson upper bound with an accuracy of larger than 0.8.

Through filling in the missing data, we can reanalyse historical polymers and suggest potential “missed” candidates with desired gas selectivity. ML with sparse features was also performed, and we suggest that permeability of He, H2, O2, N2 and CH4 can be quantitatively postulated using the gas permeability of O2 and/or

CO2. Primary insight on the gas permeability of polymers can thus be gained at the initial stage of experimental measurements and our model has the potential to rapidly identify polymer membranes worth further investigation.

References 1. M. Galizia, W.S. Chi, Z.P. Smith, T.C. Merkel, R.W. Baker, B.D. Freeman, 50th Anniversary Perspective: Polymers and Mixed Matrix Membranes for Gas and Vapor Separation: A Review and Prospective Opportunities, Macromolecules. 50 (2017) 7809–7843. https://doi.org/10.1021/acs.macromol.7b01718. 2. L.M. Thornton, A.W., Freeman, B.D., and Robeson, Polymer Gas Separation Membrane Database, (2012). https:// membrane-australasia.org/. 3. S. van Buuren, K. Groothuis-Oudshoorn, mice: Multivariate imputation by chained equations in R, J. Stat. Softw. (2010) 1–68.

P59 © The Author(s), 2020

Identifying degradation patterns of Li-ion batteries from impedance spectroscopy using machine learning

Yunwei Zhang1, Qiaochu Tang2, Yao Zhang3, Jiabin Wang2, Ulrich Stimming2, Alpha Lee1 1Cavendish Laboratory, University of Cambridge, Cambridge CB3 0HE, UK, 2Chemistry – School of Natural and Environmental Sciences, Newcastle University, NE1 7RU Newcastle upon Tyne, UK, 3Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge CB3 0WA, UK

Forecasting the state of health and remaining useful life of Li-ion batteries is an unsolved challenge that limits technologies such as consumer electronics and electric vehicles. Here, we build an accurate battery forecasting system by combining electrochemical impedance spectroscopy (EIS)—a real-time, non-invasive and information- rich measurement that is hitherto underused in battery diagnosis—with Gaussian process machine learning. Over 20,000 EIS spectra of commercial Li-ion batteries are collected at different states of health, states of charge and temperatures—the largest dataset to our knowledge of its kind. Our Gaussian process model takes the entire spectrum as input, without further feature engineering, and automatically determines which spectral features predict degradation. Our model accurately predicts the remaining useful life, even without complete knowledge of past operating conditions of the battery. Our results demonstrate the value of EIS signals in battery management systems [1].

References 1. Yunwei Zhang, et al. "Identifying degradation patterns of lithium ion batteries from impedance spectroscopy using machine learning."Nature communications11.1 (2020): 1-6.

P60 © The Author(s), 2020 F

O

NH

O

O

NH NH O NH

O O

F

O

O

NH NH

O O

NH

O

O

NH

O Royal Society of Chemistry Thomas Graham House Burlington House International offices www.rsc.org Science Park, Milton Road Piccadilly, London Beijing, China Tokyo, Japan Cambridge,NH CB4 0WF, UK W1J 0BA, UK Shanghai, China Philadelphia, USA Registered charity number: 207890 Berlin, Germany Washington, USA © Royal Society of Chemistry 2019 T +44 (0)1223 420066 T +44 (0)207437 8656 Bangalore,F India

O