<<

A unified machine-learning protocol for asymmetric as a proof of concept demonstration using asymmetric hydrogenation

Sukriti Singha,1, Monika Pareeka,1, Avtar Changotraa,1, Sayan Banerjeea, Bangaru Bhaskararaoa, P. Balamuruganb,2, and Raghavan B. Sunoja,2

aDepartment of Chemistry, Indian Institute of Technology Bombay, Powai, 400076 Mumbai, India; and bIndustrial Engineering and Operations Research, Indian Institute of Technology Bombay, Powai, 400076 Mumbai, India

Edited by Michael W. Deem, Rice University, Houston, TX, and accepted by Editorial Board Member Tobin J. Marks December 10, 2019 (received for review September 24, 2019) Design of asymmetric catalysts generally involves time- and impact the enantioselectivity of the reaction. The chosen param- resource-intensive heuristic endeavors. In view of the steady eters are then used in random forest (RF) as well as other ML increase in interest toward efficient catalytic asymmetric reactions algorithms, which are well known for their predictive capabilities and the rapid growth in the field of machine learning (ML) in recent (37, 38). This study is a comprehensive application of a range of years, we envisaged dovetailing these two important domains. We ML methods in asymmetric hydrogenation catalysis using enan- selected a set of quantum chemically derived molecular descriptors tiomeric excess (%ee) as the target value. from five different asymmetric binaphthyl-derived catalyst families with the propensity to impact the enantioselectivity of asymmetric Choice of Reaction Type and Asymmetric Catalyst Families hydrogenation of and . The predictive power of the A total of 368 known asymmetric hydrogenation reactions cata- random forest (RF) built using the molecular parameters of a set of lyzed by five different axially chiral binaphthyl catalyst families – 368 substrate catalyst combinations is found to be impressive, with (encompassing BINOL-phosphite [L1], BINOL-phosphoramidite a root-mean-square error (rmse) in the predicted enantiomeric ex- [L2], BINAP [L3], BINAP-O [L4], and BINOL-phosphoric acid cess (%ee) of about 8.4 ± 1.8 compared to the experimentally [L5]) and a series of alkenes and imines were considered to ex- CHEMISTRY known values. The accuracy of RF is found to be superior to other amine the applicability of ML (Fig. 1A)(39–42) (see SI Appendix, ML methods such as convolutional neural network, decision tree, Tables S1–S4 for more details of catalyst and substrate families). and eXtreme gradient boosting as well as stepwise linear regression. The chemical diversity of catalysts and substrates considered in The proposed method is expected to provide a leap forward in the design of catalysts for asymmetric transformations. this study can be gleaned from Fig. 1. Both the reaction type and the catalyst families have been widely applied (43, 44) in asymmetric catalysis | machine learning | computational chemistry enantioselective synthesis of important pharmaceutical and

Significance he ever-growing requirement for high-purity chiral compounds Tin therapeutic applications serves as an impetus for continued development of asymmetric catalysts (1, 2). Development of suitable machine-learning (ML) approaches by Creation of new catalysts often involves tedious trial and error using molecular descriptors can provide significant impetus to cycles driven by chemical intuition. Molecular insights on current efforts in asymmetric catalysis, wherein one strives to stereocontrol in transition states, obtained using modern elec- make a desired stereoisomer of a given handedness (enantio- tronic structure computations, have also contributed to this mer) in a highly selective manner. The proposed approach pro- goal, although such methods are generally resource intensive vides a sustainable model that trains on known catalysts and (3, 4). Therefore, development of faster reliable catalyst design helps to predict the efficacy of additional catalysts for asym- techniques that can guide future experiments is of high con- metric synthesis, thereby expediting the discovery with lesser temporary interest. Among the leading efforts, the application cost as compared to traditional empirical methods. Training ML — of mathematical modeling using multivariate stepwise linear algorithms first using the available known examples predicting regression (MLR) has recently emerged as a promising tool (5, 6). additional catalysts using such algorithms for subsequent in-line In particular, a fairly large number of molecular parameters are experimental validation, re-training by augmenting with the — – considered to identify the best regression equation that correlates additional data thus generated can provide a superior train – important steric and electronic parameters to the stereochemical predict train cycle suitable for accelerated reaction discovery. outcome. The tuning of the equations to obtain a compatible Author contributions: P.B. and R.B.S. designed research; S.S., M.P., A.C., S.B., and B.B. model in MLR can be challenging, as can incorporation of higher- performed research; S.S., M.P., A.C., and S.B. analyzed data; and S.S., M.P., A.C., S.B., order nonlinear terms. We believe that the burgeoning advances in P.B., and R.B.S. wrote the paper. the domain of machine learning could be exploited as a comple- The authors declare no competing interest. mentary optimization tool. This article is a PNAS Direct Submission. M.W.D. is a guest editor invited by the Diverse domains of modern science have already embraced Editorial Board. the benefits of machine learning (ML) with an impressive degree Published under the PNAS license. – of success (7 11). There have also been some important applica- Data deposition: The data reported in this paper have been deposited in GitHub, https:// tions of ML in chemistry at large and catalytic reactions in par- github.com/Sunojlab/ML-for-Asymmetric-Catalysis. ticular (12–27), such as for predicting selectivities (28–36). We 1S.S., M.P., and A.C. contributed equally to this work. intend to design a practically useful ML protocol to be deployed 2To whom correspondence may be addressed. Email: [email protected] for asymmetric hydrogenation of substrates bearing C = CandC= N or [email protected]. bonds using axially chiral binaphthyl-derived catalyst families. This article contains supporting information online at https://www.pnas.org/lookup/suppl/ Our ML approach makes use of molecular parameters of the doi:10.1073/pnas.1916392117/-/DCSupplemental. catalysts and substrates that could have higher propensity to First published January 8, 2020.

www.pnas.org/cgi/doi/10.1073/pnas.1916392117 PNAS | January 21, 2020 | vol. 117 | no. 3 | 1339–1345 Downloaded by guest on September 28, 2021 A (catalysts and substrates)

32 R R R R A representative illustration of sterimol parameters, 1 H 2 5 6 R 8 7 rotational consyants and electrostatic potential 6 34 R P R O RCy O O 3 PPh2 O 7 O 5 P N P P O R1 12 O PPh2 O R7 O 11 O R4 P OH 33 B1B1 16 R7 BB5 H R2 R5 R6 R8 31 L1 (39) L2 (120) L3 (42) L4 (52) L5 (115)

RCx R 16 R 16 R 9 10 N 13

R11 12 R12 R14 12 R15 Catalysts: L1 [20], L2 [14], L3 [6], L4 [7], L5 [11] = 58 RCz Substrates: alkene [96], imine [94] = 190 Reactions: (Catalysts-Substrates) = 368 B (diversity of substituents on the catalysts)

L1 L3 L5 R3 R3 R = H OiPr P N L2 P N 5 R = Ph CF3 H CH2CH2Cl 8 R1 = CH3 R = CH , H R4 R4 OBn C Ph O 2 3 OPh CH(CH ) OMe 3 2 H R = R = Ph 3 CH3 4 O(C=O)tBu Ph CH3 H Ph O CF3 CH CH OCH O o-Br(C6H4) 2 2 3 CH3 Ph P CH C(CH ) CH CH 2 L4 2 3 3 CH2CH2Cl 2 3 NPh H CH R6 = H R = Ph CH2CH(C2H5)2 3 7 Ph C CH2OMe Me PPh2 Ph Ph iPr CH3 H CH Ph 3 CH Ph CH3 i C C2H5 2 CH2CH3 Pr Si(Ph)3 CH3 CH CH CH(CH ) Ph CH Ph 3 2 3 2 2 iPr C (alkenes)

R11 R11 R11 R10 R10 R12 R10 R12 R12 R9 R9 R9

H OAc TMSO Cl H3CO TMSO O O CH3 CF3 NCOPh O Ph p-OMe(C6H4) O N H3CO CO Me H 2 o-Me(C6H4) O N OMe Br H3CO HO H CH2CO2Me o-OMe(C6H4) OMe O NHCOCH3 CO2Et NCOCH3 OMe O N H3CO MeO O p-Cl(C H ) m-Br(C H ) H 6 4 6 4 OMe H3C TMSO F p-OMe(C6H4) o-Cl(C6H4) HO O N C H p-CF (C H ) NCOCH H 2 5 3 6 4 3 OMe H3COC OTBS O N CO2H O H m-Me(C6H4) MeO NHCOPh TMSO p-SMe(C6H4) H CO 3 O N CH2OH p-NO (C H ) H 2 6 4 NCHO OBn O N CH2CO2Me H CO H p-OAc(C6H4) 3 N OMe H CH2CH2OH TMSO H3C OMe CH CH CO Me O N 2 2 2 H O N NCOCH3 O CH CH(CH ) H3C OMe H 2 3 2 O N p-F(C6H4) CO2Me H H2(Ph)CO TMSO p-Br(C6H4) O N NCOCH OMe H O N p-Me(C H ) 3 O O 6 4 H2(Ph)CO H D (imines)

R13

N R15

R14 H HO m-NO2(C6H4) HO Ph CH(CH3)2 CH3 CH2OMe HO CH3 Me C3H7 p-OPh(C H ) Br 6 4 H C NO2 OMe S 3 o-F(C6H4) p-CO2Me(C6H4) OMe p-CF (C H ) CH o-CH3(C6H4) 3 6 4 OMe Bz 3 CH3 HO Cl p-F(C6H4) H3C CH3 p-NO2(C6H4) Cl CF3 HO HO CH3 CN m-CH3(C6H4) o-CF3(C6H4) C H p-Cl(C H ) 2 5 6 4 Cl OMe H3C p-Me(C H ) CH3 6 4 CH3 o-OMe(C6H4) HO Cl CO Me p-OMe(C H ) 2 6 4 p-Br(C H ) o-Cl(C H ) 6 4 Ph Cl 6 4 Cl Cl H3C m-OMe(C6H4) CH3 HO OMe o-Br(C H ) HO 6 4 HO HO OMe o-Me(C6H4) OMe F OMe m-Br(C6H4) Cl Ph OMe

Fig. 1. Catalysts and substrates. (A) A generalized representation of catalyst families and substrates used in catalytic asymmetric hydrogenation. The number of reactions in each catalyst family is shown in parentheses. The number of catalysts and substrates in each case is given in square brackets. The atom- numbering scheme is shown for a representative member of the L1 family. A representative set of global molecular descriptors (e.g., sterimol parameters [L1, B1, B5] and rotational constants [RCx, RCy, RCz]) is shown. (B–D) Various substituents in (B) each catalyst family, (C) alkenes, and (D) imines are shown.

1340 | www.pnas.org/cgi/doi/10.1073/pnas.1916392117 Singh et al. Downloaded by guest on September 28, 2021 agrochemical compounds (45–47). Although the number of sam- natural population analysis (NPA) were chosen as the elec- ples appears rather modest from a ML perspective, the molecular tronic parameters. In addition to site-specific molecular parameters derived from each catalyst and reactant of this library properties, we have also included 22 global descriptors that proved sufficient to train the ML algorithms (see below). represent certain overall molecular properties (e.g., HOMO Recent density functional theory studies have explored the role and LUMO energies, dipole moment, polar surface area, etc.; of noncovalent interactions in the stereocontrolling transition SI Appendix,TableS5). The nature of the chosen parameters states of several reactions (48), including organo-catalyzed and is critical to the quality of the model and its ability to predict transition metal-catalyzed reactions of the binaphthyl family (4). the reaction outcome (14). Since we aim to develop a ML model Such molecular insights have highlighted the influence of suitable across five different axially chiral catalyst families, most geometric and electronic features of a catalyst in dictating the parameters were chosen in such a way that they share equivalent/ formation of the preferred in asymmetric transfor- common core regions. The differences, such as that arising due to SI mations. However, deconvolution of the subtle interdependencies substituents, are treated using local parameters like sterimol ( Appendix of these features remains a major challenge. ,sectionIV). A step-by-step tutorial for using the python codes for various ML algorithms employed in this study is pro- Selection of Parameters for Machine Learning vided in SI Appendix, section XVII. We have chosen a set of molecular parameters for each catalyst and substrate from the respective minimum energy geometries Subset Details and Application of RF Method obtained at the M06-2X/6–31G** level of theory in the condensed The parameters mentioned above formed the necessary dataset phase (49) (full details in SI Appendix,sectionV). These param- for the construction of a random number of trees using the RF eters include bond lengths (BL), bond angles (BA), dihedral an- algorithm. The output of the RF is the desired quantity of gles (DA), distance between nonbonded atoms (NBL), and interest, i.e., %ee. RF is an ensemble technique where mul- sterimol parameters that capture molecular dimension. The vi- tiple decision trees are combined to get better predictive brational frequencies (VF) and the corresponding vibrational in- performance. We built random forest regressors, with a split tensities (VI) of certain normal modes of vibration (5), chemical of 20% samples in the test and 80% in the training sets (shown shifts (NMR), and charges (q) of various atoms obtained using the as step 1 in Fig. 2A). To ensure the desirable technical rigor, CHEMISTRY AB

(for 100 runs) (for 1 run) C Dzoom out

E

8.4

11.7

RMSE 9.6 8.4 9.2

RF DT GB CNN

Fig. 2. Subset details and performance of unified random forest. (A) A general representation of the common procedure for calculating the final rmse. (B) Different catalyst–substrate subsets and the corresponding rmse. (C) Absolute deviation between the experimental and predicted %ee for all of the 100 runs. Color shades of green, yellow, and red respectively depict the superior, moderate, and inferior quantitative agreements between the experimental values and that predicted by the RF. The best run, characterized by the lowest rmse, is shown within a white border. Absolute deviations between the experimental and RF predicted %ee in the range of 0 to 80 output values for all of the 100 runs are separately provided in SI Appendix, section XVII.(D) A zoomed-out representation of the best run to convey how many predictions are close (darker green boxes) to the actual values. (E) rmse comparison across various ML methods such as RF, DT, eXtreme GB, and CNN.

Singh et al. PNAS | January 21, 2020 | vol. 117 | no. 3 | 1341 Downloaded by guest on September 28, 2021 100 different test–train splits were constructed using random 6.8 ± 0.8. Another RF model was built, by including examples selection, and a fivefold cross-validation (see SI Appendix, from the BINAP-O catalyst family (L4), which offers a visible section VI.2 for more details of cross-validation) for each of diversity compared to structurally similar L1, L2, and L5. This the 100 different training sets was carried out to identify the expanded training set also provided high accuracy in predictions best hyperparameter combination (step 2 in Fig. 2A). The RF with an rmse of 8.6 ± 2.0. In the final RF model, all 368 reactions model built using this best hyperparameter combination was were considered. This unified RF comprises 58 catalysts drawn then used for the prediction on the out-of-bag samples from from five different families and 190 diverse collections of alkenes the test sets in each of the 100 runs corresponding, respectively, and imines as the substrates. Gratifyingly, predictions were good to the 100 test–train splits (step 3 in Fig. 2A). The average with an rmse of only 8.4 ± 1.8 (Fig. 2 C and D). In every run, performance is measured and reported in terms of the root-mean- prediction for 73 samples was made, resulting in several thousand square error (rmse) over these 100 runs (see SI Appendix, Tables predictions over 100 runs (the same samples might have naturally S16 and S17 for more details). The %ee of each substrate–catalyst been predicted multiple times). Most of these predictions are combination is therefore predicted once or multiple times in in excellent agreement with the experimental %ees, leading to this approach. The quality of a trained RF model was examined a dominance of the green-colored pixels in Fig. 2C. In the best by comparing the predicted %ee with that of the experimental run (Fig. 2D), only 6 of 73 predictions showed an error in %ee of values for the samples in the test set, which were not part of the 15 or higher. Alternatively, an rmse of 8.4 ± 1.8 implies that %ees model building. of an overwhelming majority of 65 samples of 73 in a typical run First, independent RF models were constructed for each of are within 10 units of the actual %ee. It is important to note that the five catalyst families. For instance, in the RF model for despite small sample size, our models performed very well for catalyst L1, only reactions corresponding to that family were individual subsets and their combinations, wherein the sample considered for the formation of the training and test sets (Fig. 2B). size varies from as small as 39 to as big as 368. The results of application of the RF models were encouraging for In any endeavor in asymmetric catalysis, one strives to accom- each of the five catalyst families, in that they showed a consistently plish superior %ees resulting in acute shortage of lower target low average deviation of just 5.4 in the predicted %ee for exter- values as the latter is considered an inferior result by the ex- nal samples. This indicates the capability of the RF to decipher perimental organic chemists. To address such an inevitable class the intricate relationship between the molecular parameters of the imbalance issue, we have also retrained our ML algorithms by substrate–catalyst combination and the enantioselectivity. The additional inclusion of synthetic data, generated using the synthetic trained RF models were efficiently able to capture the substrate minority oversampling technique (SMOTE). New minority data diversity within each catalyst family. The rmses in the predicted % (corresponding to the output %ee range where many fewer sam- ee for the test sets compared to the experimental values were ples were initially present) are created between the existing mi- foundtobe6.3± 2.4 (L1), 6.5 ± 1.3 (L2), 9.2 ± 3.3 (L3), 8.2 ± 4.6 nority data points (for more details, see SI Appendix,sectionVI.1). (L4), and 7.1 ± 1.2 (L5) (Fig. 2B), which engender good confi- Interestingly, we could obtain excellent test and train rmses while dence that the RF models for various catalyst–substrate combi- using a pure dataset or with the inclusion of synthetic data with the nations within a catalyst family (see SI Appendix, Table S18 for original data. Nearly overlapping rmses for the test and train sets, more details). for the top four best-performing methods such as the random Thus far, we have assessed the competence of our trained RF forest, are evidence for minimal overfitting (SI Appendix,Fig.S4 model to predict from within a set of experimentally known and Tables S16 and S17). Further, the difference in rmse across catalysts. In other words, the predicted %ee for a set of randomly different methods while using real and synthetic data together did selected catalyst–substrate combinations showed very good not show any large variation compared to the use of only real data. agreement with the reported experimental values. It is important This is encouraging in that for the current study, the class imbal- to note that %ee for the test set was unknown to the RF. This ance has not led to any notable issue in our predictive capabilities. situation, in spirit, is comparable to prediction ahead of experi- Next, a comparison of our RF model with other commonly mental verification. Along these lines, a key question pertaining used ML techniques was carried out to assess their relative per- to the utility of RF in asymmetric catalysis is worth considering. formance (SI Appendix, Tables S6 and S17 and Figs. S1–S4). The This is to examine whether a unified RF for all five catalyst comparable rmses of 9.2 ± 1.9, 9.6 ± 1.9, and 11.6 ± 2.8 were families could be developed to predict the %ee for any catalyst– obtained, respectively, for decision tree (DT), extreme gradient substrate combination. boosting (GB), and convolutional neural networks (CNN) (Fig. 2E). These results suggest that the complex ML algorithms are Generalization to Diverse Catalyst–Substrate Combinations able to decipher the interdependencies between the chemical de- Belonging to Different Catalyst Families scriptors of the diverse catalyst–substrate combinations and the To develop a unified model, we considered catalyst–substrate enantioselectivity in asymmetric hydrogenation reactions. combinations of different catalyst families toward building an In the developmental phase of asymmetric catalysis, it is additional set of RF models. The trained RF is then used in the desirable to have higher enantioselectivities, which are typically prediction of %ee for an external test set drawn from more than measured in terms of the area under the curve obtained using a one catalyst family. On the basis of the broad structural similarity suitable chiral column in a high-performance liquid chromatog- (Fig. 2B) between L1 and L2, these families were bundled to- raphy run. Subsequent and more involved methods (e.g., X-ray gether first. Interestingly, in quantitative terms, rmse of 8.5 ± 3.2 crystallography) might be required to assign the absolute config- (for L1–L2) was noticed between the predicted and experimental uration of the newly generated stereogenic centers. We examined %ees, when the test set consists of a randomized mix of catalyst– whether different ML approaches described in the previous sec- substrate samples drawn from L1 and L2 families. The results tions to predict the extent of enantioselectivity could be used for here are more promising compared to our first approach where a predicting the absolute sense of stereoinduction as well. Indeed, a different RF was developed for different catalyst families. The good accuracy of 84.29 ± 3.6% with RF was achieved. In the best vital clue that a RF could be successful in dealing with such di- run, an overwhelming majority of 66 samples of 69 were predicted versity in the catalyst–substrate combinations prompted us to correctly (SI Appendix, section XVIII). probe the potential of our approach further. In an effort to assess the importance of such chemical descrip- Next, a more diverse set of substrate–catalyst combinations tors in asymmetric hydrogenation, we employed a two-pronged consisting of L1, L2, and L5 was considered for developing an approach. First, we randomly shuffled the chemical descriptors additional RF model. The rmse in %ee for external predictions is between samples to generate another dataset wherein none of the

1342 | www.pnas.org/cgi/doi/10.1073/pnas.1916392117 Singh et al. Downloaded by guest on September 28, 2021 true descriptors of a given sample is associated with the sample Parameter Set itself (SI Appendix,sectionXIII.2). In other words, all of the Parameter chemical descriptors of a given sample actually belong to some other randomized sample. In another approach, we generated 14.17 VI > 14.17 Enantioselectivity 12-16 (%ee) arbitrary descriptors that follow a normal distribution. None of -S

the numerical values generated through this approach has any 9.0 156.6628 NMR > 156.6628 chemical significance or resemblance to the actual chemical 11 descriptors (SI Appendix, section XIII). In both these approaches, substantially poorer performance was obtained compared to the DA 99.855 6-5- >99.855 34.48 VI > 34.48 originally employed chemically meaningful descriptors, thus serv- 12-11 5-12 ing as evidence of the importance of chemical descriptors that we 29.5 42.6 have employed in this study (for more details, see SI Appendix, 0.785 VI > 0.785 6.775 > 6.775 5-12 dm-S Tables S25 and S26). (An interesting example on the importance

of chemical descriptors in a random forest study can be found in 69.6 BA 72.0 91.3 118.655 ref. 50.) We also performed correlation analysis on the feature 5-6-34 > 118.655 matrix to identify the potential interdependencies between various 73.0 BA 129.275 > 129.275 features. After removing the correlated features, we trained an- 12-11-16 other RF model with 60 features and the test rmse was found to be only marginally higher compared to the performance obtained 84.3 389.0 SI Appendix Volume > 389.0 using 101 features (for more details, see , Tables S21 VI = vibrational intensity BA = bond angle and S23). DA = dihedral angle 92.8 We believe that a synergistic combination of experiments and 85.9 dm-S = dipole moment-substrate ML-based analysis can be exploited toward accelerating the Fig. 4. Decision tree analysis for 368 reactions. The discriminating attribute ongoing developments in asymmetric catalysis. It would be in- at the higher level in a decision tree has a more pronounced impact on the teresting to probe whether the trained ML models are capable of outcome while the lower attributes tend to exert differing influences predicting for unseen samples drawn from different axially chiral depending on the preceding set of attributes in that branch. The paths catalysts. To this end, we have chosen two additional catalyst– shown in red convey the combination of descriptors for the most promising L2 L3 substrate–catalyst combinations. Refer to Fig. 1A for atom numbering. The

substrate combinations that resemble the original and CHEMISTRY series (SI Appendix, Table S28). One of these two groups of parameter values are in their respective standard units. catalysts contains a binaphthyl backbone with fused saturated B rings (51) and the other is a biphenyl (52) (denoted, applicability of our ML model for axially chiral asymmetric respectively, as L2′ and L3′ in Fig. 3A). The RF is now trained hydrogenation catalysts. using all 368 samples (full dataset used in the previous calcula- tions) and the trained RF model is used for predicting on 43 Identification of Chemically Relevant Patterns Using unseen samples, which form the additional test set (SI Appendix, Decision Tree Tables S29 and S30). As shown in Fig. 3B, very good predictions – ± In view of the inherent black-box like aspects in ML, a DT-based with an rmse of 8.5 0.0 could be obtained for these sets of analysis (53, 54) was performed on all 368 reactions to examine catalysts. In addition, we examined our ML model for across- whether we could derive better chemical insights. This rationale catalyst class applicability by predicting for L4–L5 catalysts (SI deserves special attention that enantioselectivity is affected by Appendix, Tables S30 and S31) by using training on L1–L2–L3 the interactions between the catalyst and substrate, irrespective catalysts. Although lower predictive performance is noted in of the mechanism of the reaction. The catalyst–substrate interac- across-catalyst class trials, the origin of inferior predictions could tions are captured primarily through the geometric and electronic be traced to certain outliers, which in turn are due to class im- descriptors used in this study. Certain combinations of geometric balance in the data distribution. Interestingly, outlier removal (as and electronic features of the catalyst and/or substrate might have low as 1 of 39 predictions and as high as 13 of 167 predictions) a higher impact on the %ee. This analysis can also provide logical resulted in significantly improved predictions in the across- guidelines on how subtle variations in the molecular features can catalyst class as well. These results further endorse the general help fine-tune the %ee. Following the standard procedure, 20% of samples were kept aside as the holdout set in each run wherein a critical hyperparameter such as “max_depth” was varied to identify A the best tree (SI Appendix,TableS9). The best DT obtained from 100 independent runs is presented in Fig. 4. Certain interesting details emerged from the DT analysis. The appearance of vibrational intensity of the substrate (VI12-16-S) as the root node conveys the critical importance of the electronic B parameter on the %ee. This parameter can be varied by way of 3.0 1.7 4.6 6.6 5.7 9.2 9.4 1.6 3.4 introducing suitable substituents on the alkene/imine moiety. 7.5 7.4 12.4 7.1 9.0 8.6 1.7 6.2 7.9 Another aspect of this DT study relates to the path shown using 9.1 8.4 18.0 0.8 9.3 10.0 20.2 8.6 18.0 the red line in Fig. 4, which is intended to highlight how %ee of 6.2 1.6 4.9 8.7 3.0 5.6 2.0 9.7 6.1 8.9 7.2 5.1 15.4 3.3 5.4 8.1 an out-of-bag sample could be predicted by examining its mo- lecular parameters. For instance, if a substrate has its C = Cor Fig. 3. Predictions on 43 out-of-bag samples using the random forest al- C = N stretching intensity (VI12-16-S) greater than 14.1 and the gorithm trained on the full dataset of 368 samples. (A) A generalized rep- 13C chemical shift (NMR11) less than 39.6, it is likely to yield resentation of catalysts and substrates used in the test set. (B) The difference ee > between the reported experimentally observed and predicted percentage of high % ( 92%). The next important decision node is the biaryl across all test sets. Color shades of green, yellow, and dihedral angle (DA6-5-12-11) followed by (VI5-12); both these red respectively depict the superior, moderate, and inferior quantitative features home in on the region of axial of the binaphthyl agreements with experimentally reported %ee and that predicted by the RF. core. More importantly, dihedral angles such as (DA6-5-12-11)

Singh et al. PNAS | January 21, 2020 | vol. 117 | no. 3 | 1343 Downloaded by guest on September 28, 2021 can be tuned through suitable substituents to modulate the extent Outlook of enantioselectivity (48). An interesting bond angle (BA5-6-34) is We have demonstrated a proof-of-concept application of machine identified that offers high %ee when its values are greater than learning in the domain of asymmetric catalysis. Harnessing this 118.6. This is a very important feature considering that the present high-throughput method, the discovery of asymmetric catalysts study encompasses five different catalysts. The two last nodes in could be accelerated by using ML tools like random forest and the decision tree are the features such as the bond angle (BA12-11- decision tree. The RF was able to make accurate prediction of 16) and volume, both belonging to the catalyst. The emergence enantioselectivities with an rmse in %ee of just 8.4 ± 1.8 compared of the electronic parameters of the catalyst ((NMR11), (VI5-12), to the experimental values, indicating its practical utility toward (BA12-11-16) and (volume)) in conjunction with that of the sub- identifying lead candidates for catalytic asymmetric applications. strate ((VI12-16-S), (dipole-moment S)) can be regarded as an The trained RF has been able to predict correctly a whole range of ee indication of the importance of catalyst–substrate interactions in % s in complex situations when both substrate and catalyst are the stereocontrolling transition states (55). We also performed the from outside the training set, thus indicating a scope for a potential DT analysis with 60 features obtained after the correlation anal- breakthrough in the discovery of asymmetric catalysts as well as in making an informed choice of substrates for a particular catalyst. ysis. It is interesting to note that all these features appearing as The RF approach could have far-reaching implications in studying important features in the DT are not correlated to the original SI Appendix and expanding the asymmetric catalyst and substrate libraries. We 101 features ( ,TableS24). Other analyses to gauge believe that the leads emerging through machine learning on what the relative importance of various molecular parameters by combination of catalysts and substrate(s) has better propensity to SI Appendix using parameter ranking ( , Tables S19 and S20 and be successful could be coupled with automated experimental Figs. S5–S13) and partial least-squares (PLS) analysis identified protocols. Our approach can also be exploited in a broad range of almost the same set of parameters as the high-ranked ones that asymmetric reactions and thus can open up promising avenues appeared as important nodes in DT (SI Appendix,Fig.S22) toward cost-effective and efficient design of asymmetric catalysts. (DTs for subsets are provided in SI Appendix,Figs.S14–S21). While we note that the interpretation of important features and Data Availability. All data discussed in this paper are available their implications in %ee, mentioned above, is broadly useful, to readers. the identification of the most appropriate combination of fea- tures can become insurmountable for a given system. Hence, we ACKNOWLEDGMENTS. Generous computing time from SpaceTime super- computing facility at Indian Institute of Technology (IIT) Bombay is acknowl- acknowledge that the lack of explicit inclusion of a causative edged. M.P. is grateful to Council of Scientific and Industrial Research, New feature might even result in judging another feature, which is Delhi and A.C. and B.B. acknowledge University Grants Commission, New Delhi correlated to the causative feature, as important. In such situ- for Senior Research Fellowships. We acknowledge Prof. Preethi Jyothi (Depart- ment of Computer Science and Engineering, IIT Bombay) and Soumi Tribedi ations, physical interpretation of those correlated features (Department of Chemistry, IIT Bombay) for valuable discussions and automa- would be convoluted. tion of parameter extraction, respectively, during the course of this project.

1. P. J. Walsh, M. C. Kozlowski, Fundamentals of Asymmetric Catalysis (University Science 19. M. I. Jordan, T. M. Mitchell, Machine learning: Trends, perspectives, and prospects. Books, 2008). Science 349, 255–260 (2015). 2. M. S. Taylor, E. N. Jacobsen, Asymmetric catalysis in complex target synthesis. Proc. 20. K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, A. Walsh, Machine learning for Natl. Acad. Sci. U.S.A. 101, 5368–5373 (2004). molecular and materials science. Nature 559, 547–555 (2018). 3. M. S. Sigman, K. C. Harper, E. N. Bess, A. Milo, The development of multidimensional 21. F. Brockherde et al., Bypassing the Kohn-Sham equations with machine learning. Nat. analysis tools for asymmetric catalysis and beyond. Acc. Chem. Res. 49, 1292–1301 Commun. 8, 872 (2017). (2016). 22. Z. Zhou, X. Li, R. N. Zare, Optimizing chemical reactions with deep reinforcement 4. Y. H. Lam, M. N. Grayson, M. C. Holland, A. Simon, K. N. Houk, Theory and modeling learning. ACS Cent. Sci. 3, 1337–1344 (2017). of asymmetric catalytic reactions. Acc. Chem. Res. 49, 750–762 (2016). 23. R. Gómez-Bombarelli et al., Automatic chemical design using a data-driven continu- 5. J. P. Reid, M. S. Sigman, Comparing quantitative prediction methods for the discovery ous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018). of small-molecule chiral catalysts. Nat. Rev. Chem. 2, 290–305 (2018). 24. S. Szymkuc´ et al., Computer-assisted synthetic planning: The end of the beginning. 6. A. Milo, A. J. Neel, F. D. Toste, M. S. Sigman, Organic chemistry. A data-intensive Angew. Chem. Int. Ed. Engl. 55, 5904–5937 (2016). approach to mechanistic elucidation applied to chiral anion catalysis. Science 347, 25. B. Liu et al., Retrosynthetic reaction prediction using neural sequence-to-sequence 737–743 (2015). models. ACS Cent. Sci. 3, 1103–1113 (2017). 7. Z. W. Ulissi et al., Machine-learning methods enable exhaustive searches for active 26. M. H. S. Segler, M. Preuss, M. P. Waller, Planning chemical syntheses with deep neural bimetallic facets and reveal active site motifs for CO2 reduction. ACS Catal. 7, 6600– networks and symbolic AI. Nature 555, 604–610 (2018). 6608 (2017). 27. P. S. Gromski, A. B. Henson, J. M. Granda, L. Cronin, How to explore chemical space 8. K. Tran, Z. W. Ulissi, Active learning across intermetallics to guide discovery of elec- using algorithms and automation. Nat. Rev. Chem. 3, 119–128 (2019). trocatalysts for CO2 reduction and H2 evolution. Nat. Catal. 1, 696–703 (2018). 28. A. F. Zahrt et al., Prediction of higher-selectivity catalysts by computer-driven work- 9. B. Sanchez-Lengeling, A. Aspuru-Guzik, Inverse molecular design using machine flow and machine learning. Science 363, eaau5631 (2019). learning: Generative models for matter engineering. Science 361, 360–365 (2018). 29. A. Tomberg, M. J. Johansson, P. O. Norrby, A predictive tool for electrophilic aromatic 10. S. M. Moosavi et al., Capturing chemical intuition in synthesis of metal-organic substitutions using machine learning. J. Org. Chem. 84, 4695–4703 (2019). frameworks. Nat. Commun. 10, 539 (2019). 30. J. Aires-de-Sousa, J. Gasteiger, New description of molecular chirality and its appli- 11. P. V. Balachandran, B. Kowalski, A. Sehirlioglu, T. Lookman, Experimental search for cation to the prediction of the preferred enantiomer in stereoselective reactions. J. high-temperature ferroelectric perovskites guided by two-step machine learning. Chem. Inf. Comput. Sci. 41, 369–375 (2001). Nat. Commun. 9, 1668 (2018). 31. J. Aires-de-Sousa, J. Gasteiger, Prediction of enantiomeric excess in a combinatorial 12. D. T. Ahneman, J. G. Estrada, S. Lin, S. D. Dreher, A. G. Doyle, Predicting reaction per- library of catalytic enantioselective reactions. J. Comb. Chem. 7, 298–301 (2005). formance in C-N cross-coupling using machine learning. Science 360,186–190 (2018). 32. J. Chen, W. Jiwu, L. Mingzong, T. You, Calculation on enantiomeric excess of catalytic 13. J. M. Granda, L. Donina, V. Dragone, D. L. Long, L. Cronin, Controlling an organic asymmetric reactions of diethylzinc addition to aldehydes with topological indices synthesis robot with machine learning to search for new reactivity. Nature 559, 377– and artificial neural network. J. Mol. Catal. A Chem. 258, 191–197 (2006). 381 (2018). 33. Q. Y. Zhang, D. D. Zhang, J. Y. Li, H. L. Long, L. Xu, Prediction of enantiomeric excess 14. G. Skoraczyn´ski et al., Predicting the outcomes of organic reactions via machine in a catalytic process: A chemoinformatics approach using chirality codes. MATCH learning: Are current descriptors sufficient? Sci. Rep. 7, 3582 (2017). Commun. Math. Comput. Chem. 67, 773–786 (2012). 15. Y. Zhuo, A. Mansouri Tehrani, A. O. Oliynyk, A. C. Duke, J. Brgoch, Identifying an 34. P. J. Donoghue, P. Helquist, P. O. Norrby, O. Wiest, Prediction of enantioselectivity in efficient, thermally robust inorganic phosphor host via machine learning. Nat. Com- catalyzed hydrogenations. J. Am. Chem. Soc. 131, 410–411 (2009). mun. 9, 4377 (2018). 35. W. Beker, E. P. Gajewska, T. Badowski, B. A. Grzybowski, Prediction of major regio-, 16. J. N. Wei, D. Duvenaud, A. Aspuru-Guzik, Neural networks for the prediction of or- site-, and diastereoisomers in diels–alder reactions by using machine-learning: The ganic chemistry reactions. ACS Cent. Sci. 2, 725–732 (2016). importance of physically meaningful descriptors. Angew. Chem. Int. Ed. Engl. 58, 17. Z. W. Ulissi, A. J. Medford, T. Bligaard, J. K. Nørskov, To address surface reaction 4515–4519 (2019). network complexity using scaling relations machine learning and DFT calculations. 36. A. F. Zahrt, S. E. Denmark, Evaluating continuous chirality measure as a 3D descriptor in Nat. Commun. 8, 14621 (2017). chemoinformatics applied to asymmetric catalysis. Tetrahedron 75, 1841–1851 (2019). 18. J. R. Kitchin, Machine learning in catalysis. Nat. Catal. 1, 230–232 (2018). 37. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015).

1344 | www.pnas.org/cgi/doi/10.1073/pnas.1916392117 Singh et al. Downloaded by guest on September 28, 2021 38. V. Svetnik et al., Random forest: A classification and regression tool for compound 47. P. C. J. Kamer, P. W. N. M. van Leeuwen, Phosphorus(III) Catalysts in Homogeneous classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43, 1947–1958 (2003). Catalysis: Design and Synthesis (Wiley-VCH, 2012). 39. W. Tang, X. Zhang, New chiral phosphorus for enantioselective hydrogenation. 48. S. E. Wheeler, T. J. Seguin, Y. Guan, A. C. Doney, Noncovalent interactions in orga- Chem. Rev. 103, 3029–3070 (2003). nocatalysis and the prospect of computational catalyst design. Acc. Chem. Res. 49, 40. J. P. Reid, L. Simón, J. M. Goodman, A practical guide for predicting the 1061–1069 (2016). – of bifunctional phosphoric acid catalyzed reactions of imines. Acc. Chem. Res. 49, 1029 49. M. J. Frisch et al., Gaussian 09, D.01 (Gaussian, Wallingford, CT, 2009). 1041 (2016). 50. J. G. Estrada, D. T. Ahneman, R. P. Sheridan, S. D. Dreher, A. G. Doyle, Response to 41. M. T. Reetz, G. Mehler, Highly enantioselective Rh-catalyzed hydrogenation reactions Comment on “Predicting reaction performance in C–N cross-coupling using machine based on chiral monophosphite ligands. Angew. Chem. Int. Ed. Engl. 39, 3889–3890 learning.” Science 360, 186–190 (2018). (2000). 51. D. J. Nelson, R. Li, C. Brammer, Using correlations to compare additions to alkenes: 42. A. J. Minnaard, B. L. Feringa, L. Lefort, J. G. de Vries, Asymmetric hydrogenation using Homogeneous hydrogenation by using Wilkinson’s catalyst. J. Org. Chem. 70, 761–767 monodentate phosphoramidite ligands. Acc. Chem. Res. 40, 1267–1277 (2007). 43. J. F. Teichert, B. L. Feringa, Phosphoramidites: Privileged ligands in asymmetric (2005). catalysis. Angew. Chem. Int. Ed. Engl. 49,2486–2528 (2010). 52. T. Morimoto, K. Yoshikawa, M. Murata, N. Yamamoto, K. Achiwa, Preparation of 44. J. A. F. Boogers et al., A mixed- approach enables the asymmetric hydrogena- axially chiral biphenyl diphosphine ligands and their application in asymmetric hy- tion of an α-isopropylcinnamic acid en route to the renin inhibitor aliskiren. Org. drogenation. Chem. Pharm. Bull. (Tokyo) 52, 1445–1450 (2004). Process Res. Dev. 11, 585–591 (2007). 53. P. Raccuglia et al., Machine-learning-assisted materials discovery using failed experi- 45. P. Etayo, A. Vidal-Ferran, Rhodium-catalysed asymmetric hydrogenation as a valuable ments. Nature 533,73–76 (2016). synthetic tool for the preparation of chiral drugs. Chem. Soc. Rev. 42, 728–754 (2013). 54. L. E. O. Breiman, Random forests. Mach. Learn. 45,5–32 (2001). 46. D. J. Ager, A. H. M. de Vries, J. G. de Vries, Asymmetric homogeneous hydrogenations 55. T. Lu, S. E. Wheeler, Organic chemistry. Harnessing weak interactions for enantiose- at scale. Chem. Soc. Rev. 41, 3340–3380 (2012). lective catalysis. Science 347, 719–720 (2015). CHEMISTRY

Singh et al. PNAS | January 21, 2020 | vol. 117 | no. 3 | 1345 Downloaded by guest on September 28, 2021