Automated Analysis and Validation of Open Chemical Data Nicholas Elliot

Automated Analysis and Validation of Open Chemical Data Nicholas Elliot Day Emmanuel College A dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Unilever Centre for Molecular Science Informatics Department of Chemistry Lensfield Road, Cambridge, CB2 1EW, United Kingdom. November 20, 2008 Disclaimer This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where specifically indicated. This thesis does not exceed the specified word limit (60000) as defined by the Chemistry Degree Committee. This thesis has been typeset in 12pt font using LATEX2ε according to the specifications defined by the Board of Graduate Studies and the Chemistry Degree Committee. Abstract Methods to automatically extract Open Data from the chemical literature, validate it, and use it to validate theory are examined. Chemical identifiers which assist the automatic location of chemical structures using commercial Web search engines are investigated. The IUPAC International Chemical Idenfitifer (InChI) gives almost 100% recall and pre- cision, though is shown to be too long for present search engines. A com- bination of InChI and InChIKey, a shorter, fixed-length hash of the InChI string, is concluded to be the best current method of identifying structures. The proportion of published, Open Crystallographic Information Files (CIFs) that are valid with respect to the specification is shown to be im- proving, and is around 99% in 2007. The error rate in the conversion of valid CIFs to Chemical Markup Language (CML) is less than 0.2%. The machine generation of connection tables from CIFs requires many heuristics, and in some cases it is impossible to deduce the exact connection table. CrystalEye, a fully-automated system for the reformulation of the frag- mented crystallographic Web into a structured XML-based repository is described. Published, Open CIFs can be located and aggregated programmat- ically with almost 100% recall. It is shown that, by converting CIF data to CML, software can be created to use the latest Web standards and tech- nologies to enhance the ability of Web users to browse, find, keep updated, download and reuse the latest published crystallography. A workflow for the high-throughput calculation of solid-state geometry using a semi-empirical method is described. A wide-range of organic and inorganic systems provided by CrystalEye are used to test both the data and the method. Several errors in the method are discovered, many of which can be attributed to the parameterization process. An Open NMR experiment to perform high-throughput prediction of 13C chemical shifts using a GIAO protocol is described. The data and analysis were provided on publicly-available webpages to enable crowdsourcing, which assisted in discovering an error rate of 6.1% in the starting data. The protocol was refined during the work and shown to have an average unsigned error of 2.24ppm for 13C nuclei of small, rigid molecules; comparable to the errors observed elsewhere for general structures using HOSE and Neural Network methods. iii Acknowledgments I would like to thank Dr Peter Murray-Rust for his advice and guidance throughout this work. Dr Joe Townsend, Jim Downing, Dr Yong Zhang, Dr Andrew Walkingshaw, Dr Nico Adams and Dr Simon Tyrrell are thanked for useful discussions and advice throughout the course of my PhD. I would also like to thank the UCC computing staff, particularly Dr Charlotte Bolton, without which this work would not have been possible. The EPSRC is thanked for funding. Cheers also to Dan, Andy, Rob and Jon for providing copious amounts of tea and distracting ‘activities’. As always, thanks most to Mum, Dad, Vick and Anna. iv Contents Disclaimer i Abstract ii Acknowledgements iv Table of contents viii List of tables ix List of figures xv Glossary xvi 1 Introduction 1 1.1 Chemicalidentification . 2 1.2 OpenData............................. 2 1.3 DataandMetadata........................ 4 1.4 Machine-understandable chemical data . 5 1.4.1 TheSemanticWeb .................... 6 1.5 OpenSourcesoftware....................... 8 1.6 e-Science.............................. 8 1.7 Aims................................ 9 2 Locating chemical data on the Web 11 2.1 Introduction............................ 11 2.2 Websearch-engines . .. .. 11 2.2.1 Sitemaps.......................... 12 2.3 Representing chemical entities as strings . 13 2.4 The IUPAC International Chemical Identifier . 16 2.4.1 InChIcreationandstructure. 17 2.4.2 Searching for InChIs on the Web . 22 2.4.3 The Google-InChI Web Service . 25 2.4.4 Staurosporine - an InChI case study . 26 v 2.5 Conclusions ............................ 38 3 Creating and deriving data in CML from CIF 39 3.1 Introduction............................ 39 3.2 WhyconvertCIFtoCML? . .. 40 3.3 The Crystallographic Information Framework . 42 3.3.1 STAR file concepts and syntax . 42 3.3.2 CIFsyntax ........................ 44 3.3.3 CIFdictionaries. 46 3.3.4 checkCIF ......................... 47 3.4 CIFXML: converting CIF to XML . 49 3.4.1 ConformancetoCIF . 52 3.4.2 CIFXML-J functionality . 53 3.4.3 RepresentationofCIFsinXML . 54 3.4.4 CIFXML-J architecture .................. 57 3.4.5 Using CIFXML-J ...................... 60 3.4.6 CIF to CIFXML conversion statistics . 64 3.5 CIFConverter: convertingCIFtoCML. 67 3.5.1 Handling multiple datablocks . 72 3.5.2 BuildingtheCML .................... 72 3.5.3 Adding symmetry elements . 76 3.5.4 Dictionaryvalidation . 76 3.5.5 Using CIFConverter ................... 78 3.5.6 CIFXML to CML conversion statistics . 80 3.6 EnhancingtheCML ....................... 82 3.6.1 Creatingtheconnectiontable . 84 3.6.2 Adding canonical identifiers . 98 3.7 Currentusage........................... 99 3.8 Conclusions ............................ 99 4 Automating the aggregation, creation and dissemination of semantic crystallography 101 4.1 Introduction............................101 4.2 Implementing a workflow system . 103 4.2.1 TheTavernaWorkbench . .103 4.2.2 Component development and workflows . 107 4.2.3 Developing under Taverna . 111 4.3 CrystalEye.............................114 4.3.1 Implementation . .116 4.3.2 Aggregation ........................116 4.3.3 Processingthedata. .129 4.3.4 Thewebsite ........................134 4.3.5 Services ..........................140 vi 4.3.6 Databasedistribution. .151 4.3.7 CrystalEyedatainRDF . .156 4.3.8 Furtherwork .......................158 4.4 Conclusions ............................160 5 High-throughput prediction of solid-state geometry using semi- empirical methods 161 5.1 Introduction............................161 5.2 MOPAC-CML...........................163 5.2.1 FoX .............................163 5.2.2 MOPAC and FoX .....................165 5.2.3 MOPAC-CMLandJmol . .167 5.3 High-throughputcomputing . .167 5.3.1 Condor...........................170 5.3.2 CamGrid .........................170 5.4 Thecalculationworkflow . .171 5.4.1 Structureselection . .171 5.4.2 MOPACinput.......................172 5.4.3 Condorinput .......................175 5.4.4 Job submission and retrieval . 177 5.4.5 Theoverallworkflow . .178 5.5 Thefirstprotocol .........................178 5.5.1 Organicstructures . .181 5.5.2 Inorganicstructures. .182 5.6 Thesecondprotocol . .184 5.6.1 Thedata..........................186 5.6.2 Inorganicstructures. .187 5.6.3 Organicstructures . .198 5.7 Conclusions ............................212 6 High-throughput prediction of 13C NMR chemical shifts by quantum-mechanical GIAO calculations 215 6.1 Introduction............................215 6.2 NuclearMagneticResonance. .215 6.3 ComputationalNMR . .216 6.3.1 TheRzepaProtocol . .218 6.3.2 NMRShiftDB .......................219 6.4 Calculations............................222 6.4.1 Structureselection . .222 6.4.2 Gaussian03input . .223 6.4.3 Condorinput .......................224 6.4.4 TMS............................225 6.5 OpenComputationalNMR . .226 vii 6.6 Analysis ..............................227 6.6.1 Preparingtheoutput . .227 6.6.2 Initialresults . .229 6.6.3 Sourcesoferror . .230 6.6.4 Cleaningthedataset . .233 6.6.5 Conformationalissues . .237 6.6.6 HSR1 ...........................239 6.6.7 Conclusions ........................244 7 Conclusions 246 A Analysis of the efficacy of Web search engines for chemical search 252 A.1 Search-enginesandstrategy . .252 A.2 Searchtermsandmetrics. .256 A.3 SearchingforCASnumbers . .257 A.4 SearchingforInChIs . .259 A.4.1 The InChI architecture and implications . 259 A.4.2 Results...........................260 A.5 SearchingforSMILES . .261 A.6 Searching for InChI strings from the KEGG collection using Google...............................262 B MOPAC calculation references 270 B.1 Second-protocol inorganic calculations . 270 B.1.1 Shortatom-atomdistances . 270 B.1.2 Silicas ...........................271 B.1.3 Errorsinthedata. .272 B.1.4 Errorsinmodeling . .272 B.2 Organiccalculations . .273 B.2.1 Calculations that terminated with controlled errors . 273 B.2.2 Calculations containing radicals that converged suc- cessful ...........................276 B.2.3 Changes in connection table . 276 B.2.4 Densitychangeoutliers. 277 C Published Work 282 Bibliography 284 viii List of Tables 3.1 Example of applying the canonicalization algorithm . 61 3.2 Proportion of error occurrences in CIF to CML conversion . 82 5.1 Table showing the density change mean and standard devia- tion for structures with various minimum translation vectors. 211 6.1 Comparison of the linear fitting statistics

Automated Analysis and Validation of Open Chemical Data Nicholas Elliot

EXPLORING NEW ASYMMETRIC REACTIONS CATALYSED by DICATIONIC Pd(II) COMPLEXES

Synthesis and Applications of Derivatives of 1,7-Diazaspiro[5.5

The Role of Conformational Dynamics in Isocyanide Hydratase Catalysis

Activities of the COST D37 Gridchem Computational Chemistry Workflow Group

AMBIDENT PROPERTIES of PHOSPHORAMIDAT'es and SULPHONAMIDES a Thesis Submitted by JAMES NICHOLAS ILEY in Partial Fulfillment of T

'Molecule of the Month' Website—An Extraordinary Chemistry Educational Resource Online for Over 20 Years

NSF FAIR Chemical Data Publishing Guidelines Workshop on Chemical Structures and Spectra: Major Outcomes and Outlooks for the Chemistry Community

GUIDELINES for the USE of the INTERNET by IUPAC BODIES (Technical Report)

Historical Group

Grid Workflow Workshop 2011

COMMITTEE Chairman: Dr Peter J T Morris 5 Helford Way, Upminster, Essex RM14 1RJ [E-Mail: [email protected]] Secretary: Prof

Herman Skolnik Award Symposium Honoring Henry Rzepa and Peter Murray-Rust ACS National Meeting, Philadelphia, PA, August 21, 2012