
Hastings et al. J Cheminform (2021) 13:23 https://doi.org/10.1186/s13321-021-00500-8 Journal of Cheminformatics RESEARCH ARTICLE Open Access Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classifcation Janna Hastings* , Martin Glauer, Adel Memariani, Fabian Neuhaus and Till Mossakowski Abstract Chemical data is increasingly openly available in databases such as PubChem, which contains approximately 110 million compound entries as of February 2021. With the availability of data at such scale, the burden has shifted to organisation, analysis and interpretation. Chemical ontologies provide structured classifcations of chemical entities that can be used for navigation and fltering of the large chemical space. ChEBI is a prominent example of a chemi- cal ontology, widely used in life science contexts. However, ChEBI is manually maintained and as such cannot easily scale to the full scope of public chemical data. There is a need for tools that are able to automatically classify chemical data into chemical ontologies, which can be framed as a hierarchical multi-class classifcation problem. In this paper we evaluate machine learning approaches for this task, comparing diferent learning frameworks including logistic regression, decision trees and long short-term memory artifcial neural networks, and diferent encoding approaches for the chemical structures, including cheminformatics fngerprints and character-based encoding from chemical line notation representations. We fnd that classical learning approaches such as logistic regression perform well with sets of relatively specifc, disjoint chemical classes, while the neural network is able to handle larger sets of overlapping classes but needs more examples per class to learn from, and is not able to make a class prediction for every mole- cule. Future work will explore hybrid and ensemble approaches, as well as alternative network architectures including neuro-symbolic approaches. Keywords: Chemical ontology, Automated classifcation, Machine learning, LSTM Introduction and organisation of such huge datasets at scale becomes In the last decades, signifcant progress has been made ever more important. Classifcation into meaningful within the life sciences in bringing chemical data into groupings or classes enables efective downstream flter- the public domain in open databases such as PubChem ing, selection, analysis and interpretation [2]. Chemical [1]. Tese resources are massive in scale: as of Febru- ontologies provide structured classifcations of chemical ary 2021, PubChem contains approximately 110 million entities into hierarchically arranged and clearly defned structurally distinct entries. Tis presents both oppor- chemical classes. To address the challenge of scale, it tunities and challenges; the annotation, interpretation would be benefcial if structurally described molecular entities could be automatically and efciently classifed into chemical ontologies [2–4]. *Correspondence: [email protected] Department of Computer Science, Otto-von-Guericke University Machine learning has a long history of applications in of Magdeburg, Magdeburg, Germany computational chemistry. For example, it is used for the prediction of various chemical and biological properties © The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creat iveco mmons .org/publi cdoma in/ zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Hastings et al. J Cheminform (2021) 13:23 Page 2 of 20 from chemical structures (e.g. [5–7]). Classical mul- by a section describing our methods. Tereafter, we pre- tivariate statistics and machine learning approaches sent and discuss our results. include logistic regression, support vector machines, decision trees and Bayesian networks. For these classi- Background cal approaches, relevant features need to be specifed in Chemical ontologies advance. With recent advances in algorithms, data avail- Chemical ontologies provide a standardised and shared ability and computational processing capability, multi- classifcation of chemical entities into chemical classes. layer artifcial neural networks, which are able to learn One prominent example of a chemical ontology is ChEBI features directly from raw data, have begun to be used in [11, 13], a publicly available and manually annotated chemistry applications [8–10]. ontology, containing approximately 58,700 fully anno- For the purpose of machine learning, the problem of tated entities, and over 100,000 preliminary (partially automated classifcation of a structurally defned molec- annotated) entities, as of the last release (February 2021). ular entity into a chemical ontology can be transformed Tis includes both molecules and classes of molecules. into a multi-class prediction problem: given the molecu- ChEBI ofers separate ontology hierarchies for the clas- lar structure (and associated features) corresponding to sifcation of molecular entities based on features of their a molecular entity, a model can be trained that aims to associated chemical structures (atoms, bonds and overall automatically predict the class or classes into which that confguration) and based on their functions or how they molecular entity should be classifed. Te desiderata for are used. For the purpose of this paper we only use the an ontology class prediction for a molecular entity is that structure-based branch of the ontology. it should be (a) correct, i.e. it should be a class to which ChEBI has been widely adopted throughout the life sci- the molecular entity does belong; and (b) as specifc as ences, and can be considered the “gold standard” chemi- possible, i.e. there should ideally be no sub-classes of the cal ontology in the public domain. It has been applied for predicted class in the ontology to which the molecule multiple purposes, including in support of the bioinfor- also belongs. matics and systems biology of metabolism [14], biological In this contribution, we evaluate several machine learn- data interpretation [15, 16], natural language processing ing approaches for their applicability to the problem of [17], and as a chemistry component for the semantic web classifying novel molecular entities into the ChEBI chem- (e.g. [18, 19]). However, ChEBI is manually maintained, ical ontology [11] based on their chemical structures. which creates an obvious bottleneck that hinders the util- Tis is to our knowledge the frst systematic and broad ity of ChEBI and its chemical classifcation. With growth evaluation of machine learning for the problem of struc- primarily limited by the manual annotation process, ture-based chemical ontology classifcation as applied ChEBI is not able to scale to encompass the full scope to an existing ontology of the scope of ChEBI. Tere are of the publicly available chemical datasets such as are challenges with the transformation of ChEBI into a form included in PubChem. Moreover, ChEBI cannot address that can be used for this task, which we discuss below. use cases that arise in the context of novel molecular dis- We evaluate both classical machine learning approaches, covery, e.g. in the pharmaceutical domain where ontolo- which learn to predict a single “best match” class for an gies are used in the management of integrated private input molecule, and artifcial neural networks, which and public large-scale datasets as input to early drug dis- learn to predict a likelihood of class membership for covery pipelines [20] for which it is important that part of every class that the network knows about, given an input the data be kept private. Moreover, it hinders applications molecule. We use input encodings based on chemical fn- in the context of investigations into large-scale molecular gerprints for the classical classifers, and on the SMILES systems such as whole-genome metabolism, for which it character-based line notation [12] for the artifcial neural is important that the knowledge base be as complete as networks. Te overall objective of this work is to assess possible [21]. how suitable machine learning is for the task of auto- matically predicting chemical ontology classes for novel Automated structure‑based classifcation in chemical chemical structures. We also explore whether there are ontologies performance diferences between diferent families of Chemical ontologies are typically implemented using machine learning approaches for this problem, and if so, logic-based semantic formalisms, including the W3C whether these diferences are uniform or interact with standard Web Ontology (OWL) language [22]. Based on diferent branches of the
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages20 Page
-
File Size-