Structure-Based Classification and Ontology in Chemistry
Total Page:16
File Type:pdf, Size:1020Kb
Research Collection Journal Article Structure-based classification and ontology in chemistry Author(s): Hastings, Janna; Magka, Despoina; Batchelor, Colin; Duan, Lian; Stevens, Robert; Ennis, Marcus; Steinbeck, Christoph Publication Date: 2012-04-05 Permanent Link: https://doi.org/10.3929/ethz-b-000049483 Originally published in: Journal of Cheminformatics 4(1), http://doi.org/10.1186/1758-2946-4-8 Rights / License: Creative Commons Attribution 2.0 Generic This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Hastings et al. Journal of Cheminformatics 2012, 4:8 http://www.jcheminf.com/content/4/1/8 RESEARCHARTICLE Open Access Structure-based classification and ontology in chemistry Janna Hastings1,2*, Despoina Magka3, Colin Batchelor4, Lian Duan1,5, Robert Stevens6, Marcus Ennis1 and Christoph Steinbeck1 Abstract Background: Recent years have seen an explosion in the availability of data in the chemistry domain. With this information explosion, however, retrieving relevant results from the available information, and organising those results, become even harder problems. Computational processing is essential to filter and organise the available resources so as to better facilitate the work of scientists. Ontologies encode expert domain knowledge in a hierarchically organised machine-processable format. One such ontology for the chemical domain is ChEBI. ChEBI provides a classification of chemicals based on their structural features and a role or activity-based classification. An example of a structure-based class is ‘pentacyclic compound’ (compounds containing five-ring structures), while an example of a role-based class is ‘analgesic’, since many different chemicals can act as analgesics without sharing structural features. Structure-based classification in chemistry exploits elegant regularities and symmetries in the underlying chemical domain. As yet, there has been neither a systematic analysis of the types of structural classification in use in chemistry nor a comparison to the capabilities of available technologies. Results: We analyze the different categories of structural classes in chemistry, presenting a list of patterns for features found in class definitions. We compare these patterns of class definition to tools which allow for automation of hierarchy construction within cheminformatics and within logic-based ontology technology, going into detail in the latter case with respect to the expressive capabilities of the Web Ontology Language and recent extensions for modelling structured objects. Finally we discuss the relationships and interactions between cheminformatics approaches and logic-based approaches. Conclusion: Systems that perform intelligent reasoning tasks on chemistry data require a diverse set of underlying computational utilities including algorithmic, statistical and logic-based tools. For the task of automatic structure- based classification of chemical entities, essential to managing the vast swathes of chemical data being brought online, systems which are capable of hybrid reasoning combining several different approaches are crucial. We provide a thorough review of the available tools and methodologies, and identify areas of open research. Background methods in chemistry include chemical structure-based Recent years have seen an explosion in the availability of algorithmic and statistical methods for the construction data throughout the natural sciences. Availability of data of hierarchies and similarity landscapes. These techni- facilitates research through complex data-mining and ques are essential not only for human consumption of knowledge discovery methods. However, with the infor- data in the form of effective browsing and searching but mation explosion, retrieving relevant information from also in scientific methods for interpreting underlying these data has become much more difficult. Computa- biological mechanisms and detecting bioactivity patterns tional processing is essential to filter, retrieve and orga- associated with chemical structure [1]. nise such data. Traditional large-scale data management In biomedicine and the natural sciences more gener- ally, hierarchical organisation and large-scale data man- * Correspondence: [email protected] agement are being facilitated by formal ontologies: 1Cheminformatics and Metabolism, European Bioinformatics Institute, machine-understandable encodings of human domain Hinxton, UK knowledge. Such ontologies are used in several different Full list of author information is available at the end of the article © 2012 Hastings et al; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Hastings et al. Journal of Cheminformatics 2012, 4:8 Page 2 of 20 http://www.jcheminf.com/content/4/1/8 ways [2-4]. Firstly, they ensure standardisation of termi- chemistry underlying biological ontologies [11]; semantic nology and identification across all entities in a domain similarity [12]; and metabolome prediction [13]. so that multiple sources of data can be aggregated With the large-scale availability of chemical data through comparable reference terms. Secondly, they through projects such as PubChem [14], making sense provide hierarchical organisation so that such aggrega- of the data and mapping between different internal and tion can be performed at different levels for novel data- external collections has become one of the most press- driven scientific discovery. Thirdly, they facilitate brows- ing challenges facing chemical integration into modern ing and searching in an easily accessible fashion. They biomedical science. Such mappings are facilitated by the also allow for logic-based intelligent applications that spiderweb of annotations and cross-references attached are able to perform complex reasoning tasks such as to each entity in a chemical ontology such as ChEBI: checking for errors and inconsistencies and deriving the mappings to other chemical identifiers (such as logical inferences. Logic-based knowledge representation InChI, PubChem, KEGG, DrugBank, Chembl, Reaxys (where ontologies serve as knowledge engineering arte- and, where publicly available, CAS), and the annotations facts) can be contrasted with algorithmic ‘knowledge that use the ontology identifiers to identify chemical representation’, in which software algorithms procedu- entities in biological databases such as pathway data- rally define outputs based on stated inputs, and with sta- bases, protein interaction databases, systems biology tistical ‘knowledge representation’, in which complex modeling databases, biochemical reaction databases and statistical models are trained to produce outputs based many more. The availability of such a growing diction- on a given set of inputs by learning weights for a com- ary of cross-references in the public domain that oper- plex set of internal parameters. An advantage of logic- ates at a broader level than only that of fully-specified based knowledge representation is that it allows the chemical structures(as InChI does) allows mapping to knowledge to be explicitly expressed as knowledge,i.e. be extended to classes of chemical entities that may as statements that are comprehensible, true and self- behave similarly and therefore be described in one refer- contained, and available for modification by persons ence in a reaction database, for example. without a computational background such as domain Similarly to GO, ChEBI is manually maintained by a experts; this is in contrast to statistical methods that team of expert curators. Historically, bio-ontologies such operate as black boxes and to procedural methods that as GO and ChEBI have been developed as Directed require a programmer in order to manipulate or extend Acyclic Graphs (DAGs), a deliberately simplified ontol- them. ogy format which allowed domain experts (non-logi- Bio-ontologies have enjoyed increasing success in cians) to directly participate in ontology engineering at a addressing the large-scale data integration requirement time when tools that supported more sophisticated emerging from the recent increase in data volume [4]. semantics were rather difficult for non-technical persons One example of such a successful bio-ontology is the to use. However, with the increasing availability of sup- Gene Ontology (GO) [5], which is used inter alia to porting tools and widespread adoption, there is a grow- unify annotations between disparate biological databases ing trend of evolution of bio-ontologies towards the and for the statistical analysis of large-scale genetic data greater expressive power provided by the Web Ontology to identify genes that are significantly enriched for spe- Language (OWL) [15] and its extensions, which provides cific functions. For the domain of biologically interesting a sophisticated suite of logic-based constructs to support chemistry, the Chemical Entities of Biological Interest eloquent knowledge representation and automated rea- ontology (ChEBI) [6] provides a classification of chemi- soning in real-world domains [16]. ChEBI is an ideal cal entities such as atoms, molecules and ions. ChEBI ontology to take advantage of increasing formalisation, organises chemical entities according to shared struc-