Structure-Based Classification and Ontology in Chemistry

Total Page:16

File Type:pdf, Size:1020Kb

Structure-Based Classification and Ontology in Chemistry Research Collection Journal Article Structure-based classification and ontology in chemistry Author(s): Hastings, Janna; Magka, Despoina; Batchelor, Colin; Duan, Lian; Stevens, Robert; Ennis, Marcus; Steinbeck, Christoph Publication Date: 2012-04-05 Permanent Link: https://doi.org/10.3929/ethz-b-000049483 Originally published in: Journal of Cheminformatics 4(1), http://doi.org/10.1186/1758-2946-4-8 Rights / License: Creative Commons Attribution 2.0 Generic This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Hastings et al. Journal of Cheminformatics 2012, 4:8 http://www.jcheminf.com/content/4/1/8 RESEARCHARTICLE Open Access Structure-based classification and ontology in chemistry Janna Hastings1,2*, Despoina Magka3, Colin Batchelor4, Lian Duan1,5, Robert Stevens6, Marcus Ennis1 and Christoph Steinbeck1 Abstract Background: Recent years have seen an explosion in the availability of data in the chemistry domain. With this information explosion, however, retrieving relevant results from the available information, and organising those results, become even harder problems. Computational processing is essential to filter and organise the available resources so as to better facilitate the work of scientists. Ontologies encode expert domain knowledge in a hierarchically organised machine-processable format. One such ontology for the chemical domain is ChEBI. ChEBI provides a classification of chemicals based on their structural features and a role or activity-based classification. An example of a structure-based class is ‘pentacyclic compound’ (compounds containing five-ring structures), while an example of a role-based class is ‘analgesic’, since many different chemicals can act as analgesics without sharing structural features. Structure-based classification in chemistry exploits elegant regularities and symmetries in the underlying chemical domain. As yet, there has been neither a systematic analysis of the types of structural classification in use in chemistry nor a comparison to the capabilities of available technologies. Results: We analyze the different categories of structural classes in chemistry, presenting a list of patterns for features found in class definitions. We compare these patterns of class definition to tools which allow for automation of hierarchy construction within cheminformatics and within logic-based ontology technology, going into detail in the latter case with respect to the expressive capabilities of the Web Ontology Language and recent extensions for modelling structured objects. Finally we discuss the relationships and interactions between cheminformatics approaches and logic-based approaches. Conclusion: Systems that perform intelligent reasoning tasks on chemistry data require a diverse set of underlying computational utilities including algorithmic, statistical and logic-based tools. For the task of automatic structure- based classification of chemical entities, essential to managing the vast swathes of chemical data being brought online, systems which are capable of hybrid reasoning combining several different approaches are crucial. We provide a thorough review of the available tools and methodologies, and identify areas of open research. Background methods in chemistry include chemical structure-based Recent years have seen an explosion in the availability of algorithmic and statistical methods for the construction data throughout the natural sciences. Availability of data of hierarchies and similarity landscapes. These techni- facilitates research through complex data-mining and ques are essential not only for human consumption of knowledge discovery methods. However, with the infor- data in the form of effective browsing and searching but mation explosion, retrieving relevant information from also in scientific methods for interpreting underlying these data has become much more difficult. Computa- biological mechanisms and detecting bioactivity patterns tional processing is essential to filter, retrieve and orga- associated with chemical structure [1]. nise such data. Traditional large-scale data management In biomedicine and the natural sciences more gener- ally, hierarchical organisation and large-scale data man- * Correspondence: [email protected] agement are being facilitated by formal ontologies: 1Cheminformatics and Metabolism, European Bioinformatics Institute, machine-understandable encodings of human domain Hinxton, UK knowledge. Such ontologies are used in several different Full list of author information is available at the end of the article © 2012 Hastings et al; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Hastings et al. Journal of Cheminformatics 2012, 4:8 Page 2 of 20 http://www.jcheminf.com/content/4/1/8 ways [2-4]. Firstly, they ensure standardisation of termi- chemistry underlying biological ontologies [11]; semantic nology and identification across all entities in a domain similarity [12]; and metabolome prediction [13]. so that multiple sources of data can be aggregated With the large-scale availability of chemical data through comparable reference terms. Secondly, they through projects such as PubChem [14], making sense provide hierarchical organisation so that such aggrega- of the data and mapping between different internal and tion can be performed at different levels for novel data- external collections has become one of the most press- driven scientific discovery. Thirdly, they facilitate brows- ing challenges facing chemical integration into modern ing and searching in an easily accessible fashion. They biomedical science. Such mappings are facilitated by the also allow for logic-based intelligent applications that spiderweb of annotations and cross-references attached are able to perform complex reasoning tasks such as to each entity in a chemical ontology such as ChEBI: checking for errors and inconsistencies and deriving the mappings to other chemical identifiers (such as logical inferences. Logic-based knowledge representation InChI, PubChem, KEGG, DrugBank, Chembl, Reaxys (where ontologies serve as knowledge engineering arte- and, where publicly available, CAS), and the annotations facts) can be contrasted with algorithmic ‘knowledge that use the ontology identifiers to identify chemical representation’, in which software algorithms procedu- entities in biological databases such as pathway data- rally define outputs based on stated inputs, and with sta- bases, protein interaction databases, systems biology tistical ‘knowledge representation’, in which complex modeling databases, biochemical reaction databases and statistical models are trained to produce outputs based many more. The availability of such a growing diction- on a given set of inputs by learning weights for a com- ary of cross-references in the public domain that oper- plex set of internal parameters. An advantage of logic- ates at a broader level than only that of fully-specified based knowledge representation is that it allows the chemical structures(as InChI does) allows mapping to knowledge to be explicitly expressed as knowledge,i.e. be extended to classes of chemical entities that may as statements that are comprehensible, true and self- behave similarly and therefore be described in one refer- contained, and available for modification by persons ence in a reaction database, for example. without a computational background such as domain Similarly to GO, ChEBI is manually maintained by a experts; this is in contrast to statistical methods that team of expert curators. Historically, bio-ontologies such operate as black boxes and to procedural methods that as GO and ChEBI have been developed as Directed require a programmer in order to manipulate or extend Acyclic Graphs (DAGs), a deliberately simplified ontol- them. ogy format which allowed domain experts (non-logi- Bio-ontologies have enjoyed increasing success in cians) to directly participate in ontology engineering at a addressing the large-scale data integration requirement time when tools that supported more sophisticated emerging from the recent increase in data volume [4]. semantics were rather difficult for non-technical persons One example of such a successful bio-ontology is the to use. However, with the increasing availability of sup- Gene Ontology (GO) [5], which is used inter alia to porting tools and widespread adoption, there is a grow- unify annotations between disparate biological databases ing trend of evolution of bio-ontologies towards the and for the statistical analysis of large-scale genetic data greater expressive power provided by the Web Ontology to identify genes that are significantly enriched for spe- Language (OWL) [15] and its extensions, which provides cific functions. For the domain of biologically interesting a sophisticated suite of logic-based constructs to support chemistry, the Chemical Entities of Biological Interest eloquent knowledge representation and automated rea- ontology (ChEBI) [6] provides a classification of chemi- soning in real-world domains [16]. ChEBI is an ideal cal entities such as atoms, molecules and ions. ChEBI ontology to take advantage of increasing formalisation, organises chemical entities according to shared struc-
Recommended publications
  • Massively Parallel Databases and Mapreduce Systems
    Foundations and Trends R in Databases Vol. 5, No. 1 (2012) 1–104 c 2013 S. Babu and H. Herodotou DOI: 10.1561/1900000036 Massively Parallel Databases and MapReduce Systems Shivnath Babu Herodotos Herodotou Duke University Microsoft Research [email protected] [email protected] Contents 1 Introduction 2 1.1 Requirements of Large-scale Data Analytics . 3 1.2 Categorization of Systems . 4 1.3 Categorization of System Features . 6 1.4 Related Work . 8 2 Classic Parallel Database Systems 10 2.1 Data Model and Interfaces . 11 2.2 Storage Layer . 12 2.3 Execution Engine . 18 2.4 Query Optimization . 22 2.5 Scheduling . 26 2.6 Resource Management . 28 2.7 Fault Tolerance . 29 2.8 System Administration . 31 3 Columnar Database Systems 33 3.1 Data Model and Interfaces . 34 3.2 Storage Layer . 34 3.3 Execution Engine . 39 3.4 Query Optimization . 41 ii iii 3.5 Scheduling . 42 3.6 Resource Management . 42 3.7 Fault Tolerance . 43 3.8 System Administration . 44 4 MapReduce Systems 45 4.1 Data Model and Interfaces . 46 4.2 Storage Layer . 47 4.3 Execution Engine . 51 4.4 Query Optimization . 54 4.5 Scheduling . 56 4.6 Resource Management . 58 4.7 Fault Tolerance . 60 4.8 System Administration . 61 5 Dataflow Systems 62 5.1 Data Model and Interfaces . 63 5.2 Storage Layer . 66 5.3 Execution Engine . 69 5.4 Query Optimization . 71 5.5 Scheduling . 73 5.6 Resource Management . 74 5.7 Fault Tolerance . 75 5.8 System Administration .
    [Show full text]
  • Curare: Curating and Managing Big Data Collections on the Cloud
    CURARE : curating and managing big data collections on the cloud Gavin Kemp To cite this version: Gavin Kemp. CURARE : curating and managing big data collections on the cloud. Databases [cs.DB]. Université de Lyon, 2018. English. NNT : 2018LYSE1179. tel-02058604 HAL Id: tel-02058604 https://tel.archives-ouvertes.fr/tel-02058604 Submitted on 6 Mar 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. N°d’ordre NNT : xxx THESE de DOCTORAT DE L’UNIVERSITE DE LYON opérée au sein de l’Université Claude Bernard Lyon 1 Ecole Doctorale ED 512 (InfoMaths) Spécialité de doctorat : Discipline : Informatique Soutenue publiquement le 26/09/2018, par : Gavin KEMP CURARE: CURATING AND MANAGING BIG DATA COLLECTIONS ON THE CLOUD Devant le jury composé de : BELLATRECHE, Ladjel Professeur, ENSMA Poitiers Rapporteur EXPOSITO, Ernesto Professeur, Université de Pau et des Pays de l'Adour Rapporteur D'ORAZIO, Laurent Professeur, Université de Rennes Examinateur ABDERRAFIAA, Koukam Professeur, Université de Technologie de Belfort-Montbéliard Examinateur HASSAS, Salima Professeure, Université de Lyon 1 Examinatrice GHODOUS, Parisa Professeure, Université de Lyon 1 Co-directrice de thèse VARGAS-SOLAR, Genoveva Chargée de Recherches, CNRS Co-directrice de thèse FERREIRA DA SILVA, Catarina Maître de Conférences, Université de Lyon 1 Co-directrice de thèse UNIVERSITE CLAUDE BERNARD - LYON 1 Président de l’Université M.
    [Show full text]
  • Database Management Systems Ebooks for All Edition (
    Database Management Systems eBooks For All Edition (www.ebooks-for-all.com) PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sun, 20 Oct 2013 01:48:50 UTC Contents Articles Database 1 Database model 16 Database normalization 23 Database storage structures 31 Distributed database 33 Federated database system 36 Referential integrity 40 Relational algebra 41 Relational calculus 53 Relational database 53 Relational database management system 57 Relational model 59 Object-relational database 69 Transaction processing 72 Concepts 76 ACID 76 Create, read, update and delete 79 Null (SQL) 80 Candidate key 96 Foreign key 98 Unique key 102 Superkey 105 Surrogate key 107 Armstrong's axioms 111 Objects 113 Relation (database) 113 Table (database) 115 Column (database) 116 Row (database) 117 View (SQL) 118 Database transaction 120 Transaction log 123 Database trigger 124 Database index 130 Stored procedure 135 Cursor (databases) 138 Partition (database) 143 Components 145 Concurrency control 145 Data dictionary 152 Java Database Connectivity 154 XQuery API for Java 157 ODBC 163 Query language 169 Query optimization 170 Query plan 173 Functions 175 Database administration and automation 175 Replication (computing) 177 Database Products 183 Comparison of object database management systems 183 Comparison of object-relational database management systems 185 List of relational database management systems 187 Comparison of relational database management systems 190 Document-oriented database 213 Graph database 217 NoSQL 226 NewSQL 232 References Article Sources and Contributors 234 Image Sources, Licenses and Contributors 240 Article Licenses License 241 Database 1 Database A database is an organized collection of data.
    [Show full text]
  • An Overview of Graph Databases and Their Applications in the Biomedical Domain Santiago Timon-Reina´ *, Mariano Rincon´ and Rafael Martínez-Tom´As
    Database, 2021, 1–22 doi:10.1093/database/baab026 Review Review An overview of graph databases and their applications in the biomedical domain Santiago Timon-Reina´ *, Mariano Rincon´ and Rafael Martínez-Tom´as Departamento de Inteligencia Artificial, Universidad Nacional de Educaci´on a Distancia (UNED), C/Juan del Rosal, 16 Ciudad Universitaria, Madrid 28040, Spain *Corresponding author: Tel: +34 91-398-7209; Email: [email protected] Citation details: Tim´on-Reina, S., Rinc´on, M., Mart´inez-Tom´as, R. et al. An overview of graph databases and their applications in the biomedical domain. Database (2021) Vol. 2021: article ID baab026; doi:10.1093/database/baab026 Received 28 November 2020; Revised 24 March 2021; Accepted 30 April 2021 Abstract Over the past couple of decades, the explosion of densely interconnected data has stim- ulated the research, development and adoption of graph database technologies. From early graph models to more recent native graph databases, the landscape of implementa- tions has evolved to cover enterprise-ready requirements. Because of the interconnected nature of its data, the biomedical domain has been one of the early adopters of graph databases, enabling more natural representation models and better data integration workflows, exploration and analysis facilities. In this work, we survey the literature to explore the evolution, performance and how the most recent graph database solutions are applied in the biomedical domain, compiling a great variety of use cases. With this evidence, we conclude that the available graph database management systems are fit to support data-intensive, integrative applications, targeted at both basic research and exploratory tasks closer to the clinic.
    [Show full text]
  • Representing and Querying Disease Networks Using Graph Databases Artem Lysenko1†, Irina A
    Lysenko et al. BioData Mining (2016) 9:23 DOI 10.1186/s13040-016-0102-8 REVIEW Open Access Representing and querying disease networks using graph databases Artem Lysenko1†, Irina A. Roznovăţ2*†, Mansoor Saqi2†, Alexander Mazein2, Christopher J Rawlings1 and Charles Auffray2 * Correspondence: [email protected] Abstract Lysenko, Roznovat and Saqi are joint-first authors. Background: Systems biology experiments generate large volumes of data of †Equal contributors multiple modalities and this information presents a challenge for integration due to 2European Institute for Systems a mix of complexity together with rich semantics. Here, we describe how graph Biology and Medicine (EISBM), CIRI UMR CNRS 5308, databases provide a powerful framework for storage, querying and envisioning of CNRS-ENS-UCBL-INSERM, Lyon, biological data. France Full list of author information is Results: We show how graph databases are well suited for the representation of available at the end of the article biological information, which is typically highly connected, semi-structured and unpredictable. We outline an application case that uses the Neo4j graph database for building and querying a prototype network to provide biological context to asthma related genes. Conclusions: Our study suggests that graph databases provide a flexible solution for the integration of multiple types of biological data and facilitate exploratory data mining to support hypothesis generation. Keywords: Disease management platform, Graph database, Neo4j graph, Protein-centric framework, Systems medicine, Computational approach Introduction A major effort in translational medicine is to understand the molecular basis of disease [1, 2]. Analysis of high-throughput experimental data together with patient phenotypic information has led to the identification of sets of candidate genes, proteins and path- ways that may be implicated in many disease conditions.
    [Show full text]
  • Query Execution
    Querying Heterogeneous Data in an In-situ Unified Agile System Dissertation zur Erlangung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) vorgelegt dem Rat der Fakultät für Mathematik und Informatik der Friedrich-Schiller-Universität Jena von M.Sc. Javad Chamanara geboren am 23.10.1972 in Eilam Gutachter 1. Prof. Dr. Birgitta König-Ries Friedrich-Schiller-Universität Jena, 07743 Jena, Thüringen, Deutschland 2. Wird vom Fakultätsrat bekannt gegeben Prof. Dr. H. V. Jagadish University of Michigan, 48109-2121 Ann Arbor, Michigan, USA 3. Wird vom Fakultätsrat bekannt gegeben Prof. Dr. Klaus Meyer-Wegener Friedrich-Alexander-Universität, 91058 Erlangen, Bayern, Deutschland Tag der öffentlichen Verteidigung: 12. APRIL 2018 Ehrenwörtliche Erklärung Hiermit erkläre ich, • dass mir die Promotionsordnung der Fakultät bekannt ist, • dass ich die Dissertation selbst angefertigt habe, keine Textabschnitte oder Ergebnisse eines Dritten oder eigenen Prüfungsarbeiten ohne Kennzeichnung übernommen und alle von mir benutzten Hilfsmittel, persönliche Mitteilungen und Quellen in meiner Arbeit angegeben habe, • dass ich die Hilfe eines Promotionsberaters nicht in Anspruch genommen habe und dass Dritte weder unmittelbar noch mittelbar geldwerte Leistungen von mir für Arbeiten erhal- ten haben, die im Zusammenhang mit dem Inhalt der vorgelegten Dissertation stehen, • dass ich die Dissertation noch nicht als Prüfungsarbeit für eine staatliche oder andere wissenschaftliche Prüfung eingereicht habe. Bei der Auswahl und Auswertung des Materials sowie bei der Herstellung des Manuskripts haben mich folgende Personen unterstützt: • Prof. Dr. Birgitta König-Ries Ich habe die gleiche, eine in wesentlichen Teilen ähnliche bzw. eine andere Abhandlung bereits bei einer anderen Hochschule als Dissertation eingereicht: Ja / Nein. Jena, den 12. April 2018 [Javad Chamanara] To Diana Deutsche Zusammenfassung Die Datenheterogenität wächst in allen Aspekten viel rasanter als je zuvor.
    [Show full text]
  • Handbook of Chemoinformatics Algorithms / Editors, Jean-Loup Faulon, Andreas Bender
    MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® soft- ware or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software. Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2010 by Taylor and Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number: 978-1-4200-8292-0 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
    [Show full text]
  • Towards an RDF Analytics Language: Learning from Successful Experiences
    Towards an RDF Analytics Language: Learning from Successful Experiences Fadi Maali and Stefan Decker Digital Enterprise Research Institute, NUI Galway, Ireland {fadi.maali,stefan.decker}@deri.org Abstract. SPARQL, the W3C standard querying language for RDF, provides rich capabilities for slicing and dicing RDF data. The latest version, SPARQL 1.1, added support for aggregation, nested and dis- tributed queries among others. Nevertheless, the purely declarative na- ture of SPARQL and the lack of support for common programming pat- terns, such as recursion and iteration, make it challenging to perform complex data processing and analysis in SPARQL. In the database com- munity, similar limitations of SQL resulted in a surge of proposals of analytics languages and frameworks. These languages are carefully de- signed to run on top of distributed computation platforms. In this paper, we review these efforts of the database community, identify a number of common themes they bear and discuss their applicability in the Semantic Web and Linked Data realm. In particular, design decisions related to the data model, schema restrictions, data transformation and the pro- gramming paradigm are examined and a number of related challenges for defining an RDF analytics language are outlined. 1 Introduction The cost of acquiring and storing data has dropped dramatically in the last few years. Consequently, petabytes and terabytes datasets are becoming commonplace, especially in industries such as telecom, health care, retail, pharmaceutical and financial services. This col- lected data is playing a crucial role in societies, governments and enterprises. For instance, data science is increasingly utilised in sup- porting data-driven decisions and in delivering data products [16, 20].
    [Show full text]
  • Comprehensive Medicinal Chemistry III 30010. Fingerprints and Other
    Comprehensive Medicinal Chemistry III 30010. Fingerprints and other molecular descriptions for database analysis and searching Dávid Bajusz1, Anita Rácz2,3, and Károly Héberger2 1 Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Institute of Organic Chemistry Hungarian Academy of Sciences, Magyar tudósok krt. 2, H-1117 Budapest, Hungary E-mail: [email protected] Phone: + 36 1 382 69 74 2 Plasma Chemistry Research Group, Institute of Materials and Environmental Chemistry, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Magyar tudósok krt. 2, H-1117 Budapest, Hungary E-mail: [email protected] and [email protected] Phone: + 36 1 382 65 09 3 Department of Applied Chemistry, Szent István University, Villányi út 29-43, H-1118 Budapest, Hungary Cite as follows: Dávid Bajusz, Anita Rácz, and Károly Héberger*, Chapter 3.14 – Chemical Data Formats, Fingerprints, and Other Molecular Descriptions for Database Analysis and Searching. In: Reference Module in Chemistry, Molecular Sciences and Chemical Engineering Comprehensive Medicinal Chemistry III - Volume 3 Editor-in-Chiefs Samuel Chackalamannil, David P. Rotella and Simon E. Ward; In silico methods Eds: A. Davies, C. Edge, Available online 13 June 2017, Pages 329–378 Oxford: Elsevier. Elsevier (2017) http://dx.doi.org/10.1016/B978-0-12-409547-2.12345-5 ISBN: 9780128032008 Abstract In this chapter we strive to provide a comprehensive but reasonably compact overview of the various possibilities for the computational representation of molecules. This includes a detailed introduction to the most commonly used chemical file formats (complemented with a few novel or more specific representations), a thorough overview of the theoretical backgrounds of various molecular fingerprints and descriptors, and a complete subchapter devoted to similarity measures and data fusion approaches.
    [Show full text]
  • Distributed Dataflow Processing of Large RDF Graphs
    Provided by the author(s) and NUI Galway in accordance with publisher policies. Please cite the published version when available. Title Distributed dataflow processing of large RDF graphs Author(s) Maali, Fadi Publication Date 2017-05-29 Item record http://hdl.handle.net/10379/6674 Downloaded 2021-09-26T02:50:30Z Some rights reserved. For more information, please see the item record link above. NATIONAL UNIVERSITY OF IRELAND GALWAY Distributed Dataflow Processing of Large RDF Graphs by Fadi Maali Supervisor: Prof. Dr. Stefan Decker A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in the College of Engineering and Informatics NUI Galway July 2017 \Mathematical elegance and practical utility are close companions, as the history of science has shown again and again. Sometimes scientists discover the theory and leave it to mathematicians to figure out why it's elegant, and other times mathematicians develop an elegant theory and leave it to scientists to figure out what it's good for." How not to be Wrong, Jordan Ellenberg \Mathematical solutions are selected by the subliminal self on the basis of mathe- matical beauty, of the harmony of numbers and forms, of geometric elegance." Zen and the Art of Motorcycle Maintenance, Robert M Pirsig Abstract As part of the big data world, RDF, the graph-based data model of the Semantic Web, is growing in use. Consequently, the size of available RDF data is increasing and massive datasets are becoming commonplace. Nevertheless, when analysing large RDF datasets, users are left mainly with two options: using SPARQL, the main query language for RDF, or using an existing non-RDF-specific big data language.
    [Show full text]
  • Exploring Biological Pathways Using the SBGN Standard and Graph
    Touré et al. BMC Bioinformatics (2016) 17:494 DOI 10.1186/s12859-016-1394-x SOFTWARE Open Access STON: exploring biological pathways using the SBGN standard and graph databases Vasundra Touré1,2* , Alexander Mazein2, Dagmar Waltemath1, Irina Balaur2, Mansoor Saqi2, Ron Henkel3,4, Johann Pellet2 and Charles Auffray2 Abstract Background: When modeling in Systems Biology and Systems Medicine, the data is often extensive, complex and heterogeneous. Graphs are a natural way of representing biological networks. Graph databases enable efficient storage and processing of the encoded biological relationships. They furthermore support queries on the structure of biological networks. Results: We present the Java-based framework STON (SBGN TO Neo4j). STON imports and translates metabolic, signalling and gene regulatory pathways represented in the Systems Biology Graphical Notation into a graph-oriented format compatible with the Neo4j graph database. Conclusion: STON exploits the power of graph databases to store and query complex biological pathways. This advances the possibility of: i) identifying subnetworks in a given pathway; ii) linking networks across different levels of granularity to address difficulties related to incomplete knowledge representation at single level; and iii) identifying common patterns between pathways in the database. Keywords: Systems biology graphical notation, Neo4j, Graph database, Systems biology, Systems medicine Background is widely used in Systems Biology and fills the previous When modeling in Systems Biology and in Systems gap of standardized visual representations for biological Medicine, the resulting data is often extensive, complex networks. and heterogeneous. A visual representation can support SBGN is composed of a set of three complementary lan- users in data analysis and interpretation [1].
    [Show full text]
  • Information Retrieval and Text Mining Technologies for Chemistry
    Review pubs.acs.org/CR Information Retrieval and Text Mining Technologies for Chemistry † ○ ‡ ○ § ∥ ⊥ ‡ Martin Krallinger, , Obdulia Rabal, , Analiá Lourenco,̧ , , Julen Oyarzabal,*, # ∇ ■ and Alfonso Valencia*, , , † Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, C/Melchor Fernandeź Almagro 3, Madrid E-28029, Spain ‡ Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Avenida Pio XII 55, Pamplona E-31008, Spain § ESEI - Department of Computer Science, University of Vigo, Edificio Politecnico,́ Campus Universitario As Lagoas s/n, Ourense E-32004, Spain ∥ Centro de Investigaciones Biomedicaś (Centro Singular de Investigacioń de Galicia), Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain ⊥ CEB-Centre of Biological Engineering, University of Minho, Campus de Gualtar, Braga 4710-057, Portugal # Life Science Department, Barcelona Supercomputing Centre (BSC-CNS), C/Jordi Girona, 29-31, Barcelona E-08034, Spain ∇ Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona, C/ Baldiri Reixac 10, Barcelona E-08028, Spain ■ InstitucióCatalana de Recerca i Estudis Avancatş (ICREA), Passeig de Lluís Companys 23, Barcelona E-08010, Spain ABSTRACT: Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information.
    [Show full text]