Regular Expressions

Total Page:16

File Type:pdf, Size:1020Kb

Regular Expressions Biology Chemistry Informatics Welcome! Mass Spectrometry meets ChemInformatics WCMC Metabolomics Course 2013 Tobias Kind Course 1: General Introduction http://fiehnlab.ucdavis.edu/staff/kind 1 What is ChemInformatics? Chemistry Mathematics Statistics Informatics Chemometrics est. 1975 Cheminformatics est. 1998 2 Who uses Cheminformatics? All parts of chemistry heavily depend on cheminformatics. Life sciences, biochemistry, drug industries use cheminformatics. 20 years ago: 80% in lab – 20% in front of computer Now: 20% in lab - 70% in front of computer (*) Examples: • Organic chemistry – automated reaction planning, Beilstein search • Physical chemistry – modeling of structure properties (boiling points) • Inorganic chemistry – ligand bond interactions • Analytical chemistry – structure elucidation of small compounds • Biochemistry – protein/small molecule interaction networks (*) 10% fixing and installing new programs 3 PhD Motivation for Mass Spectrometry meets ChemInformatics To be a master of spectra you need to be a master of structures in the first place. 765 100 OH N NH O O 50 N 807 747 O 705 O N O HO O O O 676 723 604 265 353 395 455 513 538 636 0 260 310 360 410 460 510 560 610 660 710 760 810 (nist_msms) Vincristine Complex MS data interpretations only possible with software MS data obtained by hyphenated techniques (GC-MS, LC-MS) Mass spectral database search and structure search routinely are used Mass spectrometers deliver multidimensional data 4 Computer Illiteracy – a threat to your research Your computer is your friend You don’t have a computer? You don’t have a friend (just kidding) PDP-11 www.bell-labs.com • Assume you have a computer: Please step forward name: CPU, speed, memory, hard disk, OS • You are a chemist, biochemist, biologist: Please step forward name: Computer language or DB you know 5 OS = operating system; DB = database, CPU = central processing unit Fighting Computer Illiteracy - name your PC CPU INTEL,AMD,IBM,HP Pentium, Opteron, Xeon 12-20 Core Memory DDR, DDR2 GEIL, KINGSTON 16-128 GByte Hard disk SEAGATE, WD Raptor, Barracuda, Cheetah 100-1000 GByte OS MICROSOFT, LINUX Windows, Linux, OSX, Virtual OS Language C, Basic, Perl, JAVA Bit < Byte < kByte < MByte < GByte < TByte Single Core < Dual Core < QuadCore < MultiCore MFLOP/s < GFLOP/s < TFLOP/s < PFLOP/s Cray 2 in rot, Nixdorfmuseum, 2004, 1 Thread < Dual Thread < MultiThreaded 6 The free lunch is over – multithreading needed Can your metabolomics software use multiple CPUs? NO YES 7 Herb Sutter (MS): http://www.gotw.ca/publications/concurrency-ddj.htm The free lunch is over – multithreading needed Course example MZMINE alignment (7 files -18 min LC-MS) Single core vs. multi-core 50 seconds Mors certa, hora incerta! 3:29 minutes 8 Herb Sutter (MS): http://www.gotw.ca/publications/concurrency-ddj.htm Best recommendation ever for slow computers Install an SSD! Single hard disk SSD RAID10 RAMDISK Seagate 750 GB Samsung 830 (2 TB) OSFMount SSDs and Ramdisks have 200 to 1000-fold(!) 4k speed. 4k speed matters. 9 Computer Illiteracy – learn a programming language Why should you? 20% lab time – 80% computer time Mass spectrometers deliver data – not results Why shouldn't you? (fake reasons) You are too old to learn… You are not good with computers… Your have more important research to do… You are so rich you have programmers who work for you… 10 Picture Source: WIKI James Manners from Genova, Italia Computer Illiteracy – learn a programming language • Learn any language which has a large code and user base (JAVA, Perl, Visual Basic) • Use IDEs with automatic code completion like MS Visual Express or Eclipse • Don’t re-invent code - use (and document) code search engines like http://code.ohloh.net/ (formerly koders now ohloh); google.com/codesearch http://krugle.org moOMoOMoOMoOMoOmoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMo >>++++++++[<++++>-] OMMMmoOMMMMoOMoOMoOMoOMoOMoO >++++++++++++++[<+++++++>-] MoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMo OMMMmoOMMMommMoOMoOMoOMoOMoO +>+++++++++++[<++++++++++>-] MoOMoOMoOMoOMoOMoOMoOMMMmoOMMMMoOMoOMMMmoOMMM ++>+++++++++++++++++++[<++++++>-] MoOMoOMoOMoOMoOMoOMoOMoOMoOMoO ++>+++++++++++++++++++[<++++++>-] MoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMoOMo >++++++++++++[<+++++++++>-] OMoOMoOMoOMoOMoOMoOMoOMoOMoO Language “cow” Language “brainfuck” Do *not* learn these working but esoteric languages 11 There are 1123 programming languages http://99-bottles-of-beer.net/ Program development – Eclipse for JAVA example JAVA or C code Projects Text output 12 Your computer Illiteracy – your emergency helpers Regular expressions; SQL database requests; EXCEL VBA scripts or Perl scripts are special tools for data handling (Swiss army knifes) Regular expressions (RegEx) are used for finding and replacing text [0-9] – represents all numbers Examples: \n\n – find double empty lines [a-z] – represents all small letters find \t replace with spaces “ “ \n – represents new line (CR/LF) find two numbers in brackets ([0-9][0-9]) \t – represents TAB Learn about RegEx SQL is used for programming databases Large Database Table SQL query Result yr subject winner yr subject winner 1901 Chemistry Jacobus H. van 't Hoff SELECT yr, subject, winner 1909 Chemistry Wilhelm Ostwald 1902 Chemistry Emil Fischer 1903 Chemistry Svante Arrhenius FROM nobel 1904 Chemistry Sir William Ramsay WHERE yr = 1909 and 1905 Chemistry Adolf von Baeyer subject = 'chemistry' 1906 Chemistry Henri Moissan 1907 Chemistry Eduard Buchner 1908 Chemistry Ernest Rutherford 1909 Chemistry Wilhelm Ostwald 1910 Chemistry Otto Wallach Visit the SQL Zoo 1913 … 13 Regular Expressions – example MS data Task: create a list of 4 columns with names, formulas, CAS numbers and peaks Problem: 24,000 lines of mass spectral data (*.msp) Program: Textpad (WIN), Smultron (Mac) 104 100 O N O 50 189 78 51 14 28 39 63 89 117 131 160 0 10 30 50 70 90 110 130 150 170 190 (mainlib) 2,5-Pyrrolidinedione, 1-methyl-3-phenyl- (m/z - intensity pair) Enter (CR/LF) in gray Number of lines in text 14 Regular Expressions – example MS data Solution: replace Enter (\n) with TAB (\t) and use Replace ALL 1 2 Result: Metadata in one line 3 15 Regular Expressions – example MS data Solution: copy only lines of interest (Mark ALL – Copy Bookmarked Lines) 16 Regular Expressions – Result for MS data Solution: Replace redundant code with nothing, copy tab separated file to EXCEL Result: 1:30 min for RegEx job (1 hour manually?) Average spectrum size: 70 peaks Minimum size: 5 peaks Maximum size: 439 peaks Most spectra have 35 and 45 peaks 17 Be prepared – visualize your structures Try Marvin Space via Webstart 18 Calculation of tetrahedral and double bond stereoisomers How many stereoisomers can you expect from glucose (KEGG)? Example: separation of species with ion mobility MS (FAIMS) O OH HO HO OH OH Glucose Example calculated with MarvinView (via JAVA Webstart) 19 Computation of resonance forms (electron shifts) What are possible resonant structures? Important for mass spectral interpretation (electron impact, electrospray) OH Phenol Example calculated with MarvinView Start via WebStart 20 Generation of tautomers using MSketch How many tautomers can you expect? Important for mass spectral interpretations and LC-MS. O CH3 H3C O Methyl acetate Example calculated with MarvinView Start via WebStart Derivatization in GC-MS and LC-MS solves the tautomer problem Common tautomerisms: Enol/Keto, Lactams, Amines/Imines, Amides/Imides 21 Property calculations on chemicalize.org 22 Mass spectral database search – know what exists How many mass spectra with formula C11H8O3 in NIST DB? Result: 19 for C11H8O3 in NIST05 DB Download NIST-MS-Search 23 Mass spectral interpretation Assign structural elements to mass spectral peaks Download Mass Spectrum Interpreter Version 2 24 Mass Spec Scissors (ACDLabs Free) Q: What is peak m/z 281 in negative mode? http://www.hmdb.ca/metabolites/HMDB09837 25 Molecular Weight Calculator Calculate isotopic masses Find formulas from masses Calculate isotopic patterns 100.0 80.0 60.0 40.0 20.0 Download MWTWIN 0.0 522.00 524.00 526.00 528.00 530.00 532.00 26 Structure search – know what could be possible How many compounds (isomer structures) are found in public databases? Result: 272 for C11H8O3 http://www.chemspider.com/ 27 Stay tuned – new mass spectrometry publications via Yahoo Pipes [LINK] [RSS] 28 Be open minded – NMR can do some things better 2D-NMR needed for de-novo structure elucidation NMR metabolic profiling is highly reproducible with low variance 29 ChenomX Profiler – with 312 pH and frequency tuned reference spectra NMR prediction with ChemAxon Msketch 30 The Last Page - What is important to remember: Learn about CPU type, memory, hard disks, bits and bytes; shock you colleagues with random questions about their computer Think about automation, thinks you would like to do (even if you can’t) shock you colleagues with a small computer script Use regular expressions for stupid or boring jobs you delete/replace data more than 3x - remember RegEx, RegEx, Regex Use scripting languages for small problems (EXCEL VBA, PERL) steal some small examples and color your EXCEL data in rainbow color Generate yourself a collection of programs and databases for MS try such programs in a Virtual Machine without messing up your system 31 .
Recommended publications
  • Semantic Web Integration of Cheminformatics Resources with the SADI Framework Leonid L Chepelev1* and Michel Dumontier1,2,3
    Chepelev and Dumontier Journal of Cheminformatics 2011, 3:16 http://www.jcheminf.com/content/3/1/16 METHODOLOGY Open Access Semantic Web integration of Cheminformatics resources with the SADI framework Leonid L Chepelev1* and Michel Dumontier1,2,3 Abstract Background: The diversity and the largely independent nature of chemical research efforts over the past half century are, most likely, the major contributors to the current poor state of chemical computational resource and database interoperability. While open software for chemical format interconversion and database entry cross-linking have partially addressed database interoperability, computational resource integration is hindered by the great diversity of software interfaces, languages, access methods, and platforms, among others. This has, in turn, translated into limited reproducibility of computational experiments and the need for application-specific computational workflow construction and semi-automated enactment by human experts, especially where emerging interdisciplinary fields, such as systems chemistry, are pursued. Fortunately, the advent of the Semantic Web, and the very recent introduction of RESTful Semantic Web Services (SWS) may present an opportunity to integrate all of the existing computational and database resources in chemistry into a machine-understandable, unified system that draws on the entirety of the Semantic Web. Results: We have created a prototype framework of Semantic Automated Discovery and Integration (SADI) framework SWS that exposes the QSAR descriptor functionality of the Chemistry Development Kit. Since each of these services has formal ontology-defined input and output classes, and each service consumes and produces RDF graphs, clients can automatically reason about the services and available reference information necessary to complete a given overall computational task specified through a simple SPARQL query.
    [Show full text]
  • Joelib Tutorial
    JOELib Tutorial A Java based cheminformatics/computational chemistry package Dipl. Chem. Jörg K. Wegner JOELib Tutorial: A Java based cheminformatics/computational chemistry package by Dipl. Chem. Jörg K. Wegner Published $Date: 2004/03/16 09:16:14 $ Copyright © 2002, 2003, 2004 Dept. Computer Architecture, University of Tübingen, GermanyJörg K. Wegner Updated $Date: 2004/03/16 09:16:14 $ License This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation version 2 of the License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. Documents PS (JOELibTutorial.ps), PDF (JOELibTutorial.pdf), RTF (JOELibTutorial.rtf) versions of this tutorial are available. Plucker E-Book (http://www.plkr.org) versions: HiRes-color (JOELib-HiRes-color.pdb), HiRes-grayscale (JOELib-HiRes-grayscale.pdb) (recommended), HiRes-black/white (JOELib-HiRes-bw.pdb), color (JOELib-color.pdb), grayscale (JOELib-grayscale.pdb), black/white (JOELib-bw.pdb) Revision History Revision $Revision: 1.5 $ $Date: 2004/03/16 09:16:14 $ $Id: JOELibTutorial.sgml,v 1.5 2004/03/16 09:16:14 wegner Exp $ Table of Contents Preface ........................................................................................................................................................i 1. Installing JOELib
    [Show full text]
  • Cheminformatics and Chemical Information
    Cheminformatics and Chemical Information Matt Sundling Advisor: Prof. Curt Breneman Department of Chemistry and Chemical Biology/Center for Biotechnology and Interdisciplinary Studies Rensselaer Polytechnic Institute Rensselaer Exploratory Center for Cheminformatics Research http://reccr.chem.rpi.edu/ Many thanks to Theresa Hepburn! Upcoming lecture plug: Prof. Curt Breneman DSES Department November 15th Advances in Cheminformatics: Applications in Biotechnology, Drug Design and Bioseparations Cheminformatics is about collecting, storing, and analyzing [usually large amounts of] chemical data. •Pharmaceutical research •Materials design •Computational/Automated techniques for analysis •Virtual high-throughput screening (VHTS) QSAR - quantitative structure-activity relationship Molecular Model Activity Structures O N N Cl AAACCTCATAGGAAGCA TACCAGGAATTACATCA … Molecular Model Activity Structures O N N Cl AAACCTCATAGGAAGCATACCA GGAATTACATCA… Structural Descriptors Physiochemical Descriptors Topological Descriptors Geometrical Descriptors Molecular Descriptors Model Activity Structures Acitivty: bioactivity, ADME/Tox evaluation, hERG channel effects, p-456 isozyme inhibition, anti- malarial efficacy, etc… Structural Descriptors Physiochemical Descriptors Topological Descriptors Molecule D1 D2 … Activity (IC50) molecule #1 21 0.1 Geometrical Descriptors = molecule #2 33 2.1 molecule #3 10 0.9 + … Activity Molecular Descriptors Model Activity Structures Goal: Minimize the error: ∑((yi −f xi)) f1(x) f2(x) Regression Models: linear, multi-linear,
    [Show full text]
  • Committee on Publications and Cheminformatics Data Standards (CPCDS)
    Committee on Publications and Cheminformatics Data Standards (CPCDS) Report to the Bureau, April 2020 Leah McEwen, Committee Chair I. Executive Summary The IUPAC Committee on Publications and Cheminformatics Data Standards (CPCDS) advises on issues related to dissemination of information, primarily of IUPAC outputs. The portfolio is expanding and diversifying, in types of content, modes of publication and potential readership. Publication of IUPAC recommendations, technical reports and other information resources remains at the core of IUPAC dissemination activity and that of CPCDS. Increasingly the committee also focuses on the development and dissemination of chemical data and information standards to facilitate robust communication in the digital environment. Further details on these two priority areas are provided in Appendix A (Subcommittee on Publications: Report to CPCDS) and Appendix B (CPCDS Chair’s Statement: Towards a “Digital IUPAC”). The 2018-2019 biennium brought stability to key systems and workflows including those supporting IUPAC’s flagship journal Pure and Applied Chemistry (PAC) and online access to the Compendium of Chemical Terminology (a.k.a., the “Gold Book”). A number of collaborative symposia and workshops with strategic partners including CODATA and the GO FAIR Chemistry Implementation Network (ChIN) surfaced use cases and infrastructure needs for digital science among diverse community stakeholders. These activities lay the groundwork for developing a robust and systematic program for Digital IUPAC and “the creation of a consistent and interoperable global framework for human and machine-readable chemical information,” as articulated in the CPCDS Terms of Reference. This vision will be a critical component of success for IUPAC’s contribution towards the United Nations Sustainable Development Goals.
    [Show full text]
  • Scalable Analysis of Large Datasets in Life Sciences
    Scalable Analysis of Large Datasets in Life Sciences LAEEQ AHMED Doctoral Thesis in Computer Science Stockholm, Sweden 2019 Division of Computational Science and Technology School of Electrical Engineerning and Computer Science KTH Royal Institute of Technology TRITA-EECS-AVL-2019:69 SE-100 44 Stockholm ISBN: 978-91-7873-309-5 SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av teknologie doktorsexamen i elektrotek- nik och datavetenskap tisdag den 3:e Dec 2019, klockan 10.00 i sal Kollegiesalen, Brinellvägen 8, KTH-huset, Kungliga Tekniska Högskolan. © Laeeq Ahmed, 2019 Tryck: Universitetsservice US AB To my respected parents, my beloved wife and my lovely brother and sister Abstract We are experiencing a deluge of data in all fields of scientific and business research, particularly in the life sciences, due to the development of better instrumentation and the rapid advancements that have occurred in informa- tion technology in recent times. There are major challenges when it comes to handling such large amounts of data. These range from the practicalities of managing these large volumes of data, to understanding the meaning and practical implications of the data. In this thesis, I present parallel methods to efficiently manage, process, analyse and visualize large sets of data from several life sciences fields at a rapid rate, while building and utilizing various machine learning techniques in a novel way. Most of the work is centred on applying the latest Big Data Analytics frameworks for creating efficient virtual screening strategies while working with large datasets. Virtual screening is a method in cheminformatics used for Drug discovery by searching large libraries of molecule structures.
    [Show full text]
  • Cheminformatics for Genome-Scale Metabolic Reconstructions
    CHEMINFORMATICS FOR GENOME-SCALE METABOLIC RECONSTRUCTIONS John W. May European Molecular Biology Laboratory European Bioinformatics Institute University of Cambridge Homerton College A thesis submitted for the degree of Doctor of Philosophy June 2014 Declaration This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation is not substantially the same as any I have submitted for a degree, diploma or other qualification at any other university, and no part has already been, or is currently being submitted for any degree, diploma or other qualification. This dissertation does not exceed the specified length limit of 60,000 words as defined by the Biology Degree Committee. This dissertation has been typeset using LATEX in 11 pt Palatino, one and half spaced, according to the specifications defined by the Board of Graduate Studies and the Biology Degree Committee. June 2014 John W. May to Róisín Acknowledgements This work was carried out in the Cheminformatics and Metabolism Group at the European Bioinformatics Institute (EMBL-EBI). The project was fund- ed by Unilever, the Biotechnology and Biological Sciences Research Coun- cil [BB/I532153/1], and the European Molecular Biology Laboratory. I would like to thank my supervisor, Christoph Steinbeck for his guidance and providing intellectual freedom. I am also thankful to each member of my thesis advisory committee: Gordon James, Julio Saez-Rodriguez, Kiran Patil, and Gos Micklem who gave their time, advice, and guidance. I am thankful to all members of the Cheminformatics and Metabolism Group.
    [Show full text]
  • Print Special Issue Flyer
    IMPACT FACTOR 4.411 an Open Access Journal by MDPI Cheminformatics, Past, Present, and Future: From Chemistry to Nanotechnology and Complex Systems Guest Editors: Message from the Guest Editors Prof. Dr. Humbert González- Cheminformatics techniques form have been interesting Díaz tools in the effort to explore large chemical spaces, [email protected] reducing at the same time animal testing as well as costs in Dr. Aliuska Duardo-Sanchez terms of materials and human resources on drug discovery [email protected] processes in medicinal chemistry. In addition, with the emergence of nanotechnologies, cheminformatics needs Prof. Dr. Alejandro Pazos to deal with drug–nanoparticle systems used as drug Sierra [email protected] delivery systems, drug co-therapy systems, etc. This brings to mind the application of cheminformatics in areas beyond drug discovery such as the fuel industry, polymer sciences, materials science, and biomedical engineering. Deadline for manuscript submissions: In this context, we propose to open this new issue to closed (15 November 2020) discuss with all colleagues worldwide all the past, present, and future challenges of cheminformatics. The present Special Issue is also associated with MOL2NET-05, the International Conference on Multidisciplinary Sciences, ISSN: 2624-5078, MDPI SciForum, Basel, Switzerland, 2019. The link of the conference: https://mol2net- 05.sciforum.net/. We especially encourage submissions of papers from colleagues worldwide to the conference (short communications) and complete versions (full papers) to the present Special Issue. mdpi.com/si/31770 SpeciaIslsue IMPACT FACTOR 4.411 an Open Access Journal by MDPI Editor-in-Chief Message from the Editor-in-Chief Prof.
    [Show full text]
  • Trust, but Verify: on the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research
    J. Chem. Inf. Model. XXXX, xxx, 000 A Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research Denis Fourches,† Eugene Muratov,†,‡ and Alexander Tropsha*,† Laboratory for Molecular Modeling, Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, North Carolina 27599, and Laboratory of Theoretical Chemistry, Department of Molecular Structure, A.V. Bogatsky Physical-Chemical Institute NAS of Ukraine, Odessa, 65080, Ukraine Received May 5, 2010 1. INTRODUCTION to the prediction performances of the derivative QSAR models. They also presented several illustrative examples With the recent advent of high-throughput technologies of incorrect structures generated from either correct or for both compound synthesis and biological screening, there incorrect SMILES. The main conclusions of the study were is no shortage of publicly or commercially available data that small structural errors within a data set could lead to sets and databases1 that can be used for computational drug 2 significant losses of predictive ability of QSAR models. The discovery applications (reviewed recently in Williams et al. ). authors further demonstrated that manual curation of struc- Rapid growth of large, publicly available databases (such 3 4 tural data leads to substantial increase in the model predic- as PubChem or ChemSpider containing more than 20 tivity. This conclusion becomes especially important in light million molecular records each) enabled by experimental of the aforementioned study of Oprea et al.6 that cited a projects such as NIH’s Molecular Libraries and Imaging significant error rate in medicinal chemistry literature. Initiative5 provides new opportunities for the development Alarmed by these conclusions, we have examined several of cheminformatics methodologies and their application to popular public databases of bioactive molecules to assess knowledge discovery in molecular databases.
    [Show full text]
  • Chemoinformatics for Green Chemistry
    Contents LIST OF PAPERS .................................................................................... 3 LISTS OF ACRONYMS AND ABBREVIATIONS ............................... 4 INTRODUCTION .................................................................................. 7 BACKGROUND .............................................................................................7 AIMS OF THE THESIS ..................................................................................9 METHODOLOGY ............................................................................... 11 QUANTITATIVE STRUCTURE-PROPERTY RELATIONSHIPS (QSPR)......11 Linear Free Energy Relationships (LFER)...............................................11 Linear Solvation Energy Relationship (LSER).........................................12 Reversed-Phase Liquid Chromatography (RPLC).....................................14 Quantitative Structure-Retention Relationships (QSRR) .........................14 Global Linear Solvation Energy Relationship (LSER)..............................15 Molecular Descriptors ...............................................................................16 The Definition of the Applicability Domain ..............................................17 The Distance-Based Approach Using PLS.......................................17 The Williams Plot .............................................................................18 Associative Neural Networks (ASNN)..............................................19 DATA ANALYSIS ........................................................................................20
    [Show full text]
  • The Literature of Chemoinformatics: 1978–2018
    International Journal of Molecular Sciences Editorial The Literature of Chemoinformatics: 1978–2018 Peter Willett Information School, University of Sheffield, Sheffield S1 4DP, UK; p.willett@sheffield.ac.uk Received: 17 July 2020; Accepted: 31 July 2020; Published: 4 August 2020 Abstract: This article presents a study of the literature of chemoinformatics, updating and building upon an analogous bibliometric investigation that was published in 2008. Data on outputs in the field, and citations to those outputs, were obtained by means of topic searches of the Web of Science Core Collection. The searches demonstrate that chemoinformatics is by now a well-defined sub-discipline of chemistry, and one that forms an essential part of the chemical educational curriculum. There are three core journals for the subject: The Journal of Chemical Information and Modeling, the Journal of Cheminformatics, and Molecular Informatics, and, having established itself, chemoinformatics is now starting to export knowledge to disciplines outside of chemistry. Keywords: bibliometrics; cheminformatics; chemoinformatics; scientometrics 1. Introduction Increasing use is being made of bibliometric methods to analyze the published academic literature, with studies focusing on, e.g., author productivity, the articles appearing in a specific journal, the characteristics of bibliographic frequency distributions, new metrics for research evaluation, and the citations to publications in a specific subject area inter alia (e.g., [1–6]). There have been many bibliometric studies of various aspects of chemistry over the years, with probably the earliest such study being the famous 1926 paper by Lotka in which he discussed author productivity based in part on an analysis of publications in Chemical Abstracts [7].
    [Show full text]
  • Cheminformatics-Aided Pharmacovigilance: Application To
    Low Y. et al. J Am Med Inform Assoc 2015;0:1–11. doi:10.1093/jamia/ocv127, Research and Applications Journal of the American Medical Informatics Association Advance Access published October 24, 2015 RECEIVED 27 March 2015 Cheminformatics-aided pharmacovigilance: REVISED 6 July 2015 ACCEPTED 11 July 2015 application to Stevens-Johnson Syndrome Yen S Low1,2, Ola Caster3,4, Tomas Bergvall3, Denis Fourches1, Xiaoling Zang1, G Niklas Nore´n3,5, Ivan Rusyn2, Ralph Edwards3, Alexander Tropsha1 ABSTRACT .................................................................................................................................................... Objective Quantitative Structure-Activity Relationship (QSAR) models can predict adverse drug reactions (ADRs), and thus provide early warnings of potential hazards. Timely identification of potential safety concerns could protect patients and aid early diagnosis of ADRs among the exposed. Our objective was to determine whether global spontaneous reporting patterns might allow chemical substructures associated with Stevens- Johnson Syndrome (SJS) to be identified and utilized for ADR prediction by QSAR models. RESEARCH AND APPLICATIONS Materials and Methods Using a reference set of 364 drugs having positive or negative reporting correlations with SJS in the VigiBase global repo- sitory of individual case safety reports (Uppsala Monitoring Center, Uppsala, Sweden), chemical descriptors were computed from drug molecular structures. Random Forest and Support Vector Machines methods were used to develop QSAR models, which were validated by external 5-fold cross validation. Models were employed for virtual screening of DrugBank to predict SJS actives and inactives, which were corroborated using Downloaded from knowledge bases like VigiBase, ChemoText, and MicroMedex (Truven Health Analytics Inc, Ann Arbor, Michigan). Results We developed QSAR models that could accurately predict if drugs were associated with SJS (area under the curve of 75%–81%).
    [Show full text]
  • The Tonnabytes Big Data Challenge
    The Tonnabytes Big Data Challenge: Transforming Science and Education Kirk Borne George Mason University Ever since we first began to explore our world… … humans have asked questions and … … have collected evidence (data) to help answer those questions. Astronomy: the world’s second oldest profession ! Characteristics of Big Data • Big quantities of data are acquired everywhere now. But… • What do we mean by “big”? • Gigabytes? Terabytes? Petabytes? Exabytes? • The meaning of “big” is domain-specific and resource- dependent (data storage, I/O bandwidth, computation cycles, communication costs) • I say … we all are dealing with our own “tonnabytes” • There are 4 dimensions to the Big Data challenge: 1. Volume (tonnabytes data challenge) 2. Complexity (variety, curse of dimensionality) 3. Rate of data and information flowing to us (velocity) 4. Verification (verifying inference-based models from data) • Therefore, we need something better to cope with the data tsunami … Data Science – Informatics – Data Mining Examples of Recommendations**: Inference from Massive or Complex Data • Advances in fundamental mathematics and statistics are needed to provide the language, structure, and tools for many needed methodologies of data-enabled scientific inference. – Example : Machine learning in massive data sets • Algorithmic advances in handling massive and complex data are crucial. • Visualization (visual analytics) and citizen science (human computation or data processing) will play key roles. • ** From the NSF report: Data-Enabled Science in the Mathematical and Physical Sciences, (2010) http://www.cra.org/ccc/docs/reports/DES-report_final.pdf This graphic says it all … • Clustering – examine the data and find the data clusters (clouds), without considering what the items are = Characterization ! • Classification – for each new data item, try to place it within a known class (i.e., a known category or cluster) = Classify ! • Outlier Detection – identify those data items that don’t fit into the known classes or clusters = Surprise ! Graphic provided by Professor S.
    [Show full text]