Synthetic Data in Machine Learning for Medicine and Healthcare

Total Page:16

File Type:pdf, Size:1020Kb

Synthetic Data in Machine Learning for Medicine and Healthcare comment Synthetic data in machine learning for medicine and healthcare The proliferation of synthetic data in artifcial intelligence for medicine and healthcare raises concerns about the vulnerabilities of the software and the challenges of current policy. Richard J. Chen, Ming Y. Lu, Tifany Y. Chen, Drew F. K. Williamson and Faisal Mahmood s artificial intelligence (AI) for scenarios such as road accidents and harsh of AI involves learning algorithms, also applications in medicine and driving environments for which collecting known as deep generative models, that healthcare undergoes increased data can be challenging4, in medicine and can emulate how data are generated in A 11,12 regulatory analysis and clinical adoption, healthcare accurate synthetic data can be the real world . Generative adversarial the data used to train the algorithms are used to increase diversity in datasets and networks (GANs) are a type of generative undergoing increasing scrutiny. Scrutiny of to increase the robustness and adaptability model that learn probability distributions the training data is central to understanding of AI models. However, synthetic data can of how high-dimensional data are likely algorithmic biases and pitfalls. These can also be used maliciously, as exemplified to be distributed. GANs consist of two arise from datasets with sample-selection by fake impersonation videos (also known neural networks — a generator and a biases — for example, from a hospital that as deepfakes), which can propagate discriminator — that compete in a minimax admits patients with certain socioeconomic misinformation and fool facial recognition game (that is, a game of minimizing the backgrounds, or medical images acquired software5. maximum possible loss) to fool each other. with one particular type of equipment or The United States Food and Drug For instance, in a GAN being trained to camera model. Algorithms trained with Administration (FDA) has put forward an produce paintings in the style of Claude biases in sample selection typically fail approval pathway for AI-based software Monet, the generator would be a neural when deployed in settings sufficiently as a medical device (AI-SaMD). An network that aims to produce a Monet different from those in which the trained increasing number of AI algorithms are counterfeit that fools a critic discriminator data were acquired1. Biases can also arise being approved, with uses ranging from network attempting to distinguish real owing to class imbalances — as is typical the detection of atrial fibrillation to the Monet paintings from counterfeits. As of data associated with rare diseases — clinical grading of pathology slides6,7. By the game progresses, the generator learns which degrade the performance of trained incorporating synthetic data emulating the from the poor counterfeits caught by the AI models for diagnosis and prognosis. phenotypes of underrepresented conditions discriminator, progressively creating more And AI-driven diagnostic assistance tools and individuals, AI algorithms can make realistic counterfeits. GANs have shown relying on historical data would not typically better medical decisions in a wider range promise in a variety of applications, ranging detect new phenotypes, such as those of of real-world environments. In fact, the use from synthesizing paintings of modern patients with stroke or cancer presenting of synthetic data has attracted mainstream landscapes in the style of Claude Monet to with symptoms of coronavirus disease 2019 attention as a potential path forward for generating realistic images of skin lesions13 (COVID-19)2. Because the utility of AI greater reproducibility in research and (Fig. 1, top), pathology slides14, colon algorithms for healthcare applications hinges for implementing differential privacy for mucosa15 and chest X-rays16–18 (Fig. 1, top), on the exhaustive curation of medical data protected health information (PHI)8,9. With and in a range of imaging modalities19–22. with ground-truth labels, the algorithms are the increasing digitization of health data and Advancements in computer graphics and as effective or as robust as the data they are the market size of AI for healthcare expected in information theory have driven progress supplied with. to reach US$45 billion by 2026 (ref. 8), in generative modelling from the generation Therefore, large datasets that are diverse the role of synthetic data in the health of 28 × 28 (pixels per side) black-and-white and representative (of the heterogeneity information economy needs to be precisely images of handwritten digits to simulating of phenotypes in the gender, ethnicity and delineated in order to develop fault-tolerant life-like human faces with 1,024 × 1,024 geography of the individuals or patients, and and patient-facing health systems10. How do high-fidelity images23. To show the in the healthcare systems, workflows and synthetic data fit within existing regulatory capabilities of the AI-driven generation of equipment used) are necessary to develop frameworks for modifying AI algorithms in synthetic medical data, we produced images and refine best practices in evidence-based healthcare? To what ends can synthetic data of three histological subtypes of renal cell medicine involving AI3. To overcome be used to protect or exploit patient privacy, carcinoma (chromophobe, clear cell and the paucity of annotated medical data in and to improve medical decision-making? In papillary carcinoma) by training a GAN real-world settings, synthetic data are being this Comment, we examine the proliferation with 10,000 real images of each subtype, increasingly used. Synthetic data can be of synthetic data in medical AI and discuss and then compared the performance of created from perturbations using accurate the associated vulnerabilities and necessary the model with another model trained forward models (that is, models that simulate policy challenges. using both real and synthetic data (Fig. 1, outcomes given specific inputs), physical middle and bottom). The synthetic images simulations or AI-driven generative models. Fidelity tests generated by the GAN finely mimic the As with the development of computer vision Beyond improved image classification and characteristic thin-walled ‘chicken wire’ algorithms for self-driving cars to emulate natural language processing, one promise vasculature of the clear-cell carcinoma NATURE BIOMEDICAL ENGINEERING | VOL 5 | JUNE 2021 | 493–497 | www.nature.com/natbiomedeng 493 comment Synthetic images Real images Challenges in adoption The generation of synthetic data has garnered significant attention in medicine and healthcare13,14,17,32–34 because it can Skin lesions improve existing AI algorithms through data augmentation. For instance, among renal cell carcinomas, the chromophobe subtype is rare and accounts for merely 5% of all renal cell carcinoma cases35. By providing Chest X-rays synthetic histology images of renal cell carcinoma as additional training input to a convolutional neural network, the detection accuracy of the subtype can be improved (Fig. 1, bottom). However, the wider roles of synthetic Chromophobe 40 μm 40 μm data in AI systems in healthcare remain unclear. Unlike traditional medical devices, the function of AI-SaMDs may need to be adaptive to data streams that evolve Clear cell over time, as is the case for health data 40 μm 40 μm from smartphone sensors36,37. Researchers may be tempted to use synthetic data Renal cell carcinoma histology Renal cell carcinoma as a stopgap for the fine-tuning of algorithms; however, policymakers may Papillary 40 μm 40 μm find it troubling that there are not always clinical-quality measures and evaluation metrics for synthetic data. In a proposed Real Real FDA regulatory framework for software Real and synthetic Real and synthetic modifications in adaptive AI-SaMDs, 0.7 0.8 0.9 1.0 0.7 0.8 0.9 1.0 guidance for updating algorithms would AUC AUC mandate reference standards and quality assurance of any new data sources6. Fig. 1 | Synthetic medical data in action. Top: synthetic and real images of skin lesions and of frontal However, when generating synthetic data for chest X-rays. Middle: synthetic and real histology images of three subtypes of renal cell carcinoma. rare or new disease conditions, there may Bottom: areas under the receiver operating characteristic curve (AUC) for the classification performance not even be sufficient samples to establish of an independent dataset of the histology images by a deep-learning model trained with 10,000 real clinical reference standards. As with other images of each subtype and by the same model trained with the real-image dataset augmented by data-driven deep-learning algorithms, 10,000 synthetic images of each subtype. Methodology and videos are available as Supplementary generative models are constrained by the Information. size and quality of the training dataset used to model the data distribution, and models trained with biased datasets would still be biased toward overrepresented conditions. subtype and the unique features of the other diversity in adolescents with de novo How can we assess whether synthetic data two subtypes. The GAN also improves the mutations makes the weights of the trained are emulating the correct phenotype and accuracy of classification. GAN model publicly available, the GAN are free from artefacts that would bias the By closely mimicking real-world could be used by a third party to synthesize deployed AI-SaMDs? Current quantitative observational data, synthetic
Recommended publications
  • Arxiv:1907.05276V1 [Cs.CV] 6 Jul 2019 Media Online Are Rapidly Outpacing the Ability to Manually Detect and Respond to Such Media
    Human detection of machine manipulated media Matthew Groh1, Ziv Epstein1, Nick Obradovich1,2, Manuel Cebrian∗1,2, and Iyad Rahwan∗1,2 1Media Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 2Center for Humans & Machines, Max Planck Institute for Human Development, Berlin, Germany Abstract Recent advances in neural networks for content generation enable artificial intelligence (AI) models to generate high-quality media manipulations. Here we report on a randomized experiment designed to study the effect of exposure to media manipulations on over 15,000 individuals’ ability to discern machine-manipulated media. We engineer a neural network to plausibly and automatically remove objects from images, and we deploy this neural network online with a randomized experiment where participants can guess which image out of a pair of images has been manipulated. The system provides participants feedback on the accuracy of each guess. In the experiment, we randomize the order in which images are presented, allowing causal identification of the learning curve surrounding participants’ ability to detect fake content. We find sizable and robust evidence that individuals learn to detect fake content through exposure to manipulated media when provided iterative feedback on their detection attempts. Over a succession of only ten images, participants increase their rating accuracy by over ten percentage points. Our study provides initial evidence that human ability to detect fake, machine-generated content may increase alongside the prevalence of such media online. Introduction The recent emergence of artificial intelligence (AI) powered media manipulations has widespread societal implications for journalism and democracy [1], national security [2], and art [3]. AI models have the potential to scale misinformation to unprecedented levels by creating various forms of synthetic media [4].
    [Show full text]
  • Editorial Algorithms
    March 26, 2021 Editorial Algorithms The next AI frontier in News and Information National Press Club Journalism Institute Housekeeping - You will receive slides after the presentation - Interactive polls throughout the session - Use QA tool on Zoom to ask questions - If you still have questions, reach out to Francesco directly [email protected] or connect on twitter @fpmarconi What will your learn during today’s session 1. How AI is being used and will be used by journalists and communicators, and what that means for how we work 2. The ethics of AI, particularly related to diversity, equity and representation 3. The role of AI in the spread of misinformation and ‘fake news’ 4. How journalists and communications professionals can avoid pitfalls while taking advantage of the possibilities AI provides to develop new ways of telling stories and connecting with audiences Poll time! What’s your overall feeling about AI and news? A. It’s the future of news and communications! B. I’m cautiously optimistic C. I’m cautiously pessimistic D. It’s bad news for journalism Journalism was designed to solve information scarcity Newspapers were established to provide increasingly literate audiences with information. Journalism was originally designed to solve the problem that information was not widely available. The news industry went through a process of standardization so it could reliably source and produce information more efficiently. This led to the development of very structured ways of writing such as the “inverted pyramid” method, to organizing workflows around beats and to plan coverage based on news cycles. A new reality: information abundance The flood of information that humanity is now exposed to presents a new challenge that cannot be solved exclusively with the traditional methods of journalism.
    [Show full text]
  • AI for Health Examples from Industry, Government, and Academia Table of Contents
    guidehouse.com AI for Health Examples from industry, government, and academia Table of Contents Abstract 3 Introduction to artificial intelligence 4 Introduction to health data 7 Molecules: AI for understanding biology and developing therapies 10 Fundamental research 10 Drug development 11 Patients: AI for predicting disease and improving care 13 Diagnostics and outcome risks 13 Personalized medicine 14 Groups: AI for optimizing clinical trials and safeguarding public health 15 Clinical trials 15 Public health 17 Conclusion and outlook 18 2 Guidehouse Abstract This paper is directed toward a health-informed reader who is curious about the developments and potential of artificial intelligence (AI) in the health space, but could equally be read by AI practitioners curious about how their knowledge and methods are being used to advance human health. We present a brief, equation-free introduction to AI and its major subfields in order to provide a framework for understanding the technical context of the examples that follow. We discuss the various data sources available for questions of health and life sciences, as well as the unique challenges inherent to these fields. We then consider recent (past five years) applications of AI that have already had tangible, measurable impact to the advancement of biomedical knowledge and the development of new and improved treatments. These examples are organized by scale, ranging from the molecule (fundamental research and drug development) to the patient (diagnostics, risk-scoring, and personalized medicine) to the group (clinical trials and public health). Finally, we conclude with a brief summary and our outlook for the future of AI for health.
    [Show full text]
  • Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information
    Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2017.DOI Pre-training of Deep Bidirectional Protein Sequence Representations with Structural Information SEONWOO MIN1,2, SEUNGHYUN PARK3, SIWON KIM1, HYUN-SOO CHOI4, BYUNGHAN LEE5 (Member, IEEE), and SUNGROH YOON1,6 (Senior Member, IEEE) 1Department of Electrical and Computer engineering, Seoul National University, Seoul 08826, South Korea 2LG AI Research, Seoul 07796, South Korea 3Clova AI Research, NAVER Corp., Seongnam 13561, South Korea 4Department of Computer Science and Engineering, Kangwon National University, Chuncheon 24341, South Korea 5Department of Electronic and IT Media Engineering, Seoul National University of Science and Technology, Seoul 01811, South Korea 6Interdisciplinary Program in Artificial Intelligence, ASRI, INMC, and Institute of Engineering Research, Seoul National University, Seoul 08826, South Korea Corresponding author: Byunghan Lee ([email protected]) or Sungroh Yoon ([email protected]) This research was supported by the National Research Foundation (NRF) of Korea grants funded by the Ministry of Science and ICT (2018R1A2B3001628 (S.Y.), 2014M3C9A3063541 (S.Y.), 2019R1G1A1003253 (B.L.)), the Ministry of Agriculture, Food and Rural Affairs (918013-4 (S.Y.)), and the Brain Korea 21 Plus Project in 2021 (S.Y.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ABSTRACT Bridging the exponentially growing gap between the numbers of unlabeled and labeled protein sequences, several studies adopted semi-supervised learning for protein sequence modeling. In these studies, models were pre-trained with a substantial amount of unlabeled data, and the representations were transferred to various downstream tasks.
    [Show full text]
  • Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences
    bioRxiv preprint doi: https://doi.org/10.1101/622803; this version posted May 29, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. BIOLOGICAL STRUCTURE AND FUNCTION EMERGE FROM SCALING UNSUPERVISED LEARNING TO 250 MILLION PROTEIN SEQUENCES Alexander Rives ∗y z Siddharth Goyal ∗x Joshua Meier ∗x Demi Guo ∗x Myle Ott x C. Lawrence Zitnick x Jerry Ma y x Rob Fergus y z x ABSTRACT In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Unsupervised learning recovers information about protein structure: secondary structure and residue-residue contacts can be identified by linear projections from the learned representations. Training language models on full sequence diversity rather than individual protein families increases recoverable information about secondary structure.
    [Show full text]
  • Deepfaked Online Content Is Highly Effective in Manipulating People's
    Deepfaked online content is highly effective in manipulating people’s attitudes and intentions Sean Hughes, Ohad Fried, Melissa Ferguson, Ciaran Hughes, Rian Hughes, Xinwei Yao, & Ian Hussey In recent times, disinformation has spread rapidly through social media and news sites, biasing our (moral) judgements of other people and groups. “Deepfakes”, a new type of AI-generated media, represent a powerful new tool for spreading disinformation online. Although Deepfaked images, videos, and audio may appear genuine, they are actually hyper-realistic fabrications that enable one to digitally control what another person says or does. Given the recent emergence of this technology, we set out to examine the psychological impact of Deepfaked online content on viewers. Across seven preregistered studies (N = 2558) we exposed participants to either genuine or Deepfaked content, and then measured its impact on their explicit (self-reported) and implicit (unintentional) attitudes as well as behavioral intentions. Results indicated that Deepfaked videos and audio have a strong psychological impact on the viewer, and are just as effective in biasing their attitudes and intentions as genuine content. Many people are unaware that Deepfaking is possible; find it difficult to detect when they are being exposed to it; and most importantly, neither awareness nor detection serves to protect people from its influence. All preregistrations, data and code available at osf.io/f6ajb. The proliferation of social media, dating apps, news and copy of a person that can be manipulated into doing or gossip sites, has brought with it the ability to learn saying anything (2). about a person’s moral character without ever having Deepfaking has quickly become a tool of to interact with them in real life.
    [Show full text]
  • Machine Learning for Speech Recognition by Alice Coucke, Head of Machine Learning Research
    Machine Learning for Speech Recognition by Alice Coucke, Head of Machine Learning Research @alicecoucke [email protected] Outline: 1. Recent advances in machine learning 2. From physics to machine learning IA en général Speech & NLP Startup 3. Working at Snips (now Sonos) Snips A lot of applications in the real world, quite a difference from theoretical work Un des rares domaines scientifiques où la théorie et la pratique sont très emmêlées Recent Advances in Applied Machine Learning A lot of applications in the real world, quite a difference from theoretical work Reinforcement learning Learning goal-oriented behavior within simulated environments Go Starcraft II Dota 2 AlphaGo (Deepmind, 2016) AlphaStar (Deepmind) OpenAI Five (OpenAI) Play-driven learning for robots Sim-to-real dexterity learning (Google Brain) Project BLUE (UC Berkeley) Machine Learning for Life Sciences Deep learning applied to biology and medicine Protein folding & structure Cardiac arrhythmia prediction Eye disease diagnosis prediction (NHS, UCL, Deepmind) from ECGs AlphaFold (Deepmind) (Stanford) Reconstruct speech from neural activity Limb control restoration (UCSF) (Batelle, Ohio State Univ) Computer vision High-level understanding of digital images or videos GANs for image generation (Heriot Watt Univ, DeepMind) « Common sense » understanding of actions in videos (TwentyBn, DeepMind, MIT, IBM…) GANs for artificial video dubbing GAN for full body synthesis (Synthesia) (DataGrid) From physics to machine learning and back A surge of interest from the physics community
    [Show full text]
  • Artificial Intelligence and Cybersecurity
    CEPS Task Force Report Artificial Intelligence and Cybersecurity Technology, Governance and Policy Challenges R a p p o r t e u r s L o r e n z o P u p i l l o S t e f a n o F a n t i n A f o n s o F e r r e i r a C a r o l i n a P o l i t o Artificial Intelligence and Cybersecurity Technology, Governance and Policy Challenges Final Report of a CEPS Task Force Rapporteurs: Lorenzo Pupillo Stefano Fantin Afonso Ferreira Carolina Polito Centre for European Policy Studies (CEPS) Brussels May 2021 The Centre for European Policy Studies (CEPS) is an independent policy research institute based in Brussels. Its mission is to produce sound analytical research leading to constructive solutions to the challenges facing Europe today. Lorenzo Pupillo is CEPS Associate Senior Research Fellow and Head of the Cybersecurity@CEPS Initiative. Stefano Fantin is Legal Researcher at Center for IT and IP Law, KU Leuven. Afonso Ferreira is Directeur of Research at CNRS. Carolina Polito is CEPS Research Assistant at GRID unit, Cybersecurity@CEPS Initiative. ISBN 978-94-6138-785-1 © Copyright 2021, CEPS All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means – electronic, mechanical, photocopying, recording or otherwise – without the prior permission of the Centre for European Policy Studies. CEPS Place du Congrès 1, B-1000 Brussels Tel: 32 (0) 2 229.39.11 e-mail: [email protected] internet: www.ceps.eu Contents Preface ...........................................................................................................................................................
    [Show full text]
  • The Nooscope Manifested: AI As Instrument of Knowledge Extractivism
    AI & SOCIETY https://doi.org/10.1007/s00146-020-01097-6 ORIGINAL ARTICLE The Nooscope manifested: AI as instrument of knowledge extractivism Matteo Pasquinelli1 · Vladan Joler2 Received: 27 March 2020 / Accepted: 14 October 2020 © The Author(s) 2020 Abstract Some enlightenment regarding the project to mechanise reason. The assembly line of machine learning: data, algorithm, model. The training dataset: the social origins of machine intelligence. The history of AI as the automation of perception. The learning algorithm: compressing the world into a statistical model. All models are wrong, but some are useful. World to vector: the society of classifcation and prediction bots. Faults of a statistical instrument: the undetection of the new. Adversarial intelligence vs. statistical intelligence: labour in the age of AI. Keywords Nooscope · Political economy · Mechanised knowledge · Information compression · Ethical machine learning 1 Some enlightenment The purpose of the Nooscope map is to secularize AI regarding the project to mechanise from the ideological status of ‘intelligent machine’ to one of reason knowledge instruments. Rather than evoking legends of alien cognition, it is more reasonable to consider machine learn- The Nooscope is a cartography of the limits of artifcial ing as an instrument of knowledge magnifcation that helps intelligence, intended as a provocation to both computer to perceive features, patterns, and correlations through vast science and the humanities. Any map is a partial perspec- spaces of data beyond human reach. In the history of science tive, a way to provoke debate. Similarly, this map is a mani- and technology, this is no news; it has already been pursued festo—of AI dissidents.
    [Show full text]
  • Overview of Current State of Research on the Application of Artificial Intelligence Techniques for COVID-19
    Overview of current state of research on the application of artificial intelligence techniques for COVID-19 Vijay Kumar1, Dilbag Singh2, Manjit Kaur2 and Robertas Damaševičius3,4 1 Computer Science and Engineering Department, National Institute of Technology, Hamirpur, Himachal Pradesh, India 2 School of Engineering and Applied Sciences, Bennett University, Greater Noida, India 3 Faculty of Applied Mathematics, Silesian University of Technology, Gliwice, Poland 4 Department of Applied Informatics, Vytautas Magnus University, Kaunas, Lithuania ABSTRACT Background: Until now, there are still a limited number of resources available to predict and diagnose COVID-19 disease. The design of novel drug-drug interaction for COVID-19 patients is an open area of research. Also, the development of the COVID-19 rapid testing kits is still a challenging task. Methodology: This review focuses on two prime challenges caused by urgent needs to effectively address the challenges of the COVID-19 pandemic, i.e., the development of COVID-19 classification tools and drug discovery models for COVID-19 infected patients with the help of artificial intelligence (AI) based techniques such as machine learning and deep learning models. Results: In this paper, various AI-based techniques are studied and evaluated by the means of applying these techniques for the prediction and diagnosis of COVID-19 disease. This study provides recommendations for future research and facilitates knowledge collection and formation on the application of the AI techniques for dealing with the COVID-19 epidemic and its consequences. Conclusions: The AI techniques can be an effective tool to tackle the epidemic caused by COVID-19. These may be utilized in four main fields such as prediction, diagnosis, drug design, and analyzing social implications for COVID-19 infected patients.
    [Show full text]
  • A Reinforcement Learning Approach to Sequential Conformer Search
    TorsionNet: A Reinforcement Learning Approach to Sequential Conformer Search Tarun Gogineni1, Ziping Xu2, Exequiel Punzalan3, Runxuan Jiang1, Joshua Kammeraad2;3, Ambuj Tewari2, Paul Zimmerman3 1Department of EECS, University of Michigan 2Department of Statistics, University of Michigan 3Department of Chemistry, University of Michigan {tgog,zipingxu,epunzal,runxuanj,joshkamm,tewaria,paulzim}@umich.edu Abstract Molecular geometry prediction of flexible molecules, or conformer search, is a long- standing challenge in computational chemistry. This task is of great importance for predicting structure-activity relationships for a wide variety of substances ranging from biomolecules to ubiquitous materials. Substantial computational resources are invested in Monte Carlo and Molecular Dynamics methods to generate diverse and representative conformer sets for medium to large molecules, which are yet intractable to chemoinformatic conformer search methods. We present TorsionNet, an efficient sequential conformer search technique based on reinforcement learning under the rigid rotor approximation. The model is trained via curriculum learning, whose theoretical benefit is explored in detail, to maximize a novel metric grounded in thermodynamics called the Gibbs Score. Our experimental results show that TorsionNet outperforms the highest scoring chemoinformatics method by 4x on large branched alkanes, and by several orders of magnitude on the previously unexplored biopolymer lignin, with applications in renewable energy. TorsionNet also outperforms the far more exhaustive but computationally intensive Self-Guided Molecular Dynamics sampling method. 1 Introduction Accurate prediction of likely 3D geometries of flexible molecules is a long standing goal of com- putational chemistry, with broad implications for drug design, biopolymer research, and QSAR analysis. However, this is a very difficult problem due to the exponential growth of possible stable physical structures, or conformers, as a function of the size of a molecule.
    [Show full text]
  • Arxiv:1911.05531V1 [Q-Bio.BM] 9 Nov 2019 1 Introduction
    Accurate Protein Structure Prediction by Embeddings and Deep Learning Representations Iddo Drori1;2, Darshan Thaker1, Arjun Srivatsa1, Daniel Jeong1, Yueqi Wang1, Linyong Nan1, Fan Wu1, Dimitri Leggas1, Jinhao Lei1, Weiyi Lu1, Weilong Fu1, Yuan Gao1, Sashank Karri1, Anand Kannan1, Antonio Moretti1, Mohammed AlQuraishi3, Chen Keasar4, and Itsik Pe’er1 1 Columbia University, Department of Computer Science, New York, NY, 10027 2 Cornell University, School of Operations Research and Information Engineering, Ithaca, NY 14853 3 Harvard University, Department of Systems Biology, Harvard Medical School, Boston, MA 02115 4 Ben-Gurion University, Department of Computer Science, Israel, 8410501 Abstract. Proteins are the major building blocks of life, and actuators of almost all chemical and biophysical events in living organisms. Their native structures in turn enable their biological functions which have a fun- damental role in drug design. This motivates predicting the structure of a protein from its sequence of amino acids, a fundamental problem in com- putational biology. In this work, we demonstrate state-of-the-art protein structure prediction (PSP) results using embeddings and deep learning models for prediction of backbone atom distance matrices and torsion angles. We recover 3D coordinates of backbone atoms and reconstruct full atom protein by optimization. We create a new gold standard dataset of proteins which is comprehensive and easy to use. Our dataset consists of amino acid sequences, Q8 secondary structures, position specific scoring matrices, multiple sequence alignment co-evolutionary features, backbone atom distance matrices, torsion angles, and 3D coordinates. We evaluate the quality of our structure prediction by RMSD on the latest Critical Assessment of Techniques for Protein Structure Prediction (CASP) test data and demonstrate competitive results with the winning teams and AlphaFold in CASP13 and supersede the results of the winning teams in CASP12.
    [Show full text]