Artificielle

Total Page:16

File Type:pdf, Size:1020Kb

Artificielle #69 NOVEMBRE 2019 L’INTELLIGENCE ARTIFICIELLE CONTEXTE − TECHNOLOGIES − DOMAINES D’APPLICATION − PERSPECTIVES POINT DE VUE L’IA, nouveau « graal » de l’informaticien ? Par Rodolphe Gelin, expert IA (Renault Paris) © DR « Si le Cro-Magnon informatique des années 90 pouvait se dire que la loi de Moore allait régler ses problèmes de performance, le geek contemporain ne peut plus trop y compter, la physique des micro-processeurs ayant déjà pratiquement atteint ses limites. » e retour en grâce de l’intelligence artificielle et surtout des nouvelles valeurs des paramètres qui, en une dizaine d’itérations, lui techniques basées sur l’apprentissage a été rendu possible permettent de bien reconnaître les 80 000 exemples. Les perfor- par l’explosion du nombre de données disponibles et l’ac- mances de ce réseau fraîchement formé sont ensuite testées sur les 20 croissement exponentiel des puissances de calcul. Ces 000 images restantes. Comme la « vérité terrain » est connue pour ces Lnouveaux ordres de grandeur permettent de mettre en œuvre des images, la qualité de l’apprentissage est automatiquement évaluée. Si algorithmes et des méthodes inenvisageables il y a une trentaine d’an- le réseau en reconnaît correctement 99 %, on peut commencer à le nées. À la fin des années 80, ce qu’on appelait une explosion combi- faire travailler réellement. Mais généralement, ce type de perfor- natoire ne semble plus aujourd’hui qu’un pétard combinatoire mouillé ! mances n’est pas obtenu du premier coup. L’apprentissage est donc recommencé selon les mêmes modalités et peut être répété une di- Un des domaines où l’évolution a été la plus spectaculaire est le trai- zaine de fois, le tout en moins de 10 minutes, grâce aux ressources de tement d’images. Comparons ainsi le même exercice réalisé par un calcul actuelles. L’étudiant de 2019 a effectué en quelques minutes étudiant à 30 ans d’intervalle : reconnaître la présence d’une voiture plus de calculs que celui de 1990 en aurait fait dans toute sa vie (si la sur une photo. A la fin du XXe siècle, son programme recherchait explicitement, parmi les pixels formant l’image, des éléments carac- technologie s’était arrêtée là). térisant une voiture : des bords (zones de couleurs différentes), des Le travail de l’informaticien du XXIe siècle serait-il donc beaucoup bords vaguement circulaires (roues) puis plus rectilignes (toit, portes, plus facile que celui de son collègue du XXe ? Peut-être... Mais les vitres…) en envisageant toutes les orientations et positions possibles nouvelles contraintes industrielles sur la protection des données ou dans l’image dont il fallait ensuite vérifier la cohérence géométrique. « l’embarquabilité » du logiciel qui doit, par exemple, tourner sur une Après plusieurs dizaines de minutes de calcul, l’ordinateur rendait son voiture n’ayant pas toujours un accès rapide à des moyens de calcul verdict sur la présence ou non d’une voiture dans l’image. Aujourd’hui, distants ou les ressources pour faire tourner ces calculs en local, in- un système de reconnaissance d’images basé sur de l’apprentissage utilise une base de données déjà labellisée de 100 000 photos, cha- duisent de nouveaux problèmes. Si le Cro-Magnon informatique des cune y étant décrite. Grâce à celles-ci, il apprend lui-même ce qui années 90 pouvait se dire que la loi de Moore allait régler ses pro- caractérise la présence d’une voiture dans une image en essayant tous blèmes de performance, le geek contemporain ne peut plus trop y les motifs possibles dans un paquet de pixels. Lors de la phase d’ap- compter, la physique des micro-processeurs ayant déjà pratiquement prentissage, le système utilise 80 000 de ces images pour entrainer atteint ses limites. D’où l’émergence de nouvelles expertises en atten- un réseau composé de centaines de milliers de neurones : ce dernier, dant la prochaine rupture technologique. A la façon de ceux qui, au- dont les paramètres (les « poids » des neurones) sont initialisés à des trefois, étaient capables de coder des routines en assembleur pour valeurs quelconques, traite les images et indique, pour chacune, ce « gagner du temps de cycle », les experts en conception d’architectures qu’il y a reconnu. Les paramètres ayant été choisis au hasard, ses ré- de réseaux de neurones et en adéquation « hardware-software » sont ponses sont généralement fausses. Une moyenne des erreurs faites mis à rude épreuve pour que l’IA devienne réalité dans notre quoti- sur ces 80 000 exemples est alors calculée puis utilisée pour corriger dien. Hier comme aujourd’hui, être informaticien reste un beau mais la valeur des poids des neurones. Le processus est relancé avec les dur métier ! 2 - L’intelligence artificielle Les voix de la recherche - #69 - Clefs SOMMAIRE DANS CE NUMÉRO LE POINT DE VUE DE RODOLPHE GELIN 2 L’INTELLIGENCE SOMMAIRE 3 DÉFINITION 4 ARTIFICIELLE CONTEXTE 6 LES TECHNOLOGIES 12 LES INFRASTRUCTURES 13 L’IA EMBARQUÉE 14 L’IA DE CONFIANCE 22 L’ALGORITHMIQUE 26 LES DOMAINES D’APPLICATION INTRODUCTION 31 30 ÉNERGIE 28 RECHERCHE FONDAMENTALE 34 Biologie / santé 35 Climat et environnement 37 Astrophysique 38 Physique nucléaire 40 CALCUL 42 PERSPECTIVES 4 questions à Yann LeCun 45 par Étienne Klein Clefs - #69 - Les voix de la recherche L’intelligence artificielle - 3 L’INTELLIGENCE ARTIFICIELLE • PAR ALEXEI GRINBAUM (Direction de la recherche fondamentale) DÉFINITION Alexei Grinbaum est physicien et philosophe. Il travaille au Laboratoire Vous avez (bien) dit de recherche sur les sciences de la matière (Institut de recherche sur les lois fondamentales de l’Univers du CEA). intelligence artificielle ? Selon les fondateurs de la cybernétique, dont John McCarthy and Marvin Minsky, le terme « intelligence artificielle » désigne un comportement produit par une machine dont on peut raisonnablement estimer que, s’il avait été le fruit d’une action humaine, il aurait exigé de l’intelligence de la part de l’agent concerné. Trait saillant à retenir, cette définition s’appuie sur la comparaison entre la machine et l’homme. En effet, bien avant les ordinateurs jouant aux échecs ou les traducteurs automatiques, Alan Turing soulignait déjà que le concept même de « machine intelligente » ne pouvait être défini qu’à travers sa confrontation « Trois types d’algorithmes avec un comportement humain. fondent les systèmes d’IA : apprentissage dit supervisé, ette définition recouvre un spectre très toute description mathématique rigoureuse de ce qui large : par exemple, elle inclut la capacité à se passe pendant l’apprentissage profond. apprentissage non supervisé, trouver des erreurs d’orthographe dans un La technique d’apprentissage supervisé présuppose apprentissage par texte, ce qui nous paraît aujourd’hui tout à Cfait automatisable. Comme d’habitude, le développe- que les systèmes informatiques élaborent leur renforcement. ment du numérique exige en permanence qu’on révise fonctionnement en suivant des lois ou des indications les définitions historiques, y compris celle de l’IA. Dans dictées ou « étiquetées », par les hommes. À l’inverse, Chacune de ces méthodes un premier sens, « intelligence artificielle » désigne un la technique non supervisée permet à la machine peut être réalisée seule, domaine de recherche autour des machines dotées d’explorer ses données sans qu’aucune « grille de d’une capacité d’apprentissage et dont le comporte- lecture » ne lui soit imposée. Souvent, elle y trouve des mais des algorithmes dits ment complexe ne peut être entièrement décrit ni com- régularités qui ne ressemblent guère à des notions pris par le concepteur humain. Le fonctionnement familières à l’homme : c’est la marque d’un élément d’apprentissage profond d’un tel système ne se réduit pas au choix d’action dans non-humain dans le comportement de ces machines (deep learning) les emploient un catalogue écrit au préalable, aussi long soit-il. que, par ailleurs, on mesure toujours à l’homme. C’est Du point de vue de l’histoire de l’informatique, aussi l’élément qui procure aux systèmes d’IA leur à des niveaux différents l’apprentissage machine n’est qu’un outil d’IA parmi incroyable efficacité. Dans le cas de l’apprentissage au sein d’un seul système. » d’autres mais, en pratique, ces deux termes sont de non supervisé, celle-ci peut aller jusqu’à mettre l’utili- plus en plus fréquemment synonymes. sateur dans la situation d’indistinction : en 2019, la Trois types d’algorithmes fondent les systèmes d’IA : génération non supervisée de textes a été capable apprentissage dit supervisé, apprentissage non d’écrire plusieurs paragraphes tout à fait identiques à supervisé, apprentissage par renforcement. Chacune la production humaine ; dans le domaine visuel, le de ces méthodes peut être réalisée seule, mais des recours au non-humain et au non-explicable est encore algorithmes dits d’apprentissage profond (deep lear- plus fondamental : la reconnaissance des images est ning) les emploient à des niveaux différents au sein beaucoup plus efficace si les règles de fonctionnement d’un seul système. Cette imbrication contribue ne sont pas dictées d’emblée par l’homme mais davantage à rendre inconcevable, au moins à ce jour, « découvertes » par le système. 4 - L’intelligence artificielle Les voix de la recherche - #69 - Clefs L’INTELLIGENCE ARTIFICIELLE La troisième méthode d’apprentissage, dite « par renforcement », consiste à identifier, en une suite 3 d’étapes d’évaluation successives, puis à établir avec une force croissante, des corrélations pertinentes entre Instabilité de l’apprentissage les données. Les dernières recherches montrent que le succès de cette méthode, très répandue dans le Les techniques actuelles ne résistent pas bien à domaine des jeux, dépend souvent de la « curiosité » plusieurs types d’attaques adversariales. Pour une de la machine : sa capacité à attribuer un poids bonne protection des systèmes d’IA, de nouvelles Alan Mathison Turing (1912-1954), conséquent à l’exploration des scénarii inconnus ou recherches et solutions techniques restent nécessaires.
Recommended publications
  • AI for Health Examples from Industry, Government, and Academia Table of Contents
    guidehouse.com AI for Health Examples from industry, government, and academia Table of Contents Abstract 3 Introduction to artificial intelligence 4 Introduction to health data 7 Molecules: AI for understanding biology and developing therapies 10 Fundamental research 10 Drug development 11 Patients: AI for predicting disease and improving care 13 Diagnostics and outcome risks 13 Personalized medicine 14 Groups: AI for optimizing clinical trials and safeguarding public health 15 Clinical trials 15 Public health 17 Conclusion and outlook 18 2 Guidehouse Abstract This paper is directed toward a health-informed reader who is curious about the developments and potential of artificial intelligence (AI) in the health space, but could equally be read by AI practitioners curious about how their knowledge and methods are being used to advance human health. We present a brief, equation-free introduction to AI and its major subfields in order to provide a framework for understanding the technical context of the examples that follow. We discuss the various data sources available for questions of health and life sciences, as well as the unique challenges inherent to these fields. We then consider recent (past five years) applications of AI that have already had tangible, measurable impact to the advancement of biomedical knowledge and the development of new and improved treatments. These examples are organized by scale, ranging from the molecule (fundamental research and drug development) to the patient (diagnostics, risk-scoring, and personalized medicine) to the group (clinical trials and public health). Finally, we conclude with a brief summary and our outlook for the future of AI for health.
    [Show full text]
  • Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information
    Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2017.DOI Pre-training of Deep Bidirectional Protein Sequence Representations with Structural Information SEONWOO MIN1,2, SEUNGHYUN PARK3, SIWON KIM1, HYUN-SOO CHOI4, BYUNGHAN LEE5 (Member, IEEE), and SUNGROH YOON1,6 (Senior Member, IEEE) 1Department of Electrical and Computer engineering, Seoul National University, Seoul 08826, South Korea 2LG AI Research, Seoul 07796, South Korea 3Clova AI Research, NAVER Corp., Seongnam 13561, South Korea 4Department of Computer Science and Engineering, Kangwon National University, Chuncheon 24341, South Korea 5Department of Electronic and IT Media Engineering, Seoul National University of Science and Technology, Seoul 01811, South Korea 6Interdisciplinary Program in Artificial Intelligence, ASRI, INMC, and Institute of Engineering Research, Seoul National University, Seoul 08826, South Korea Corresponding author: Byunghan Lee ([email protected]) or Sungroh Yoon ([email protected]) This research was supported by the National Research Foundation (NRF) of Korea grants funded by the Ministry of Science and ICT (2018R1A2B3001628 (S.Y.), 2014M3C9A3063541 (S.Y.), 2019R1G1A1003253 (B.L.)), the Ministry of Agriculture, Food and Rural Affairs (918013-4 (S.Y.)), and the Brain Korea 21 Plus Project in 2021 (S.Y.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ABSTRACT Bridging the exponentially growing gap between the numbers of unlabeled and labeled protein sequences, several studies adopted semi-supervised learning for protein sequence modeling. In these studies, models were pre-trained with a substantial amount of unlabeled data, and the representations were transferred to various downstream tasks.
    [Show full text]
  • Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences
    bioRxiv preprint doi: https://doi.org/10.1101/622803; this version posted May 29, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. BIOLOGICAL STRUCTURE AND FUNCTION EMERGE FROM SCALING UNSUPERVISED LEARNING TO 250 MILLION PROTEIN SEQUENCES Alexander Rives ∗y z Siddharth Goyal ∗x Joshua Meier ∗x Demi Guo ∗x Myle Ott x C. Lawrence Zitnick x Jerry Ma y x Rob Fergus y z x ABSTRACT In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Unsupervised learning recovers information about protein structure: secondary structure and residue-residue contacts can be identified by linear projections from the learned representations. Training language models on full sequence diversity rather than individual protein families increases recoverable information about secondary structure.
    [Show full text]
  • Machine Learning for Speech Recognition by Alice Coucke, Head of Machine Learning Research
    Machine Learning for Speech Recognition by Alice Coucke, Head of Machine Learning Research @alicecoucke [email protected] Outline: 1. Recent advances in machine learning 2. From physics to machine learning IA en général Speech & NLP Startup 3. Working at Snips (now Sonos) Snips A lot of applications in the real world, quite a difference from theoretical work Un des rares domaines scientifiques où la théorie et la pratique sont très emmêlées Recent Advances in Applied Machine Learning A lot of applications in the real world, quite a difference from theoretical work Reinforcement learning Learning goal-oriented behavior within simulated environments Go Starcraft II Dota 2 AlphaGo (Deepmind, 2016) AlphaStar (Deepmind) OpenAI Five (OpenAI) Play-driven learning for robots Sim-to-real dexterity learning (Google Brain) Project BLUE (UC Berkeley) Machine Learning for Life Sciences Deep learning applied to biology and medicine Protein folding & structure Cardiac arrhythmia prediction Eye disease diagnosis prediction (NHS, UCL, Deepmind) from ECGs AlphaFold (Deepmind) (Stanford) Reconstruct speech from neural activity Limb control restoration (UCSF) (Batelle, Ohio State Univ) Computer vision High-level understanding of digital images or videos GANs for image generation (Heriot Watt Univ, DeepMind) « Common sense » understanding of actions in videos (TwentyBn, DeepMind, MIT, IBM…) GANs for artificial video dubbing GAN for full body synthesis (Synthesia) (DataGrid) From physics to machine learning and back A surge of interest from the physics community
    [Show full text]
  • Artificial Intelligence and Cybersecurity
    CEPS Task Force Report Artificial Intelligence and Cybersecurity Technology, Governance and Policy Challenges R a p p o r t e u r s L o r e n z o P u p i l l o S t e f a n o F a n t i n A f o n s o F e r r e i r a C a r o l i n a P o l i t o Artificial Intelligence and Cybersecurity Technology, Governance and Policy Challenges Final Report of a CEPS Task Force Rapporteurs: Lorenzo Pupillo Stefano Fantin Afonso Ferreira Carolina Polito Centre for European Policy Studies (CEPS) Brussels May 2021 The Centre for European Policy Studies (CEPS) is an independent policy research institute based in Brussels. Its mission is to produce sound analytical research leading to constructive solutions to the challenges facing Europe today. Lorenzo Pupillo is CEPS Associate Senior Research Fellow and Head of the Cybersecurity@CEPS Initiative. Stefano Fantin is Legal Researcher at Center for IT and IP Law, KU Leuven. Afonso Ferreira is Directeur of Research at CNRS. Carolina Polito is CEPS Research Assistant at GRID unit, Cybersecurity@CEPS Initiative. ISBN 978-94-6138-785-1 © Copyright 2021, CEPS All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means – electronic, mechanical, photocopying, recording or otherwise – without the prior permission of the Centre for European Policy Studies. CEPS Place du Congrès 1, B-1000 Brussels Tel: 32 (0) 2 229.39.11 e-mail: [email protected] internet: www.ceps.eu Contents Preface ...........................................................................................................................................................
    [Show full text]
  • Overview of Current State of Research on the Application of Artificial Intelligence Techniques for COVID-19
    Overview of current state of research on the application of artificial intelligence techniques for COVID-19 Vijay Kumar1, Dilbag Singh2, Manjit Kaur2 and Robertas Damaševičius3,4 1 Computer Science and Engineering Department, National Institute of Technology, Hamirpur, Himachal Pradesh, India 2 School of Engineering and Applied Sciences, Bennett University, Greater Noida, India 3 Faculty of Applied Mathematics, Silesian University of Technology, Gliwice, Poland 4 Department of Applied Informatics, Vytautas Magnus University, Kaunas, Lithuania ABSTRACT Background: Until now, there are still a limited number of resources available to predict and diagnose COVID-19 disease. The design of novel drug-drug interaction for COVID-19 patients is an open area of research. Also, the development of the COVID-19 rapid testing kits is still a challenging task. Methodology: This review focuses on two prime challenges caused by urgent needs to effectively address the challenges of the COVID-19 pandemic, i.e., the development of COVID-19 classification tools and drug discovery models for COVID-19 infected patients with the help of artificial intelligence (AI) based techniques such as machine learning and deep learning models. Results: In this paper, various AI-based techniques are studied and evaluated by the means of applying these techniques for the prediction and diagnosis of COVID-19 disease. This study provides recommendations for future research and facilitates knowledge collection and formation on the application of the AI techniques for dealing with the COVID-19 epidemic and its consequences. Conclusions: The AI techniques can be an effective tool to tackle the epidemic caused by COVID-19. These may be utilized in four main fields such as prediction, diagnosis, drug design, and analyzing social implications for COVID-19 infected patients.
    [Show full text]
  • A Reinforcement Learning Approach to Sequential Conformer Search
    TorsionNet: A Reinforcement Learning Approach to Sequential Conformer Search Tarun Gogineni1, Ziping Xu2, Exequiel Punzalan3, Runxuan Jiang1, Joshua Kammeraad2;3, Ambuj Tewari2, Paul Zimmerman3 1Department of EECS, University of Michigan 2Department of Statistics, University of Michigan 3Department of Chemistry, University of Michigan {tgog,zipingxu,epunzal,runxuanj,joshkamm,tewaria,paulzim}@umich.edu Abstract Molecular geometry prediction of flexible molecules, or conformer search, is a long- standing challenge in computational chemistry. This task is of great importance for predicting structure-activity relationships for a wide variety of substances ranging from biomolecules to ubiquitous materials. Substantial computational resources are invested in Monte Carlo and Molecular Dynamics methods to generate diverse and representative conformer sets for medium to large molecules, which are yet intractable to chemoinformatic conformer search methods. We present TorsionNet, an efficient sequential conformer search technique based on reinforcement learning under the rigid rotor approximation. The model is trained via curriculum learning, whose theoretical benefit is explored in detail, to maximize a novel metric grounded in thermodynamics called the Gibbs Score. Our experimental results show that TorsionNet outperforms the highest scoring chemoinformatics method by 4x on large branched alkanes, and by several orders of magnitude on the previously unexplored biopolymer lignin, with applications in renewable energy. TorsionNet also outperforms the far more exhaustive but computationally intensive Self-Guided Molecular Dynamics sampling method. 1 Introduction Accurate prediction of likely 3D geometries of flexible molecules is a long standing goal of com- putational chemistry, with broad implications for drug design, biopolymer research, and QSAR analysis. However, this is a very difficult problem due to the exponential growth of possible stable physical structures, or conformers, as a function of the size of a molecule.
    [Show full text]
  • Arxiv:1911.05531V1 [Q-Bio.BM] 9 Nov 2019 1 Introduction
    Accurate Protein Structure Prediction by Embeddings and Deep Learning Representations Iddo Drori1;2, Darshan Thaker1, Arjun Srivatsa1, Daniel Jeong1, Yueqi Wang1, Linyong Nan1, Fan Wu1, Dimitri Leggas1, Jinhao Lei1, Weiyi Lu1, Weilong Fu1, Yuan Gao1, Sashank Karri1, Anand Kannan1, Antonio Moretti1, Mohammed AlQuraishi3, Chen Keasar4, and Itsik Pe’er1 1 Columbia University, Department of Computer Science, New York, NY, 10027 2 Cornell University, School of Operations Research and Information Engineering, Ithaca, NY 14853 3 Harvard University, Department of Systems Biology, Harvard Medical School, Boston, MA 02115 4 Ben-Gurion University, Department of Computer Science, Israel, 8410501 Abstract. Proteins are the major building blocks of life, and actuators of almost all chemical and biophysical events in living organisms. Their native structures in turn enable their biological functions which have a fun- damental role in drug design. This motivates predicting the structure of a protein from its sequence of amino acids, a fundamental problem in com- putational biology. In this work, we demonstrate state-of-the-art protein structure prediction (PSP) results using embeddings and deep learning models for prediction of backbone atom distance matrices and torsion angles. We recover 3D coordinates of backbone atoms and reconstruct full atom protein by optimization. We create a new gold standard dataset of proteins which is comprehensive and easy to use. Our dataset consists of amino acid sequences, Q8 secondary structures, position specific scoring matrices, multiple sequence alignment co-evolutionary features, backbone atom distance matrices, torsion angles, and 3D coordinates. We evaluate the quality of our structure prediction by RMSD on the latest Critical Assessment of Techniques for Protein Structure Prediction (CASP) test data and demonstrate competitive results with the winning teams and AlphaFold in CASP13 and supersede the results of the winning teams in CASP12.
    [Show full text]
  • Protein Data Analysis
    Protein Data Analysis Group 7: Arjun Dharma, Thomas Waldschmidt, Rahil Dedhia Advisor: Dr. Peter Rose Director of the Structural Bioinformatics Laboratory at SDSC June 5, 2020 Abstract Deep Learning transformer models such as Bidirectional Encoder Representations from Transformers (BERT) have been widely success- ful in a variety of natural language based tasks. Recently, BERT has been applied to protein sequences and has shown some success in pro- tein prediction tasks relevant to biologists, such as secondary structure, fluorescence, and stability. To continue the investigation into BERT, we examined a new prediction task known as subcellular location, first described in DeepLoc (2017). Using BERT embeddings from a UC Berkeley research project titled Tasks Assessing Protein Embeddings (TAPE) as features for downstream modeling, we achieved a 67% test set accuracy using a support vector classifier for the 10 class classifica- tion task, and 89% using a Keras deep neural network for the binary classification task (membrane bound vs water soluble protein). Next, we created a containerized Flask app using Docker which is deployable to AWS EC2 with the ability to run on a GPU. This service allows for embedding protein sequences using pretrained models, as well as pro- viding an interface for visualizing the embedding space using principal component analysis and plotly. 1 Contents 1 Introduction 3 1.1 Challenge and Problem Statement ............... 3 1.1.1 Understanding Proteins ................. 3 1.1.2 Transformer Architecture and BERT .......... 5 1.1.3 Protein Embeddings and TAPE ............. 7 1.1.4 Subcellular Location ................... 8 2 Team Roles and Responsibilities 9 2.1 Assigned Roles .........................
    [Show full text]
  • Fully-Differentiable Protein Folding Powered by Energy-Based Models
    EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based Models Jiaxiang Wu Shitong Luo Tao Shen Tencent AI Lab Peking University Tencent AI Lab [email protected] [email protected] [email protected] Haidong Lan Sheng Wang Junzhou Huang Tencent AI Lab Tencent AI Lab Tencent AI Lab [email protected] [email protected] [email protected] Abstract Accurate protein structure prediction from amino-acid sequences is critical to bet- ter understanding proteins’ function. Recent advances in this area largely bene- fit from more precise inter-residue distance and orientation predictions, powered by deep neural networks. However, the structure optimization procedure is still dominated by traditional tools, e.g. Rosetta, where the structure is solved via min- imizing a pre-defined statistical energy function (with optional prediction-based restraints). Such energy function may not be optimal in formulating the whole conformation space of proteins. In this paper, we propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network. This network is trained in a denoising manner, attempting to predict the correction signal from corrupted distance matrices between Cα atoms. Once the network is well trained, Langevin dynamics based sampling is adopted to gradu- ally optimize structures from random initialization. Extensive experiments demon- strate that our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines. 1 Introduction arXiv:2105.04771v2 [cs.LG] 31 May 2021 The biological function of a protein is largely determined by its 3-dimensional structure [30]. Protein structure determination through experimental methods, e.g.
    [Show full text]
  • Deep Learning and Generative Methods in Cheminformatics and Chemical Biology: Navigating Small Molecule Space Intelligently
    Biochemical Journal (2020) 477 4559–4580 https://doi.org/10.1042/BCJ20200781 Review Article Deep learning and generative methods in cheminformatics and chemical biology: navigating small molecule space intelligently Douglas B. Kell1,2, Soumitra Samanta1 and Neil Swainston1 1Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, U.K.; 2Novo Nordisk Foundation Centre for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs. Lyngby, Denmark Correspondence: Douglas B. Kell ([email protected] or [email protected]) The number of ‘small’ molecules that may be of interest to chemical biologists — chemical space — is enormous, but the fraction that have ever been made is tiny. Most strategies are discriminative, i.e. have involved ‘forward’ problems (have molecule, establish proper- ties). However, we normally wish to solve the much harder generative or inverse problem (describe desired properties, find molecule). ‘Deep’ (machine) learning based on large- scale neural networks underpins technologies such as computer vision, natural language processing, driverless cars, and world-leading performance in games such as Go; it can also be applied to the solution of inverse problems in chemical biology. In particular, recent developments in deep learning admit the in silico generation of candidate molecular structures and the prediction of their properties, thereby allowing one to navigate (bio) chemical space intelligently. These methods are revolutionary but require an understanding of both (bio)chemistry and computer science to be exploited to best advantage. We give a high-level (non-mathematical) background to the deep learning revolution, and set out the crucial issue for chemical biology and informatics as a two-way mapping from the discrete nature of individual molecules to the continuous but high-dimensional latent representation that may best reflect chemical space.
    [Show full text]
  • Deep Learning-Based Advances in Protein Structure Prediction
    International Journal of Molecular Sciences Review Deep Learning-Based Advances in Protein Structure Prediction Subash C. Pakhrin 1,†, Bikash Shrestha 2,†, Badri Adhikari 2,* and Dukka B. KC 1,* 1 Department of Electrical Engineering and Computer Science, Wichita State University, Wichita, KS 67260, USA; [email protected] 2 Department of Computer Science, University of Missouri-St. Louis, St. Louis, MO 63121, USA; [email protected] * Correspondence: [email protected] (B.A.); [email protected] (D.B.K.) † Both the authors should be considered equal first authors. Abstract: Obtaining an accurate description of protein structure is a fundamental step toward under- standing the underpinning of biology. Although recent advances in experimental approaches have greatly enhanced our capabilities to experimentally determine protein structures, the gap between the number of protein sequences and known protein structures is ever increasing. Computational protein structure prediction is one of the ways to fill this gap. Recently, the protein structure prediction field has witnessed a lot of advances due to Deep Learning (DL)-based approaches as evidenced by the suc- cess of AlphaFold2 in the most recent Critical Assessment of protein Structure Prediction (CASP14). In this article, we highlight important milestones and progresses in the field of protein structure prediction due to DL-based methods as observed in CASP experiments. We describe advances in various steps of protein structure prediction pipeline viz. protein contact map prediction, protein distogram prediction, protein real-valued distance prediction, and Quality Assessment/refinement. We also highlight some end-to-end DL-based approaches for protein structure prediction approaches. Additionally, as there have been some recent DL-based advances in protein structure determination Citation: Pakhrin, S.C.; Shrestha, B.; using Cryo-Electron (Cryo-EM) microscopy based, we also highlight some of the important progress Adhikari, B.; KC, D.B.
    [Show full text]