Opportunities and Obstacles for Deep Learning in Biology and Medicine

Total Page:16

File Type:pdf, Size:1020Kb

Opportunities and Obstacles for Deep Learning in Biology and Medicine Deleted: ,*, Opportunities and obstacles for deep learning in Formatted ... [1] Formatted ... [2] biology and medicine Deleted: Xie8, Formatted A DOI-citable preprint of this manuscript is available at https://doi.org/10.1101/142760. ... [3] Deleted: Rosen9, This manuscript was automatically generated from greenelab/deep-review@b3b57d3 on January 19, 2018. Formatted ... [4] Deleted: Lengerich10, Authors Formatted ... [5] Deleted: Israeli11, 1, 2 3 Travers Ching , Daniel S. Himmelstein , Brett K. Beaulieu-Jones , Alexandr A. 4 5 2 6 7 Formatted ... [6] Kalinin , Brian T. Do , Gregory P. Way , Enrico Ferrero , Paul-Michael Agapow , Michael 12 Zietz2, Michael M. Hoffman8,9,10, Wei Xie11, Gail L. Rosen12, Benjamin J. Deleted: Lanchantin , Lengerich13, Johnny Israeli14, Jack Lanchantin15, Stephen Woloszynek12, Anne E. Formatted ... [7] Carpenter16, Avanti Shrikumar17, Jinbo Xu18, Evan M. Cofer19,20, Christopher A. Deleted: Woloszynek9, Lavender21, Srinivas C. Turaga22, Amr M. Alexandari17, Zhiyong Lu23, David J. Formatted ... [8] Harris24, Dave DeCaprio25, Yanjun Qi15, Anshul Kundaje17,26, Yifan Peng23, Laura K. Deleted: Carpenter13, Wiley27, Marwin H.S. Segler28, Simina M. Boca29, S. Joshua Swamidass30, Austin Huang31, Anthony Gitter32,33,†, Casey S. Greene2,† Formatted ... [9] Deleted: Shrikumar14, — Author order was determined with a randomized algorithm Formatted ... [10] † — To whom correspondence should be addressed: [email protected] (A.G.) and Deleted: Xu15, [email protected] (C.S.G.) Formatted ... [11] 1. Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, HI Deleted: Cofer16, 2. Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Formatted Pennsylvania, Philadelphia, PA ... [12] 3. Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Deleted: Harris17, Philadelphia, PA Formatted 4. Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI ... [13] 5. Harvard Medical School, Boston, MA Deleted: DeCaprio18, 6. Computational Biology and Stats, Target Sciences, GlaxoSmithKline, Stevenage, United Kingdom Formatted 7. Data Science Institute, Imperial College London, London, United Kingdom ... [14] 8. Princess Margaret Cancer Centre, Toronto, ON, Canada Deleted: Qi12, 9. Department of Medical Biophysics, Toronto, ON, Canada Formatted 10. Department of Computer Science, Toronto, ON, Canada ... [15] 11. Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN Deleted: Kundaje19, 12. Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Peng20, Engineering, Drexel University, Philadelphia, PA Deleted: 13. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA Formatted ... [16] 14. Biophysics Program, Stanford University, Stanford, CA 15. Department of Computer Science, University of Virginia, Charlottesville, VA Formatted ... [17] 16. Imaging Platform, Broad Institute of Harvard and MIT, Cambridge, MA Deleted: Wiley21, 17. Department of Computer Science, Stanford University, Stanford, CA 18. Toyota Technological Institute at Chicago, Chicago, IL Formatted ... [18] 19. Department of Computer Science, Trinity University, San Antonio, TX Deleted: Segler22, 20. Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 21. Integrative Bioinformatics, National Institute of Environmental Health Sciences, National Institutes of Health, Formatted ... [19] Research Triangle Park, NC Deleted: Gitter23,†, 22. Howard Hughes Medical Institute, Janelia Research Campus, Ashburn, VA 23. National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, Formatted ... [20] * … Bethesda, MD Deleted: — Author order was determined with a... [21] 24. Department of Wildlife Ecology and Conservation, University of Florida, Gainesville, FL 25. ClosedLoop.ai, Austin, TX Moved (insertion) [1] 26. Department of Genetics, Stanford University, Stanford, CA Deleted: and Department of Computer Science 27. Moved up [1]: National Center for Biotechnology 28. Division of Biomedical Informatics and Personalized Medicine, University of Colorado School of Medicine, Aurora, Information and National Library of Medicine, CO National Institutes of Health, Bethesda, MD 29. Institute of Organic Chemistry, Westfälische Wilhelms-Universität Münster, Münster, Germany 30. Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC 31. Department of Pathology and Immunology, Washington University in Saint Louis, Saint Louis, MO 32. Department of Medicine, Brown University, Providence, RI 33. Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI Deleted: and 34. Morgridge Institute for Research, Madison, WI Abstract Deep learning, which describes a class of machine learning algorithms, has recently showed impressive results across a variety of domains. Biology and medicine are data rich, but the data are complex and often ill-understood. Problems of this nature may be particularly well-suited to deep learning techniques. We examine applications of deep learning to a variety of biomedical problems— Deleted: -- patient classification, fundamental biological processes, and treatment of patients—and discuss Deleted: -- to predict whether deep learning will transform these tasks or if the biomedical sphere poses unique challenges. We find that deep learning has yet to revolutionize or definitively resolve any of these problems, but promising advances have been made on the prior state of the art. Even when improvement over a previous baseline has been modest, we have seen signs that deep learning methods may speed or aid human investigation. More work is needed to address concerns related to interpretability and how to best model each problem. Furthermore, the limited amount of labeled data for training presents problems in some domains, as do legal and privacy constraints on work with Deleted: can sensitive health records. Nonetheless, we foresee deep learning powering changes at both bench Deleted: the and bedside with the potential to transform several areas of biology and medicine. Introduction to deep learning Biology and medicine are rapidly becoming data-intensive. A recent comparison of genomics with social media, online videos, and other data-intensive disciplines suggests that genomics alone will equal or surpass other fields in data generation and analysis within the next decade [1]. The volume and complexity of these data present new opportunities, but also pose new challenges. Automated algorithms that extract meaningful patterns could lead to actionable knowledge and change how we develop treatments, categorize patients, or study diseases, all within privacy-critical environments. The term deep learning has come to refer to a collection of new techniques that, together, have demonstrated breakthrough gains over existing best-in-class machine learning algorithms across several fields. For example, over the past five years these methods have revolutionized image classification and speech recognition due to their flexibility and high accuracy [2]. More recently, deep learning algorithms have shown promise in fields as diverse as high-energy physics [3], dermatology [4], and translation among written languages [5]. Across fields, “off-the-shelf” Deleted: " implementations of these algorithms have produced comparable or higher accuracy than previous Deleted: " best-in-class methods that required years of extensive customization, and specialized implementations are now being used at industrial scales. Deep learning approaches grew from research in neural networks, which were first proposed in 1943 [6] as a model for how our brains process information. The history of neural networks is interesting in its own right [7]. In neural networks, inputs are fed into the input layer, which feeds into Deleted: a hidden one or more hidden layers, which eventually link to an output layer. A layer consists of a set of Deleted: produce nodes, sometimes called “features” or “units,” which are connected via edges to the immediately earlier and the immediately deeper layers. In some special neural network architectures, nodes can connect to themselves with a delay. The nodes of the input layer generally consist of the variables being measured in the dataset of interest—for example, each node could represent the intensity value of a specific pixel in an image or the expression level of a gene in a specific transcriptomic experiment. The neural networks used for deep learning have multiple hidden layers. Each layer essentially performs feature construction for the layers before it. The training process used often allows layers deeper in the network to contribute to the refinement of earlier layers. For this reason, these algorithms can automatically engineer features that are suitable for many tasks and customize those features for one or more specific tasks. Deep learning does many of the same things as more familiar machine learning approaches. In Deleted: Like a clustering algorithm, it particular, deep learning approaches can be used both in supervised applications—where the goal is Deleted: build features that describe recurrent to accurately predict one or more labels or outcomes
Recommended publications
  • Data Science at OLCF
    Data Science at OLCF Bronson Messer Scientific Computing Group Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory Mallikarjam “Arjun” Shankar Group Leader – Advanced Data and Workflows Group Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory ORNL is managed by UT-Battelle, LLC for the US Department of Energy OLCF Data/Learning Strategy & Tactics 1. Engage with applications – Summit Early Science Applications (e.g., CANDLE) – INCITE projects (e.g., Co-evolutionary Networks: From Genome to 3D Proteome, Jacobson, et al.) – Directors Discretionary projects (e.g., Fusion RNN, MiNerva) 2. Create leadership-class analytics capabilities – Leadership analytics (e.g., Frameworks: pbdR, TensorFlow + Horovod) – Algorithms requiring scale (e.g., non-negative matrix factorization) 3. Enable infrastructure for analytics/AI and data-intensive facilities – Workflows to include data from observations for analysis within OLCF – Analytics enabling technologies (e.g., container deployments for rapidly changing DL/ML frameworks, analytics notebooks, etc.) 2 Data Science at the OLCF Applications Supported through DD/ALCC: Selected Machine Learning Projects on Titan: 2016-2017 Program PI PI Employer Project Name Allocation (Titan core-hrs) Discovering Optimal Deep Learning and Neuromorphic Network Structures using Evolutionary ALCC Robert Patton ORNL 75,000,000 Approaches on High Performance Computers ALCC Gabriel Perdue FNAL Large scale deep neural network optimization for neutrino physics 58,000,000 ALCC Gregory Laskowski GE High-Fidelity Simulations of Gas Turbine Stages for Model Development using Machine Learning 30,000,000 High-Throughput Screening and Machine Learning for Predicting Catalyst Structure and Designing ALCC Efthimions Kaxiras Harvard U. 17,500,000 Effective Catalysts ALCC Georgia Tourassi ORNL CANDLE Treatment Strategy Challenge for Deep Learning Enabled Cancer Surveillance 10,000,000 DD Abhinav Vishnu PNNL Machine Learning on Extreme Scale GPU systems 3,500,000 DD J.
    [Show full text]
  • Sophie Voisin
    Sophie Voisin Contact Geographic Information Science & Technology Office: (865) 574-8235 Information Oak Ridge National Laboratory E-mail: [email protected] Oak Ridge, TN 37831-6017 Citizenship: France US Permanent Resident Biosketch Dr. Sophie Voisin received her PhD Degree in Computer Science and Image Processing from the University of Burgundy, France, in 2008. She was a visiting scholar with the Imaging, Robotics, and Intelligent Systems Laboratory at The University of Tennessee from October 2004 to December 2008 to work on her PhD research, and subsequently was involved in program development activities. In September 2010, she joined the Oak Ridge National Laboratory (ORNL) as a Postoctoral Reseach Associate to support various efforts related to applied signal processing, 2D and 3D image understanding, and high performance computing. Firstly, she worked at the ORNL Spallation Neutron Source performing quantitative analysis of neutron image data for various industrial and academic applications related to quality control, process monitoring, and to retrieve the structure of objects. Then, she joined the ORNL Biomedical Science & Engineering Center to develop on one hand image processing algorithms for eyegaze data analysis and on the other hand text processing techniques for social media data mining to correlate individuals' health history and their geographical exposure. Noteworthily she was part of the team that received a R&D 100 award for the developement of a personalized computer aid diagnostic system relying on eyegaze analysis for decision making. Since April 2014, she has been working for the Geographic Information Science & Technology group. Her research focuses on developing multispectral image processing algorithms for CPU and GPU platforms for high performance computing of satellite imagery.
    [Show full text]
  • Bonnie Berger Named ISCB 2019 ISCB Accomplishments by a Senior
    F1000Research 2019, 8(ISCB Comm J):721 Last updated: 09 APR 2020 EDITORIAL Bonnie Berger named ISCB 2019 ISCB Accomplishments by a Senior Scientist Award recipient [version 1; peer review: not peer reviewed] Diane Kovats 1, Ron Shamir1,2, Christiana Fogg3 1International Society for Computational Biology, Leesburg, VA, USA 2Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel 3Freelance Writer, Kensington, USA First published: 23 May 2019, 8(ISCB Comm J):721 ( Not Peer Reviewed v1 https://doi.org/10.12688/f1000research.19219.1) Latest published: 23 May 2019, 8(ISCB Comm J):721 ( This article is an Editorial and has not been subject https://doi.org/10.12688/f1000research.19219.1) to external peer review. Abstract Any comments on the article can be found at the The International Society for Computational Biology (ISCB) honors a leader in the fields of computational biology and bioinformatics each year with the end of the article. Accomplishments by a Senior Scientist Award. This award is the highest honor conferred by ISCB to a scientist who is recognized for significant research, education, and service contributions. Bonnie Berger, Simons Professor of Mathematics and Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT) is the 2019 recipient of the Accomplishments by a Senior Scientist Award. She is receiving her award and presenting a keynote address at the 2019 Joint International Conference on Intelligent Systems for Molecular Biology/European Conference on Computational Biology in Basel, Switzerland on July 21-25, 2019. Keywords ISCB, Bonnie Berger, Award This article is included in the International Society for Computational Biology Community Journal gateway.
    [Show full text]
  • DREAM: a Dialogue on Reverse Engineering Assessment And
    DREAM:DREAM: aa DialogueDialogue onon ReverseReverse EngineeringEngineering AssessmentAssessment andand MethodsMethods Andrea Califano: MAGNet: Center for the Multiscale Analysis of Genetic and Cellular Networks C2B2: Center for Computational Biology and Bioinformatics ICRC: Irving Cancer RResearchesearch Center Columbia University 1 ReverseReverse EngineeringEngineering • Inference of a predictive (generative) model from data. E.g. argmax[P(Data|Model)] • Assumptions: – Model variables (E.g., DNA, mRNA, Proteins, cellular sub- structures) – Model variable space: At equilibrium, temporal dynamics, spatio- temporal dynamics, etc. – Model variable interactions: probabilistics (linear, non-linear), explicit kinetics, etc. – Model topology: known a-priori, inferred. • Question: – Model ~= Reality? ReverseReverse EngineeringEngineering Data Biological System Expression Proteomics > NFAT ATGATGGATG CTCGCATGAT CGACGATCAG GTGTAGCCTG High-throughput GGCTGGA Structure Sequence Biology … Biochemical Model Validation Control X-Y- Control X+Y+ Y X Z X+Y- X-Y+ Control Control Specific Prediction SomeSome ReverseReverse EngineeringEngineering MethodsMethods • Optimization: High-Dimensional objective function max corresponds to best topology – Liang S, Fuhrman S, Somogyi (REVEAL) – Gat-Viks and R. Shamir (Chain Functions) – Segal E, Shapira M, Regev A, Pe’er D, Botstein D, KolKollerler D, and Friedman N (Prob. Graphical Models) – Jing Yu, V. Anne Smith, Paul P. Wang, Alexander J. Hartemink, Erich D. Jarvis (Dynamic Bayesian Networks) – … • Regression: Create a general model of biochemical interactions and fit the parameters – Gardner TS, di Bernardo D, Lorentz D, and Collins JJ (NIR) – Alberto de la Fuente, Paul Brazhnik, Pedro Mendes – Roven C and Bussemaker H (REDUCE) – … • Probabilistic and Information Theoretic: Compute probability of interaction and filter with statistical criteria – Atul Butte et al. (Relevance Networks) – Gustavo Stolovitzky et al. (Co-Expression Networks) – Andrea CaCalifanolifano et al.
    [Show full text]
  • Orac (Alias David Gorski) December 31, 2014 She’S Baa-Aack
    http://scienceblogs.com/insolence/2014/12/31/oh-no-gmos-are-going-to-make-everyone-autistic/ nearly 500 comments Orac (alias David Gorski) December 31, 2014 She’s baa-aack. Remember Stephanie Seneff? When last Orac discussed her, she had been caught dumpster diving into the VAERS database in order to torture the data to make it confess a “link” between aluminum adjuvants in vaccines and acetaminophen and—you guessed it!—autism. It was a bad paper in a bad journal known as Entropy that I deconstructed in detail around two years ago. As I said at the time, I hadn’t seen a “review” article that long and that badly done since the even more horrible article by Helen Ratajczak entitled Theoretical aspects of autism: Causes–A review (which, not surprisingly, was cited approvingly by Seneff et al). Seneff, it turns out, is an MIT scientist, but she is not a scientist with any expertise in autism, epidemiology, or, for that matter, any relevant scientific discipline that would give her the background knowledge and skill set to take on analyzing the epidemiological literature regarding autism. Indeed, she is in the Computer Science and Artificial Intelligence Laboratory at MIT, and her web page theredescribes her thusly: Stephanie Seneff is a Senior Research Scientist at the MIT Computer Science and Artificial Intelligence Laboratory. She received the B.S. degree in Biophysics in 1968, the M.S. and E.E. degrees in Electrical Engineering in 1980, and the Ph.D degree in Electrical Engineering and Computer Science in 1985, all from MIT.
    [Show full text]
  • Research Report 2006 Max Planck Institute for Molecular Genetics, Berlin Imprint | Research Report 2006
    MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS Research Report 2006 Max Planck Institute for Molecular Genetics, Berlin Imprint | Research Report 2006 Published by the Max Planck Institute for Molecular Genetics (MPIMG), Berlin, Germany, August 2006 Editorial Board Bernhard Herrmann, Hans Lehrach, H.-Hilger Ropers, Martin Vingron Coordination Claudia Falter, Ingrid Stark Design & Production UNICOM Werbeagentur GmbH, Berlin Number of copies: 1,500 Photos Katrin Ullrich, MPIMG; David Ausserhofer Contact Max Planck Institute for Molecular Genetics Ihnestr. 63–73 14195 Berlin, Germany Phone: +49 (0)30-8413 - 0 Fax: +49 (0)30-8413 - 1207 Email: [email protected] For further information about the MPIMG please see our website: www.molgen.mpg.de MPI for Molecular Genetics Research Report 2006 Table of Contents The Max Planck Institute for Molecular Genetics . 4 • Organisational Structure. 4 • MPIMG – Mission, Development of the Institute, Research Concept. .5 Department of Developmental Genetics (Bernhard Herrmann) . 7 • Transmission ratio distortion (Hermann Bauer) . .11 • Signal Transduction in Embryogenesis and Tumor Progression (Markus Morkel). 14 • Development of Endodermal Organs (Heiner Schrewe) . 16 • Gene Expression and 3D-Reconstruction (Ralf Spörle). 18 • Somitogenesis (Lars Wittler). 21 Department of Vertebrate Genomics (Hans Lehrach) . 25 • Molecular Embryology and Aging (James Adjaye). .31 • Protein Expression and Protein Structure (Konrad Büssow). .34 • Mass Spectrometry (Johan Gobom). 37 • Bioinformatics (Ralf Herwig). .40 • Comparative and Functional Genomics (Heinz Himmelbauer). 44 • Genetic Variation (Margret Hoehe). 48 • Cell Arrays/Oligofingerprinting (Michal Janitz). .52 • Kinetic Modeling (Edda Klipp) . .56 • In Vitro Ligand Screening (Zoltán Konthur). .60 • Neurodegenerative Disorders (Sylvia Krobitsch). .64 • Protein Complexes & Cell Organelle Assembly/ USN (Bodo Lange/Thorsten Mielke). .67 • Automation & Technology Development (Hans Lehrach).
    [Show full text]
  • Modeling and Analysis of RNA-Seq Data: a Review from a Statistical Perspective
    Modeling and analysis of RNA-seq data: a review from a statistical perspective Wei Vivian Li 1 and Jingyi Jessica Li 1;2;∗ Abstract Background: Since the invention of next-generation RNA sequencing (RNA-seq) technolo- gies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies. The analysis of RNA-seq data at four different levels (samples, genes, transcripts, and exons) involve multiple statistical and computational questions, some of which remain challenging up to date. Results: We review RNA-seq analysis tools at the sample, gene, transcript, and exon levels from a statistical perspective. We also highlight the biological and statistical questions of most practical considerations. Conclusion: The development of statistical and computational methods for analyzing RNA- seq data has made significant advances in the past decade. However, methods developed to answer the same biological question often rely on diverse statical models and exhibit dif- ferent performance under different scenarios. This review discusses and compares multiple commonly used statistical models regarding their assumptions, in the hope of helping users select appropriate methods as needed, as well as assisting developers for future method development. 1 Introduction RNA sequencing (RNA-seq) uses the next generation sequencing (NGS) technologies to reveal arXiv:1804.06050v3 [q-bio.GN] 1 May 2018 the presence and quantity of RNA molecules in biological samples. Since its invention, RNA- seq has revolutionized transcriptome analysis in biological research. RNA-seq does not require any prior knowledge on RNA sequences, and its high-throughput manner allows for genome-wide profiling of transcriptome landscapes [1,2].
    [Show full text]
  • Stephanie Seneff Telephone
    Name: Stephanie Seneff Telephone: (617) 901-0442 (cell) (617) 253-0451 (office) Email: seneff@csail.mit.edu Current Position (2014): Senior Research Scientist, Spoken Language Systems Group, MIT Computer Science and Artificial Intelligence Laboratory Brief Biography Stephanie Seneff is a Senior Research Scientist at MIT's Computer Science and Artificial Intelligence Laboratory. She has a Bachelor's degree from MIT in biology with a minor in food and nutrition, and a PhD in Electrical Engineering and Computer Science, also from MIT. Throughout her career, Dr. Seneff has conducted research in diverse areas, including human auditory modeling, spoken dialogue systems, natural language processing, information retrieval and summarization, and computational biology. She has published over 200 refereed articles in technical journals and conferences on these subjects, and has been invited to give several keynote speeches. Dr. Seneff has recently become interested in the effect of drugs and diet on health and nutrition, and she has presented talks on these subjects at various workshops and written several essays on the web articulating her view. She is currently developing spoken dialogue systems to support intelligent search and summarization of user-provided reviews in the medical domain. She has authored over a dozen recently published papers on theories proposing that a low-micronutrient, high- carbohydrate diet contributes to the metabolic syndrome and to Alzheimer's disease, and that sulfur deficiency, environmental toxicants, and insufficient sunlight exposure to the skin and eyes play an important role in many modern conditions and diseases, including heart disease, diabetes, gastrointestinal problems, Alzheimer's disease and autism. Education: School Department Degrees Received MIT Biology B.S.
    [Show full text]
  • Steven L. Salzberg
    Steven L. Salzberg McKusick-Nathans Institute of Genetic Medicine Johns Hopkins School of Medicine, MRB 459, 733 North Broadway, Baltimore, MD 20742 Phone: 410-614-6112 Email: [email protected] Education Ph.D. Computer Science 1989, Harvard University, Cambridge, MA M.Phil. 1984, M.S. 1982, Computer Science, Yale University, New Haven, CT B.A. cum laude English 1980, Yale University Research Areas: Genomics, bioinformatics, gene finding, genome assembly, sequence analysis. Academic and Professional Experience 2011-present Professor, Department of Medicine and the McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University. Joint appointments as Professor in the Department of Biostatistics, Bloomberg School of Public Health, and in the Department of Computer Science, Whiting School of Engineering. 2012-present Director, Center for Computational Biology, Johns Hopkins University. 2005-2011 Director, Center for Bioinformatics and Computational Biology, University of Maryland Institute for Advanced Computer Studies 2005-2011 Horvitz Professor, Department of Computer Science, University of Maryland. (On leave of absence 2011-2012.) 1997-2005 Senior Director of Bioinformatics (2000-2005), Director of Bioinformatics (1998-2000), and Investigator (1997-2005), The Institute for Genomic Research (TIGR). 1999-2006 Research Professor, Departments of Computer Science and Biology, Johns Hopkins University 1989-1999 Associate Professor (1996-1999), Assistant Professor (1989-1996), Department of Computer Science, Johns Hopkins University. On leave 1997-99. 1988-1989 Associate in Research, Graduate School of Business Administration, Harvard University. Consultant to Ford Motor Co. of Europe and to N.V. Bekaert (Kortrijk, Belgium). 1985-1987 Research Scientist and Senior Knowledge Engineer, Applied Expert Systems, Inc., Cambridge, MA. Designed expert systems for financial services companies.
    [Show full text]
  • Machine Learning and Statistical Methods for Clustering Single-Cell RNA-Sequencing Data Raphael Petegrosso 1, Zhuliu Li 1 and Rui Kuang 1,∗
    i i “main” — 2019/5/3 — 12:56 — page 1 — #1 i i Briefings in Bioinformatics doi.10.1093/bioinformatics/xxxxxx Advance Access Publication Date: Day Month Year Manuscript Category Subject Section Machine Learning and Statistical Methods for Clustering Single-cell RNA-sequencing Data Raphael Petegrosso 1, Zhuliu Li 1 and Rui Kuang 1,∗ 1Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, MN, USA ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Single-cell RNA-sequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA- seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, k-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells, and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells.
    [Show full text]
  • Assessing the Impact of Alternative Splicing in Cancer
    FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Assessing the impact of alternative splicing in cancer Ana Gomes Mestrado Integrado em Engenharia Informática e Computação Supervisor: Rui Camacho (FEUP) Second Supervisor: Valdemar Máximo (FMUP/IPATIMUP) July 26, 2015 Assessing the impact of alternative splicing in cancer Ana Gomes Mestrado Integrado em Engenharia Informática e Computação Approved in oral examination by the committee: Chair: João Moreira External Examiner: Sérgio Matos Supervisor: Rui Camacho July 26, 2015 Abstract Worldwide, millions of people live every day with a diagnosis of cancer. Cancer has several possible causes. One of such possibilities is the genomic origin. In this thesis we designed and developed informatics tools to help cancer researchers to investigate cancer possible origin an [aberrant] alternative splicing. In this process usually a single fragment of DNA can result in more than one transcript during which an aberrant mutation can occur and be the cause of a disorder. For the genome analysis RNA-seq was used in our study. RNA-seq has been used nowadays, quite frequently, as a procedure to sequence genomes. RNA-seq performs the reconstruction of at least part of the genome of a patient from small fragments of it (reads), calculates the set of active genes and compares it with one from a reference person. This last step of active gene differentiation may help researchers to understand the original biological question that triggered the study. At this last stage it is also important to collect several kinds of information associated with the active genes in order to establish a solid base for informed decisions based on the process.
    [Show full text]
  • Lecture 10: Phylogeny 25,27/12/12 Phylogeny
    גנומיקה חישובית Computational Genomics פרופ' רון שמיר ופרופ' רודד שרן Prof. Ron Shamir & Prof. Roded Sharan ביה"ס למדעי המחשב אוניברסיטת תל אביב School of Computer Science, Tel Aviv University , Lecture 10: Phylogeny 25,27/12/12 Phylogeny Slides: • Adi Akavia • Nir Friedman’s slides at HUJI (based on ALGMB 98) •Anders Gorm Pedersen,Technical University of Denmark Sources: Joe Felsenstein “Inferring Phylogenies” (2004) 1 CG © Ron Shamir Phylogeny • Phylogeny: the ancestral relationship of a set of species. • Represented by a phylogenetic tree branch ? leaf ? Internal node ? ? ? Leaves - contemporary Internal nodes - ancestral 2 CG Branch© Ron Shamir length - distance between sequences 3 CG © Ron Shamir 4 CG © Ron Shamir 5 CG © Ron Shamir 6 CG © Ron Shamir “classical” Phylogeny schools Classical vs. Modern: • Classical - morphological characters • Modern - molecular sequences. 7 CG © Ron Shamir Trees and Models • rooted / unrooted “molecular • topology / distance clock” • binary / general 8 CG © Ron Shamir To root or not to root? • Unrooted tree: phylogeny without direction. 9 CG © Ron Shamir Rooting an Unrooted Tree • We can estimate the position of the root by introducing an outgroup: – a species that is definitely most distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant 10 CG © Ron Shamir A Scally et al. Nature 483, 169-175 (2012) doi:10.1038/nature10842 HOW DO WE FIGURE OUT THESE TREES? TIMES? 12 CG © Ron Shamir Dangers of Paralogs • Right species distance: (1,(2,3)) Sequence Homology Caused
    [Show full text]