Ewan Birney (Tweetable Until the Last Bit – I’Ll Point It Out)

Total Page:16

File Type:pdf, Size:1020Kb

Ewan Birney (Tweetable Until the Last Bit – I’Ll Point It Out) Information in the Lifesciences… The Science of the 21st Century …and why it needs infrastructure Ewan Birney (tweetable until the last bit – I’ll point it out) EBI is an Outstation of the European Molecular Biology Laboratory. Outline of the talk • Who am I? • Genomics 2000-2012 • HapMap, 1000 Genomes, GWAS, ENCODE • Route into medicine • Why we need an infrastructure • (Some more whimsical uses of DNA…) Who am I? • Associate Director at European Bioinformatics Institute (EBI) • Involved in genomics since I was 19 (almost 20 years!) • Trained as a biochemist – most people think I am CS • Analysed – sometimes lead EBI is in Hinxton, South – Cambridgeshire human/mouse/rat/platypus etc genomes. EBI is part of EMBL, ~like • Lead the analysis of CERN for molecular biology ENCODE Molecular Biology • The study of how life works – at a molecular level • Key molecules: • DNA – Information store (Disk) • RNA – Key information transformer, also does stuff (RAM) • Proteins – The business end of life (Chip, robotic arms) • Metabolites – Fuel and signalling molecules (electricity) • Theories of how these interact – no theories of to predict what they are • Instead we determine attributes of molecules and store them in globally accessible, open, databases Crash Course in DNA EBI is an Outstation of the European Molecular Biology Laboratory. DNA is a covalently linked polymer nearly always found in anti-parallel, non covalent pairs We represent it as strings, not worrying about one pair of the two polymers >6 dna:chromosome chromosome:GRCh37:6:133017695:133161157:1 GCAGCAAGACAGAAGTGACTCATACATACAAGGGATCCCCAATAAGATTATCGGCAGATT TCTCATCAATAACTTTGGAGACCACAAAGCATTGAGCTGATATATTTAAAGTACTGAAAG AAAAAAAAATCTGACAACCAAGAATTCTATATCCATCAGAACTGCCCTTCAAAAGGGAGG GAGAAATGAAGACATTCTCAGATTTGAGAAGAAAGGAAAGAGAGAAGGGAGGGGAGGGGA GAGGAGGGGAGGGGAGGAGAGGAGAGGAGAGGGCACAGTGGCTCACGCCTGTAATCCTAG CACTTTGCAAGACTGAGGCCAGTGGAACACCTGAGGTCAGGAGATCGAGACCATCCTGGC TAACACGGTGAAACCCCGTCTCCACTAAAAATACAAAAAATTAGCCAGGCGTGGTGGCAG GTGCCTGTAGTTCCAGCTACTCAGGAGGCTGAGGCAGCAGAATGGCGTGAACTCGGGAGG TGGAGCTTGCAGTGAGCTGAGATTGCGCCCCTGCACTCCAGCCTGGGTGACAGAGTGAGA CTCTGTCTCAAAAAAATAAAAAGTTTAAAAATATTTTAAAAAAAGAAAGAAAGAAGGGAG 1 monomer is called a “base pair” – bp We can routinely determine small parts of DNA 1977-1990 – 500 bp, manual tracking 1990-2000 – 500 bp, computational tracking, 1D, “capillary” 2005-2012 – 20-100bp, 2D systems, (“2nd Generation” or NGS) Fred Sanger, inventor of terminator DNA rd 2012 - ?? >5kb, Real time “3 sequencing Generation” Costs have come exponentially down Volumes have gone up! 1E+14 1E+13 Capillary reads 1E+12 Assembled sequences Next gen. reads 1E+11 1E+10 Bases 1E+09 Similar explosion of 100000000 Image based methods 10000000 See Blog posts about da 1000000 specific compression 100000 1980 1985 1990 1995 2000 2005 2010 2015 Date A genome is all our DNA Every cell has two copies of 3e9bp (one from mum, one from dad) in 24 polymers (“chromosomes”) Ecoli: 4e6, Yeast, 12e6 Medaka, White Pine 0.9e9 20e9 Human Genome project • 1989 – 2000 – sequencing the human genome • Just 1 “individual” – actually a mosaic of about 24 individuals but as if it was one • Old school technologies • A bit epic • Now • Same data volume generated in ~3mins in a current large scale centre • It’s all about the analysis What happened next? EBI is an Outstation of the European Molecular Biology Laboratory. We looked into human variation 3 in 10,000 bases between any two individuals are different (a bit more between Africans) The similarity of a European to an African (any population) is Only marginally smaller than European to European (2 or 3%). Only a minute amount of DNA is unique to any population … and associate this with traits or disease (you can infer the majority of the genome by knowing a base About 1 every 5,000 to 10,000 bases – the experiments to Look at this density is far cheaper than sequencing) ENCODE Dimensions CellsCells 3,010 Experiments 182 Cell Lines/ Tissues 182 Cell Lines/ 5 TeraBases 1716x of the Human Genome Methods/Factors 164 Assays (114 different Chip) ENCODE Uniform Analysis Pipeline Anshul Kundaje, Qunhua Li, Michael Hoffman, Jason Ernst, Joel Rozowsky, Pouya Kheradpour Mapped reads from production (Bam) Uniform Peak Calling Pipeline (SPP, PeakSeq) Signal Generation (read extension and mappability correction) Good reproducibility Poor reproducibility Segmentation Rep2 Rep1 IDR Processing, QC and Blacklist Filtering ChromHMM/Segway Self Organising Maps Motif Discovery Stats, GSC Signal Aggregation enrichments, etc. over peaks Discovering functional genome segments Michael Hoffman, Jason Ernst, Bill Noble, Manolis Kellis Well understood: TSS, Gene Start, Gene Bodies Reassuringly Interesting “Enhancers” (2 states) Insulators Definitely There, Unexpected Specific Gene End ~7 Major flavours of genome Sub-classification of Repeats 25 “elaborations” 1,000s of details Impact on Medicine EBI is an Outstation of the European Molecular Biology Laboratory. 3 big areas of impact for medicine Germ line “Precision” cancer Pathogens + Risk to disease medicine Hospital acquired infections Germ Line impact • Everyone has differential risk of disease • But the shift in risk is small • Perhaps 1 to 2% have a striking change in risk to a serious disease (>10 fold) which is “actionable” • This goes up to 3-4% if you 1:500 people have HCM count some less clinically 1:500 people have FH worrying diseases Precision cancer diagnosis • Cancer is a genomic disease • By sequencing a cancer you can understand its molecular form better • Particular molecular forms respond to particular bugs Pathogens • Sequencing provides a clear cut diagnosis of pathogens • Can also be used to sequence environments (eg, hospitals) • Immune systems for hospitals Why we need an infrastructure… EBI is an Outstation of the European Molecular Biology Laboratory. Infrastructure components for ENCODE • ~300TB of space • (small compared to 1,000 genomes/Cancer) • >5 years of CPU • (EBI 10,000 core farm critical in turn around times) • High bandwidth connectivity + Aspera • Easier to connect Seattle and Santa Cruz via EBI (!) Infrastructures are critical… But we only notice them when they go wrong Biology already needs an information infrastructure • For the human genome • (…and the mouse, and the rat, and… x 150 now, 1000 in the future!) - Ensembl • For the function of genes and proteins • For all genes, in text and computational – UniProt and GO • For all 3D structures • To understand how proteins work – PDBe • For where things are expressed • The differences and functionality of cells - Atlas ..But this keeps on going… • We have to scale across all of (interesting) life • There are a lot of species out there! • We have to handle new areas, in particular medicine • A set of European haplotypes for good imputation • A set of actionable variants in germline and cancers • We have to improve our chemical understanding • Of biological chemicals • Of chemicals which interfere with Biology How? Fully Centralised Fully Distributed Pros: Stability, reuse, Pros: Responsive, Geographic Learning ease Language responsive Cons: Hard to concentrateCons: Internal communication overhead Expertise across of life scienceHarder for end users to learn Geographic, language placementHarder to provide multi-decade stability Bottlenecks and lack of diversity How? Robust network with a strong hub Node “Domain Specific hub” “National” General Hub (EBI) ELIXIR’s mission To build a sustainable European infrastructure for biological information, supporting life science research and its medicine translation to: environment bioindustries society 34 Overlapping Networks Other infrastructures needed for biology • EuroBioImaging • Cellular and whole organism Imaging • BioBanks (BBMRI) • We need numbers – European populations – in particular for rare diseases, but also for specific sub types of common disease • Mouse models and phenotypes (Infrafrontier) • A baseline set of knockouts and phenotypes in our most tractable mammalian model • (it’s hard to prove something in human) • Robust molecular assays in a clinical setting (EATRIS) • The ability to reliably use state of the art molecular techniques in a clinical research setting And… just for fun… (and don’t tweet this bit) EBI is an Outstation of the European Molecular Biology Laboratory. Over a beer… Ha! At some point all the data we Store is going to be DNA… Of course, the cost effective way To store this would be as DNA… 1 g == 1 PB (with redundancy) Scaleable? Cost effective? *** University of Albany SUNY Group *** *** HudsonAlpha Institute, Caltech, Stanford Group *** Scott A. Tenenbaum (5), Luiz O. Penalva Devin M. Absher (14), Henry Amrhein (59), Michael Anaya (42), Anita Bansal (14), Serafim Batzoglou (2), Kevin M. (84), Francis Doyle (5). Bowling (14), Marie K. Cross (14), Nicholas S. Davis (14), Tracy Eggleston (14), Clarke Gasper (42), Jason Gertz (14), DeSalvo Gilberto (59), Chris Gunter (14), Preti Jain (14), Brandon King (59), Anshul Kundaje (2), Shawn E. *** University of Chicago, Stanford Group Levy (14), Max W. LibbrechtIan (2), Geor giDunham, K. Marinov (59), Kenneth McCue (59), Sarah K. Meadows (14), Ali Mortazavi *** (30), Michael A. Muratet (14), Richard M. Myers (14), Amy S. Nesmith (14), J. Scott Newberry (14), Kimberly M. ENCODE Authors Subhradip Karmakar (41), Stephen G. Newberry (14), Stephanie L. Parker (14), E. Christopher Partridge (14), Florencia Pauli (14), Shirley Pepke (60), Landt (13), Raj R. Bhanvadia (41), Alina Barbara Pusey (14), Timothy E. Reddy (14), Arend
Recommended publications
  • Gene Prediction: the End of the Beginning Comment Colin Semple
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by PubMed Central http://genomebiology.com/2000/1/2/reports/4012.1 Meeting report Gene prediction: the end of the beginning comment Colin Semple Address: Department of Medical Sciences, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK. E-mail: [email protected] Published: 28 July 2000 reviews Genome Biology 2000, 1(2):reports4012.1–4012.3 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2000/1/2/reports/4012 © GenomeBiology.com (Print ISSN 1465-6906; Online ISSN 1465-6914) Reducing genomes to genes reports A report from the conference entitled Genome Based Gene All ab initio gene prediction programs have to balance sensi- Structure Determination, Hinxton, UK, 1-2 June, 2000, tivity against accuracy. It is often only possible to detect all organised by the European Bioinformatics Institute (EBI). the real exons present in a sequence at the expense of detect- ing many false ones. Alternatively, one may accept only pre- dictions scoring above a more stringent threshold but lose The draft sequence of the human genome will become avail- those real exons that have lower scores. The trick is to try and able later this year. For some time now it has been accepted increase accuracy without any large loss of sensitivity; this deposited research that this will mark a beginning rather than an end. A vast can be done by comparing the prediction with additional, amount of work will remain to be done, from detailing independent evidence.
    [Show full text]
  • The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe
    The EMBL-European Bioinformatics Institute The hub for bioinformatics in Europe Blaise T.F. Alako, PhD [email protected] www.ebi.ac.uk What is EMBL-EBI? • Part of the European Molecular Biology Laboratory • International, non-profit research institute • Europe’s hub for biological data, services and research The European Molecular Biology Laboratory Heidelberg Hamburg Hinxton, Cambridge Basic research Structural biology Bioinformatics Administration Grenoble Monterotondo, Rome EMBO EMBL staff: 1500 people Structural biology Mouse biology >60 nationalities EMBL member states Austria, Belgium, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member state: Australia Who we are ~500 members of staff ~400 work in services & support >53 nationalities ~120 focus on basic research EMBL-EBI’s mission • Provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress • Contribute to the advancement of biology through basic investigator-driven research in bioinformatics • Provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators • Help disseminate cutting-edge technologies to industry • Coordinate biological data provision throughout Europe Services Data and tools for molecular life science www.ebi.ac.uk/services Browse our services 9 What services do we provide? Labs around the
    [Show full text]
  • Functional Effects Detailed Research Plan
    GeCIP Detailed Research Plan Form Background The Genomics England Clinical Interpretation Partnership (GeCIP) brings together researchers, clinicians and trainees from both academia and the NHS to analyse, refine and make new discoveries from the data from the 100,000 Genomes Project. The aims of the partnerships are: 1. To optimise: • clinical data and sample collection • clinical reporting • data validation and interpretation. 2. To improve understanding of the implications of genomic findings and improve the accuracy and reliability of information fed back to patients. To add to knowledge of the genetic basis of disease. 3. To provide a sustainable thriving training environment. The initial wave of GeCIP domains was announced in June 2015 following a first round of applications in January 2015. On the 18th June 2015 we invited the inaugurated GeCIP domains to develop more detailed research plans working closely with Genomics England. These will be used to ensure that the plans are complimentary and add real value across the GeCIP portfolio and address the aims and objectives of the 100,000 Genomes Project. They will be shared with the MRC, Wellcome Trust, NIHR and Cancer Research UK as existing members of the GeCIP Board to give advance warning and manage funding requests to maximise the funds available to each domain. However, formal applications will then be required to be submitted to individual funders. They will allow Genomics England to plan shared core analyses and the required research and computing infrastructure to support the proposed research. They will also form the basis of assessment by the Project’s Access Review Committee, to permit access to data.
    [Show full text]
  • Semantic Web
    SEMANTIC WEB: REVOLUTIONIZING KNOWLEDGE DISCOVERY IN THE LIFE SCIENCES SEMANTIC WEB: REVOLUTIONIZING KNOWLEDGE DISCOVERY IN THE LIFE SCIENCES Edited by Christopher J. O. Baker1 and Kei-Hoi Cheung2 1Knowledge Discovery Department, Institute for Infocomm Research, Singapore, Singapore; 2Center for Medical Informatics, Yale University School of Medicine, New Haven, CT, USA Kluwer Academic Publishers Boston/Dordrecht/London Contents PART I: Database and Literature Integration Semantic web approach to database integration in the life sciences KEI-HOI CHEUNG, ANDREW K. SMITH, KEVIN Y. L. YIP, CHRISTOPHER J. O. BAKER AND MARK B. GERSTEIN Querying Semantic Web Contents: A case study LOIC ROYER, BENEDIKT LINSE, THOMAS WÄCHTER, TIM FURCH, FRANCOIS BRY, AND MICHAEL SCHROEDER Knowledge Acquisition from the Biomedical Literature LYNETTE HIRSCHMAN, WILLIAM HAYES AND ALFONSO VALENCIA PART II: Ontologies in the Life Sciences Biological Ontologies PATRICK LAMBRIX, HE TAN, VAIDA JAKONIENE, AND LENA STRÖMBÄCK Clinical Ontologies YVES LUSSIER AND OLIVIER BODENREIDER Ontology Engineering For Biological Applications vi Revolutionizing knowledge discovery in the life sciences LARISA N. SOLDATOVA AND ROSS D. KING The Evaluation of Ontologies: Toward Improved Semantic Interoperability LEO OBRST, WERNER CEUSTERS, INDERJEET MANI, STEVE RAY AND BARRY SMITH OWL for the Novice JEFF PAN PART III: Ontology Visualization Techniques for Ontology Visualization XIAOSHU WANG AND JONAS ALMEIDA On Vizualization of OWL Ontologies SERGUEI KRIVOV, FERDINANDO VILLA, RICHARD WILLIAMS, AND XINDONG WU PART IV: Ontologies in Action Applying OWL Reasoning to Genomics: A Case Study KATY WOLSTENCROFT, ROBERT STEVENS AND VOLKER HAARSLEV Can Semantic Web Technologies enable Translational Medicine? VIPUL KASHYAP, TONYA HONGSERMEIER AND SAMUEL J. ARONSON Ontology Design for Biomedical Text Mining RENÉ WITTE, THOMAS KAPPLER, AND CHRISTOPHER J.
    [Show full text]
  • The for Report 07-08
    THE CENTER FOR INTEGRATIVE GENOMICS REPORT 07-08 www.unil.ch/cig Table of Contents INTRODUCTION 2 The CIG at a glance 2 The CIG Scientific Advisory Committee 3 Message from the Director 4 RESEARCH 6 Richard Benton Chemosensory perception in Drosophila: from genes to behaviour 8 Béatrice Desvergne Networking activity of PPARs during development and in adult metabolic homeostasis 10 Christian Fankhauser The effects of light on plant growth and development 12 Paul Franken Genetics and energetics of sleep homeostasis and circadian rhythms 14 Nouria Hernandez Mechanisms of basal and regulated RNA polymerase II and III transcription of ncRNA in mammalian cells 16 Winship Herr Regulation of cell proliferation 18 Henrik Kaessmann Mammalian evolutionary genomics 20 Sophie Martin Molecular mechanisms of cell polarization 22 Liliane Michalik Transcriptional control of tissue repair and angiogenesis 24 Alexandre Reymond Genome structure and expression 26 Andrzej Stasiak Functional transitions of DNA structure 28 Mehdi Tafti Genetics of sleep and the sleep EEG 30 Bernard Thorens Molecular and physiological analysis of energy homeostasis in health and disease 32 Walter Wahli The multifaceted roles of PPARs 34 Other groups at the Génopode 37 CORE FACILITIES 40 Lausanne DNA Array Facility (DAFL) 42 Protein Analysis Facility (PAF) 44 Core facilities associated with the CIG 46 EDUCATION 48 Courses and lectures given by CIG members 50 Doing a PhD at the CIG 52 Seminars and symposia 54 The CIG annual retreat 62 The CIG and the public 63 Artist in residence at the CIG 63 PEOPLE 64 1 Introduction The Center for IntegratiVE Genomics (CIG) at A glance The Center for Integrative Genomics (CIG) is the newest depart- ment of the Faculty of Biology and Medicine of the University of Lausanne (UNIL).
    [Show full text]
  • Classifying Transport Proteins Using Profile Hidden Markov Models And
    Classifying Transport Proteins Using Profile Hidden Markov Models and Specificity Determining Sites Qing Ye A Thesis in The Department of Computer Science and Software Engineering Presented in Partial Fulfillment of the Requirements for the Degree of Master of Computer Science (MCompSc) at Concordia University Montréal, Québec, Canada April 2019 ⃝c Qing Ye, 2019 CONCORDIA UNIVERSITY School of Graduate Studies This is to certify that the thesis prepared By: Qing Ye Entitled: Classifying Transport Proteins Using Profile Hidden Markov Models and Specificity Determining Sites and submitted in partial fulfillment of the requirements for the degree of Master of Computer Science (MCompSc) complies with the regulations of this University and meets the accepted standards with respect to originality and quality. Signed by the Final Examining Committee: Chair Dr. T.-H. Chen Examiner Dr. T. Glatard Examiner Dr. A. Krzyzak Supervisor Dr. G. Butler Approved by Martin D. Pugh, Chair Department of Computer Science and Software Engineering 2019 Amir Asif, Dean Faculty of Engineering and Computer Science Abstract Classifying Transport Proteins Using Profile Hidden Markov Models and Specificity Determining Sites Qing Ye This thesis develops methods to classifiy the substrates transported across a membrane by a given transmembrane protein. Our methods use tools that predict specificity determining sites (SDS) after computing a multiple sequence alignment (MSA), and then building a profile Hidden Markov Model (HMM) using HMMER. In bioinformatics, HMMER is a set of widely used applications for sequence analysis based on profile HMM. Specificity determining sites (SDS) are the key positions in a protein sequence that play a crucial role in functional variation within the protein family during the course of evolution.
    [Show full text]
  • Ontology-Based Methods for Analyzing Life Science Data
    Habilitation a` Diriger des Recherches pr´esent´ee par Olivier Dameron Ontology-based methods for analyzing life science data Soutenue publiquement le 11 janvier 2016 devant le jury compos´ede Anita Burgun Professeur, Universit´eRen´eDescartes Paris Examinatrice Marie-Dominique Devignes Charg´eede recherches CNRS, LORIA Nancy Examinatrice Michel Dumontier Associate professor, Stanford University USA Rapporteur Christine Froidevaux Professeur, Universit´eParis Sud Rapporteure Fabien Gandon Directeur de recherches, Inria Sophia-Antipolis Rapporteur Anne Siegel Directrice de recherches CNRS, IRISA Rennes Examinatrice Alexandre Termier Professeur, Universit´ede Rennes 1 Examinateur 2 Contents 1 Introduction 9 1.1 Context ......................................... 10 1.2 Challenges . 11 1.3 Summary of the contributions . 14 1.4 Organization of the manuscript . 18 2 Reasoning based on hierarchies 21 2.1 Principle......................................... 21 2.1.1 RDF for describing data . 21 2.1.2 RDFS for describing types . 24 2.1.3 RDFS entailments . 26 2.1.4 Typical uses of RDFS entailments in life science . 26 2.1.5 Synthesis . 30 2.2 Case study: integrating diseases and pathways . 31 2.2.1 Context . 31 2.2.2 Objective . 32 2.2.3 Linking pathways and diseases using GO, KO and SNOMED-CT . 32 2.2.4 Querying associated diseases and pathways . 33 2.3 Methodology: Web services composition . 39 2.3.1 Context . 39 2.3.2 Objective . 40 2.3.3 Semantic compatibility of services parameters . 40 2.3.4 Algorithm for pairing services parameters . 40 2.4 Application: ontology-based query expansion with GO2PUB . 43 2.4.1 Context . 43 2.4.2 Objective .
    [Show full text]
  • BIOGRAPHICAL SKETCH NAME: Berger
    BIOGRAPHICAL SKETCH NAME: Berger, Bonnie eRA COMMONS USER NAME (credential, e.g., agency login): BABERGER POSITION TITLE: Simons Professor of Mathematics and Professor of Electrical Engineering and Computer Science EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, include postdoctoral training and residency training if applicable. Add/delete rows as necessary.) EDUCATION/TRAINING DEGREE Completion (if Date FIELD OF STUDY INSTITUTION AND LOCATION applicable) MM/YYYY Brandeis University, Waltham, MA AB 06/1983 Computer Science Massachusetts Institute of Technology SM 01/1986 Computer Science Massachusetts Institute of Technology Ph.D. 06/1990 Computer Science Massachusetts Institute of Technology Postdoc 06/1992 Applied Mathematics A. Personal Statement Advances in modern biology revolve around automated data collection and sharing of the large resulting datasets. I am considered a pioneer in the area of bringing computer algorithms to the study of biological data, and a founder in this community that I have witnessed grow so profoundly over the last 26 years. I have made major contributions to many areas of computational biology and biomedicine, largely, though not exclusively through algorithmic innovations, as demonstrated by nearly twenty thousand citations to my scientific papers and widely-used software. In recognition of my success, I have just been elected to the National Academy of Sciences and in 2019 received the ISCB Senior Scientist Award, the pinnacle award in computational biology. My research group works on diverse challenges, including Computational Genomics, High-throughput Technology Analysis and Design, Biological Networks, Structural Bioinformatics, Population Genetics and Biomedical Privacy. I spearheaded research on analyzing large and complex biological data sets through topological and machine learning approaches; e.g.
    [Show full text]
  • PREDICTD: Parallel Epigenomics Data Imputation with Cloud-Based Tensor Decomposition
    bioRxiv preprint doi: https://doi.org/10.1101/123927; this version posted April 4, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. PREDICTD: PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition Timothy J. Durham Maxwell W. Libbrecht Department of Genome Sciences Department of Genome Sciences University of Washington University of Washington J. Jeffry Howbert Jeff Bilmes Department of Genome Sciences Department of Electrical Engineering University of Washington University of Washington William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington April 4, 2017 Abstract The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project have produced thousands of data sets mapping the epigenome in hundreds of cell types. How- ever, the number of cell types remains too great to comprehensively map given current time and financial constraints. We present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to address this issue by computationally im- puting missing experiments in collections of epigenomics experiments. PREDICTD leverages an intuitive and natural model called \tensor decomposition" to impute many experiments si- multaneously. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining methods yields further improvement. We show that PREDICTD data can be used to investigate enhancer biology at non-coding human accelerated regions. PREDICTD provides reference imputed data sets and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, two technologies increasingly applicable in bioinformatics.
    [Show full text]
  • Multi-Class Protein Classification Using Adaptive Codes
    Journal of Machine Learning Research 8 (2007) 1557-1581 Submitted 8/06; Revised 4/07; Published 7/07 Multi-class Protein Classification Using Adaptive Codes Iain Melvin∗ [email protected] NEC Laboratories of America Princeton, NJ 08540, USA Eugene Ie∗ [email protected] Department of Computer Science and Engineering University of California San Diego, CA 92093-0404, USA Jason Weston [email protected] NEC Laboratories of America Princeton, NJ 08540, USA William Stafford Noble [email protected] Department of Genome Sciences Department of Computer Science and Engineering University of Washington Seattle, WA 98195, USA Christina Leslie [email protected] Center for Computational Learning Systems Columbia University New York, NY 10115, USA Editor: Nello Cristianini Abstract Predicting a protein’s structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on develop- ing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classification problem is in fact a huge multi- class problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vs- the-rest prediction scores. Specifically, we use a ranking perceptron algorithm to learn a weight- ing of binary classifiers that improves multi-class prediction with respect to a fixed set of out- put codes.
    [Show full text]
  • Aggregation and Correlation Toolbox for Analyses of Genome Tracks Justin Jee Yale University
    University of Massachusetts eM dical School eScholarship@UMMS Program in Bioinformatics and Integrative Biology Program in Bioinformatics and Integrative Biology Publications and Presentations 4-15-2011 ACT: aggregation and correlation toolbox for analyses of genome tracks Justin Jee Yale University Joel Rozowsky Yale University Kevin Y. Yip Yale University See next page for additional authors Follow this and additional works at: http://escholarship.umassmed.edu/bioinformatics_pubs Part of the Bioinformatics Commons, Computational Biology Commons, and the Systems Biology Commons Repository Citation Jee, Justin; Rozowsky, Joel; Yip, Kevin Y.; Lochovsky, Lucas; Bjornson, Robert; Zhong, Guoneng; Zhang, Zhengdong; Fu, Yutao; Wang, Jie; Weng, Zhiping; and Gerstein, Mark B., "ACT: aggregation and correlation toolbox for analyses of genome tracks" (2011). Program in Bioinformatics and Integrative Biology Publications and Presentations. Paper 26. http://escholarship.umassmed.edu/bioinformatics_pubs/26 This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in Program in Bioinformatics and Integrative Biology Publications and Presentations by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. ACT: aggregation and correlation toolbox for analyses of genome tracks Authors Justin Jee, Joel Rozowsky, Kevin Y. Yip, Lucas Lochovsky, Robert Bjornson, Guoneng Zhong, Zhengdong Zhang, Yutao Fu, Jie Wang, Zhiping Weng, and Mark B. Gerstein Comments © The Author(s) 2011. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non- Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non- commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
    [Show full text]
  • UC Irvine UC Irvine Previously Published Works
    UC Irvine UC Irvine Previously Published Works Title The capacity of feedforward neural networks. Permalink https://escholarship.org/uc/item/29h5t0hf Authors Baldi, Pierre Vershynin, Roman Publication Date 2019-08-01 DOI 10.1016/j.neunet.2019.04.009 License https://creativecommons.org/licenses/by/4.0/ 4.0 Peer reviewed eScholarship.org Powered by the California Digital Library University of California THE CAPACITY OF FEEDFORWARD NEURAL NETWORKS PIERRE BALDI AND ROMAN VERSHYNIN Abstract. A long standing open problem in the theory of neural networks is the devel- opment of quantitative methods to estimate and compare the capabilities of different ar- chitectures. Here we define the capacity of an architecture by the binary logarithm of the number of functions it can compute, as the synaptic weights are varied. The capacity provides an upperbound on the number of bits that can be extracted from the training data and stored in the architecture during learning. We study the capacity of layered, fully-connected, architectures of linear threshold neurons with L layers of size n1, n2,...,nL and show that in essence the capacity is given by a cubic polynomial in the layer sizes: L−1 C(n1,...,nL) = Pk=1 min(n1,...,nk)nknk+1, where layers that are smaller than all pre- vious layers act as bottlenecks. In proving the main result, we also develop new techniques (multiplexing, enrichment, and stacking) as well as new bounds on the capacity of finite sets. We use the main result to identify architectures with maximal or minimal capacity under a number of natural constraints.
    [Show full text]