Challenges in Funding and Developing Genomic Software: Roots and Remedies Adam Siepel

Total Page:16

File Type:pdf, Size:1020Kb

Challenges in Funding and Developing Genomic Software: Roots and Remedies Adam Siepel Siepel Genome Biology (2019) 20:147 https://doi.org/10.1186/s13059-019-1763-7 OPINION Open Access Challenges in funding and developing genomic software: roots and remedies Adam Siepel in 2019 [6]. There were no formal mechanisms at the time Abstract for grant applications for scientific research. Instead, The computer software used for genomic analysis has William Herschel simply approached the King directly become a crucial component of the infrastructure for with a request for royal patronage. The 40-ft tele- life sciences. However, genomic software is still scope is one of the earliest examples of government typically developed in an ad hoc manner, with investment in the infrastructure for scientific research, inadequate funding, and by academic researchers not to enable a project that simply would not have been trained in software development, at substantial costs possible with private funds alone. to the research community. I examine the roots of the The model of government investment in scientific in- incongruity between the importance of and the frastructure became increasingly well-established degree of investment in genomic software, and I throughout the 19th and 20th centuries, culminating in suggest several potential remedies for current the “Big Science” of the World War II and Post-War problems. As genomics continues to grow, new eras. Science in modern times has been dominated, in strategies for funding and developing the software many ways, by these massive public investments. Prom- that powers the field will become increasingly inent examples include the Manhattan project (equiva- essential. lent to $22 billion in 2016 [7, 8]), the Apollo program (equivalent to $107 billion [9]), the Space Shuttle pro- gram (equivalent to $219 billion [10]), the Large Hadron A traveler in late-eighteenth-century England who passed Collider (equivalent $4.8 billion [11]) and, more pertin- through the town of Slough—located just west of London ent to this article, the Human Genome Project (equiva- and not far from present-day Heathrow Airport—might lent to $5.0 billion [12]; Fig. 1b). have come upon a massive 40-ft-long telescope, sus- Indeed, we now live in a world where much of the day- pended in a wooden frame more than 50 ft tall (Fig. 1a). to-day work in science depends on a publicly funded in- The telescope was located at the home of William frastructure. In particular, many working in genomics rely Herschel and his sister Caroline, two of the greatest heavily on data sets such as those generated by the astronomers of their day. It was the largest telescope ENCODE, Roadmap Epigenomics, 1000 Genomes, in the world until it was dismantled in 1839. Weigh- Genotype-Tissue Expression (GTEx), The Cancer Genome ing over 1000 lbs., the “40-ft telescope”,asitwas Atlas (TCGA), Genetic European Variation in Disease known, was sufficiently impressive to the general pub- (GEUVADIS), and most recently, Human Cell Atlas pro- lic to emerge as a regional tourist attraction. Its auda- jects. We store and search sequence data using GenBank, cious scale inspired prominent thinkers and writers of EMBL-Bank, DDBJ, UniProt, and Pfam, examine three- the time, including Erasmus Darwin and William dimensional protein structures in PDB or EMDB, scour Blake [1, 5]. the literature using PubMed, and view genomic annota- The 40-ft telescope took 5 years to build and was paid tions using the UCSC Genome Browser and Ensembl for by a grant of £4000 from King George III, who was Browser. All these resources have been maintained for de- strongly committed to scientific research throughout his cades either directly by government agencies or through reign. This grant represented a substantial sum at the long-term public funding to universities and research time, roughly equivalent to £600,000 (about US$800,000) institutes. Correspondence: [email protected] The computer software on which millions of scientists Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold rely for genomic analysis is no less an essential part of the Spring Harbor, NY 11724, USA © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Siepel Genome Biology (2019) 20:147 Page 2 of 14 Fig. 1 a William and Caroline Herschel’s 40-foot telescope in Slough, England, 1789 [1]. b Two of the teams of scientists that contributed to the Human Genome Project: (top) Sanger Center, Hinxton, UK [2]; (bottom) Washington University Genome Sequencing Center, St. Louis, USA, both circa 2000. c Some relics of the pre-Internet world: (clockwise from top left) the author learning to program in BASIC on a Commodore 64 in a cold upstate New York basement, 1983 (credit: Virginia Siepel), as one of the millions of children who were introduced to home computers during the 1980s and 1990s, some of whom would go on to write much of the software that powers genomics today; floppy disk for PAUP version 3.1.1,©1993; Sun Microsystems SPARCstation 1 with Mosaic web browser faintly visible on screen, 1994 [3]; screen shot from the MASE alignment program [4]. d Prof. David Haussler of UC Santa Cruz with the original Dell computer cluster that his team used to assemble the human genome, 2000. Photo (c) UC Santa Cruz, used with permission infrastructure of biological research than large shared data arrive at this odd situation? Why is scientific software sup- sets or public databases, yet the model for funding and de- ported differently from other forms of scientific infrastruc- veloping computer software differs substantially. Most ture? Why are adequate funds not set aside for this widely used genomic software is developed by independ- important work? ent investigators working in academic or not-for-profit in- In this article, I offer my perspective on the unique stitutions with support from government grants. This problem of funding and developing software for genom- software is generally freely available to the community, ics, based on my 25 years in the field—as a developer typically with no subscription or licensing fees and nonre- and user of software, a professional programmer and strictive terms of use. At the same time, it is often mea- principal investigator, an applicant for and reviewer of gerly funded, unreliable, hard-to-use, poorly documented, grant proposals, and an employee of government, uni- and/or poorly supported. How did we, as a community, versity, and private research institutions. I first examine Siepel Genome Biology (2019) 20:147 Page 3 of 14 what makes genomic software development unusual and suggesting approximately $16 billion is spent on basic how the field has come to be the way it is. Overall, I life sciences research [153]). If even 10% of these funds argue that, despite some important strengths of our are devoted to projects that rely in part on genomics current model for software development, we as a com- and genomic software, which seems plausible, then this munity have “painted ourselves into a corner” in terms software would be instrumental in supporting more than of developing robust, well-engineered software and are a billion dollars per year in research. Furthermore, total paying for it; we are, in a sense, addicted to free soft- R&D expenditures in the US are estimated at about four ware. Finally, I suggest some possible remedies that at- times those of the federal government, and scaling up to tempt to strike a balance between addressing important worldwide R&D expenditures requires about another deficiencies in the current model and maintaining its factor of three. (Total R&D expenditures in the US, in- core strengths. My discussion of these topics necessarily cluding those in the private sector and at other govern- have a US bias, but I believe that many of my points are mental levels, are estimated at about $500 billion internationally valid. Also, although this article focuses annually. The US leads the world in spending on sci- on genomics, similar trends occur in other areas of com- ence, but China is not far behind, and several other putational biology, such as structural biology and prote- countries—including Japan, Germany, South Korea, omics, as well as in some other areas of scientific France, and the UK—also account for substantial software development. amounts. Together, the top ten countries spend about $1.5 trillion per year on R&D [154]). Therefore, a rough Software for genomics is critical to the research calculation suggests that the worldwide research that de- infrastructure for the life sciences pends, at least in part, on genomic software is likely to During the past 25 years, genomic software development cost tens of billions of dollars annually. has grown from an obscure cottage industry to an essen- tial part of the infrastructure of biological research. Re- Software for genomics lacks a sustainable model searchers across the globe rely on computational tools for development and maintenance for read mapping, genome assembly, multiple alignment, Despite the overwhelming importance of genomic soft- phylogenetics, population genomics, and visualization of ware, there is broad agreement among practitioners that genomic data, among many other applications. Import- the current model for its development has serious flaws. antly, these tools are no longer used only by genomic As noted above, most genomic software is developed by specialists, but across all the life sciences, including dis- academic groups and funded by government grants, yet parate fields such as ecology and evolution, molecular there are relatively few dedicated granting opportunities and cell biology, clinical genetics, plant breeding, bio- for genomic software development, and those that exist physics, and bioengineering.
Recommended publications
  • Evidence of Selection at the Ramosa1 Locus During Maize Domestication
    Molecular Ecology (2010) 19, 1296–1311 doi: 10.1111/j.1365-294X.2010.04562.x Evidence of selection at the ramosa1 locus during maize domestication BRANDI SIGMON and ERIK VOLLBRECHT Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA 50011, USA Abstract Modern maize was domesticated from Zea mays parviglumis, a teosinte, about 9000 years ago in Mexico. Genes thought to have been selected upon during the domestication of crops are commonly known as domestication loci. The ramosa1 (ra1) gene encodes a putative transcription factor that controls branching architecture in the maize tassel and ear. Previous work demonstrated reduced nucleotide diversity in a segment of the ra1 gene in a survey of modern maize inbreds, indicating that positive selection occurred at some point in time since maize diverged from its common ancestor with the sister species Tripsacum dactyloides and prompting the hypothesis that ra1 may be a domestication gene. To investigate this hypothesis, we examined ear phenotypes resulting from minor changes in ra1 activity and sampled nucleotide diversity of ra1 across the phylogenetic spectrum between tripsacum and maize, including a broad panel of teosintes and unimproved maize landraces. Weak mutant alleles of ra1 showed subtle effects in the ear, including crooked rows of kernels due to the occasional formation of extra spikelets, correlating a plausible, selected trait with subtle variations in gene activity. Nucleotide diversity was significantly reduced for maize landraces but not for teosintes, and statistical tests implied directional selection on ra1 consistent with the hypothesis that ra1 is a domestication locus. In maize landraces, a noncoding 3¢-segment contained almost no genetic diversity and 5¢-flanking diversity was greatly reduced, suggesting that a regulatory element may have been a target of selection.
    [Show full text]
  • Computational Methods Addressing Genetic Variation In
    COMPUTATIONAL METHODS ADDRESSING GENETIC VARIATION IN NEXT-GENERATION SEQUENCING DATA by Charlotte A. Darby A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland June 2020 © 2020 Charlotte A. Darby All rights reserved Abstract Computational genomics involves the development and application of computational meth- ods for whole-genome-scale datasets to gain biological insight into the composition and func- tion of genomes, including how genetic variation mediates molecular phenotypes and disease. New biotechnologies such as next-generation sequencing generate genomic data on a massive scale and have transformed the field thanks to simultaneous advances in the analysis toolkit. In this thesis, I present three computational methods that use next-generation sequencing data, each of which addresses the genetic variations within and between human individuals in a different way. First, Samovar is a software tool for performing single-sample mosaic single-nucleotide variant calling on whole genome sequencing linked read data. Using haplotype assembly of heterozygous germline variants, uniquely made possible by linked reads, Samovar identifies variations in different cells that make up a bulk sequencing sample. We apply it to 13cancer samples in collaboration with researchers at Nationwide Childrens Hospital. Second, scHLAcount is a software pipeline that computes allele-specific molecule counts for the HLA genes from single-cell gene expression data. We use a personalized reference genome based on the individual’s genotypes to reveal allele-specific and cell type-specific gene expression patterns. Even given technology-specific biases of single-cell gene expression data, we can resolve allele-specific expression for these genes since the alleles are often quite different between the two haplotypes of an individual.
    [Show full text]
  • Big Data, Moocs, and ... (PDF)
    HHMI Constellation Studios for Science Education November 13-15, 2015 | HHMI Headquarters | Chevy Chase, MD Big Data, MOOCs, and Quantitative Education for Biologists Co-Chairs Pavel Pevzner, University of California- San Diego Sarah Elgin, Washington University Studio Objectives Discuss existing challenges in bioinformatics education with experts in computational biology and quantitative biology education, Evaluate best practices in teaching quantitative and computational biology, and Collaborate with scientist educators to develop instructional modules to support a biology curriculum that includes quantitative approaches. Friday | November 13 4:00 pm Arrival Registration Desk 5:30 – 6:00 pm Reception Great Hall 6:00 – 7:00 pm Dinner Dining Room 7:00 – 7:15 pm Welcome K202 David Asai, HHMI Cynthia Bauerle, HHMI Pavel Pevzner, University of California-San Diego Sarah Elgin, Washington University Alex Hartemink, Duke University 7:15 – 8:00 pm How to Maximize Interaction and Feedback During the Studio K202 Cynthia Bauerle and Sarah Simmons, HHMI 8:00 – 9:00 pm Keynote Presentation K202 "Computing + Biology = Discovery" Speakers: Ran Libeskind-Hadas, Harvey Mudd College Eliot Bush, Harvey Mudd College 9:00 – 11:00 pm Social The Pilot Saturday | November 14 7:30 – 8:15 am Breakfast Dining Room 8:30 – 10:00 am Lecture session 1 K202 Moderator: Pavel Pevzner 834a-854a “How is body fat regulated?” Laurie Heyer, Davidson College 856a-916a “How can we find mutations that cause cancer?” Ben Raphael, Brown University “How does a tumor evolve over time?” 918a-938a Russell Schwartz, Carnegie Mellon University “How fast do ribosomes move?” 940a-1000a Carl Kingsford, Carnegie Mellon University 10:05 – 10:55 am Breakout working groups Rooms: S221, (coffee available in each room) N238, N241, N140 1.
    [Show full text]
  • DNA Sequencing
    Contig Assembly ATCGATGCGTAGCAGACTACCGTTACGATGCCTT… TAGCTACGCATCGTCTGATGGCAATGCTACGGAA.. C T AG AGCAGA TAGCTACGCATCGT GT CTACCG GC TT AT CG GTTACGATGCCTT AT David Wishart, Ath 3-41 [email protected] DNA Sequencing 1 Principles of DNA Sequencing Primer DNA fragment Amp PBR322 Tet Ori Denature with Klenow + ddNTP heat to produce + dNTP + primers ssDNA The Secret to Sanger Sequencing 2 Principles of DNA Sequencing 5’ G C A T G C 3’ Template 5’ Primer dATP dATP dATP dATP dCTP dCTP dCTP dCTP dGTP dGTP dGTP dGTP dTTP dTTP dTTP dTTP ddCTP ddATP ddTTP ddGTP GddC GCddA GCAddT ddG GCATGddC GCATddG Principles of DNA Sequencing G T _ _ short C A G C A T G C + + long 3 Capillary Electrophoresis Separation by Electro-osmotic Flow Multiplexed CE with Fluorescent detection ABI 3700 96x700 bases 4 High Throughput DNA Sequencing Large Scale Sequencing • Goal is to determine the nucleic acid sequence of molecules ranging in size from a few hundred bp to >109 bp • The methodology requires an extensive computational analysis of raw data to yield the final sequence result 5 Shotgun Sequencing • High throughput sequencing method that employs automated sequencing of random DNA fragments • Automated DNA sequencing yields sequences of 500 to 1000 bp in length • To determine longer sequences you obtain fragmentary sequences and then join them together by overlapping • Overlapping is an alignment problem, but different from those we have discussed up to now Shotgun Sequencing Isolate ShearDNA Clone into Chromosome into Fragments Seq. Vectors Sequence 6 Shotgun Sequencing Sequence Send to Computer Assembled Chromatogram Sequence Analogy • You have 10 copies of a movie • The film has been cut into short pieces with about 240 frames per piece (10 seconds of film), at random • Reconstruct the film 7 Multi-alignment & Contig Assembly ATCGATGCGTAGCAGACTACCGTTACGATGCCTT… TAGCTACGCATCGTCTGATGGCAATGCTACGGAA.
    [Show full text]
  • Cloud Computing and the DNA Data Race Michael Schatz
    Cloud Computing and the DNA Data Race Michael Schatz April 14, 2011 Data-Intensive Analysis, Analytics, and Informatics Outline 1. Genome Assembly by Analogy 2. DNA Sequencing and Genomics 3. Large Scale Sequence Analysis 1. Mapping & Genotyping 2. Genome Assembly Shredded Book Reconstruction • Dickens accidentally shreds the first printing of A Tale of Two Cities – Text printed on 5 long spools It was theIt was best the of besttimes, of times, it was it wasthe worstthe worst of of times, times, it it was was the the ageage of of wisdom, wisdom, it itwas was the agethe ofage foolishness, of foolishness, … … It was theIt was best the bestof times, of times, it was it was the the worst of times, it was the theage ageof wisdom, of wisdom, it was it thewas age the of foolishness,age of foolishness, … It was theIt was best the bestof times, of times, it was it wasthe the worst worst of times,of times, it it was the age of wisdom, it wasit was the the age age of offoolishness, foolishness, … … It was It thewas best the ofbest times, of times, it wasit was the the worst worst of times,of times, itit waswas thethe ageage ofof wisdom,wisdom, it wasit was the the age age of foolishness,of foolishness, … … It wasIt thewas best the bestof times, of times, it wasit was the the worst worst of of times, it was the age of ofwisdom, wisdom, it wasit was the the age ofage foolishness, of foolishness, … … • How can he reconstruct the text? – 5 copies x 138, 656 words / 5 words per fragment = 138k fragments – The short fragments from every copy are mixed
    [Show full text]
  • BENG181/CSE 181/BIMM 181 Molecular Sequence Analysis ​ ​ ​ ​ ​ ​ Instructor: Pavel Pevzner ​
    COURSE ANNOUNCEMENT FOR WINTER 2021 BENG181/CSE 181/BIMM 181 Molecular Sequence Analysis https://sites.google.com/site/ucsdcse181 ​ ​ ​ ​ ​ ​ Instructor: Pavel Pevzner ​ ● phone: (858) 822-4365 ● e.mail: [email protected] ​ ● web site: bioalgorithms.ucsd.edu Teaching Assistants: ● Andrey Bzikadze ([email protected]) ​ ​ ● Hsuan-lin (Charlene) Her ([email protected]) ​ ​ Time: 6:30-7:50 Mon/Wed, Place: online (seminar Friday 4:00-4:50 online) ​ ​ ​ ​ ​ Zoom link for the class: https://ucsd.zoom.us/j/99782745100 ​ Zoom link for the seminar: https://ucsd.zoom.us/j/96805484881 ​ Office hours: PP: (Th 3-5 online), TAs (online Tue 1-2 PM and 4-5 PM or by appointment ​ online) PP zoom link: https://ucsd.zoom.us/j/96986851791 ​ Andrey Bzikadze zoom link: https://ucsd.zoom.us/j/94881347266 ​ Hsuan-lin (Charlene) Her zoom link: https://ucsd.zoom.us/j/95134947264 ​ Prerequisites: The course assumes some prior background in biology, some algorithmic ​ culture (CSE 101 course on algorithms as a prerequisite), and some programming skills. Flipped online class. Starting in 2014, the Innovative Learning Technology Initiative (ILTI) ​ ​ ​ at University of California encourages professors to transform their classes into online ​ ​ offerings available across various UC campuses. Dr. Pevzner is funded by the ILTI and NIH to develop new online approaches to bioinformatics education at UCSD. Since 2014, well before the COVID-19 pandemic, all lectures in this class are available online rather than presented in the classroom. Multi-university class. This class closely follows the textbook Bioinformatics Algorithms: ​ ​ an Active Learning Approach that has now been adopted by 140+ instructors from 40+ ​ ​ countries.
    [Show full text]
  • New Softwares for Automated Microsatellite Marker Development
    Published online February 21, 2006 Nucleic Acids Research, 2006, Vol. 34, No. 4 e31 doi:10.1093/nar/gnj030 New softwares for automated microsatellite marker development Wellington Martins, Daniel de Sousa1, Karina Proite2, Patrı´cia Guimara˜es2, Marcio Moretzsohn2 and David Bertioli3 Department of Computer Science, Catholic University of Goia´s, Brazil, 1Department of Computer Science, Catholic University of Rio de Janeiro, Brazil, 2Embrapa Genetic Resources and Biotechnology, Brası´lia, Brazil and 3Genomic Sciences and Biotechnology, Catholic University of Brası´lia, Brazil Received November 23, 2005; Revised and Accepted January 31, 2006 ABSTRACT whose unit of repetition is between 1 and 6 bp. They are highly abundant in the genomes of eukaryotes, polymorphic and Microsatellites are repeated small sequence motifs usually co-dominant and transferable between different map- that are highly polymorphic and abundant in the ping populations. Microsatellite markers can also be used in genomes of eukaryotes. Often they are the molecular automated genotyping techniques. Thus, they have become markers of choice. To aid the development of micro- one of the most useful molecular markers for a large number satellite markers we have developed a module that of organisms. integrates a program for the detection of microsatel- Researchers working on the development of microsatellite lites (TROLL), with the sequence assembly and markers need an efficient way to go from usually hundreds or analysis software, the Staden Package. The module thousands of trace and/or text sequence files to the identifi- has easily adjustable parameters for microsatellite cation of new potential markers. However, the softwares lengths and base pair quality control.
    [Show full text]
  • Steven L. Salzberg
    Steven L. Salzberg McKusick-Nathans Institute of Genetic Medicine Johns Hopkins School of Medicine, MRB 459, 733 North Broadway, Baltimore, MD 20742 Phone: 410-614-6112 Email: [email protected] Education Ph.D. Computer Science 1989, Harvard University, Cambridge, MA M.Phil. 1984, M.S. 1982, Computer Science, Yale University, New Haven, CT B.A. cum laude English 1980, Yale University Research Areas: Genomics, bioinformatics, gene finding, genome assembly, sequence analysis. Academic and Professional Experience 2011-present Professor, Department of Medicine and the McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University. Joint appointments as Professor in the Department of Biostatistics, Bloomberg School of Public Health, and in the Department of Computer Science, Whiting School of Engineering. 2012-present Director, Center for Computational Biology, Johns Hopkins University. 2005-2011 Director, Center for Bioinformatics and Computational Biology, University of Maryland Institute for Advanced Computer Studies 2005-2011 Horvitz Professor, Department of Computer Science, University of Maryland. (On leave of absence 2011-2012.) 1997-2005 Senior Director of Bioinformatics (2000-2005), Director of Bioinformatics (1998-2000), and Investigator (1997-2005), The Institute for Genomic Research (TIGR). 1999-2006 Research Professor, Departments of Computer Science and Biology, Johns Hopkins University 1989-1999 Associate Professor (1996-1999), Assistant Professor (1989-1996), Department of Computer Science, Johns Hopkins University. On leave 1997-99. 1988-1989 Associate in Research, Graduate School of Business Administration, Harvard University. Consultant to Ford Motor Co. of Europe and to N.V. Bekaert (Kortrijk, Belgium). 1985-1987 Research Scientist and Senior Knowledge Engineer, Applied Expert Systems, Inc., Cambridge, MA. Designed expert systems for financial services companies.
    [Show full text]
  • A Tool for Detecting Base Mis-Calls in Multiple Sequence Alignments by Semi-Automatic Chromatogram Inspection
    ChromatoGate: A Tool for Detecting Base Mis-Calls in Multiple Sequence Alignments by Semi-Automatic Chromatogram Inspection Nikolaos Alachiotis Emmanouella Vogiatzi∗ Scientific Computing Group Institute of Marine Biology and Genetics HITS gGmbH HCMR Heidelberg, Germany Heraklion Crete, Greece [email protected] [email protected] Pavlos Pavlidis Alexandros Stamatakis Scientific Computing Group Scientific Computing Group HITS gGmbH HITS gGmbH Heidelberg, Germany Heidelberg, Germany [email protected] [email protected] ∗ Affiliated also with the Department of Genetics and Molecular Biology of the Democritian University of Thrace at Alexandroupolis, Greece. Corresponding author: Nikolaos Alachiotis Keywords: chromatograms, software, mis-calls Abstract Automated DNA sequencers generate chromatograms that contain raw sequencing data. They also generate data that translates the chromatograms into molecular sequences of A, C, G, T, or N (undetermined) characters. Since chromatogram translation programs frequently introduce errors, a manual inspection of the generated sequence data is required. As sequence numbers and lengths increase, visual inspection and manual correction of chromatograms and corresponding sequences on a per-peak and per-nucleotide basis becomes an error-prone, time-consuming, and tedious process. Here, we introduce ChromatoGate (CG), an open-source software that accelerates and partially automates the inspection of chromatograms and the detection of sequencing errors for bidirectional sequencing runs. To provide users full control over the error correction process, a fully automated error correction algorithm has not been implemented. Initially, the program scans a given multiple sequence alignment (MSA) for potential sequencing errors, assuming that each polymorphic site in the alignment may be attributed to a sequencing error with a certain probability.
    [Show full text]
  • Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd
    European Bioinformatics Institute EMBL-EBI Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by PickeringHutchins Ltd www.pickeringhutchins.com EMBL member states: Austria, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, United Kingdom. Associate member state: Australia EMBL-EBI is a part of the European Molecular Biology Laboratory (EMBL) EMBL-EBI EMBL-EBI EMBL-EBI EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD United Kingdom Tel. +44 (0)1223 494 444, Fax +44 (0)1223 494 468 www.ebi.ac.uk EMBL Heidelberg Meyerhofstraße 1 69117 Heidelberg Germany Tel. +49 (0)6221 3870, Fax +49 (0)6221 387 8306 www.embl.org [email protected] EMBL Grenoble 6, rue Jules Horowitz, BP181 38042 Grenoble, Cedex 9 France Tel. +33 (0)476 20 7269, Fax +33 (0)476 20 2199 EMBL Hamburg c/o DESY Notkestraße 85 22603 Hamburg Germany Tel. +49 (0)4089 902 110, Fax +49 (0)4089 902 149 EMBL Monterotondo Adriano Buzzati-Traverso Campus Via Ramarini, 32 00015 Monterotondo (Rome) Italy Tel. +39 (0)6900 91402, Fax +39 (0)6900 91406 © 2012 EMBL-European Bioinformatics Institute All texts written by EBI-EMBL Group and Team Leaders. This publication was produced by the EBI’s Outreach and Training Programme. Contents Introduction Foreword 2 Major Achievements 2011 4 Services Rolf Apweiler and Ewan Birney: Protein and nucleotide data 10 Guy Cochrane: The European Nucleotide Archive 14 Paul Flicek:
    [Show full text]
  • THE BIG CHALLENGES of BIG DATA As They Grapple with Increasingly Large Data Sets, Biologists and Computer Scientists Uncork New Bottlenecks
    TECHNOLOGY FEATURE THE BIG CHALLENGES OF BIG DATA As they grapple with increasingly large data sets, biologists and computer scientists uncork new bottlenecks. EMBL–EBI Extremely powerful computers are needed to help biologists to handle big-data traffic jams. BY VIVIEN MARX and how the genetic make-up of different can- year, particle-collision events in CERN’s Large cers influences how cancer patients fare2. The Hadron Collider generate around 15 petabytes iologists are joining the big-data club. European Bioinformatics Institute (EBI) in of data — the equivalent of about 4 million With the advent of high-throughput Hinxton, UK, part of the European Molecular high-definition feature-length films. But the genomics, life scientists are starting to Biology Laboratory and one of the world’s larg- EBI and institutes like it face similar data- Bgrapple with massive data sets, encountering est biology-data repositories, currently stores wrangling challenges to those at CERN, says challenges with handling, processing and mov- 20 petabytes (1 petabyte is 1015 bytes) of data Ewan Birney, associate director of the EBI. He ing information that were once the domain of and back-ups about genes, proteins and small and his colleagues now regularly meet with astronomers and high-energy physicists1. molecules. Genomic data account for 2 peta- organizations such as CERN and the European With every passing year, they turn more bytes of that, a number that more than doubles Space Agency (ESA) in Paris to swap lessons often to big data to probe everything from every year3 (see ‘Data explosion’). about data storage, analysis and sharing.
    [Show full text]
  • UNIVERSITY of CALIFORNIA RIVERSIDE RNA-Seq
    UNIVERSITY OF CALIFORNIA RIVERSIDE RNA-Seq Based Transcriptome Assembly: Sparsity, Bias Correction and Multiple Sample Comparison A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Wei Li September 2012 Dissertation Committee: Dr. Tao Jiang , Chairperson Dr. Stefano Lonardi Dr. Marek Chrobak Dr. Thomas Girke Copyright by Wei Li 2012 The Dissertation of Wei Li is approved: Committee Chairperson University of California, Riverside Acknowledgments The completion of this dissertation would have been impossible without help from many people. First and foremost, I would like to thank my advisor, Dr. Tao Jiang, for his guidance and supervision during the four years of my Ph.D. He offered invaluable advice and support on almost every aspect of my study and research in UCR. He gave me the freedom in choosing a research problem I’m interested in, helped me do research and write high quality papers, Not only a great academic advisor, he is also a sincere and true friend of mine. I am always feeling appreciated and fortunate to be one of his students. Many thanks to all committee members of my dissertation: Dr. Stefano Lonardi, Dr. Marek Chrobak, and Dr. Thomas Girke. I will be greatly appreciated by the advice they offered on the dissertation. I would also like to thank Jianxing Feng, Prof. James Borneman and Paul Ruegger for their collaboration in publishing several papers. Thanks to the support from Vivien Chan, Jianjun Yu and other bioinformatics group members during my internship in the Novartis Institutes for Biomedical Research.
    [Show full text]