Reducing the Complexity of OMICS Data Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Reducing the Complexity of OMICS Data Analysis Julius-Maximilians-Universität Würzburg Reducing the complexity of OMICS data analysis Dissertation zur Erlangung des naturwissenschaftlichen Doktorgrades der Julius-Maximilians-Universität Würzburg Vorgelegt von Beat Wolf aus Fribourg, CH, 2017 Eingereicht am: 5 April 2017 bei der Fakultät für Mathematik und Informatik 1. Gutachter: Prof. Dr. Thomas Dandekar 2. Gutachter: Prof. Dr. Pierre Kuonen Tag der mündlichen Prüfung: 31 August 2017 Summary The field of genetics faces a lot of challenges and opportunities in both research and diag- nostics due to the rise of next generation sequencing (NGS), a technology that allows to sequence DNA increasingly fast and cheap. NGS is not only used to analyze DNA, but also RNA, which is a very similar molecule also present in the cell, in both cases producing large amounts of data. The big amount of data raises both infrastructure and usability problems, as powerful computing infrastructures are required and there are many manual steps in the data analysis which are complicated to execute. Both of those problems limit the use of NGS in the clinic and research, by producing a bottleneck both computationally and in terms of manpower, as for many analyses geneticists lack the required computing skills. Over the course of this thesis we investigated how computer science can help to improve this situation to reduce the complexity of this type of analysis. We looked at how to make the analysis more accessible to increase the number of people that can perform OMICS data analysis (OMICS groups various genomics data-sources). To approach this problem, we developed a graphical NGS data analysis pipeline aimed at a diagnostics environment while still being useful in research in close collaboration with the Human Genetics Depart- ment at the University of Würzburg. The pipeline has been used in various research papers on covering subjects, including works with direct author participation in genomics, tran- scriptomics as well as epigenomics. To further validate the graphical pipeline, a user survey was carried out which confirmed that it lowers the complexity of OMICS data analysis. We also studied how the data analysis can be improved in terms of computing infrastruc- ture by improving the performance of certain analysis steps. We did this both in terms of speed improvements on a single computer (with notably variant calling being faster by up to 18 times), as well as with distributed computing to better use an existing infrastructure. The improvements were integrated into the previously described graphical pipeline, which itself also was focused on low resource usage. As a major contribution and to help with future development of parallel and distributed applications, for the usage in genetics or otherwise, we also looked at how to make it easier to develop such applications. Based on the parallel object programming model (POP), we created a Java language extension called POP-Java, which allows for easy and transpar- ent distribution of objects. Through this development, we brought the POP model to the cloud, Hadoop clusters and present a new collaborative distributed computing model called FriendComputing. The advances made in the different domains of this thesis have been published in various works specified in this document. i Zusammenfassung Das Gebiet der Genetik steht vor vielen Herausforderungen, sowohl in der Forschung als auch Diagnostik, aufgrund des "next generation sequencing" (NGS), eine Technologie die DNA immer schneller und billiger sequenziert. NGS wird nicht nur verwendet um DNA zu analysieren sondern auch RNA, ein der DNA sehr ähnliches Molekül, wobei in beiden Fällen große Datenmengen zu erzeugt werden. Durch die große Menge an Daten entstehen Infrastruktur und Benutzbarkeitsprobleme, da leistungsstarke Computerinfrastrukturen er- forderlich sind, und es viele manuelle Schritte in der Datenanalyse gibt die kompliziert auszuführen sind. Diese beiden Probleme begrenzen die Verwendung von NGS in der Klinik und Forschung, da es einen Engpass sowohl im Bereich der Rechnerleistung als auch beim Personal gibt, da für viele Analysen Genetikern die erforderlichen Computerkenntnisse fehlen. In dieser Arbeit haben wir untersucht wie die Informatik helfen kann diese Situation zu verbessern indem die Komplexität dieser Art von Analyse reduziert wird. Wir haben angeschaut, wie die Analyse zugänglicher gemacht werden kann um die Anzahl Personen zu erhöhen, die OMICS (OMICS gruppiert verschiedene Genetische Datenquellen) Daten- analysen durchführen können. In enger Zusammenarbeit mit dem Institut für Humangenetik der Universität Würzburg wurde eine graphische NGS Datenanalysen Pipeline erstellt um diese Frage zu erläutern. Die graphische Pipeline wurde für den Diagnostikbereich entwickelt ohne aber die Forschung aus dem Auge zu lassen. Darum warum die Pipeline in verschiede- nen Forschungsgebieten verwendet, darunter mit direkter Autorenteilname Publikationen in der Genomik, Transkriptomik und Epigenomik, Die Pipeline wurde auch durch eine Be- nutzerumfrage validiert, welche bestätigt, dass unsere graphische Pipeline die Komplexität der OMICS Datenanalyse reduziert. Wir haben auch untersucht wie die Leistung der Datenanalyse verbessert werden kann, damit die nötige Infrastruktur zugänglicher wird. Das wurde sowohl durch das optimieren der verfügbaren Methoden (wo z.B. die Variantenanalyse bis zu 18 mal schneller wurde) als auch mit verteiltem Rechnen angegangen, um eine bestehende Infrastruktur besser zu verwenden. Die Verbesserungen wurden in der zuvor beschriebenen graphischen Pipeline integriert, wobei generell die geringe Ressourcenverbrauch ein Fokus war. Um die künftige Entwicklung von parallelen und verteilten Anwendung zu unterstützen, ob in der Genetik oder anderswo, haben wir geschaut, wie man es einfacher machen könnte solche Applikationen zu entwickeln. Dies führte zu einem wichtigen informatischen Result, in dem wir, basierend auf dem Model von „parallel object programming“ (POP), eine Erweiterung der Java-Sprache na- mens POP-Java entwickelt haben, die eine einfache und transparente Verteilung von Ob- jekten ermöglicht. Durch diese Entwicklung brachten wir das POP-Modell in die Cloud, Hadoop-Cluster und präsentieren ein neues Model für ein verteiltes kollaboratives rechnen, FriendComputing genannt. Die verschiedenen veröffentlichten Teile dieser Dissertation werden speziel aufgelistet und diskutiert. ii Acknowledgment For this thesis to happen and finish I have to thank numerous people and institutions. First and foremost I would like to thank Prof. Pierre Kuonen for not only giving me the opportunity to make this dissertation, but encouraging me to do so and giving me the best environment possible. I would also like to thank Prof. Thomas Dandekar for supervising my thesis, giving me precious advice and guidance in the field of bioinformatics. A big thanks goes also to Dr. David Atlan, that gave me the opportunity to perform this thesis with a very practical oriented approach, making it possible for much of my work being used in real laboratories across Europe. Having my work being used on a daily basis in a diagnostics environment was a major motivational force throughout the thesis. I would also like to thank Prof. Clemens Müller Reible and Prof. Simone Rost of the Institute of Human Genetics in Würzburg, for following my thesis with so much interest, giving me advice and most importantly for their trust in my work, introducing it in their laboratory to be used for the regular data analysis. I would like to thank the co-authors with which I had the opportunity to write various papers, through which I could learn a lot and get familiarized with many topics. Without them, much of my work would be theoretical with no practical implications. Having me supported me throughout the thesis, I also want to thank especially my girlfriend Gaëlle Kolly. A special thanks also goes to my parents, which made it possible to follow a research career. Last but not least I would also like to thank the University of Würzburg and the Univer- sity of Applied Sciences and Arts Western Switzerland for accepting me for my PhD. I’m grateful for having had the opportunity to make my PhD through a collaboration of two Universities, one more focused on the academic side and the other on the practical side. iii Contents 1. Introduction1 1.1. Motivation and scope . .1 1.2. Contributions . .3 1.3. Thesis outline . .4 I. Foundations5 2. Genetics6 2.1. Introduction . .6 2.1.1. Genetic code . .9 2.1.2. Next generation sequencing . 12 2.2. Summary . 16 3. OMICs data analysis 18 3.1. Genomics . 18 3.1.1. State of the art . 21 3.2. Transcriptomics . 27 3.2.1. State of the art . 29 3.3. Epigenomics . 31 3.3.1. State of the art . 33 3.4. File-formats . 35 3.5. Summary . 37 4. Diagnostics 39 4.1. Introduction . 39 4.2. Genetic disorders . 41 4.3. Software requirements . 43 4.4. Summary . 45 5. Parallel & distributed computing 46 5.1. Introduction . 46 5.2. History . 47 5.3. State of the art . 50 5.3.1. CPU . 50 5.3.2. GPGPU . 51 5.3.3. Distributed computing . 52 5.4. Summary . 53 iv Contents CONTENTS II. Methods 54 6. Graphical pipeline 55 6.1. Introduction . 55 6.2. Prototype . 56 6.3. Methods . 57 6.4. User interface . 58 6.5. Project management . 59 6.6. Annotations . 60 6.7. Data analysis . 62 6.7.1. Quality control . 62 6.7.2. Sequence alignment . 62 6.7.3. Coverage analysis . 64 6.7.4. Variant analysis . 65 6.7.5. Variant comparator . 68 6.7.6. Copy number variations . 70 6.7.7. Distribution . 71 6.8. Discussion . 72 7. Data analysis 73 7.1. Sequence alignment . 73 7.1.1. Introduction . 73 7.1.2. State of the art . 74 7.1.3. Methods . 75 7.1.4. Results . 80 7.1.5. Summary . 85 7.2. Meta-Alignment . 86 7.2.1. Introduction . 86 7.2.2. Method . 87 7.2.3. Results . 89 7.2.4. Summary . 92 7.3. Variant calling . 94 7.3.1. Introduction . 94 7.3.2.
Recommended publications
  • A Method to Infer Changed Activity of Metabolic Function from Transcript Profiles
    ModeScore: A Method to Infer Changed Activity of Metabolic Function from Transcript Profiles Andreas Hoppe and Hermann-Georg Holzhütter Charité University Medicine Berlin, Institute for Biochemistry, Computational Systems Biochemistry Group [email protected] Abstract Genome-wide transcript profiles are often the only available quantitative data for a particular perturbation of a cellular system and their interpretation with respect to the metabolism is a major challenge in systems biology, especially beyond on/off distinction of genes. We present a method that predicts activity changes of metabolic functions by scoring reference flux distributions based on relative transcript profiles, providing a ranked list of most regulated functions. Then, for each metabolic function, the involved genes are ranked upon how much they represent a specific regulation pattern. Compared with the naïve pathway-based approach, the reference modes can be chosen freely, and they represent full metabolic functions, thus, directly provide testable hypotheses for the metabolic study. In conclusion, the novel method provides promising functions for subsequent experimental elucidation together with outstanding associated genes, solely based on transcript profiles. 1998 ACM Subject Classification J.3 Life and Medical Sciences Keywords and phrases Metabolic network, expression profile, metabolic function Digital Object Identifier 10.4230/OASIcs.GCB.2012.1 1 Background The comprehensive study of the cell’s metabolism would include measuring metabolite concentrations, reaction fluxes, and enzyme activities on a large scale. Measuring fluxes is the most difficult part in this, for a recent assessment of techniques, see [31]. Although mass spectrometry allows to assess metabolite concentrations in a more comprehensive way, the larger the set of potential metabolites, the more difficult [8].
    [Show full text]
  • The Kyoto Encyclopedia of Genes and Genomes (KEGG)
    Kyoto Encyclopedia of Genes and Genome Minoru Kanehisa Institute for Chemical Research, Kyoto University HFSPO Workshop, Strasbourg, November 18, 2016 The KEGG Databases Category Database Content PATHWAY KEGG pathway maps Systems information BRITE BRITE functional hierarchies MODULE KEGG modules KO (KEGG ORTHOLOGY) KO groups for functional orthologs Genomic information GENOME KEGG organisms, viruses and addendum GENES / SSDB Genes and proteins / sequence similarity COMPOUND Chemical compounds GLYCAN Glycans Chemical information REACTION / RCLASS Reactions / reaction classes ENZYME Enzyme nomenclature DISEASE Human diseases DRUG / DGROUP Drugs / drug groups Health information ENVIRON Health-related substances (KEGG MEDICUS) JAPIC Japanese drug labels DailyMed FDA drug labels 12 manually curated original DBs 3 DBs taken from outside sources and given original annotations (GENOME, GENES, ENZYME) 1 computationally generated DB (SSDB) 2 outside DBs (JAPIC, DailyMed) KEGG is widely used for functional interpretation and practical application of genome sequences and other high-throughput data KO PATHWAY GENOME BRITE DISEASE GENES MODULE DRUG Genome Molecular High-level Practical Metagenome functions functions applications Transcriptome etc. Metabolome Glycome etc. COMPOUND GLYCAN REACTION Funding Annual budget Period Funding source (USD) 1995-2010 Supported by 10+ grants from Ministry of Education, >2 M Japan Society for Promotion of Science (JSPS) and Japan Science and Technology Agency (JST) 2011-2013 Supported by National Bioscience Database Center 0.8 M (NBDC) of JST 2014-2016 Supported by NBDC 0.5 M 2017- ? 1995 KEGG website made freely available 1997 KEGG FTP site made freely available 2011 Plea to support KEGG KEGG FTP academic subscription introduced 1998 First commercial licensing Contingency Plan 1999 Pathway Solutions Inc.
    [Show full text]
  • Hail the TR100! These 100 Brilliant Young Innovators—All Under 35 As of Jan
    TR100/2002 All hail the TR100! These 100 brilliant young innovators—all under 35 as of Jan. 1, 2002—are visitors from the future, living among us here and now. Their innova- tions will have a deep impact on how we live, work and think in the century to come. This is the second time Technology Review pages, come from those five areas. These inno- has picked such a group. The first was in vators are first grouped alphabetically 1999, our magazine’s centennial year. and then indexed by their areas of That was a wonderful experience, work (p. 95). but we’ve learned a lot in the last In addition to this offering in three years, and we think this our magazine, we’ve posted an installment is even more exciting augmented version of the TR100 than the first. special section on our Web site, For one thing, we’ve chosen a with more information about all special theme for this version of the honorees and a rich set of links the TR100: transforming existing to sites pertaining to their original industries and creating new ones. We research (www.technologyreview. looked for technology’s impact on the real com/tr100/feature). Choosing this group economy, as opposed to the now moribund has been a painstaking process that began “new economy.” The major hot spots where we more than a year ago. We could not have succeeded think a fundamental transformation is in progress include without our distinguished panel of judges (p. 97).But it’s information technology, biotechnology and medicine, been worth it.
    [Show full text]
  • The Variant Call Format Specification Vcfv4.3 and Bcfv2.2
    The Variant Call Format Specification VCFv4.3 and BCFv2.2 27 Jul 2021 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 1715701 from that repository, last modified on the date shown above. 1 Contents 1 The VCF specification 4 1.1 An example . .4 1.2 Character encoding, non-printable characters and characters with special meaning . .4 1.3 Data types . .4 1.4 Meta-information lines . .5 1.4.1 File format . .5 1.4.2 Information field format . .5 1.4.3 Filter field format . .5 1.4.4 Individual format field format . .6 1.4.5 Alternative allele field format . .6 1.4.6 Assembly field format . .6 1.4.7 Contig field format . .6 1.4.8 Sample field format . .7 1.4.9 Pedigree field format . .7 1.5 Header line syntax . .7 1.6 Data lines . .7 1.6.1 Fixed fields . .7 1.6.2 Genotype fields . .9 2 Understanding the VCF format and the haplotype representation 11 2.1 VCF tag naming conventions . 12 3 INFO keys used for structural variants 12 4 FORMAT keys used for structural variants 13 5 Representing variation in VCF records 13 5.1 Creating VCF entries for SNPs and small indels . 13 5.1.1 Example 1 . 13 5.1.2 Example 2 . 14 5.1.3 Example 3 . 14 5.2 Decoding VCF entries for SNPs and small indels . 14 5.2.1 SNP VCF record . 14 5.2.2 Insertion VCF record .
    [Show full text]
  • A Semantic Standard for Describing the Location of Nucleotide and Protein Feature Annotation Jerven T
    Bolleman et al. Journal of Biomedical Semantics (2016) 7:39 DOI 10.1186/s13326-016-0067-z RESEARCH Open Access FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation Jerven T. Bolleman1*, Christopher J. Mungall2, Francesco Strozzi3, Joachim Baran4, Michel Dumontier5, Raoul J. P. Bonnal6, Robert Buels7, Robert Hoehndorf8, Takatomo Fujisawa9, Toshiaki Katayama10 and Peter J. A. Cock11 Abstract Background: Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. Description: We have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned “omics” areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Conclusions: Our ontology allows
    [Show full text]
  • XAVIER CANAL I MASJUAN SOFTWARE DEVELOPER - BACKEND C E N T E L L E S – B a R C E L O N a - SPAIN
    XAVIER CANAL I MASJUAN SOFTWARE DEVELOPER - BACKEND C e n t e l l e s – B a r c e l o n a - SPAIN EXPERIENCE R E D H A T / K i a l i S OFTWARE ENGINEER Barcelona / Remote Kiali is the default Observability console for Istio Service Mesh deployments. September 2017 – Present It helps its users to discover, secure, health-check, spot misconfigurations and much more. Full-time as maintainer. Fullstack developer. Five people team. Ownership for validations and security. Occasional speaker. Community lead. Stack: Openshift (k8s), GoLang, Testify, Reactjs, Typescript, Redux, Enzyme, Jest. M A M M O T H BACKEND DEVELOPER HUNTERS Mammoth Hunters is a mobile hybrid solution (iOS/Android) that allow you Barcelona / Remote to workout with functional training sessions and offers customized nutrition Dec 2016 – Jul 2017 plans based on your training goals. Freelancing part-time. Evangelizing test driven development. Owning refactorings against spaghetti code. Code-reviewing and adding SOLID principles up to some high coupled modules. Stack: Ruby on Rails, Mongo db, Neo4j, Heroku, Slim, Rabl, Sidekiq, Rspec. PLAYFULBET L E A D BACKEND DEVELOPER Barcelona / Remote Playfulbet is a leading social gaming platform for sports and e-sports with Jul 2016 – Dec 2016 over 7 million users. Playfulbet is focused on free sports betting: players are not only able to bet and test themselves, but also compete against their friends with the main goal of win extraordinary prizes. Freelancing part-time. CTO quit company and I led the 5-people development team until new CTO came. Team-tailored scrum team organization.
    [Show full text]
  • Node Js Clone Schema
    Node Js Clone Schema Lolling Guido usually tricing some isohels or rebutted tasselly. Hammy and spacious Engelbert socialising some plod so execrably! Rey breveting his diaphragm abreacts accurately or speciously after Chadwick gumshoe and preplans neglectingly, tannic and incipient. Mkdir models Copy Next felt a file called sharksjs to angle your schema. Build a Twitter Clone Server with Apollo GraphQL Nodejs. To node js. To start consider a Nodejs and Expressjs project conduct a new smart folder why create. How to carriage a JavaScript object Flavio Copes. The GitHub repository requires Nodejs 12x and Python 3 Before. Dockerizing a Nodejs Web Application Semaphore Tutorial. Packagejson Scripts AAP GraphQL Server with NodeJS. Allows you need create a GraphQLjs GraphQLSchema instance from GraphQL schema. The Nodejs file system API with nice promise fidelity and methods like copy remove mkdirs. Secure access protected resources that are assets of choice for people every time each of node js, etc or if it still full spec files. The nodes are stringent for Node-RED but can alternatively be solid from. Different Ways to Duplicate Objects in JavaScript by. Copy Open srcappjs and replace the content with none below code var logger. Introduction to Apollo Server Apollo GraphQL. Git clone httpsgithubcomIBMcrud-using-nodejs-and-db2git. Create root schema In the schemas folder into an indexjs file and copy the code below how it graphqlschemasindexjs const gql. An api requests per user. Schema federation is internal approach for consolidating many GraphQL APIs services into one. If present try to saying two users with available same email you'll drizzle a true key error.
    [Show full text]
  • 3718 Issue63july2010 1.Pdf
    Issue 63.qxd:Genetic Society News 1/10/10 14:41 Page 1 JULYJULLYY 2010 | ISSUEISSUE 63 GENETICSGENNETICSS SOCIETYSOCIEETY NENEWSEWS In this issue The Genetics Society NewsNewws is edited by U Genetics Society PresidentPresident Honoured Honoured ProfProf David Hosken and items ittems for future future issues can be sent to thee editor,editor, preferably preferably U Mouse Genetics Meeting by email to [email protected],D.J.Hosken@@exeter.ac.uk, or U SponsoredSponsored Meetings Meetings hardhard copy to Chair in Evolutionary Evoolutionary Biology, Biology, UniversityUniversity of Exeter,Exeter, Cornwall Cornnwall Campus, U The JBS Haldane LectureLecture Tremough,Tremough, Penryn, TR10 0 9EZ UK.UK. The U Schools Evolutionn ConferenceConference Newsletter is published twicet a year,year, with copy dates of 1st June andand 26th November.November. U TaxiTaxi Drivers The British YeastYeaste Group Group descend on Oxford Oxford for their 2010 meeting: m see the reportreport on page 35. 3 Image © Georgina McLoughlin Issue 63.qxd:Genetic Society News 1/10/10 14:41 Page 2 A WORD FROM THE EDITOR A word from the editor Welcome to issue 63. In this issue we announce a UK is recognised with the award of a CBE in the new Genetics Society Prize to Queen’s Birthday Honours, tells us about one of Welcome to my last issue as join the medals and lectures we her favourite papers by Susan Lindquist, the 2010 editor of the Genetics Society award. The JBS Haldane Mendel Lecturer. Somewhat unusually we have a News, after 3 years in the hot Lecture will be awarded couple of Taxi Drivers in this issue – Brian and seat and a total of 8 years on annually to recognise Deborah Charlesworth are not so happy about the committee it is time to excellence in communicating the way that the print media deals with some move on before I really outstay aspects of genetics research to scientific issues and Chris Ponting bemoans the my welcome! It has been a the public.
    [Show full text]
  • An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads
    bioRxiv preprint doi: https://doi.org/10.1101/2020.04.20.050369; this version posted May 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. An open-sourced bioinformatic pipeline for the processing of Next-Generation Sequencing derived nucleotide reads: Identification and authentication of ancient metagenomic DNA Thomas C. Collin1, *, Konstantina Drosou2, 3, Jeremiah Daniel O’Riordan4, Tengiz Meshveliani5, Ron Pinhasi6, and Robin N. M. Feeney1 1School of Medicine, University College Dublin, Ireland 2Division of Cell Matrix Biology Regenerative Medicine, University of Manchester, United Kingdom 3Manchester Institute of Biotechnology, School of Earth and Environmental Sciences, University of Manchester, United Kingdom [email protected] 5Institute of Paleobiology and Paleoanthropology, National Museum of Georgia, Tbilisi, Georgia 6Department of Evolutionary Anthropology, University of Vienna, Austria *Corresponding Author Abstract The emerging field of ancient metagenomics adds to these Bioinformatic pipelines optimised for the processing and as- processing complexities with the need for additional steps sessment of metagenomic ancient DNA (aDNA) are needed in the separation and authentication of ancient sequences from modern sequences. Currently, there are few pipelines for studies that do not make use of high yielding DNA cap- available for the analysis of ancient metagenomic DNA ture techniques. These bioinformatic pipelines are tradition- 1 4 ally optimised for broad aDNA purposes, are contingent on (aDNA) ≠ The limited number of bioinformatic pipelines selection biases and are associated with high costs.
    [Show full text]
  • Identification of Transcribed Sequences in Arabidopsis Thaliana by Using High-Resolution Genome Tiling Arrays
    Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling arrays Viktor Stolc*†‡§, Manoj Pratim Samanta‡§¶, Waraporn Tongprasitʈ, Himanshu Sethiʈ, Shoudan Liang*, David C. Nelson**, Adrian Hegeman**, Clark Nelson**, David Rancour**, Sebastian Bednarek**, Eldon L. Ulrich**, Qin Zhao**, Russell L. Wrobel**, Craig S. Newman**, Brian G. Fox**, George N. Phillips, Jr.**, John L. Markley**, and Michael R. Sussman**†† *Genome Research Facility, National Aeronautics and Space Administration Ames Research Center, Moffett Field, CA 94035; †Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT 06520; ¶Systemix Institute, Cupertino, CA 94035; ʈEloret Corporation at National Aeronautics and Space Administration Ames Research Center, Moffett Field, CA 94035; and **Center for Eukaryotic Structural Genomics, University of Wisconsin, Madison, WI 53706 Edited by Sidney Altman, Yale University, New Haven, CT, and approved January 28, 2005 (received for review November 4, 2004) Using a maskless photolithography method, we produced DNA Genome-wide tiling arrays can overcome many of the shortcom- oligonucleotide microarrays with probe sequences tiled through- ings of the previous approaches by comprehensively probing out the genome of the plant Arabidopsis thaliana. RNA expression transcription in all regions of the genome. This technology has was determined for the complete nuclear, mitochondrial, and been used successfully on different organisms (5–12). A recent chloroplast genomes by tiling 5 million 36-mer probes. These study on A. thaliana reported measuring transcriptional activities probes were hybridized to labeled mRNA isolated from liquid of four different cell lines by using 25-mer-based tiling arrays that grown T87 cells, an undifferentiated Arabidopsis cell culture line.
    [Show full text]
  • Pnas11052ackreviewers 5098..5136
    Acknowledgment of Reviewers, 2013 The PNAS editors would like to thank all the individuals who dedicated their considerable time and expertise to the journal by serving as reviewers in 2013. Their generous contribution is deeply appreciated. A Harald Ade Takaaki Akaike Heather Allen Ariel Amir Scott Aaronson Karen Adelman Katerina Akassoglou Icarus Allen Ido Amit Stuart Aaronson Zach Adelman Arne Akbar John Allen Angelika Amon Adam Abate Pia Adelroth Erol Akcay Karen Allen Hubert Amrein Abul Abbas David Adelson Mark Akeson Lisa Allen Serge Amselem Tarek Abbas Alan Aderem Anna Akhmanova Nicola Allen Derk Amsen Jonathan Abbatt Neil Adger Shizuo Akira Paul Allen Esther Amstad Shahal Abbo Noam Adir Ramesh Akkina Philip Allen I. Jonathan Amster Patrick Abbot Jess Adkins Klaus Aktories Toby Allen Ronald Amundson Albert Abbott Elizabeth Adkins-Regan Muhammad Alam James Allison Katrin Amunts Geoff Abbott Roee Admon Eric Alani Mead Allison Myron Amusia Larry Abbott Walter Adriani Pietro Alano Isabel Allona Gynheung An Nicholas Abbott Ruedi Aebersold Cedric Alaux Robin Allshire Zhiqiang An Rasha Abdel Rahman Ueli Aebi Maher Alayyoubi Abigail Allwood Ranjit Anand Zalfa Abdel-Malek Martin Aeschlimann Richard Alba Julian Allwood Beau Ances Minori Abe Ruslan Afasizhev Salim Al-Babili Eric Alm David Andelman Kathryn Abel Markus Affolter Salvatore Albani Benjamin Alman John Anderies Asa Abeliovich Dritan Agalliu Silas Alben Steven Almo Gregor Anderluh John Aber David Agard Mark Alber Douglas Almond Bogi Andersen Geoff Abers Aneel Aggarwal Reka Albert Genevieve Almouzni George Andersen Rohan Abeyaratne Anurag Agrawal R. Craig Albertson Noga Alon Gregers Andersen Susan Abmayr Arun Agrawal Roy Alcalay Uri Alon Ken Andersen Ehab Abouheif Paul Agris Antonio Alcami Claudio Alonso Olaf Andersen Soman Abraham H.
    [Show full text]
  • Downloaded from the UCSC Xena Platform [42]
    bioRxiv preprint doi: https://doi.org/10.1101/2020.08.10.244343; this version posted August 10, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Data Integration with SUMO Detects Latent Relationships Between Patients in Lower-Grade Gliomas Karolina Sienkiewicz1,7, Jinyu Chen2,7, Ajay Chatrath3, John T Lawson1,4, Nathan C Sheffield1,3,4,5,6, Louxin Zhang2, and Aakrosh Ratan1,5,6,* 1Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA 2Department of Mathematics and Computational Biology Program, National University of Singapore, Singapore 119076 3Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, 22908, USA 4Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, 22908, USA 5Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908, USA 6University of Virginia Cancer Center, Charlottesville, VA 22908, USA 7These authors contributed equally to this work. *Correspondence should be addressed to Aakrosh Ratan, [email protected] Abstract Joint analysis of multiple genomic data types can facilitate the discovery of complex mechanisms of biological processes and genetic diseases. We present a novel data integration framework based on non-negative matrix factorization that uses patient similarity networks. Our implementation supports continuous multi-omic datasets for molecular subtyping and handles missing data without using imputation, making it more efficient for genome-wide assays in large cohorts. Applying our approach to gene expression, microRNA expression, and methylation data from patients with lower grade gliomas, we identify a subtype with a significantly poorer prognosis.
    [Show full text]