Statistical Methods for the Analysis of the Genetics of Gene Expression

Total Page:16

File Type:pdf, Size:1020Kb

Statistical Methods for the Analysis of the Genetics of Gene Expression Statistical methods for the analysis of the genetics of gene expression Matthias Alexander Heinig April 13, 2011 Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) am Fachbereich Mathematik und Informatik der Freien Universit¨at Berlin Gutachter: Prof. Dr. Martin Vingron Prof. Dr. Norbert H¨ubner 1. Referent: Prof. Dr. Martin Vingron 2. Referent: Prof. Dr. Norbert H¨ubner Tag der Promotion: 13. Dezember 2010 Preface The work that led to this thesis is part of several collaborative projects in which I participated. For the sake of clarity this thesis presents all results that led to the conclusions drawn and individual contributions will be detailed here. For each project I mention publications, highlight my contributions and acknowledge the work of my collaborators. Star project The work presented in section 3.3.1 was published in Nature Genetics [1] by the Star Consortium. It was a large EU funded project initiated by Norbert H¨ubner and led by Kathrin Saar aiming to assess genetic variability among rat strains and to provide haplotype and genetic maps. My contribution was the design of the strategy for the construction of the maps and its implementation and application to three different populations of rats. Moreover I performed the comparative analysis of these maps. I would like to acknowledge Herbert Schulz, Kathrin Saar, Norbert H¨ubner, Edwin Cuppen, Richard Mott, Domenique Gaugier and Mari´e-Th´er`ese Bihoreau for discussions and guidance and Denise Brocklebank for discussion and contributions to the implementation of the analysis. I would like to thank Oliver Hummel and Herbert Schulz for the preprocessing of genotype data. Ephx2 project Analysis and results presented in sections 3.4.1, 4.1 and 4.2.3 were pulished in Nature Genetics [2]. This project was designed by Judith Fischer, Jan Monti and Norbert H¨ubner and carried out in the lab of Norbert H¨ubner mainly by Judith Fischer and Svetlana Paskas. My contribution to this project was the eQTL analysis, analysis of the genetic effect on the arachidonic acid pathway (section 4.1) and the transcription factor binding site analysis of the Ephx2 promoter (section 4.2.3). In addition I participated in the analysis of the physiological data together with Judith Fischer. I would like to thank Judith Fischer for leading the project and producing all the data. I would like to thank Norbert H¨ubner for discussions and giving me the opportunity to work on this data. Finally I would like to acknowledge Martin Vingron for his ideas on the pathway analysis (section 4.1). I have presented the methodology of this part of the project as a talk at the International conference on Genetics 2008 in Berlin. Euratools project The work presented in sections 3.3 and 4.3 was published in Nature [3]. The data were generated as part of the Euratools consortium – an EU funded large integrated project. I contributed the eQTL analysis across seven tissues (section 3.3) and the design of the transcription factor and co-expression analysis in sections 4.3.1 and 4.3.2 which form the basis of this study. Together with Maxim Rotival and Enrico Petretto, I performed the co- expression analysis in the human eQTL data set (section 4.3.6). Analysis of the influence of the human chromosome 13 locus on the network (section 4.3.7) was carried out in parallel by myself, Anja Bauerfeind and Leonardo Bottolo. Analysis of the EBI2 cis-eQTL (section 4.3.7) was performed in parallel by myself and Anja Bauerfeind. Translation to human T1D i data (4.3.8) was performed in collaboration with Chris Wallace, David Clayton and John Todd with contributions of Anja Bauerfeind and myself. In particular I would like to acknowledge that Chris Wallace developed the theory for the extensions of enrichment analysis to account for confounding factors (sections 2.2.3 and 4.3.8). I am grateful to the Euratools consortium for providing me with a short term fellowship that got me started on this project during my stay in London. I thank Norbert H¨ubner, Enrico Petretto and Stuart Cook for initiating this project and finding the right collaboration partners to make it a success. I would like to thank Enrico Petretto for his constant support and advice in this project. I thank Helge Roider for his assistance with the PASTAA analysis. I am grateful to Carola Rintisch and Svetlana Paskas who performed the experiment to confirm my TFBS predictions. My special thanks go to Anja Bauerfeind for help and advice for the human eQTL analysis and genotype imputation. I would like to thank Sarah Langley who performed the cell type specificity analysis in section 4.3.3 and Leonardo Bottolo for the analysis in section 4.3.4. I would like to thank Stefan Blankeberg and his coworkers for providing us access to eQTL data from the Gutenberg Heart Study, Francois Cambien on behalf of the Cardiogenics consortium for access to eQTL data from the Cardiogenics study and John Todd for access to the human T1D data. The analysis presented in section 4.4 has not been published yet, but a manuscript is currently being submitted. The data analysed there are the same as presented above. I have designed and implemented the analysis. I would like to thank Enrico Petretto and Mario Falchi for stimulating discussions over coffee that shaped my ideas. I would like to thank Martin Vingron for his ideas and advice on writing the manuscript. I acknowledge numerous discussions about linear models with Hughes Richard. Finally I would like to thank Norbert H¨ubner who prompted me to evaluate phenotype data in this context. sTRAP project The method presented in section 4.2 was published in Human Mutation [4]. The project was a close collaboration between Thomas Manke and myself. We designed and performed the analysis together. I would like to thank Helge Roider for providing me the C-code of TRAP which allowed me to implement the tRap R-package and the sTRAP webserver. Acknowledgements I would like to thank both my supervisors Norbert H¨ubner and Martin Vingron for offering me the opportunity to get the best of both worlds, experimental and com- putational by co-supervising my thesis. I am grateful for the many fruitful discussions, the ideas and support I have enjoyed from both during my thesis. I would like to thank Norbert H¨ubner for embedding me into the very stimulating environment of the Euratools consortium which allowed me to participate in many collaborations. I would like to thank Martin Vingron for cre- ating an extraordinary working atmosphere in his group. I am very grateful to all members of the Vingron and H¨ubner groups and all PhD students of both institutes many of whom became friends. Last but not least I would like to thank my parents for their support and Kristina for her patience. Berlin, September 2010 Matthias Heinig ii Contents 1 Introduction 1 1.1 Objective and structure of the thesis ......................... 4 1.2 Genetic mapping of quantitative trait loci (QTL) .................. 5 1.3 Genome wide association studies ........................... 8 1.4 The BXH/HXB recombinant inbred strains ..................... 8 1.5 The F2 intercross between SHHF and SHRSP .................... 9 2 Statistical tools 11 2.1 Linear models ...................................... 11 2.1.1 Estimation of parameters ........................... 11 2.1.2 Hypothesis testing ............................... 12 2.1.3 The multiple linear regression model ..................... 12 2.1.4 Model selection via lasso ............................ 13 2.2 Functional enrichment analysis ............................ 13 2.2.1 Exact test on a contingency table ....................... 14 2.2.2 Gene set enrichment analysis ......................... 14 2.2.3 Extensions to gene set enrichment analysis ................. 15 2.2.4 The iterated hypergeometric test ....................... 17 3 Expression quantitative trait loci (eQTL) 19 3.1 Gene expression as quantitative trait ......................... 19 3.2 Measuring gene expression with microarrays ..................... 20 3.2.1 Gene expression microarrays ......................... 20 3.2.2 Normalisation .................................. 20 3.3 Mapping of eQTLs in the BXH/HXB RI strains .................. 24 3.3.1 Construction of a high density SNP map ................... 24 3.3.2 Gene expression data .............................. 25 3.3.3 Expression QTL mapping ........................... 25 3.4 Integrated analysis of eQTLs and physiological data ................ 31 3.4.1 Identification of a risk factor for heart failure ................ 32 3.4.2 Identification of a candidate gene for systolic blood pressure ........ 33 4 eQTL genes in gene expression networks 37 4.1 Extension of gene set enrichment analysis for genetic mapping ........... 37 4.1.1 Linking the arachidonic acid pathway to heart failure ........... 38 4.2 Sequence variation in transcription factor binding sites ............... 41 4.2.1 Modelling transcription factor binding site affinities ............ 41 iii Contents 4.2.2 sTRAP: A framework to rank affinity changes ................ 43 4.2.3 Identification of the regulator of Ephx2 ................... 50 4.2.4 Conclusions ................................... 52 4.3 The role of transcription factors in clusters of trans-eQTLs ............ 52 4.3.1 Identification of genetically regulated TF-networks ............. 53 4.3.2 Extension of TF-networks by co-expression analysis ............ 58 4.3.3 Cross species cell type enrichment analysis ................. 58 4.3.4 Genetic mapping of the expanded TF-network ............... 60 4.3.5 Identification of a candidate regulatory factor ................ 60 4.3.6 Comparative co-expression analysis with humans .............. 62 4.3.7 Analysis of the human regulatory locus ................... 63 4.3.8 Translation to human GWAS data ...................... 65 4.3.9 Conclusions ................................... 68 4.4 Co-expression as quantitative trait .......................... 68 4.4.1 A linear model for the mapping of co-expression as a quantitative trait .
Recommended publications
  • A Method to Infer Changed Activity of Metabolic Function from Transcript Profiles
    ModeScore: A Method to Infer Changed Activity of Metabolic Function from Transcript Profiles Andreas Hoppe and Hermann-Georg Holzhütter Charité University Medicine Berlin, Institute for Biochemistry, Computational Systems Biochemistry Group [email protected] Abstract Genome-wide transcript profiles are often the only available quantitative data for a particular perturbation of a cellular system and their interpretation with respect to the metabolism is a major challenge in systems biology, especially beyond on/off distinction of genes. We present a method that predicts activity changes of metabolic functions by scoring reference flux distributions based on relative transcript profiles, providing a ranked list of most regulated functions. Then, for each metabolic function, the involved genes are ranked upon how much they represent a specific regulation pattern. Compared with the naïve pathway-based approach, the reference modes can be chosen freely, and they represent full metabolic functions, thus, directly provide testable hypotheses for the metabolic study. In conclusion, the novel method provides promising functions for subsequent experimental elucidation together with outstanding associated genes, solely based on transcript profiles. 1998 ACM Subject Classification J.3 Life and Medical Sciences Keywords and phrases Metabolic network, expression profile, metabolic function Digital Object Identifier 10.4230/OASIcs.GCB.2012.1 1 Background The comprehensive study of the cell’s metabolism would include measuring metabolite concentrations, reaction fluxes, and enzyme activities on a large scale. Measuring fluxes is the most difficult part in this, for a recent assessment of techniques, see [31]. Although mass spectrometry allows to assess metabolite concentrations in a more comprehensive way, the larger the set of potential metabolites, the more difficult [8].
    [Show full text]
  • The Kyoto Encyclopedia of Genes and Genomes (KEGG)
    Kyoto Encyclopedia of Genes and Genome Minoru Kanehisa Institute for Chemical Research, Kyoto University HFSPO Workshop, Strasbourg, November 18, 2016 The KEGG Databases Category Database Content PATHWAY KEGG pathway maps Systems information BRITE BRITE functional hierarchies MODULE KEGG modules KO (KEGG ORTHOLOGY) KO groups for functional orthologs Genomic information GENOME KEGG organisms, viruses and addendum GENES / SSDB Genes and proteins / sequence similarity COMPOUND Chemical compounds GLYCAN Glycans Chemical information REACTION / RCLASS Reactions / reaction classes ENZYME Enzyme nomenclature DISEASE Human diseases DRUG / DGROUP Drugs / drug groups Health information ENVIRON Health-related substances (KEGG MEDICUS) JAPIC Japanese drug labels DailyMed FDA drug labels 12 manually curated original DBs 3 DBs taken from outside sources and given original annotations (GENOME, GENES, ENZYME) 1 computationally generated DB (SSDB) 2 outside DBs (JAPIC, DailyMed) KEGG is widely used for functional interpretation and practical application of genome sequences and other high-throughput data KO PATHWAY GENOME BRITE DISEASE GENES MODULE DRUG Genome Molecular High-level Practical Metagenome functions functions applications Transcriptome etc. Metabolome Glycome etc. COMPOUND GLYCAN REACTION Funding Annual budget Period Funding source (USD) 1995-2010 Supported by 10+ grants from Ministry of Education, >2 M Japan Society for Promotion of Science (JSPS) and Japan Science and Technology Agency (JST) 2011-2013 Supported by National Bioscience Database Center 0.8 M (NBDC) of JST 2014-2016 Supported by NBDC 0.5 M 2017- ? 1995 KEGG website made freely available 1997 KEGG FTP site made freely available 2011 Plea to support KEGG KEGG FTP academic subscription introduced 1998 First commercial licensing Contingency Plan 1999 Pathway Solutions Inc.
    [Show full text]
  • 3 13437143.Pdf
    Title Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones Imanishi, Tadashi; Itoh, Takeshi; Suzuki, Yutaka; O'Donovan, Claire; Fukuchi, Satoshi; Koyanagi, Kanako O.; Barrero, Roberto A.; Tamura, Takuro; Yamaguchi-Kabata, Yumi; Tanino, Motohiko; Yura, Kei; Miyazaki, Satoru; Ikeo, Kazuho; Homma, Keiichi; Kasprzyk, Arek; Nishikawa, Tetsuo; Hirakawa, Mika; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Ashurst, Jennifer; Jia, Libin; Nakao, Mitsuteru; Thomas, Michael A.; Mulder, Nicola; Karavidopoulou, Youla; Jin, Lihua; Kim, Sangsoo; Yasuda, Tomohiro; Lenhard, Boris; Eveno, Eric; Suzuki, Yoshiyuki; Yamasaki, Chisato; Takeda, Jun-ichi; Gough, Craig; Hilton, Phillip; Fujii, Yasuyuki; Sakai, Hiroaki; Tanaka, Susumu; Amid, Clara; Bellgard, Matthew; Bonaldo, Maria de Fatima; Bono, Hidemasa; Bromberg, Susan K.; Brookes, Anthony J.; Bruford, Elspeth; Carninci, Piero; Chelala, Claude; Couillault, Christine; Souza, Sandro J. de; Debily, Marie-Anne; Devignes, Marie-Dominique; Dubchak, Inna; Endo, Toshinori; Estreicher, Anne; Eyras, Eduardo; Fukami-Kobayashi, Kaoru; R. Gopinath, Gopal; Graudens, Esther; Hahn, Yoonsoo; Han, Michael; Han, Ze-Guang; Hanada, Kousuke; Hanaoka, Hideki; Harada, Erimi; Hashimoto, Katsuyuki; Hinz, Ursula; Hirai, Momoki; Hishiki, Teruyoshi; Hopkinson, Ian; Imbeaud, Sandrine; Inoko, Hidetoshi; Kanapin, Alexander; Kaneko, Yayoi; Kasukawa, Takeya; Kelso, Janet; Kersey, Author(s) Paul; Kikuno, Reiko; Kimura, Kouichi; Korn, Bernhard; Kuryshev, Vladimir; Makalowska, Izabela; Makino, Takashi; Mano, Shuhei;
    [Show full text]
  • Ontology-Based Methods for Analyzing Life Science Data
    Habilitation a` Diriger des Recherches pr´esent´ee par Olivier Dameron Ontology-based methods for analyzing life science data Soutenue publiquement le 11 janvier 2016 devant le jury compos´ede Anita Burgun Professeur, Universit´eRen´eDescartes Paris Examinatrice Marie-Dominique Devignes Charg´eede recherches CNRS, LORIA Nancy Examinatrice Michel Dumontier Associate professor, Stanford University USA Rapporteur Christine Froidevaux Professeur, Universit´eParis Sud Rapporteure Fabien Gandon Directeur de recherches, Inria Sophia-Antipolis Rapporteur Anne Siegel Directrice de recherches CNRS, IRISA Rennes Examinatrice Alexandre Termier Professeur, Universit´ede Rennes 1 Examinateur 2 Contents 1 Introduction 9 1.1 Context ......................................... 10 1.2 Challenges . 11 1.3 Summary of the contributions . 14 1.4 Organization of the manuscript . 18 2 Reasoning based on hierarchies 21 2.1 Principle......................................... 21 2.1.1 RDF for describing data . 21 2.1.2 RDFS for describing types . 24 2.1.3 RDFS entailments . 26 2.1.4 Typical uses of RDFS entailments in life science . 26 2.1.5 Synthesis . 30 2.2 Case study: integrating diseases and pathways . 31 2.2.1 Context . 31 2.2.2 Objective . 32 2.2.3 Linking pathways and diseases using GO, KO and SNOMED-CT . 32 2.2.4 Querying associated diseases and pathways . 33 2.3 Methodology: Web services composition . 39 2.3.1 Context . 39 2.3.2 Objective . 40 2.3.3 Semantic compatibility of services parameters . 40 2.3.4 Algorithm for pairing services parameters . 40 2.4 Application: ontology-based query expansion with GO2PUB . 43 2.4.1 Context . 43 2.4.2 Objective .
    [Show full text]
  • Genbank by Walter B
    GenBank by Walter B. Goad o understand the significance of the information stored in memory—DNA—to the stuff of activity—proteins. The idea that GenBank, you need to know a little about molecular somehow the bases in DNA determine the amino acids in proteins genetics. What that field deals with is self-replication—the had been around for some time. In fact, George Gamow suggested in T process unique to life—and mutation and 1954, after learning about the structure proposed for DNA, that a recombination—the processes responsible for evolution-at the triplet of bases corresponded to an amino acid. That suggestion was fundamental level of the genes in DNA. This approach of working shown to be true, and by 1965 most of the genetic code had been from the blueprint, so to speak, of a living system is very powerful, deciphered. Also worked out in the ’60s were many details of what and studies of many other aspects of life—the process of learning, for Crick called the central dogma of molecular genetics—the now example—are now utilizing molecular genetics. firmly established fact that DNA is not translated directly to proteins Molecular genetics began in the early ’40s and was at first but is first transcribed to messenger RNA. This molecule, a nucleic controversial because many of the people involved had been trained acid like DNA, then serves as the template for protein synthesis. in the physical sciences rather than the biological sciences, and yet These great advances prompted a very distinguished molecular they were answering questions that biologists had been asking for geneticist to predict, in 1969, that biology was just about to end since years.
    [Show full text]
  • Causal Diagrams for Interference Elizabeth L
    STATISTICAL SCIENCE Volume 29, Number 4 November 2014 Special Issue on Semiparametrics and Causal Inference Causal Etiology of the Research of James M. Robins ..............................................Thomas S. Richardson and Andrea Rotnitzky 459 Doubly Robust Policy Evaluation and Optimization ..........................Miroslav Dudík, Dumitru Erhan, John Langford and Lihong Li 485 Statistics,CausalityandBell’sTheorem.....................................Richard D. Gill 512 Standardization and Control for Confounding in Observational Studies: A Historical Perspective..............................................Niels Keiding and David Clayton 529 CausalDiagramsforInterference............Elizabeth L. Ogburn and Tyler J. VanderWeele 559 External Validity: From do-calculus to Transportability Across Populations .........................................................Judea Pearl and Elias Bareinboim 579 Nonparametric Bounds and Sensitivity Analysis of Treatment Effects .................Amy Richardson, Michael G. Hudgens, Peter B. Gilbert and Jason P. Fine 596 The Bayesian Analysis of Complex, High-Dimensional Models: Can It Be CODA? .......................................Y.Ritov,P.J.Bickel,A.C.GamstandB.J.K.Kleijn 619 Q-andA-learning Methods for Estimating Optimal Dynamic Treatment Regimes .............. Phillip J. Schulte, Anastasios A. Tsiatis, Eric B. Laber and Marie Davidian 640 A Uniformly Consistent Estimator of Causal Effects under the k-Triangle-Faithfulness Assumption...................................................Peter Spirtes
    [Show full text]
  • Research Report 2006 Max Planck Institute for Molecular Genetics, Berlin Imprint | Research Report 2006
    MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS Research Report 2006 Max Planck Institute for Molecular Genetics, Berlin Imprint | Research Report 2006 Published by the Max Planck Institute for Molecular Genetics (MPIMG), Berlin, Germany, August 2006 Editorial Board Bernhard Herrmann, Hans Lehrach, H.-Hilger Ropers, Martin Vingron Coordination Claudia Falter, Ingrid Stark Design & Production UNICOM Werbeagentur GmbH, Berlin Number of copies: 1,500 Photos Katrin Ullrich, MPIMG; David Ausserhofer Contact Max Planck Institute for Molecular Genetics Ihnestr. 63–73 14195 Berlin, Germany Phone: +49 (0)30-8413 - 0 Fax: +49 (0)30-8413 - 1207 Email: [email protected] For further information about the MPIMG please see our website: www.molgen.mpg.de MPI for Molecular Genetics Research Report 2006 Table of Contents The Max Planck Institute for Molecular Genetics . 4 • Organisational Structure. 4 • MPIMG – Mission, Development of the Institute, Research Concept. .5 Department of Developmental Genetics (Bernhard Herrmann) . 7 • Transmission ratio distortion (Hermann Bauer) . .11 • Signal Transduction in Embryogenesis and Tumor Progression (Markus Morkel). 14 • Development of Endodermal Organs (Heiner Schrewe) . 16 • Gene Expression and 3D-Reconstruction (Ralf Spörle). 18 • Somitogenesis (Lars Wittler). 21 Department of Vertebrate Genomics (Hans Lehrach) . 25 • Molecular Embryology and Aging (James Adjaye). .31 • Protein Expression and Protein Structure (Konrad Büssow). .34 • Mass Spectrometry (Johan Gobom). 37 • Bioinformatics (Ralf Herwig). .40 • Comparative and Functional Genomics (Heinz Himmelbauer). 44 • Genetic Variation (Margret Hoehe). 48 • Cell Arrays/Oligofingerprinting (Michal Janitz). .52 • Kinetic Modeling (Edda Klipp) . .56 • In Vitro Ligand Screening (Zoltán Konthur). .60 • Neurodegenerative Disorders (Sylvia Krobitsch). .64 • Protein Complexes & Cell Organelle Assembly/ USN (Bodo Lange/Thorsten Mielke). .67 • Automation & Technology Development (Hans Lehrach).
    [Show full text]
  • Integrative Annotation of 21,037 Human Genes Validated by Full-Length Cdna Clones
    Title Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones Imanishi, Tadashi; Itoh, Takeshi; Suzuki, Yutaka; O'Donovan, Claire; Fukuchi, Satoshi; Koyanagi, Kanako O.; Barrero, Roberto A.; Tamura, Takuro; Yamaguchi-Kabata, Yumi; Tanino, Motohiko; Yura, Kei; Miyazaki, Satoru; Ikeo, Kazuho; Homma, Keiichi; Kasprzyk, Arek; Nishikawa, Tetsuo; Hirakawa, Mika; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Ashurst, Jennifer; Jia, Libin; Nakao, Mitsuteru; Thomas, Michael A.; Mulder, Nicola; Karavidopoulou, Youla; Jin, Lihua; Kim, Sangsoo; Yasuda, Tomohiro; Lenhard, Boris; Eveno, Eric; Suzuki, Yoshiyuki; Yamasaki, Chisato; Takeda, Jun-ichi; Gough, Craig; Hilton, Phillip; Fujii, Yasuyuki; Sakai, Hiroaki; Tanaka, Susumu; Amid, Clara; Bellgard, Matthew; Bonaldo, Maria de Fatima; Bono, Hidemasa; Bromberg, Susan K.; Brookes, Anthony J.; Bruford, Elspeth; Carninci, Piero; Chelala, Claude; Couillault, Christine; Souza, Sandro J. de; Debily, Marie-Anne; Devignes, Marie-Dominique; Dubchak, Inna; Endo, Toshinori; Estreicher, Anne; Eyras, Eduardo; Fukami-Kobayashi, Kaoru; R. Gopinath, Gopal; Graudens, Esther; Hahn, Yoonsoo; Han, Michael; Han, Ze-Guang; Hanada, Kousuke; Hanaoka, Hideki; Harada, Erimi; Hashimoto, Katsuyuki; Hinz, Ursula; Hirai, Momoki; Hishiki, Teruyoshi; Hopkinson, Ian; Imbeaud, Sandrine; Inoko, Hidetoshi; Kanapin, Alexander; Kaneko, Yayoi; Kasukawa, Takeya; Kelso, Janet; Kersey, Author(s) Paul; Kikuno, Reiko; Kimura, Kouichi; Korn, Bernhard; Kuryshev, Vladimir; Makalowska, Izabela; Makino, Takashi; Mano, Shuhei;
    [Show full text]
  • Next Generation Machine Learning Prediction of Protein Cellular Sorting
    TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Bioinformatik Next Generation Machine Learning Prediction of Protein Cellular Sorting Tatyana Goldberg Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Daniel Cremers Prüfer der Dissertation: 1. Univ.-Prof. Dr. Burkhard Rost 2. Prof. Dr. Yana Bromberg, The State University of New Jersey/USA 3. Univ.-Prof. Dr. Iris Antes Die Dissertation wurde am 15.12.2015 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 22.03.2016 angenommen. ii Table of Contents Abstract ............................................................................................................................................. v Zusammenfassung ........................................................................................................................... vii Acknowledgments ............................................................................................................................ ix List of publications ........................................................................................................................... xi List of Figures and Tables ............................................................................................................... xiii 1. Introduction ...............................................................................................................................
    [Show full text]
  • First Analysis Steps
    FirstFirst analysisanalysis stepssteps o quality control and optimization o calibration and error modeling o data transformations Wolfgang Huber Dep. of Molecular Genome Analysis (A. Poustka) DKFZ Heidelberg Acknowledgements Anja von Heydebreck Günther Sawitzki Holger Sültmann, Klaus Steiner, Markus Vogt, Jörg Schneider, Frank Bergmann, Florian Haller, Katharina Finis, Stephanie Süß, Anke Schroth, Friederike Wilmer, Judith Boer, Martin Vingron, Annemarie Poustka Sandrine Dudoit, Robert Gentleman, Rafael Irizarry and Yee Hwa Yang: Bioconductor short course, summer 2002 and many others a microarray slide Slide: 25x75 mm Spot-to-spot: ca. 150-350 µm 4 x 4 or 8x4 sectors 17...38 rows and columns per sector ca. 4600…46000 probes/array sector: corresponds to one print-tip Terminology sample: RNA (cDNA) hybridized to the array, aka target, mobile substrate. probe: DNA spotted on the array, aka spot, immobile substrate. sector: rectangular matrix of spots printed using the same print-tip (or pin), aka print-tip-group plate: set of 384 (768) spots printed with DNA from the same microtitre plate of clones slide, array channel: data from one color (Cy3 = cyanine 3 = green, Cy5 = cyanine 5 = red). batch: collection of microarrays with the same probe layout. Raw data scanner signal resolution: 5 or 10 mm spatial, 16 bit (65536) dynamical per channel ca. 30-50 pixels per probe (60 µm spot size) 40 MB per array Raw data scanner signal resolution: 5 or 10 mm spatial, 16 bit (65536) dynamical per channel ca. 30-50 pixels per probe (60 µm spot size) 40 MB per array Image Analysis Raw data scanner signal resolution: 5 or 10 mm spatial, 16 bit (65536) dynamical per channel ca.
    [Show full text]
  • Probabilistic Models for Gene Silencing Data
    Probabilistic Models for Gene Silencing Data Florian Markowetz Dezember 2005 Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) am Fachbereich Mathematik und Informatik der Freien Universit¨atBerlin 1. Referent: Prof. Dr. Martin Vingron 2. Referent: Prof. Dr. Klaus-Robert M¨uller Tag der Promotion: 26. April 2006 Preface Acknowledgements This work was carried out in the Computational Diagnostics group of the Department of Computational Molecular Biology at the Max Planck Institute for Molecular Genetics in Berlin. I thank all past and present colleagues for the good working atmosphere and the scientific—and sometimes maybe not so scientific—discussions. Especially, I am grateful to my supervisor Rainer Spang for suggesting the topic, his scientific support, and the opportunity to write this thesis under his guidance. I thank Michael Boutros for providing the expression data and for introducing me to the world of RNAi when I visited his lab at the DKFZ in Heidelberg. I thank Anja von Heydebreck, J¨org Schultz, and Martin Vingron for their advice and counsel as members of my PhD commitee. During the time I worked on this thesis, I enjoyed fruitful discussions with many people. In particular, I gratefully acknowledge Jacques Bloch, Steffen Grossmann, Achim Tresch, and Chen-Hsiang Yeang for their contributions. Special thanks go to Viola Gesellchen, Britta Koch, Stefanie Scheid, Stefan Bentink, Stefan Haas, Dennis Kostka, and Stefan R¨opcke, who read drafts of this thesis and greatly improved it by their comments. Publications Parts of this thesis have been published before. Chapter 2 grew out of lectures I gave in 2005 at the Instiute for Theoretical Physics and Mathematics (IPM) in Tehran, Iran, and at the German Conference on Bioinformatics (GCB), Hamburg, Germany.
    [Show full text]
  • The Analysis of RNA-Seq Experiments Using Approximate Likelihood
    The analysis of RNA-Seq experiments using approximate likelihood Daniel C. Jones A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2020 Reading Committee: Walter L. Ruzzo, Chair Sreeram Kannan Georg Seelig Program Authorized to Offer Degree: Paul G. Allen School of Computer Science & Engineering ©Copyright 2020 Daniel C. Jones University of Washington Abstract The analysis of RNA-Seq experiments using approximate likelihood Daniel C. Jones Chair of the Supervisory Committee: Walter L. Ruzzo Paul G. Allen School of Computer Science & Engineering The analysis of mRNA transcript abundance with RNA-Seq is a central tool in molecular biology research, but often analyses fail to account for the uncertainty in these estimates, which can be significant, especially when trying to disentangle isoforms or duplicated genes. Preserving uncertainty necessitates a full probabilistic model of the all the sequencing reads, which quickly becomes intractable, as experiments can consist of billions of reads. To overcome these limitations, we propose a new method of approximat- ing the likelihood function of a sparse mixture model, using a technique we call the Pólya tree transformation. We demonstrate that substituting this approximation for the real thing achieves most of the benefits with a fraction of the computational costs, leading to more accurate detection of differential transcript expression. 3 4 For Maddie. Let’s have some more normal ones. Contents 1 the analysis
    [Show full text]