Multi-Omics Data Integration and Data Model Optimization in Proteomicsdb

Total Page:16

File Type:pdf, Size:1020Kb

Multi-Omics Data Integration and Data Model Optimization in Proteomicsdb TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Proteomik und Bioanalytik Multi-omics data integration and data model optimization in ProteomicsDB Patroklos E. Samaras Vollständiger Abdruck der von der Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Prof. Dr. Dmitrij Frishman Prüfer der Dissertation: 1. Prof. Dr. Bernhard Küster 2. Prof. Dr. Julien Gagneur 3. Priv.-Doz. Dr. Martin Eisenacher Die Dissertation wurde am 20.04.2020 bei der Technischen Universität München eingereicht und durch die Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt am 30.09.2020 angenommen. To Ermina to my parents Vangelis and Ntina and to my brothers Thanasis and Tasos Abstract Proteomics, transcriptomics, genomics, and phenomics are well-established scientific fields that contribute in their own way to the research of life and disease and in particular, cancer research. However, studying the data and results from only one omics-type will only reveal part of the bigger picture of what is happening inside a cell. Several methods have been developed for multi-omics data integration, however, they are either computationally expensive standalone tools or libraries that require programming expertise for their usage and the incorporation of results into further analyses. This cumulative thesis consists of two original publications that address the need for an online resource that supports scientists and clinicians with real-time multi-omics data integration, data analysis and visualization tools. The first publication presents the initial status of ProteomicsDB, a quantitative proteomics database, and describes the extension of the platform to support new omics-types. These data required the design of new data models to solve the challenge of storing and connecting information from multiple omics types and data sources using different identifier schemes. As a result, two generic data models were implemented, one for the storage of quantitative omics expression data and one for the storage of cell viability studies. To overcome the identifier mapping challenge and to enable combined data analysis, a third model was implemented allowing inter-connections with the proteomics data already existing in the database. The data model consisted of a composition of so-called triplestores along with tables that store metadata for each identifier. The generality of this data model allowed to capture any kind of relations between identifiers, such as protein-protein interaction networks as well as pathway data. The second publication extended the data wealth of the platform with additional proteomic, cell viability, drug-target and meltome data. In addition, the available protein properties were expanded with results from a new cellular assay containing protein turnover data. To support the interpretation of the new data types and assays, the user interface of the platform was extended with suitable visualization tools. The vast amount of data stored in ProteomicsDB enabled the development of integrative online tools, like the mRNA-guided missing value imputation method for protein expression data and the machine learning-based prediction of cell viability upon drug treatment based on protein expression data. The presented version of ProteomicsDB also enables users to upload custom expression datasets and analyse them alone or in comparison to stored data, using the analytics toolbox of the platform. Finally, all functionalities of the platform were extended to support data from any organism, transforming ProteomicsDB into a multi-omics and multi-organism resource for life science research. i | P a g e Zusammenfassung Proteomik, Transkriptomik, Genomik und Phänomik sind etablierte Forschungsbereiche, die zur Erforschung von Krankheiten und speziell zur Krebsforschung beitragen. Daten und Resultate eines einzigen Omics-Typs offenbaren jedoch nur einen Bruchteil der Vorgänge auf zellulärer Ebene. Um Daten mehrerer Omics-Typen, sogenannte Multiomik-Daten, zu integrieren, wurde eine Vielzahl von Methoden entwickelt. Diese teilen sich auf in rechenintensive Stand-alone Programme und Programmbibliotheken, welche für die Nutzung und weitere Verwertung der Ergebnisse Programmierkenntnisse voraussetzen. Diese kumulative Arbeit besteht aus zwei Veröffentlichungen, die auf den Bedarf von Wissenschaftlern und Klinikern an einer Online- Ressource eingehen, welche Echtzeit-Multiomik-Datenintegration sowie Datenanalyse- und Visualisierungstools unterstützt. Die erste Publikation präsentiert den initialen Status von ProteomicsDB, einer quantitativen proteomischen Datenbank und beschreibt die Erweiterung der Plattform um neue Omics-Typen. Für die herausfordernde Speicherung und Verknüpfung verschiedener Omics Datentypen aus unterschiedlichen Quellen mit verschiedenen Idenfikationsbezeichnugen wurden neue Datenmodelle nötig. Als Lösung wurden zwei Datenmodelle implementiert, eines für die Speicherung quantitativer Omics-Expressionsdaten und eines für die Hinterlegung von Zellviabilitätsassays. Als Lösung für die Problematik des Abgleichs von Idenfikationsbezeichnugen wurde ein drittes Datenmodell implementiert, welches nun deren kombinierte Analyse und die Verknüpfung zu den bereits gespeicherten proteomischen Daten ermöglicht. Die verwendete Modellierung bestand aus der Kombination von sogenannten “Triplestores” und Tabellen, welche Metadaten für jeden Eintrag enthalten. Die resultierende allgemeine Anwendbarkeit erlaubt die Speicherung von Beziehungen jeglicher Art zwischen den Einträgen, wie zum Beispiel Protein- Protein-Interaktionsnetzwerke oder die Abbildung von Signalwegen. Die zweite Publikation erweiterte die Datenmenge der Datenbank um zusätzliche proteomische Daten, Zellviabilität-, Protein-Wirkstoffinteraktions- und Meltomedaten. Die gespeicherten Proteincharakteristiken wurden um die Ergebnisse eines neuen zellulären Proteinumsatzassays erweitert. Weiterhin wurden die Visualisierungsoptionen der Plattform um zusätzliche Darstellungen für die neuen Datentypen und Assays erweitert. Die riesige in ProteomicsDB gespeicherte Datenmenge ermöglichte die Implementierung von omic-übergreifenden Onlinetools, wie eine mRNA-basierte Methode zur Imputation fehlender Proteinexpressionswerte und die von maschinellem Lernen gestützte Vorhersage der Zellviabilität nach Medikamentenbehandlung, die basierend auf Expressionsdaten von Proteinen realisiert wurde. Die beschriebene Version von ProteomicsDB ermöglicht es außerdem den Nutzern eigene Expressionsdatensätze hochzuladen und diese allein oder im Vergleich mit den hinterlegten Daten mittels der bereitgestellten Werkzeuge zu analysieren. Desweiteren wurden alle Funktionalitäten der Plattform erweitert um Daten von anderen Organismen zu unterstützen, was ProteomicsDB in eine omics- und organismus übergreifende Plattform für die Biowissenschaften transformiert. iii | P a g e Table of contents Abstract ............................................................................................................................................. i Zusammenfassung ........................................................................................................................... iii Chapter 1 General Introduction ....................................................................................................... 1 Chapter 2 General Methods ........................................................................................................... 41 Chapter 3 Publication 1 .................................................................................................................. 59 Chapter 4 Publication 2 .................................................................................................................. 63 Chapter 5 General discussion and outlook..................................................................................... 67 Publication record ............................................................................................................................. I Acknowledgements ......................................................................................................................... III Appendix ........................................................................................................................................... V v | P a g e Chapter 1 General Introduction Contents 1 From genes to proteins to disease ................................................................................................ 3 2 Multi-omics technologies .............................................................................................................. 5 2.1 Transcriptomics ...................................................................................................................... 5 2.2 Proteomics .............................................................................................................................. 7 2.3 Phenomics ............................................................................................................................ 12 2.4 Multi-omics data integration ................................................................................................ 14 3 Bioinformatics.............................................................................................................................. 16 3.1 Public repositories
Recommended publications
  • A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus
    Page 1 of 781 Diabetes A Computational Approach for Defining a Signature of β-Cell Golgi Stress in Diabetes Mellitus Robert N. Bone1,6,7, Olufunmilola Oyebamiji2, Sayali Talware2, Sharmila Selvaraj2, Preethi Krishnan3,6, Farooq Syed1,6,7, Huanmei Wu2, Carmella Evans-Molina 1,3,4,5,6,7,8* Departments of 1Pediatrics, 3Medicine, 4Anatomy, Cell Biology & Physiology, 5Biochemistry & Molecular Biology, the 6Center for Diabetes & Metabolic Diseases, and the 7Herman B. Wells Center for Pediatric Research, Indiana University School of Medicine, Indianapolis, IN 46202; 2Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, Indianapolis, IN, 46202; 8Roudebush VA Medical Center, Indianapolis, IN 46202. *Corresponding Author(s): Carmella Evans-Molina, MD, PhD ([email protected]) Indiana University School of Medicine, 635 Barnhill Drive, MS 2031A, Indianapolis, IN 46202, Telephone: (317) 274-4145, Fax (317) 274-4107 Running Title: Golgi Stress Response in Diabetes Word Count: 4358 Number of Figures: 6 Keywords: Golgi apparatus stress, Islets, β cell, Type 1 diabetes, Type 2 diabetes 1 Diabetes Publish Ahead of Print, published online August 20, 2020 Diabetes Page 2 of 781 ABSTRACT The Golgi apparatus (GA) is an important site of insulin processing and granule maturation, but whether GA organelle dysfunction and GA stress are present in the diabetic β-cell has not been tested. We utilized an informatics-based approach to develop a transcriptional signature of β-cell GA stress using existing RNA sequencing and microarray datasets generated using human islets from donors with diabetes and islets where type 1(T1D) and type 2 diabetes (T2D) had been modeled ex vivo. To narrow our results to GA-specific genes, we applied a filter set of 1,030 genes accepted as GA associated.
    [Show full text]
  • Genetic and Genomic Analysis of Hyperlipidemia, Obesity and Diabetes Using (C57BL/6J × TALLYHO/Jngj) F2 Mice
    University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Nutrition Publications and Other Works Nutrition 12-19-2010 Genetic and genomic analysis of hyperlipidemia, obesity and diabetes using (C57BL/6J × TALLYHO/JngJ) F2 mice Taryn P. Stewart Marshall University Hyoung Y. Kim University of Tennessee - Knoxville, [email protected] Arnold M. Saxton University of Tennessee - Knoxville, [email protected] Jung H. Kim Marshall University Follow this and additional works at: https://trace.tennessee.edu/utk_nutrpubs Part of the Animal Sciences Commons, and the Nutrition Commons Recommended Citation BMC Genomics 2010, 11:713 doi:10.1186/1471-2164-11-713 This Article is brought to you for free and open access by the Nutrition at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Nutrition Publications and Other Works by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. Stewart et al. BMC Genomics 2010, 11:713 http://www.biomedcentral.com/1471-2164/11/713 RESEARCH ARTICLE Open Access Genetic and genomic analysis of hyperlipidemia, obesity and diabetes using (C57BL/6J × TALLYHO/JngJ) F2 mice Taryn P Stewart1, Hyoung Yon Kim2, Arnold M Saxton3, Jung Han Kim1* Abstract Background: Type 2 diabetes (T2D) is the most common form of diabetes in humans and is closely associated with dyslipidemia and obesity that magnifies the mortality and morbidity related to T2D. The genetic contribution to human T2D and related metabolic disorders is evident, and mostly follows polygenic inheritance. The TALLYHO/ JngJ (TH) mice are a polygenic model for T2D characterized by obesity, hyperinsulinemia, impaired glucose uptake and tolerance, hyperlipidemia, and hyperglycemia.
    [Show full text]
  • Variation in Protein Coding Genes Identifies Information
    bioRxiv preprint doi: https://doi.org/10.1101/679456; this version posted June 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Animal complexity and information flow 1 1 2 3 4 5 Variation in protein coding genes identifies information flow as a contributor to 6 animal complexity 7 8 Jack Dean, Daniela Lopes Cardoso and Colin Sharpe* 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Institute of Biological and Biomedical Sciences 25 School of Biological Science 26 University of Portsmouth, 27 Portsmouth, UK 28 PO16 7YH 29 30 * Author for correspondence 31 [email protected] 32 33 Orcid numbers: 34 DLC: 0000-0003-2683-1745 35 CS: 0000-0002-5022-0840 36 37 38 39 40 41 42 43 44 45 46 47 48 49 Abstract bioRxiv preprint doi: https://doi.org/10.1101/679456; this version posted June 21, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Animal complexity and information flow 2 1 Across the metazoans there is a trend towards greater organismal complexity. How 2 complexity is generated, however, is uncertain. Since C.elegans and humans have 3 approximately the same number of genes, the explanation will depend on how genes are 4 used, rather than their absolute number.
    [Show full text]
  • 1 Title 1 Loss of PABPC1 Is Compensated by Elevated PABPC4
    bioRxiv preprint doi: https://doi.org/10.1101/2021.02.07.430165; this version posted February 15, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 1 Title 2 Loss of PABPC1 is compensated by elevated PABPC4 and correlates with transcriptome 3 changes 4 5 Jingwei Xie1, 2, Xiaoyu Wei1, Yu Chen1 6 7 1 Department of Biochemistry and Groupe de recherche axé sur la structure des 8 protéines, McGill University, Montreal, Quebec H3G 0B1, Canada 9 10 2 To whom correspondence should be addressed: Dept. of Biochemistry, McGill 11 University, Montreal, QC H3G 0B1, Canada. E-mail: [email protected]. 12 13 14 15 Abstract 16 Cytoplasmic poly(A) binding protein (PABP) is an essential translation factor that binds to 17 the 3' tail of mRNAs to promote translation and regulate mRNA stability. PABPC1 is the 18 most abundant of several PABP isoforms that exist in mammals. Here, we used the 19 CRISPR/Cas genome editing system to shift the isoform composition in HEK293 cells. 20 Disruption of PABPC1 elevated PABPC4 levels. Transcriptome analysis revealed that the 21 shift in the dominant PABP isoform was correlated with changes in key transcriptional 22 regulators. This study provides insight into understanding the role of PABP isoforms in 23 development and differentiation. 24 Keywords 25 PABPC1, PABPC4, c-Myc 26 bioRxiv preprint doi: https://doi.org/10.1101/2021.02.07.430165; this version posted February 15, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder.
    [Show full text]
  • RAB9 Antibody (R32125)
    RAB9 Antibody (R32125) Catalog No. Formulation Size R32125 0.5mg/ml if reconstituted with 0.2ml sterile DI water 100 ug Bulk quote request Availability 1-3 business days Species Reactivity Human, Mouse, Rat Format Antigen affinity purified Clonality Polyclonal (rabbit origin) Isotype Rabbit IgG Purity Antigen affinity Buffer Lyophilized from 1X PBS with 2.5% BSA and 0.025% sodium azide UniProt P51151 Applications Western blot : 0.1-0.5ug/ml Limitations This RAB9 antibody is available for research use only. Western blot testing of 1) rat brain, 2) mouse brain, 3) human HeLa and 4) human MCF7 lysate with RAB9 antibody. Expected/observed molecular weight: ~23 kDa. Description Ras-related protein Rab-9A, also called RAB9, is a protein that in humans is encoded by the RAB9A gene. This gene is mapped to Xp22.2. RAB9 has been localized to components of the endocytic/exocytic pathway and has been implicated in recycling of membrane receptors. It has been found that downregulation of RAB9A gene expression in HeLa cells induced severe cell vacuolation. RAB9A GTPase is directly bound by TIP47 in its active, GTP-bound conformation. Moreover, RAB9A increases the affinity of TIP47 for its cargo. What’s more, this gene may involved in the transport of proteins between the endosomes and the trans Golgi network. RAB9A has been shown to interact with RABEPK and TIP47. Application Notes Optimal dilution of the RAB9 antibody should be determined by the researcher. Immunogen Amino acids KDATNVAAAFEEAVRRVLATEDRSDHLIQTDTVNLHRK of human RAB9A were used as the immunogen for the RAB9 antibody. Storage After reconstitution, the RAB9 antibody can be stored for up to one month at 4oC.
    [Show full text]
  • Bioinformatics Analysis for the Identification of Differentially Expressed Genes and Related Signaling Pathways in H
    Bioinformatics analysis for the identification of differentially expressed genes and related signaling pathways in H. pylori-CagA transfected gastric cancer cells Dingyu Chen*, Chao Li, Yan Zhao, Jianjiang Zhou, Qinrong Wang and Yuan Xie* Key Laboratory of Endemic and Ethnic Diseases , Ministry of Education, Guizhou Medical University, Guiyang, China * These authors contributed equally to this work. ABSTRACT Aim. Helicobacter pylori cytotoxin-associated protein A (CagA) is an important vir- ulence factor known to induce gastric cancer development. However, the cause and the underlying molecular events of CagA induction remain unclear. Here, we applied integrated bioinformatics to identify the key genes involved in the process of CagA- induced gastric epithelial cell inflammation and can ceration to comprehend the potential molecular mechanisms involved. Materials and Methods. AGS cells were transected with pcDNA3.1 and pcDNA3.1::CagA for 24 h. The transfected cells were subjected to transcriptome sequencing to obtain the expressed genes. Differentially expressed genes (DEG) with adjusted P value < 0.05, | logFC |> 2 were screened, and the R package was applied for gene ontology (GO) enrichment and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. The differential gene protein–protein interaction (PPI) network was constructed using the STRING Cytoscape application, which conducted visual analysis to create the key function networks and identify the key genes. Next, the Submitted 20 August 2020 Kaplan–Meier plotter survival analysis tool was employed to analyze the survival of the Accepted 11 March 2021 key genes derived from the PPI network. Further analysis of the key gene expressions Published 15 April 2021 in gastric cancer and normal tissues were performed based on The Cancer Genome Corresponding author Atlas (TCGA) database and RT-qPCR verification.
    [Show full text]
  • Human Induced Pluripotent Stem Cell–Derived Podocytes Mature Into Vascularized Glomeruli Upon Experimental Transplantation
    BASIC RESEARCH www.jasn.org Human Induced Pluripotent Stem Cell–Derived Podocytes Mature into Vascularized Glomeruli upon Experimental Transplantation † Sazia Sharmin,* Atsuhiro Taguchi,* Yusuke Kaku,* Yasuhiro Yoshimura,* Tomoko Ohmori,* ‡ † ‡ Tetsushi Sakuma, Masashi Mukoyama, Takashi Yamamoto, Hidetake Kurihara,§ and | Ryuichi Nishinakamura* *Department of Kidney Development, Institute of Molecular Embryology and Genetics, and †Department of Nephrology, Faculty of Life Sciences, Kumamoto University, Kumamoto, Japan; ‡Department of Mathematical and Life Sciences, Graduate School of Science, Hiroshima University, Hiroshima, Japan; §Division of Anatomy, Juntendo University School of Medicine, Tokyo, Japan; and |Japan Science and Technology Agency, CREST, Kumamoto, Japan ABSTRACT Glomerular podocytes express proteins, such as nephrin, that constitute the slit diaphragm, thereby contributing to the filtration process in the kidney. Glomerular development has been analyzed mainly in mice, whereas analysis of human kidney development has been minimal because of limited access to embryonic kidneys. We previously reported the induction of three-dimensional primordial glomeruli from human induced pluripotent stem (iPS) cells. Here, using transcription activator–like effector nuclease-mediated homologous recombination, we generated human iPS cell lines that express green fluorescent protein (GFP) in the NPHS1 locus, which encodes nephrin, and we show that GFP expression facilitated accurate visualization of nephrin-positive podocyte formation in
    [Show full text]
  • Mapping of Craniofacial Traits in Outbred Mice Identifies Major Developmental Genes Involved in Shape Determination
    Mapping of craniofacial traits in outbred mice identifies major developmental genes involved in shape determination Luisa F Pallares1, Peter Carbonetto2,3, Shyam Gopalakrishnan2,4, Clarissa C Parker2,5, Cheryl L Ackert-Bicknell6, Abraham A Palmer2,7, Diethard Tautz1 # 1Max Planck Institute for Evolutionary Biology, Plön, Germany 2University of Chicago, Chicago, Illinois, USA 3AncestryDNA, San Francisco, California, USA 4Museum of Natural History, Copenhagen University, Copenhagen, Denmark 5Middlebury College, Department of Psychology and Program in Neuroscience, Middlebury VT, USA 6Center for Musculoskeletal Research, University of Rochester, Rochester, NY USA 7University of California San Diego, La Jolla, CA, USA # corresponding author: [email protected] short title: craniofacial shape mapping Abstract The vertebrate cranium is a prime example of the high evolvability of complex traits. While evidence of genes and developmental pathways underlying craniofacial shape determination 1 is accumulating, we are still far from understanding how such variation at the genetic level is translated into craniofacial shape variation. Here we used 3D geometric morphometrics to map genes involved in shape determination in a population of outbred mice (Carworth Farms White, or CFW). We defined shape traits via principal component analysis of 3D skull and mandible measurements. We mapped genetic loci associated with shape traits at ~80,000 candidate single nucleotide polymorphisms in ~700 male mice. We found that craniofacial shape and size are highly heritable, polygenic traits. Despite the polygenic nature of the traits, we identified 17 loci that explain variation in skull shape, and 8 loci associated with variation in mandible shape. Together, the associated variants account for 11.4% of skull and 4.4% of mandible shape variation, however, the total additive genetic variance associated with phenotypic variation was estimated in ~45%.
    [Show full text]
  • INTS14 Form a Functional Module of Integrator That Binds Nucleic Acids and the Cleavage Module
    ARTICLE https://doi.org/10.1038/s41467-020-17232-2 OPEN INTS10–INTS13–INTS14 form a functional module of Integrator that binds nucleic acids and the cleavage module Kevin Sabath1, Melanie L. Stäubli1, Sabrina Marti1, Alexander Leitner 2, Murielle Moes 1 & ✉ Stefanie Jonas 1 ′ 1234567890():,; The Integrator complex processes 3 -ends of spliceosomal small nuclear RNAs (snRNAs). Furthermore, it regulates transcription of protein coding genes by terminating transcription after unstable pausing. The molecular basis for Integrator’s functions remains obscure. Here, we show that INTS10, Asunder/INTS13 and INTS14 form a separable, functional Integrator module. The structure of INTS13-INTS14 reveals a strongly entwined complex with a unique chain interlink. Unexpected structural homology to the Ku70-Ku80 DNA repair complex suggests nucleic acid affinity. Indeed, the module displays affinity for DNA and RNA but prefers RNA hairpins. While the module plays an accessory role in snRNA maturation, it has a stronger influence on transcription termination after pausing. Asunder/INTS13 directly binds Integrator’s cleavage module via a conserved C-terminal motif that is involved in snRNA processing and required for spermatogenesis. Collectively, our data establish INTS10-INTS13- INTS14 as a nucleic acid-binding module and suggest that it brings cleavage module and target transcripts into proximity. 1 Institute of Molecular Biology and Biophysics, ETH Zurich, Otto-Stern-Weg 5, CH-8093 Zurich, Switzerland. 2 Institute of Molecular Systems Biology, ETH ✉ Zurich, Zurich, Switzerland. email: [email protected] NATURE COMMUNICATIONS | (2020) 11:3422 | https://doi.org/10.1038/s41467-020-17232-2 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17232-2 NA polymerase II (RNAPII) requires essential processing development in Drosophila and Xenopus and shown to be factors to terminate transcription and to release its primary required for correct mitosis in human cells36,38,39.
    [Show full text]
  • Understanding and Improving the Identification of Somatic Variants
    Understanding and Improving the Identification of Somatic Variants Vinaya Vijayan Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Genetics, Bioinformatics and Computational Biology Chair: Liqing Zhang Christopher Franck Lenwood S. Heath Xiaowei Wu August 12th, 2016 Blacksburg, VA, USA Keywords: Somatic variants, Somatic variant callers, Somatic point mutations, Short tandem repeat variation, Lung squamous cell carcinoma Understanding and Improving the Identification of Somatic Variants Vinaya Vijayan ABSTRACT It is important to understand the entire spectrum of somatic variants to gain more insight into mutations that occur in different cancers for development of better diagnostic, prognostic and therapeutic tools. This thesis outlines our work in understanding somatic variant calling, improving the identification of somatic variants from whole genome and whole exome platforms and identification of biomarkers for lung cancer. Integrating somatic variants from whole genome and whole exome platforms poses a challenge as variants identified in the exonic regions of the whole genome platform may not be identified on the whole exome platform and vice-versa. Taking a simple union or intersection of the somatic variants from both platforms would lead to inclusion of many false positives (through union) and exclusion of many true variants (through intersection). We develop the first framework to improve the identification of somatic variants on whole genome and exome platforms using a machine learning approach by combining the results from two popular somatic variant callers. Testing on simulated and real data sets shows that our framework identifies variants more accurately than using only one somatic variant caller or using variants from only one platform.
    [Show full text]
  • Mouse Rabepk Conditional Knockout Project (CRISPR/Cas9)
    https://www.alphaknockout.com Mouse Rabepk Conditional Knockout Project (CRISPR/Cas9) Objective: To create a Rabepk conditional knockout Mouse model (C57BL/6J) by CRISPR/Cas-mediated genome engineering. Strategy summary: The Rabepk gene (NCBI Reference Sequence: NM_145522 ; Ensembl: ENSMUSG00000070953 ) is located on Mouse chromosome 2. 8 exons are identified, with the ATG start codon in exon 1 and the TAA stop codon in exon 8 (Transcript: ENSMUST00000145903). Exon 3 will be selected as conditional knockout region (cKO region). Deletion of this region should result in the loss of function of the Mouse Rabepk gene. To engineer the targeting vector, homologous arms and cKO region will be generated by PCR using BAC clone RP23-347M2 as template. Cas9, gRNA and targeting vector will be co-injected into fertilized eggs for cKO Mouse production. The pups will be genotyped by PCR followed by sequencing analysis. Note: Exon 3 starts from about 6.84% of the coding region. The knockout of Exon 3 will result in frameshift of the gene. The size of intron 2 for 5'-loxP site insertion: 2045 bp, and the size of intron 3 for 3'-loxP site insertion: 4516 bp. The size of effective cKO region: ~658 bp. The cKO region does not have any other known gene. Page 1 of 8 https://www.alphaknockout.com Overview of the Targeting Strategy Wildtype allele gRNA region 5' gRNA region 3' 1 2 3 8 Targeting vector Targeted allele Constitutive KO allele (After Cre recombination) Legends Exon of mouse Rabepk Homology arm cKO region loxP site Page 2 of 8 https://www.alphaknockout.com Overview of the Dot Plot Window size: 10 bp Forward Reverse Complement Sequence 12 Note: The sequence of homologous arms and cKO region is aligned with itself to determine if there are tandem repeats.
    [Show full text]
  • Proteomicsdb: a Multi-Omics and Multi-Organism Resource for Life Science
    Published online 30 October 2019 Nucleic Acids Research, 2020, Vol. 48, Database issue D1153–D1163 doi: 10.1093/nar/gkz974 ProteomicsDB: a multi-omics and multi-organism resource for life science research Patroklos Samaras 1, Tobias Schmidt 1,MartinFrejno1, Siegfried Gessulat1,2, Maria Reinecke1,3,4, Anna Jarzab1, Jana Zecha1, Julia Mergner1, Piero Giansanti1, Hans-Christian Ehrlich2, Stephan Aiche2, Johannes Rank5,6, Harald Kienegger5,6, Helmut Krcmar5,6, Bernhard Kuster1,7,* and Mathias Wilhelm1,* 1Chair of Proteomics and Bioanalytics, Technical University of Munich (TUM), Freising, Bavaria, Germany, 2Innovation Center Network, SAP SE, Potsdam, Germany, 3German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany, 4German Cancer Research Center (DKFZ), Heidelberg, Germany, 5Chair for Information Systems, Technical University of Munich (TUM), Garching, Germany, 6SAP University Competence Center, Technical University of Munich (TUM), Garching, Germany and 7Bavarian Biomolecular Mass Spectrometry Center (BayBioMS), Technical University of Munich (TUM), Freising, Bavaria, Germany Received September 14, 2019; Revised October 11, 2019; Editorial Decision October 11, 2019; Accepted October 15, 2019 ABSTRACT INTRODUCTION ProteomicsDB (https://www.ProteomicsDB.org) ProteomicsDB (https://www.ProteomicsDB.org) is an in- started as a protein-centric in-memory database for memory database initially developed for the exploration of the exploration of large collections of quantitative large quantities of quantitative human mass spectrometry- mass spectrometry-based proteomics data. The based proteomics data including the first draft of the hu- data types and contents grew over time to include man proteome (1). Among many features, it allows the real- time exploration and retrieval of protein abundance values RNA-Seq expression data, drug-target interactions across different tissues, cell lines, and body fluids via inter- and cell line viability data.
    [Show full text]