Computational Identification of Genes: Ab Initio

Total Page:16

File Type:pdf, Size:1020Kb

Computational Identification of Genes: Ab Initio Computational identification of genes: ab initio and comparative approaches Genís Parra Farré PhD thesis Barcelona, December 2004 The Figure in the cover shows a representation of geneid predictions (Figure 1, Parra et al. (2000)). Computational identification of genes: ab initio and comparative approaches Genís Parra Farré Memòria presentada per optar al grau de Doctor en Biologia per la Universitat Pompeu Fabra. Aquesta Tesi Doctoral ha estat realitzada sota la direcció del Dr. Roderic Guigó Serra al Departament de Ciències Experimentals i de la Salut de la Universitat Pompeu Fabra Roderic Guigó Serra Genís Parra Farré Barcelona, December 2004 The research in this thesis has been carried out at the Genome Bioinformatics Lab (GBL) within the Grup de Recerca en Informàtica Biomèdica (GRIB) at the Parc de Recerca Biomèdica de Barcelona (PRBB), a consortium of the Institut Municipal d’Investigació Mèdica (IMIM), Universitat Pompeu Fabra (UPF) and Centre de Regulació Genòmica (CRG). The research carried out in this thesis has been supported by grants from Ministerio de Ciencia y Tecnología to R. Guigó. To my parents To my brother and sisters Dipòsit legal: B.30805-2005 ISBN: 84-689-2750-3 Motivation It is clear that we are living a really important period in the development and the knowl- edge of life sciences. Fifty years after the description of the structure of the double helix, we have moved from the analysis of a single gene to the systematic mass sequencing of entire genomes. At the time of writing this dissertation, whole genome sequencing projects for hundreds of organisms (bacteria, archea and eukaryota, as well as many viruses and organelles) are either complete or underway. All the information we are gath- ering today will probably modify the way we will understand life, science and medicine. But, before the best use can be made of this data, the identification and the precise location of the functional regions of the genomic sequences must be determined. The most impor- tant things to realize about the “book of life”, is that we know almost nothing about the language in which it is written, and that raw genomic sequences are mainly useless for the scientific community. The challenge ahead is to extract relevant information encoded within the billions of nucleotides stored in our databases. In a very simplistic description, the first step in the functional annotation of a genome would be to find the collection of genes encoded in the nucleic acid sequences. The next step would be to assign a function to each protein, where the three dimensional structure of the proteins will play a key role. Then, using microarrays technology, it will be feasible to obtain the spatial and temporal expression pattern of each gene at any developmental stage or specific condition. Finally, the last step would be to establish the network of interactions and regulations among all the proteins of a complete genome. This thesis focuses on the first step of any genome analysis: to find where genes are. The motivation of this thesis, thus, is to give a little insight in how genes are encoded and recognized by the cell machinery and to use this information to find genes in unannotated genomic sequences. The complexity of gene prediction differs substantially in prokary- otic and eukaryotic genomes. While prokaryotic genes are encoded in single continuous open reading frames, usually adjacent to each other, eukaryotic genes are separated by long stretches of intergenic regions, and their coding sequences can be interrupted by large non coding sequences. One of the objectives of this thesis is the development of tools to identify eukaryotic genes through the modeling and recognition of their intrinsic signals and properties. This thesis addresses another significant open problem of this field: how the sequence of related genomes can contribute to the identification of genes. The value of comparative genomics is illustrated by the sequencing of the mouse genome for the purpose of anno- tating the human genome. The availability of closely related genomes makes it possible to carry out genome-wide comparisons and analysis of syntenic regions. Recently, compar- i ii Motivation ative gene predictions programs exploit this data under the assumption that conserved regions between related species correspond to functional regions (coding genes among them). Thus, the second part of this thesis describes a gene prediction program that com- bines ab initio gene prediction with comparative information between two genomes to improve the accuracy of the predictions. Nowadays computational analysis is a major, integral part of genomics. It would not be an exaggeration to claim that genomics analysis can only be made with computational tools. Only by using computational methods and statistical models we can try to find out how genes are encoded and try to accurately predict their location in complete genomic sequences. Thus, the work described in this dissertation is essentially interdisciplinary; this means that while the basic subject of matter is biological and the obtained results are of biological interest, techniques from other fields have been extensively used. Statistical approaches have been used to create models of genomic features to be able to recognize sequence mo- tifs and reproduce the underneath biological process, while computational programming has been applied to include these models into efficient bioinformatic tools. Genís Parra Barcelona, December 2004 Acknowledgments It is not just to follow convention that I first acknowledge my PhD advisor, Roderic Guigó. Quite simply, if not for him, my academic career would have finished with my bachelor degree. It was he who saw past my sub-optimal scores to someone able to work on re- search. I am indebted to him for letting me start what I hope will be a long career in research. Other people thath I would like to thanl are: Pankaj Agarwall, for the stage in the GlaxoSmithKline, the Dyctiostelium annotation group: Karol Szafranski, Gernot Glökner and Mathias Platzer, the people from the University of Geneva: Manolis Dermitzakis, Alexandre Reymond and Stylianos Antonarakis and Michael Brent and all the people of his lab for the scientific collaboration and the invitation to Saint Louis. I would also like to mention the people who may not had a direct impact on this thesis, but have influenced it indirectly by molding me into the person I am today. All the people who have given of their time, talents, and expertise to help me on this project have enriched my life. For their special friendships and assistance, I am most grateful: To Mercè for being there in the darkest years of my PhD (and life). For encouraging me when I was giving everything up. For your trust in friendship. For the years we lived together and for all the amazing things you taught me. For all the incredible journeys we did. For all the love you gave me. To Sergi for those Sunday afternoons in la filmoteca. For your way of living life and science. For la Passió d’Esparraguera, Eric Sardinas and for your comics. For those nights in New York. Special thanks for your cocktails, for listening to me and for your wise advises. To Cristina for your enthusiasm and your energy. For your tiramisu, for your pesto and for your profiteroles. For all your tenderness and comprehension. For Patrizio, Gur- dieff and Dilan Dog. For going with me to fill my bottle of water every day. For the Ravenna mosaics. For bringing happiness and joy in our every day life. To Robert for your thesis template, for your whiskey and Risk sessions. For the par- ties on your roof. For being our volleyball captain. For your pictures of California (that decided my future). For your wise statements. To Pep for sharing a fraction of all your infinite wisdom with me. I learned (or at least I tried) from you to try to do the things the best one can. To Enrique for programing geneid. I learned a lot while working with you. For your patience with my geneid problems. For all the course we teached and the moments we shared. iii iv Acknowledgments To Fabien and Isabel for those roller hockey nights. For your penguin. For your strength and courage. For your friendship. To Xavi for being our mentor in the early days. I will say nothing about your home directory. For being a destroyer. Quin payo !! To Bet for massaging my breast. For gifts you give. For your sympathy and friendship. For your complicity. To Rut for the swimming mornings, for the theater, for your smile. Remember me when you become a famous actress. To Noura for being just like you are. For your voice and for that night in the karaoke. To Ramon for all those amazing gadgets you have. For your outdoor activities. For your true friendship, for your music, for your stories and for el Pilar. To Peppolino for your hugs, for running with me, for the musica pertarda, and for the Arena sessions. To Citlali for all the pushes and shoves playing hockey. For Valencia and Blanes. For how I feel being with you. For your guacamole. To Moisès for installing Slackware on my first hard disk and for your long discussions, for your spontaneity and freshness. To Charles for incite, for your comics and your sense of humor. To Oscar for all the conversations we had while other people were dancing. For the amazing physical properties of liquids falling inside glasses. To my students. Specially to my first group: Jimena, Encarni, Bet and Jordi, who show me how difficult is to be a teacher. Just kidding !! You were the best group I ever had. To Gus. To Jan-Jaap for all the effort you did in the correction of this thesis, it was really a lot of work and I really appreciate it !!! To Queviures Murgadella for all the food I shared with my friends in the lab.
Recommended publications
  • Large-Scale Opening of Utrophints Tandem Calponin Homology (CH
    Large-scale opening of utrophin’s tandem calponin homology (CH) domains upon actin binding by an induced-fit mechanism Ava Y. Lin, Ewa Prochniewicz, Zachary M. James, Bengt Svensson, and David D. Thomas1 Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN 55455 Edited by James A. Spudich, Stanford University School of Medicine, Stanford, CA, and approved June 20, 2011 (received for review April 21, 2011) We have used site-directed spin labeling and pulsed electron has prevented the development of a reliable structural model for paramagnetic resonance to resolve a controversy concerning the any of these complexes. A major unresolved question concerns structure of the utrophin–actin complex, with implications for the the relative disposition of the tandem CH domains (CH1 and pathophysiology of muscular dystrophy. Utrophin is a homolog of CH2) (9, 10). Crystal structures of the tandem CH domains dystrophin, the defective protein in Duchenne and Becker muscular showed a closed conformation for fimbrin (11) and α-actinin (12), dystrophies, and therapeutic utrophin derivatives are currently but an open conformation for both utrophin (Utr261) (Fig. 1A) being developed. Both proteins have a pair of N-terminal calponin and dystrophin (Dys246) (16). The crystal structure of Utr261 homology (CH) domains that are important for actin binding. suggests that the central helical region connecting CH1 and CH2 Although there is a crystal structure of the utrophin actin-binding is highly flexible. Even for α-actinin, which has a closed crystal domain, electron microscopy of the actin-bound complexes has structure, computational analysis suggests the potential for a high produced two very different structural models, in which the CH do- degree of dynamic flexibility that facilitates actin binding (17).
    [Show full text]
  • Functional Aspects and Genomic Analysis
    Discussion IV./ Discussion IV.1. The discovery of UEV protein and its role in different cellular processes In our studies of cell differentiation and cell cycle control, we have isolated a new gene that is downregulated upon cell differentiation. We have demonstrated that this gene, previously called CROC1 and considered a transcriptional activator of c-fos (Rothofsky and Lin, 1997), is highly conserved in phylogeny, and constitutes, by sequence relationship, a novel subfamily of the ubiquitin-conjugating, or E2, enzymes. We have demonstrated that these proteins are very conserved in all eukaryotic organisms and that they are very similar to the E2 enzymes in sequence and structure, but these proteins lack a conserved cysteine residue responsible for the catalytic activity of the E2 enzymes (Chen et al., 1993; Jentsch, 1992a). We have given these proteins the name UEV (ubiquitin-conjugating E2 enzyme variant). Work by other laboratories has shown that experimental mutagenesis of this cysteine in the catalytic center leads to the inactivation of the E2 enzyme activity, and the mutated protein can behave as a dominant negative variant (Banerjee et al., 1995; Sung et al., 1990; Madura et al., 1993). However, in our initial experiments with recombinant proteins, UEV did not behave as a negative regulator of ubiquitination (Sancho et al., 1998). We have demonstrated also the existence of at least two different human UEV genes, one coding for UEV1/CROC-1, and the other coding for UEV2. The second protein has also been given different names by others, DDVit-1 (Fritsche et al., 1997). 30 Discussion The transcripts from the two human UEV genes differ in their 3’untranslated regions, and produce almost identical proteins.
    [Show full text]
  • Sequence Alignment/Map) Is a Text Format for Storing Sequence Alignment Data in a Series of Tab Delimited ASCII Columns
    NGS FILE FORMATS SEQUENCE FILE FORMATS FASTA FORMAT FASTA Single sequence example: >HWI-ST398_0092:1:1:5372:2486#0/1 TTTTTCGTTCTTTTCATGTACCGCTTTTTGTTCGGTTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT ACGTAGCAGCAGCATCAGTACGACTACGACGACTAGCACATGCGACGATCGATGCTAGCTGACTATCGATG Multiple sequence example: >Sequence Name 1 TTTTTCGTTCTTTTCATGTACCGCTTTTTGTTCGGTTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT ACGTAGCAGCAGCATCAGTACGACTACGACGACTAGCACATGCGACGATCGATGCTAGCTGACTATCGATG >Sequence Name 2 ACGTAGACACGACTAGCATCAGCTACGCATCGATCAGCATCGACTAGCATCACACATCGATCAGCATCACGACTAGCAT AGCATCGACTACACTACGACTACGATCCACGTACGACTAGCATGCTAGCGCTAGCTAGCTAGCTAGTCGATCGATGAGT AGCTAGCTAGCTAGC >Sequence Name 3 ACTCAGCATGCATCAGCATCGACTACGACTACGACATCGACTAGCATCAGCAT SEQUENCE FILE FORMATS FASTQ FORMAT FASTQ Text based format for storing sequence data and corresponding quality scores for each base. To enable a one-one correspondence between the base sequence and the quality score the score is stored as a single one letter/number code using an offset of the standard ASCII code. Quality scores range from 0 to 40 and represent a log10 score for the probability of being wrong. E.g. score of 30 => 1:1000 chance of error SEQUENCE FILE FORMATS FASTQ FORMAT FASTQ Each fastq file contain multiple entries and each entry consists of 4 lines: 1. header line beginning with “@“ and sequence name 2. sequence line 3. header line beginning with “+” which can have the name but rarely does 4. quality score line SEQUENCE FILE FORMATS FASTQ FORMAT FASTQ @HWI-ST398_0092:6:73:5372:2486#0/1 TTTTTCGTTCTTTTCATGTACCGCTTTTTGTTCGGTTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGAT
    [Show full text]
  • The Roles of Actin-Binding Domains 1 and 2 in the Calcium-Dependent Regulation of Actin Filament Bundling by Human Plastins
    Article The Roles of Actin-Binding Domains 1 and 2 in the Calcium-Dependent Regulation of Actin Filament Bundling by Human Plastins Christopher L. Schwebach 1,2, Richa Agrawal 1, Steffen Lindert 1, Elena Kudryashova 1 and Dmitri S. Kudryashov 1,2 1 - Department of Chemistry and Biochemistry, The Ohio State University, Columbus, OH 43210, USA 2 - Molecular, Cellular, and Developmental Biology Program, The Ohio State University, Columbus, OH 43210, USA Correspondence to Dmitri S. Kudryashov: Department of Chemistry and Biochemistry, The Ohio State University, 484 W 12th Ave, 728 Biosciences Building, Columbus, OH 43210, USA. [email protected] http://dx.doi.org/10.1016/j.jmb.2017.06.021 Edited by James Sellers Abstract The actin cytoskeleton is a complex network controlled by a vast array of intricately regulated actin-binding proteins. Human plastins (PLS1, PLS2, and PLS3) are evolutionary conserved proteins that non-covalently crosslink actin filaments into tight bundles. Through stabilization of such bundles, plastins contribute, in an isoform-specific manner, to the formation of kidney and intestinal microvilli, inner ear stereocilia, immune synapses, endocytic patches, adhesion contacts, and invadosomes of immune and cancer cells. All plastins comprise an N-terminal Ca2+-binding regulatory headpiece domain followed by two actin-binding domains (ABD1 and ABD2). Actin bundling occurs due to simultaneous binding of both ABDs to separate actin filaments. Bundling is negatively regulated by Ca2+, but the mechanism of this inhibition remains unknown. In 2+ this study, we found that the bundling abilities of PLS1 and PLS2 were similarly sensitive to Ca (pCa50 ~6.4), whereas PLS3 was less sensitive (pCa50 ~5.9).
    [Show full text]
  • Gene Prediction Using Deep Learning
    FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Gene prediction using Deep Learning Pedro Vieira Lamares Martins Mestrado Integrado em Engenharia Informática e Computação Supervisor: Rui Camacho (FEUP) Second Supervisor: Nuno Fonseca (EBI-Cambridge, UK) July 22, 2018 Gene prediction using Deep Learning Pedro Vieira Lamares Martins Mestrado Integrado em Engenharia Informática e Computação Approved in oral examination by the committee: Chair: Doctor Jorge Alves da Silva External Examiner: Doctor Carlos Manuel Abreu Gomes Ferreira Supervisor: Doctor Rui Carlos Camacho de Sousa Ferreira da Silva July 22, 2018 Abstract Every living being has in their cells complex molecules called Deoxyribonucleic Acid (or DNA) which are responsible for all their biological features. This DNA molecule is condensed into larger structures called chromosomes, which all together compose the individual’s genome. Genes are size varying DNA sequences which contain a code that are often used to synthesize proteins. Proteins are very large molecules which have a multitude of purposes within the individual’s body. Only a very small portion of the DNA has gene sequences. There is no accurate number on the total number of genes that exist in the human genome, but current estimations place that number between 20000 and 25000. Ever since the entire human genome has been sequenced, there has been an effort to consistently try to identify the gene sequences. The number was initially thought to be much higher, but it has since been furthered down following improvements in gene finding techniques. Computational prediction of genes is among these improvements, and is nowadays an area of deep interest in bioinformatics as new tools focused on the theme are developed.
    [Show full text]
  • The Actin Binding Protein Plastin-3 Is Involved in the Pathogenesis of Acute Myeloid Leukemia
    cancers Article The Actin Binding Protein Plastin-3 Is Involved in the Pathogenesis of Acute Myeloid Leukemia Arne Velthaus 1, Kerstin Cornils 2,3, Jan K. Hennigs 1, Saskia Grüb 4, Hauke Stamm 1, Daniel Wicklein 5, Carsten Bokemeyer 1, Michael Heuser 6, Sabine Windhorst 4, Walter Fiedler 1 and Jasmin Wellbrock 1,* 1 Department of Oncology, Hematology and Bone Marrow Transplantation with Division of Pneumology, Hubertus Wald University Cancer Center, University Medical Center Hamburg-Eppendorf, 20246 Hamburg, Germany; [email protected] (A.V.); [email protected] (J.K.H.); [email protected] (H.S.); [email protected] (C.B.); fi[email protected] (W.F.) 2 Department of Pediatric Hematology and Oncology, Division of Pediatric Stem Cell Transplantation and Immunology, University Medical Center Hamburg-Eppendorf, 20246 Hamburg, Germany; [email protected] 3 Research Institute Children’s Cancer Center Hamburg, 20246 Hamburg, Germany 4 Center for Experimental Medicine, Institute of Biochemistry and Signal Transduction, University Medical Center Hamburg-Eppendorf, 20246 Hamburg, Germany; [email protected] (S.G.); [email protected] (S.W.) 5 Department of Anatomy and Experimental Morphology, University Cancer Center, University Medical Center Hamburg-Eppendorf, 20246 Hamburg, Germany; [email protected] 6 Hematology, Hemostasis, Oncology and Stem Cell Transplantation, Hannover Medical School, 20246 Hannover, Germany; [email protected] * Correspondence: [email protected]; Tel.: +49-40-7410-55606 Received: 29 September 2019; Accepted: 25 October 2019; Published: 26 October 2019 Abstract: Leukemia-initiating cells reside within the bone marrow in specialized niches where they undergo complex interactions with their surrounding stromal cells.
    [Show full text]
  • Complete Genome Sequence of the Hyperthermophilic Bacteria- Thermotoga Sp
    COMPLETE GENOME SEQUENCE OF THE HYPERTHERMOPHILIC BACTERIA- THERMOTOGA SP. STRAIN RQ7 Rutika Puranik A Thesis Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE May 2015 Committee: Zhaohui Xu, Advisor Scott Rogers George Bullerjahn © 2015 Rutika Puranik All Rights Reserved iii ABSTRACT Zhaohui Xu, Advisor The genus Thermotoga is one of the deep-rooted genus in the phylogenetic tree of life and has been studied for its thermostable enzymes and the property of hydrogen production at higher temperatures. The current study focuses on the complete genome sequencing of T. sp. strain RQ7 to understand and identify the conserved as well as variable properties between the strains and its genus with the approach of comparative genomics. A pipeline was developed to assemble the complete genome based on the next generation sequencing (NGS) data. The pipeline successfully combined computational approaches with wet lab experiments to deliver a completed genome of T. sp. strain RQ7 that has the genome size of 1,851,618 bp with a GC content of 47.1%. The genome is submitted to Genbank with accession CP07633. Comparative genomic analysis of this genome with three other strains of Thermotoga, helped identifying putative natural transformation and competence protein coding genes in addition to the absence of TneDI restriction- modification system in T. sp. strain RQ7. Genome analysis also assisted in recognizing the unique genes in T. sp. strain RQ7 and CRISPR/Cas system. This strain has 8 CRISPR loci and an array of Cas coding genes in the entire genome.
    [Show full text]
  • Alternate-Locus Aware Variant Calling in Whole Genome Sequencing Marten Jäger1,2, Max Schubach1, Tomasz Zemojtel1,Knutreinert3, Deanna M
    Jäger et al. Genome Medicine (2016) 8:130 DOI 10.1186/s13073-016-0383-z RESEARCH Open Access Alternate-locus aware variant calling in whole genome sequencing Marten Jäger1,2, Max Schubach1, Tomasz Zemojtel1,KnutReinert3, Deanna M. Church4 and Peter N. Robinson1,2,3,5,6* Abstract Background: The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant calling in whole-genome sequencing (WGS). Methods: We developed an algorithm that analyzes the patterns of variant calls in the 178 structurally variable regions of the GRCh38 genome assembly, and infers whether a given sample is most likely to contain sequences from the primary assembly, an alternate locus, or their heterozygous combination at each of these 178 regions. We investigate 121 in-house WGS datasets that have been aligned to the GRCh37 and GRCh38 assemblies. Results: We show that stretches of sequences that are largely but not entirely identical between the primary assembly and an alternate locus can result in multiple variant calls against regions of the primary assembly. In WGS analysis, this results in characteristic and recognizable patterns of variant calls at positions that we term alignable scaffold-discrepant positions (ASDPs). In 121 in-house genomes, on average 51.8 ± 3.8 of the 178 regions were found to correspond best to an alternate locus rather than the primary assembly sequence, and filtering these genomes with our algorithm led to the identification of 7863 variant calls per genome that colocalized with ASDPs.
    [Show full text]
  • The Role of Actin Binding Proteins in Cell Motility Elizabeth Ojukwu University of Connecticut - Storrs, [email protected]
    University of Connecticut OpenCommons@UConn Honors Scholar Theses Honors Scholar Program Spring 5-6-2012 The Role of Actin Binding Proteins in Cell Motility Elizabeth Ojukwu University of Connecticut - Storrs, [email protected] Follow this and additional works at: https://opencommons.uconn.edu/srhonors_theses Part of the Biology Commons, and the Cell and Developmental Biology Commons Recommended Citation Ojukwu, Elizabeth, "The Role of Actin Binding Proteins in Cell Motility" (2012). Honors Scholar Theses. 271. https://opencommons.uconn.edu/srhonors_theses/271 Ojukwu The Role of Actin Binding Proteins in Cell Motility Elizabeth Ojukwu 1 Ojukwu 2 Ojukwu The Role of Actin Binding Proteins in Cell Motility Elizabeth Ojukwu University Scholar Thesis May 2012 Major Advisor- Dr. David Knecht Honors Advisor- Dr. Adam Zweifach University Scholar Advisors- Dr. Victoria Robinson and Dr. Juliet Lee Department of Molecular and Cell Biology University of Connecticut 3 Ojukwu Table of Contents Abstract . 4 Introduction . 5 I. Overview of Cell Motility . 5 II. Dictyostelium Discoideum as a Model Organism . 6 III. The Role of the Actin Cytoskeleton During Cell Migration . 8 IV. Actin Binding Proteins Regulate Actin Dynamics . 9 V. My Project . 12 Chapter 1. The Effect of Actin Binding Protein Over-Expression on Cell Motility . 15 I. Material and Methods . 15 II. Results . 19 III. Discussion and Future Directions . 22 Chapter 2 Generation and Analysis of Fimbrin Double and Triple Null Mutants . 26 I. Material and Methods . 26 II. Results . 32
    [Show full text]
  • Galaxy Platform for NGS Data Analyses
    Galaxy Platform For NGS Data Analyses Weihong Yan [email protected] Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 § UCLA galaxy and user account § Galaxy web interface and management § Tools for NGS analyses and their application § Data formats § Build/share workflow and history § Q and A ü Day 2 § Galaxy Tools for RNA-seq analysis § Galaxy Tools for ChIP-seq analysis § Galaxy Tools for annotation. § Q and A *** Published datasets/results will be used in the tutorial UCLA Galaxy http://galaxy.hoffman2.idre.ucla.edu ü Hardware – Headnode (1) 96Gb memory, 12 core – Computing nodes (8) 48Gb memory, 12 core – Storage 100 Tb disk space ü Galaxy Resource Management - Hoffman2 grid engine Default: 1 core/job bowtie, bwa, tophat, cuffdiff, cufflinks, gatk programs: 4 core/job UCLA Galaxy http://galaxy.hoffman2.idre.ucla.edu ü galaxy login account: login: your email associated with ucla ü Disk quota: 1 Tb/user Galaxy Account Management Installed tools Launch analysis and view result History of execu7on and results Raw Reads *_qseq.txt, *.fastq Upload to Galaxy File transfer protocol (ftp) deMultiplex Barcode splitter, deMultiplex workflow fastqc, compute quality statistics, Quality Assessment draw quality score boxplot, draw nuclotides distribution Process Reads Trim sequences, sickle, scythe Alignment to bwa, bowtie, bowtie2, tophat Reference Format Conversion Text manipulation toolkit, BEDTools, SAM Results (sam/bam) Tools, java genomics toolkit, picard toolkit Downstream Analyses BS-Seeker2, cufflinks, cuffdiff, macs, macs2, GATK, CEAS Visualization Genome browser, IGV Repositories of Galaxy Tools https://toolshed.g2.bx.psu.edu ü History panel contains all datasets that are uploaded and results derived from certain analyses ü A history can be organized, annotated, and managed as a project ü History is sharable.
    [Show full text]
  • Cep-2020-00633.Pdf
    Clin Exp Pediatr Vol. 64, No. 5, 208–222, 2021 Review article CEP https://doi.org/10.3345/cep.2020.00633 Understanding the genetics of systemic lupus erythematosus using Bayesian statistics and gene network analysis Seoung Wan Nam, MD, PhD1,*, Kwang Seob Lee, MD2,*, Jae Won Yang, MD, PhD3,*, Younhee Ko, PhD4, Michael Eisenhut, MD, FRCP, FRCPCH, DTM&H5, Keum Hwa Lee, MD, MS6,7,8, Jae Il Shin, MD, PhD6,7,8, Andreas Kronbichler, MD, PhD9 1Department of Rheumatology, Wonju Severance Christian Hospital, Yonsei University Wonju College of Medicine, Wonju, Korea; 2Severance Hospital, Yonsei University College of Medicine, Seoul, Korea; 3Department of Nephrology, Yonsei University Wonju College of Medicine, Wonju, Korea; 4Division of Biomedical Engineering, Hankuk University of Foreign Studies, Yongin, Korea; 5Department of Pediatrics, Luton & Dunstable University Hospital NHS Foundation Trust, Luton, UK; 6Department of Pediatrics, Yonsei University College of Medicine, Seoul, Korea; 7Division of Pediatric Nephrology, Severance Children’s Hospital, Seoul, Korea; 8Institute of Kidney Disease Research, Yonsei University College of Medicine, Seoul, Korea; 9Department of Internal Medicine IV (Nephrology and Hypertension), Medical University Innsbruck, Innsbruck, Austria 1,3) The publication of genetic epidemiology meta-analyses has analyses have redundant duplicate topics and many errors. increased rapidly, but it has been suggested that many of the Although there has been an impressive increase in meta-analyses statistically significant results are false positive. In addition, from China, particularly those on genetic associa tions, most most such meta-analyses have been redundant, duplicate, and claimed candidate gene associations are likely false-positives, erroneous, leading to research waste. In addition, since most suggesting an urgent global need to incorporate genome-wide claimed candidate gene associations were false-positives, cor- data and state-of-the art statistical inferences to avoid a flood of rectly interpreting the published results is important.
    [Show full text]
  • Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D
    BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto BIOINFORMATICS SECOND EDITION METHODS OF BIOCHEMICAL ANALYSIS Volume 43 BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins SECOND EDITION Andreas D. Baxevanis Genome Technology Branch National Human Genome Research Institute National Institutes of Health Bethesda, Maryland USA B. F. Francis Ouellette Centre for Molecular Medicine and Therapeutics Children’s and Women’s Health Centre of British Columbia University of British Columbia Vancouver, British Columbia Canada A JOHN WILEY & SONS, INC., PUBLICATION New York • Chichester • Weinheim • Brisbane • Singapore • Toronto Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons, Inc., is aware of a claim, the product names appear in initial capital or ALL CAPITAL LETTERS. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. Copyright ᭧ 2001 by John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic or mechanical, including uploading, downloading, printing, decompiling, recording or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher.
    [Show full text]