Introduction to Differential Gene Expression Analysis Using RNA-Seq

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Differential Gene Expression Analysis Using RNA-Seq Introduction to differential gene expression analysis using RNA-seq Written by Friederike D¨undar, Luce Skrabanek, Paul Zumbo September 2015 updated November 14, 2019 1 Introduction to RNA-seq 5 1.1 RNA extraction . .5 1.1.1 Quality control of RNA preparation (RIN) . .6 1.2 Library preparation methods . .6 1.3 Sequencing (Illumina) . 10 1.4 Experimental Design . 11 1.4.1 Capturing enough variability . 12 1.4.2 Avoiding bias . 13 2 Raw Data (Sequencing Reads) 15 2.1 Download from ENA . 15 2.2 Storing sequencing reads: FASTQ format . 16 2.3 Quality control of raw sequencing data . 20 3 Mapping reads with and without alignment 23 3.1 Reference genomes and annotation . 26 3.1.1 File formats for defining genomic regions . 26 3.2 Aligning reads using STAR ...................................... 30 3.3 Storing aligned reads: SAM/BAM file format . 31 3.3.1 The SAM file header section . 32 3.3.2 The SAM file alignment section . 34 3.3.3 Manipulating SAM/BAM files . 37 3.4 Quality control of aligned reads . 39 3.4.1 Basic alignment assessments . 39 3.4.2 Bias identification . 44 3.4.3 Quality control with QoRTs . 47 3.4.4 Summarizing the results of different QC tools with MultiQC ............... 47 4 Read Quantification 49 4.1 Gene-based read counting . 49 4.2 Isoform counting methods (transcript-level quantification) . 50 5 Normalizing and Transforming Read Counts 53 5.1 Normalization for sequencing depth differences . 53 5.1.1 DESeq2's specialized data set object . 54 5.2 Transformation of sequencing-depth-normalized read counts . 56 5.2.1 Log2 transformation of read counts . 56 5.2.2 Visually exploring normalized read counts . 57 5.2.3 Transformation of read counts including variance shrinkage . 58 5.3 Exploring global read count patterns . 59 5.3.1 Pairwise correlation . 59 5.3.2 Hierarchical clustering . 59 5.3.3 Principal Components Analysis (PCA) . 61 6 Differential Gene Expression Analysis (DGE) 63 © 2015-2019 Applied Bioinformatics Core j Weill Cornell Medical College Page 1 of 97 6.1 Estimating the difference between read counts for a given gene . 64 6.2 Testing the null hypothesis . 65 6.3 Running DGE analysis tools . 65 6.3.1 DESeq2 workflow . 65 6.3.2 Exploratory plots following DGE analysis . 66 6.3.3 Exercise suggestions . 69 6.3.4 edgeR . 69 6.3.5 limma-voom .......................................... 71 6.4 Judging DGE results . 73 6.5 Example downstream analyses . 75 7 Appendix 78 7.1 Improved alignment . 78 7.2 Additional tables . 79 7.3 Installing bioinformatics tools on a UNIX server . 89 Page 2 of 97 © 2015-2019 Applied Bioinformatics Core j Weill Cornell Medical College List of Tables List of Tables 1 Sequencing depth recommendations. 12 2 Examples for technical and biological replicates. 12 3 Illumina's different base call quality score schemes. 19 4 The fields of the alignment section of SAM files. 34 5 The FLAG field of SAM files. 35 6 Comparison of classification and clustering techniques. 60 7 Programs for DGE. 63 8 Biases and artifacts of Illumina sequencing data. 79 9 FASTQC test modules. 80 10 Optional entries in the header section of SAM files. 82 11 Overview of RSeQC scripts. 83 12 Overview of QoRTs QC functions. 85 13 Normalizing read counts between different conditions. 87 14 Normalizing read counts within the same sample. 88 List of Figures 1 RNA integrity assessment (RIN). .6 2 Overview of general RNA library preparation steps. .7 3 Size selection steps during common RNA library preparations. .9 4 The different steps of sequencing by synthesis. 10 5 Sequence data repositories. 15 6 Schema of an Illumina flowcell. 18 7 Phred score ranges. 19 8 Typical bioinformatics workflow of differential gene expression analysis. 20 9 Scheme of raw read QC. 21 10 Classic sequence alignment example. 23 11 RNA-seq read alignment. 24 12 Alignment-free sequence comparison. 25 13 Kallisto's strategy. 25 14 Schematic representation of a SAM file. 32 15 CIGAR strings of aligned reads. 37 16 Graphical summary of STAR's log files for 96 samples. 41 17 Different modes of counting read-transcript overlaps. 49 18 Schema of RSEM-based transcrip quantification. 51 19 Effects of different read count normalization methods. 53 20 Comparison of the read distribution plots for untransformed and log2-transformed values. 57 21 Comparison of log2- and rlog-transformed read counts. 58 22 Dendrogram of rlog-transformed read counts. 61 23 PCA on raw counts and rlog-transformed read counts. 62 24 Regression models and DGE. 64 25 Histogram and MA plot after DGE analysis. 66 26 Heatmaps of log-transformed read counts. 67 27 Read counts for two genes in two conditions. 69 28 Schematic overview of different DE analysis approaches. 71 29 Example plots to judge DGE analysis results. 74 30 Example results for ORA and GSEA. 77 © 2015-2019 Applied Bioinformatics Core j Weill Cornell Medical College Page 3 of 97 Technical Prerequisites Command-line interface The first steps of the analyses { that are the most computationally demanding { will be performed directly on our servers. The interaction with our servers is completely text-based, i.e., there will be no graphical user interface. We will instead be communicating entirely via the command line using the UNIX shell scripting language bash. You can find a good introduction into the shell basics at http://linuxcommand.org/lc3_learning_the_shell.php (for our course, chapters 2, 3, 5, 7, and 8 are probably most relevant). To start using the command line, Mac users should use the App called Terminal. Windows users need to install putty, a Terminal emulator (http://www.chiark.greenend.org.uk/~sgtatham/putty/download. html). You probably want the bits under the A Windows installer for everything except PuTTYtel heading. Putty will allow you to establish a connection with a UNIX server and interact with it. Programs that we will be using via the command line: FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ featureCounts http://bioinf.wehi.edu.au/subread-package/) MultiQC http://multiqc.info/docs/ QoRTs https://hartleys.github.io/QoRTs/ RSeQC http://rseqc.sourceforge.net/ samtools http://www.htslib.org/ STAR https://github.com/alexdobin/STAR UCSC tools.
Recommended publications
  • Genome-Wide Association Identifies Candidate Genes That Influence The
    Genome-wide association identifies candidate genes that influence the human electroencephalogram Colin A. Hodgkinsona,1, Mary-Anne Enocha, Vibhuti Srivastavaa, Justine S. Cummins-Omana, Cherisse Ferriera, Polina Iarikovaa, Sriram Sankararamanb, Goli Yaminia, Qiaoping Yuana, Zhifeng Zhoua, Bernard Albaughc, Kenneth V. Whitea, Pei-Hong Shena, and David Goldmana aLaboratory of Neurogenetics, National Institute on Alcohol Abuse and Alcoholism, Rockville, MD 20852; bComputer Science Department, University of California, Berkeley, CA 94720; and cCenter for Human Behavior Studies, Weatherford, OK 73096 Edited* by Raymond L. White, University of California, Emeryville, CA, and approved March 31, 2010 (received for review July 23, 2009) Complex psychiatric disorders are resistant to whole-genome reflects rhythmic electrical activity of the brain. EEG patterns analysis due to genetic and etiological heterogeneity. Variation in dynamically and quantitatively index cortical activation, cognitive resting electroencephalogram (EEG) is associated with common, function, and state of consciousness. EEG traits were among the complex psychiatric diseases including alcoholism, schizophrenia, original intermediate phenotypes in neuropsychiatry, having been and anxiety disorders, although not diagnostic for any of them. EEG first recorded in humans in 1924 by Hans Berger, who documented traits for an individual are stable, variable between individuals, and the α rhythm, seen maximally during states of relaxation with eyes moderately to highly heritable. Such intermediate phenotypes closed, and supplanted by faster β waves during mental activity. appear to be closer to underlying molecular processes than are EEG can be used clinically for the evaluation and differential di- clinical symptoms, and represent an alternative approach for the agnosis of epilepsy and sleep disorders, differentiation of en- identification of genetic variation that underlies complex psychiat- cephalopathy from catatonia, assessment of depth of anesthesia, ric disorders.
    [Show full text]
  • State Notation Language and Sequencer Users' Guide
    State Notation Language and Sequencer Users’ Guide Release 2.0.99 William Lupton ([email protected]) Benjamin Franksen ([email protected]) June 16, 2010 CONTENTS 1 Introduction 3 1.1 About.................................................3 1.2 Acknowledgements.........................................3 1.3 Copyright...............................................3 1.4 Note on Versions...........................................4 1.5 Notes on Release 2.1.........................................4 1.6 Notes on Releases 2.0.0 to 2.0.12..................................7 1.7 Notes on Release 2.0.........................................9 1.8 Notes on Release 1.9......................................... 12 2 Installation 13 2.1 Prerequisites............................................. 13 2.2 Download............................................... 13 2.3 Unpack................................................ 14 2.4 Configure and Build......................................... 14 2.5 Building the Manual......................................... 14 2.6 Test.................................................. 15 2.7 Use.................................................. 16 2.8 Report Bugs............................................. 16 2.9 Contribute.............................................. 16 3 Tutorial 19 3.1 The State Transition Diagram.................................... 19 3.2 Elements of the State Notation Language.............................. 19 3.3 A Complete State Program...................................... 20 3.4 Adding a
    [Show full text]
  • A Rapid, Sensitive, Scalable Method for Precision Run-On Sequencing
    bioRxiv preprint doi: https://doi.org/10.1101/2020.05.18.102277; this version posted May 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 A rapid, sensitive, scalable method for 2 Precision Run-On sequencing (PRO-seq) 3 4 Julius Judd1*, Luke A. Wojenski2*, Lauren M. Wainman2*, Nathaniel D. Tippens1, Edward J. 5 Rice3, Alexis Dziubek1,2, Geno J. Villafano2, Erin M. Wissink1, Philip Versluis1, Lina Bagepalli1, 6 Sagar R. Shah1, Dig B. Mahat1#a, Jacob M. Tome1#b, Charles G. Danko3,4, John T. Lis1†, 7 Leighton J. Core2† 8 9 *Equal contribution 10 1Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA 11 2Department of Molecular and Cell Biology, Institute of Systems Genomics, University of 12 Connecticut, Storrs, CT 06269, USA 13 3Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, 14 NY 14853, USA 15 4Department of Biomedical Sciences, College of Veterinary Medicine, Cornell University, 16 Ithaca, NY 14853, USA 17 #aPresent address: Koch Institute for Integrative Cancer Research, Massachusetts Institute of 18 Technology, Cambridge, MA 02139, USA 19 #bPresent address: Department of Genome Sciences, University of Washington, Seattle, WA, 20 USA 21 †Correspondence to: [email protected] or [email protected] 22 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.05.18.102277; this version posted May 19, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
    [Show full text]
  • RNA Isolation and Purification for Every Sample, RNA Type, and Application Isolate and Purify RNA with Confidence
    RNA isolation and purification For every sample, RNA type, and application Isolate and purify RNA with confidence RNA isolation is a crucial step in your quest to understand gene expression levels. With all the solutions that Thermo Fisher Scientific has to offer, you can be confident that you’re getting started on the right foot. Go ahead and push the limits of your research. We’ll be there to support you with robust RNA purification kits, trusted RNA tools, and experienced technical support, all backed by nearly 30 years of leadership and innovation in RNA technologies. • Isolate from any sample type, for any application • Obtain high-purity, intact RNA • Achieve high yields, even from small sample quantities Contents Methods of RNA isolation 4 RNA and sample types 5 RNA applications 6 Organic RNA extraction 8 Spin column RNA extraction 10 Lysate-based RNA extraction 11 Automated RNA purification 12 Transcriptome purification 14 mRNA purification 15 MicroRNA and small RNA purification 16 Viral RNA purification 18 RNA from FFPE samples 20 RNA isolation technology guide 22 Avoiding RNA degradation 26 Tips for handling RNA 27 RNA quantitation 28 Gene expression solutions 30 Gene expression research considerations 31 Superior cDNA synthesis for any application 32 Complete kit with flexible priming options 33 Which instrument fits your needs? 34 Which thermal cycler or PCR instrument fits your needs? 35 RNA technical resources 36 Services and support 37 Ordering information 38 Methods of RNA isolation For every application, sample, and RNA type For over three decades, our scientists have developed and professional support. Our RNA isolation products innovative and robust RNA isolation products designed include organic reagents, columns, sample lysate, and to make your job as a scientist easier.
    [Show full text]
  • Specification of the "Sequ" Command
    1 Specification of the "sequ" command 2 Copyright © 2013 Bart Massey 3 Revision 0, 1 October 2013 4 This specification describes the "universal sequence" command sequ. The sequ command is a 5 backward-compatible set of extensions to the UNIX [seq] 6 (http://www.gnu.org/software/coreutils/manual/html_node/seq-invo 7 cation.html) command. There are many implementations of seq out there: this 8 specification is built on the seq supplied with GNU Coreutils version 8.21. 9 The seq command emits a monotonically increasing sequence of numbers. It is most commonly 10 used in shell scripting: 11 TOTAL=0 12 for i in `seq 1 10` 13 do 14 TOTAL=`expr $i + $TOTAL` 15 done 16 echo $TOTAL 17 prints 55 on standard output. The full sequ command does this basic counting operation, plus 18 much more. 19 This specification of sequ is in several stages, known as compliance levels. Each compliance 20 level adds required functionality to the sequ specification. Level 1 compliance is equivalent to 21 the Coreutils seq command. 22 The usual specification language applies to this document: MAY, SHOULD, MUST (and their 23 negations) are used in the standard fashion. 24 Wherever the specification indicates an error, a conforming sequ implementation MUST 25 immediately issue appropriate error message specific to the problem. The implementation then 26 MUST exit, with a status indicating failure to the invoking process or system. On UNIX systems, 27 the error MUST be indicated by exiting with status code 1. 28 When a conforming sequ implementation successfully completes its output, it MUST 29 immediately exit, with a status indicating success to the invoking process or systems.
    [Show full text]
  • Virus Purification, Rna Extraction, and Targeted
    VIRUS PURIFICATION, RNA EXTRACTION, AND TARGETED GENOME CAPTURE IN ONE CHIP Miyako Niimi1*, Taisuke Masuda1, Kunihiro Kaihatsu2, Nobuo Kato2, and Fumihito Arai1 1Nagoya University, JAPAN and 2Osaka University, JAPAN ABSTRACT In this research, we demonstrated a microfluidic chip to pretreat the samples for viral genome assay. The microfluidic chip has the following three functions; (1) Virus purification and enrichment, (2) Viral RNA extraction, and (3) Capture of the targeted virus genome. (1) Hydroxyapatite chromatography, Boom method, and PNA (2) (Peptide Nucleic Acid) were used for the above three (3) functions, respectively. These three functions were integrated in one chip. Furthermore PNA immobilized on the glass can detect the targeted virus genome so that in situ virus detection would be possible by anybody, anywhere, anytime. KEYWORDS: Virus purification, RNA extraction and detection, Infectious disease diagnosis INTRODUCTION For the purpose of diagnosing the infectious diseases Figure 1. Concept of the microfluidic chip. The microflu- quickly and accurately, DNA sequencers for gene analysis idic chip consists of the three parts: (1) hydroxyapatite- of infectious viruses have been developed rapidly. The packed microcolumn for virus purification, (2) silica- latest DNA sequencers can treat the massive numbers of packed microcolumn for viral RNA extraction, and (3) samples such as saliva and nasal at one time. However, it PNA immobilized glass for capture of the targeted virus is necessary to purify and enrich the virus and extract the genome. viral RNA in the sample as the pretreatments before gene (1) (2) analysis. Hydroxyapatite chromatography[1] have been Sample Elution Buffer used extensively for purification and fractionation of Hydroxy- various biochemical substances such as protein and virus.
    [Show full text]
  • Zinc Fingers and a Green Thumb: Manipulating Gene Expression in Plants Segal, Stege and Barbas 165
    163 Zinc fingers and a green thumb: manipulating gene expression in plants David J Segaly, Justin T Stegez and Carlos F Barbas IIIç Artificial transcription factors can be rapidly constructed A variety of techniques have been developed to manip- from predefined zinc-finger modules to regulate virtually any ulate gene expression in plants. Increased expression of gene. Stable, heritable up- and downregulation of endogenous genes is most commonly achieved through endogenous genes has been demonstrated in transgenic transgene overexpression [1]. The introduction of tissue- plants. These advances promise new approaches for creating specific and inducible promoters has improved the utility functional knockouts and conditional overexpression, and of this approach, and efficient and robust plant transforma- for other gene discovery and manipulation applications in tion techniques have made the construction of transgenes plants. a relatively routine task. However, variable expression and co-suppression of transgenes often complicate this process. Addresses Furthermore, transgenes cannot accommodate alternative ÃThe Skaggs Institute for Chemical Biology and the Department of splicing, which may be important for the appropriate Molecular Biology, The Scripps Research Institute, La Jolla, function of some transgenes [2]. California 92037, USA yDepartment of Pharmacology and Toxicology, University of Arizona, Tucson, Arizona 85721, USA Reducing or eliminating the expression of a gene in plants zDiversa Corporation, San Diego, California 92121, USA is not as simple as overexpressing a gene. Gene disruption §The Scripps Research Institute, BCC-550, North Torrey Pines Road, by homologous recombination, tDNA insertions and che- La Jolla, California 92037, USA mical mutagenesis has been used successfully, but these e-mail: [email protected] Correspondence: Carlos F Barbas III approaches are inefficient and time-consuming technolo- gies.
    [Show full text]
  • Transcriptome-Wide M a Detection with DART-Seq
    Transcriptome-wide m6A detection with DART-Seq Kate D. Meyer ( [email protected] ) Duke University School of Medicine https://orcid.org/0000-0001-7197-4054 Method Article Keywords: RNA, epitranscriptome, RNA methylation, m6A, methyladenosine, DART-Seq Posted Date: September 23rd, 2019 DOI: https://doi.org/10.21203/rs.2.12189/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Page 1/7 Abstract m6A is the most abundant internal mRNA modication and plays diverse roles in gene expression regulation. Much of our current knowledge about m6A has been driven by recent advances in the ability to detect this mark transcriptome-wide. Antibody-based approaches have been the method of choice for global m6A mapping studies. These methods rely on m6A antibodies to immunoprecipitate methylated RNAs, followed by next-generation sequencing to identify m6A-containing transcripts1,2. While these methods enabled the rst identication of m6A sites transcriptome-wide and have dramatically improved our ability to study m6A, they suffer from several limitations. These include requirements for high amounts of input RNA, costly and time-consuming library preparation, high variability across studies, and m6A antibody cross-reactivity with other modications. Here, we describe DART-Seq (deamination adjacent to RNA modication targets), an antibody-free method for global m6A detection. In DART-Seq, the C to U deaminating enzyme, APOBEC1, is fused to the m6A-binding YTH domain. This fusion protein is then introduced to cellular RNA either through overexpression in cells or with in vitro assays, and subsequent deamination of m6A-adjacent cytidines is then detected by RNA sequencing to identify m6A sites.
    [Show full text]
  • Contributions to Biostatistics: Categorical Data Analysis, Data Modeling and Statistical Inference Mathieu Emily
    Contributions to biostatistics: categorical data analysis, data modeling and statistical inference Mathieu Emily To cite this version: Mathieu Emily. Contributions to biostatistics: categorical data analysis, data modeling and statistical inference. Mathematics [math]. Université de Rennes 1, 2016. tel-01439264 HAL Id: tel-01439264 https://hal.archives-ouvertes.fr/tel-01439264 Submitted on 18 Jan 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. HABILITATION À DIRIGER DES RECHERCHES Université de Rennes 1 Contributions to biostatistics: categorical data analysis, data modeling and statistical inference Mathieu Emily November 15th, 2016 Jury: Christophe Ambroise Professeur, Université d’Evry Val d’Essonne, France Rapporteur Gérard Biau Professeur, Université Pierre et Marie Curie, France Président David Causeur Professeur, Agrocampus Ouest, France Examinateur Heather Cordell Professor, University of Newcastle upon Tyne, United Kingdom Rapporteur Jean-Michel Marin Professeur, Université de Montpellier, France Rapporteur Valérie Monbet Professeur, Université de Rennes 1, France Examinateur Korbinian Strimmer Professor, Imperial College London, United Kingdom Examinateur Jean-Philippe Vert Directeur de recherche, Mines ParisTech, France Examinateur Pour M.H.A.M. Remerciements En tout premier lieu je tiens à adresser mes remerciements à l’ensemble des membres du jury qui m’ont fait l’honneur d’évaluer mes travaux.
    [Show full text]
  • Sequence Analysis Instructions
    Sequence Analysis Instructions In order to predict your drug metabolizing phenotype from your CYP2D6 gene sequence, you must determine: 1) The assembled sequence from your two opposing sequencing reactions 2) If your PCR product even represents the human CYP2D6 gene, 3) The location of your sequence within the CYP2D6 gene, 4) Whether differences between your alleles and CYP2D6*1 sequence represents sequencing errors or polymorphisms in your sequence, and 5) The effect(s) of any polymorphisms on CYP2D6 protein sequence. As you analyze your gene sequence, copy and paste your analyses and results into a text file for your final lab report. If some factor, like the quality of your sequence, prevents you from carrying out the complete analysis, your grade will not be penalized—just complete as many of the steps below as possible, and include an explanation of why you could not complete the analysis in your final report. Font appearing in bold green italics describes questions to answer and items to include in your final report. Part 1: Assembling your CYP2D6 sequence from both directions The sequencing reaction only produces 700-800 bases of good sequence. To get most of the sequence of the 1.2 Kbp PCR product, sequencing reactions were performed from both directions on the PCR product. These need to be assembled into one sequence using the overlap of the two sequences. In order to assure that it is going in the forward direction, it is best to derive the reverse complement sequence from the reverse sequencing file. Take the text file and paste it into a program like http://www.bioinformatics.org/sms/rev_comp.html.
    [Show full text]
  • Chapter 12 Gene Expression and Regulation
    PYF12 3/21/05 8:04 PM Page 191 Chapter 12 Gene expression and regulation Bacterial genomes usually contain several thousand different genes. Some of the gene products are required by the cell under all growth conditions and are called house- keeping genes. These include the genes that encode such proteins as DNA poly- merase, RNA polymerase, and DNA gyrase. Many other gene products are required under specific growth conditions. These include enzymes that synthesize amino acids, break down specific sugars, or respond to a specific environmental condition such as DNA damage. Housekeeping genes must be expressed at some level all of the time. Frequently, as the cell grows faster, more of the housekeeping gene products are needed. Even under very slow growth, some of each housekeeping gene product is made. The gene prod- ucts required for specific growth conditions are not needed all of the time. These genes are frequently expressed at extremely low levels, or not expressed at all when they are not needed and yet made when they are needed. This chapter will examine gene regulation or how bacteria regulate the expression of their genes so that the genes that are being expressed meet the needs of the cell for a specific growth condition. Gene regulation can occur at three possible places in the production of an active gene product. First, the transcription of the gene can be regulated. This is known as transcriptional regulation. When the gene is transcribed and how much it is transcribed influences the amount of gene product that is made. Second, if the gene encodes a protein, it can be regulated at the translational level.
    [Show full text]
  • Agenda: Gene Prediction by Cross-Species Sequence Comparison Leila Taher1, Oliver Rinner2,3, Saurabh Garg1, Alexander Sczyrba4 and Burkhard Morgenstern5,*
    Nucleic Acids Research, 2004, Vol. 32, Web Server issue W305–W308 DOI: 10.1093/nar/gkh386 AGenDA: gene prediction by cross-species sequence comparison Leila Taher1, Oliver Rinner2,3, Saurabh Garg1, Alexander Sczyrba4 and Burkhard Morgenstern5,* 1International Graduate School for Bioinformatics and Genome Research, University of Bielefeld, Postfach 10 01 31, 33501 Bielefeld, Germany, 2GSF Research Center, MIPS/Institute of Bioinformatics, Ingolsta¨dter Landstrasse 1, 85764 Neuherberg, Germany, 3Brain Research Institute, ETH Zuurich,€ Winterthurerstrasse 190, CH-8057 Zuurich,€ Switzerland, 4Faculty of Technology, Research Group in Practical Computer Science, University of Bielefeld, Postfach 10 01 31, 33501 Bielefeld, Germany and 5Institute of Microbiology and Genetics, University of Go¨ttingen, Goldschmidtstrasse 1, 37077 Go¨ttingen, Germany Received March 5, 2004; Revised and Accepted March 24, 2004 ABSTRACT Thus, with the huge amount of data produced by sequencing projects, the development of high-quality gene-prediction Automatic gene prediction is one of the major tools has become a priority in bioinformatics. For eukaryotic challenges in computational sequence analysis. organisms, gene prediction is particularly challenging, since Traditional approaches to gene finding rely on statis- protein-coding exons are usually separated by introns of vari- tical models derived from previously known genes. By able length. Most traditional methods for gene prediction are contrast, a new class of comparative methods relies intrinsic methods; they use hidden Markov models (HMMs) or on comparing genomic sequences from evolutionary other stochastic approaches describing the statistical composi- related organisms to each other. These methods are tion of introns, exons, etc. Such models are trained with already based on the concept of phylogenetic footprinting: known genes from the same or a closely related species; the they exploit the fact that functionally important most popular of these tools is GenScan (1).
    [Show full text]