What Is a Gene? Prokaryotic Gene Finding Reading Frames Open

Total Page:16

File Type:pdf, Size:1020Kb

What Is a Gene? Prokaryotic Gene Finding Reading Frames Open • Tues, Nov 30: Pairwise sequence alignment Gene Finding 1 Online FCE’s: Thru Dec 10 (global and local) • Thurs, Dec 2: Gene Finding 2, PS5 due • Tues, Dec 7: Multiple sequence alignment Project presentations 1 • Thurs, Dec 9 Substitution Database Project presentations 2 matrices searching Final papers due • Tues, Dec 14: global local BLAST DD: Extended office hours: 2:30pm – 5:30pm, MI 650 Sequence • Wed, Dec 15 statistics NS: office hours. DH 1321, noon – 2pm. • Friday Dec 17 Prokaryotic Gene Finding 8:30am Final Exam, Room: TBA Evolutionary tree reconstruction Eukaryotic Gene Finding What is a Gene? Prokaryotic Gene Finding Snyder and Gerstein, Science 2003 • Something that encodes a heritable trait • One gene, one enzyme • Identify Open Reading Frames (ORFs) • One gene,one polypeptide • Coding Statistics • One gene,one product (include RNA products) • Identify individual gene architecture features • “a complete chromosomal segment responsible for • Assemble an integrated gene description making a functional product” • Homology – coding region – regulatory region – expressed product – functional product Reading Frames Open Reading Frames An ORF is a contiguous set of codons, each specifying an A C G T A A C T G A C T A G G T G A A T amino acid (starting with ATG). ..C G T A A C T G A C T A G G T G A A.. ...G T A A C T G A C T A G G T G A A T . GGGAGCATGGTGCACCTGACTCCTGAGGTGACTTAGAC M V H L T P E V T Stop • Each grouping of the nucleotides into consecutive triplets constitutes a reading frame. All coding sequences are ORF's, but not all ORF's encode proteins • Three reading frames in the 5’->3’direction • Three in the reverse direction on the opposite strand. 1 Coding Statistics Prokaryotic Gene Finding Fickett and Tung,1992 Guigo and Fickett,1995 (Electronicreserves) • Identify Open Reading Frames (ORFs) • Codon usage • Coding Statistics – Determine codon (triplet) frequencies in known • Identify individual gene architecture features coding regions – Compare with codon frequencies in sliding • Assemble an integrated gene description window • Homology • Amino acid pair preference • CG content ccgcctggcgtcgcggtttgtttttcatctctcttcatctgca CodingStatistics CodingStatistics Fickett and Tung,1992 Fickett and Tung,1992 Guigo and Fickett,1995 Guigo and Fickett,1995 (Electronicreserves) (Electronicreserves) • Codon usage Species specific • Codon usage Species specific • Codon pair preference Species specific • Codon pair preference Species specific • Correlations in third base position • Amino acid usage Species specific • Amino acid usage • Amino acid pair preference • Amino acid pair preference • CG content • CG content Gly Val AlaVal Cys Phe Ser ccgcctggcgtcgcggtttgtttttcatctctcttcatctgca ccgcctggcgtcgcggtttgtttttcatctctcttcatctgca CodingStatistics CodingStatistics Fickett and Tung,1992 Fickett and Tung,1992 Guigo and Fickett,1995 Guigo and Fickett,1995 (Electronicreserves) (Electronic reserves) • Codon usage Species specific • Codon usage Species specific • Codon pair preference Species specific • Codon pair preference Species specific • Amino acid usage Species specific • Amino acid usage Species specific • Amino acid pair preference Species specific • Amino acid pair preference Species specific • Third position Any organism • Correlations in third base position –3rd base tends to be the same much more often • CG content than chance Gly Val AlaVal Cys Phe Ser Ser • CG content ccgcctggcgtcgcggtttgtttttcatctctcttcatctgca ccgcctggcgtcgcggtttgtttttcatctctcttcatctgca 2 Coding Statistics continued CodingStatistics Fickett and Tung,1992 Fickett and Tung,1992 Guigo and Fickett,1995 Guigo and Fickett,1995 (Electronicreserves) (Electronicreserves) CG content Species specific • Codon usage Species specific In E. coli: • Codon pair preference Species specific Coding regions are embedded in segments of uniform, Species specific 53% G+C, about 1000 bases long • Amino acid usage Non-coding regions are embedded in segments of • Amino acid pair preference Species specific uniform, 46% G+C, about 500 bases long • Third position Any organism aa, at, ta, tt occur more frequently than expected in coding regions • CG content Species specific tgccgcctggcgtcgcggtttctttttcatctctcttcatctg Look for variations in these measures in coding and non-coding regions acggcggaccgcagcgccaaagaaaaagtagagagaagtagacc (intergenic and intragenic). DNA PATTERNS IN THE E.coli lexA GENE Prokaryotic Gene Finding Promotor sequences PATTERN Repressor binding site 1 gaattcgataaatctctggtttattgtgcagtttatggttccaaaatcgccttttgctgt CTGNNNNNNNNNNCAG TTCCAA -35 TTGACA • Identify Open Reading Frames (ORFs) 61 atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca TATAAT, mRNA start GGAGG -10 TATACT mRNAstart+ +10GGGGG Ribosomal binding site • Coding Statistics 121 ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga 181 cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc • Identify individual gene architecture features 241 tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc 301 gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac 361 cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc • Assemble an integrated gene description ATG…TAA 421 cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg open reading frame 481 atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg • Homology 541 cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac 601 tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca 661 ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg 721 agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct 781 gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt 841 tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg 901 atagcccggtttatttgggcggcgtggcggttggcgcaacggcggaccagct Prokaryotic Gene Finding Homology • Identify Open Reading Frames (ORFs) • Coding Statistics • Identify individual gene architecture features • Assemble an integrated gene description • Homology Salzberg, Nature 2003 3 Prokaryotic Gene Finding Gene Finding Questions • Genome length: 0.5M bp – 10Mbp • Identify protein coding region • Coding density: ~90% • Identify Open Reading Frame • Long ORFs are usually real genes • Predict mRNA (including UTR’s) • Predict intron/exon structure Early approaches – Identify ORFs Eukaryotes only – Score windows with coding statistics • Regulatory signals – Identify gene structure elements • Protein sequence • Parse into a coherent gene model surrounded by intergenic DNA. An HMM that finds genes in E. coli Prokaryotic gene model Krogh et al,1995 (Electronic reserves) 5’ 3’ A A A observed frequencies for E. coli genes A A C … 61 triplet models Open Reading Frame Untranslated regions (UTRs) T T T Promoter region Ribosome binding site Termination sequence Start codon/Stop codon start codons stop codons Repressor site intergene model Codon models intergene model A C account for G T Example: TTT sequencing errors d0 d0 d0 A A A C C C G G G i i i T T T A A A i0 1 2 3 C C C G G G A A A T T T A A A C C C C C C G G G start codons Begin G G G End T T T T T T stop codons 4 Refinements Parameter estimation observed frequencies Krogh et al,1995 for E. coli genes coding region (Electronic reserves) • Data: 429 E. coli contigs model • Trained intergenic models with non-coding DNA • Transitions into coding model were observed codon overlap model frequencies in coding regions Training Test Contigs 300 129 Base pairs 1,271,528 324,684 long intergene model Genes 1007 251 start codons Av length 1008 1015 stop codons short Results Performance measures Perfect atg taa reality prediction • Exact locations of ~80% of known genes atg taa • Approximate locations of ~10% of known genes Almost perfect • About half of the false negatives were genes with atg reality unusual codon usage. prediction atg • Predicted genes: 286 About 150 were similar to known genes <10 Partly reality prediction >50% or >60 bp Outstanding Problems Outstanding Problems • Model cannot account for drift in CG content • Model cannot account for drift in CG content • Does not take position dependencies into • Does not take position dependencies into account account • Solution: A A A … – kth order Markov chain T T T – looks back k positions 5 First-order Markov chain Second-order Markov chain Example: transmembrane region model Example: transmembrane Transition matrix: region model Transition matrix: H L L H H L L H H L P[i, j] H H L P[i, j,k] L H: hydrophobic H: hydrophobic L: hydrophilic L: hydrophilic P(xt = i | xt−1 = j) P(xt = i | xt−1 = j, xt−2 = k) A second-order Markov chain can be expressed as a first order Markov chain with more states and Glimmer transitions Salzberg et al,1998 • Prokaryotic gene finder HH LH HL LL • Finds 98% of all genes in a bacterial genome HH LH HH • Genome independent – Uses all large, non-overlapping ORFs as training data LH • kth order Markov chain P(x = (ij) | x = ( jk)) HL LL HL t t−1 – (looks back k positions) • Higher order Markov models require more training LL data Pairwise sequence alignment (global and local) Multiple sequence alignment Substitution matrices Database searching global local BLAST Sequence statistics Prokaryotic Gene Finding Evolutionary tree reconstruction Eukaryotic Gene Finding 6.
Recommended publications
  • Exploring the Structure of Long Non-Coding Rnas, J
    IMF YJMBI-63988; No. of pages: 15; 4C: 3, 4, 7, 8, 10 1 2 Rise of the RNA Machines: Exploring the Structure of 3 Long Non-Coding RNAs 4 Irina V. Novikova, Scott P. Hennelly, Chang-Shung Tung and Karissa Y. Sanbonmatsu Q15 6 Los Alamos National Laboratory, Los Alamos, NM 87545, USA 7 Correspondence to Karissa Y. Sanbonmatsu: [email protected] 8 http://dx.doi.org/10.1016/j.jmb.2013.02.030 9 Edited by A. Pyle 1011 12 Abstract 13 Novel, profound and unexpected roles of long non-coding RNAs (lncRNAs) are emerging in critical aspects of 14 gene regulation. Thousands of lncRNAs have been recently discovered in a wide range of mammalian 15 systems, related to development, epigenetics, cancer, brain function and hereditary disease. The structural 16 biology of these lncRNAs presents a brave new RNA world, which may contain a diverse zoo of new 17 architectures and mechanisms. While structural studies of lncRNAs are in their infancy, we describe existing 18 structural data for lncRNAs, as well as crystallographic studies of other RNA machines and their implications 19 for lncRNAs. We also discuss the importance of dynamics in RNA machine mechanism. Determining 20 commonalities between lncRNA systems will help elucidate the evolution and mechanistic role of lncRNAs in 21 disease, creating a structural framework necessary to pursue lncRNA-based therapeutics. 22 © 2013 Published by Elsevier Ltd. 24 23 25 Introduction rather than the exception in the case of eukaryotic 50 organisms. 51 26 RNA is primarily known as an intermediary in gene LncRNAs are defined by the following: (i) lack of 52 11 27 expression between DNA and proteins.
    [Show full text]
  • Upstream Sequences Other Than AAUAAA Are Required for Efficient Messenger RNA 3’-End Formation in Plants
    The Plant Cell, Vol. 2, 1261-1272, December 1990 O 1990 American Society of Plant Physiologists Upstream Sequences Other than AAUAAA Are Required for Efficient Messenger RNA 3’-End Formation in Plants Bradley D. Mogen, Margaret H. MacDonald, Robert Graybosch,’ and Arthur G. Hunt2 Plant Physiology/Biochemistry/MolecularBiology Program, Department of Agronomy, University of Kentucky, Lexington, Kentucky 40546-009 1 We have characterized the upstream nucleotide sequences involved in mRNA 3’-end formation in the 3‘ regions of the cauliflower mosaic virus (CaMV) 19S/35S transcription unit and a pea gene encoding ribulose-l,5-bisphosphate carboxylase small subunit (rbcs). Sequences between 57 bases and 181 bases upstream from the CaMV polyade- nylation site were required for efficient polyadenylation at this site. In addition, an AAUAAA sequence located 13 bases to 18 bases upstream from this site was also important for efficient mRNA 3’-end formation. An element located between 60 bases and 137 bases upstream from the poly(A) addition sites in a pea rbcS gene was needed for functioning of these sites. The CaMV -181/-57 and rbcS -137/-60 elements were different in location and sequence composition from upstream sequences needed for polyadenylation in mammalian genes, but resembled the signals that direct mRNA 3’-end formation in yeast. However, the role of the AAUAAA motif in 3’-end formation in the CaMV 3’ region was reminiscent of mRNA polyadenylation in animals. We suggest that multiple elements are involved in mRNA 3‘-end formation in plants, and that interactions of different components of the plant polyadenyl- ation apparatus with their respective sequence elements and with each other are needed for efficient mRNA 3‘-end formation.
    [Show full text]
  • Insights Into Comparative Genomics, Codon Usage Bias, And
    plants Article Insights into Comparative Genomics, Codon Usage Bias, and Phylogenetic Relationship of Species from Biebersteiniaceae and Nitrariaceae Based on Complete Chloroplast Genomes Xiaofeng Chi 1,2 , Faqi Zhang 1,2 , Qi Dong 1,* and Shilong Chen 1,2,* 1 Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology, Chinese Academy of Sciences, Xining 810008, China; [email protected] (X.C.); [email protected] (F.Z.) 2 Qinghai Provincial Key Laboratory of Crop Molecular Breeding, Northwest Institute of Plateau Biology, Chinese Academy of Sciences, Xining 810008, China * Correspondence: [email protected] (Q.D.); [email protected] (S.C.) Received: 29 October 2020; Accepted: 17 November 2020; Published: 18 November 2020 Abstract: Biebersteiniaceae and Nitrariaceae, two small families, were classified in Sapindales recently. Taxonomic and phylogenetic relationships within Sapindales are still poorly resolved and controversial. In current study, we compared the chloroplast genomes of five species (Biebersteinia heterostemon, Peganum harmala, Nitraria roborowskii, Nitraria sibirica, and Nitraria tangutorum) from Biebersteiniaceae and Nitrariaceae. High similarity was detected in the gene order, content and orientation of the five chloroplast genomes; 13 highly variable regions were identified among the five species. An accelerated substitution rate was found in the protein-coding genes, especially clpP. The effective number of codons (ENC), parity rule 2 (PR2), and neutrality plots together revealed that the codon usage bias is affected by mutation and selection. The phylogenetic analysis strongly supported (Nitrariaceae (Biebersteiniaceae + The Rest)) relationships in Sapindales. Our findings can provide useful information for analyzing phylogeny and molecular evolution within Biebersteiniaceae and Nitrariaceae.
    [Show full text]
  • 1. a 6-Frame Translation Map of a Segment of DNA Is Shown, with Three Open Reading Frames (A, B, and C). Orfs a and B Are Known to Be in Separate Genes
    1. A 6-frame translation map of a segment of DNA is shown, with three open reading frames (A, B, and C). ORFs A and B are known to be in separate genes. 1a. Two transcription bubbles are shown, one in ORF A and one in ORF B. In the transcription bubble diagram, mark the following: • the location of RNA polymerase on the appropriate strand in each bubble • the RNA transcripts to show the relative lengths of RNA made by those two polymerases 1b. Are the promoters for ORFs A and/or B present in the DNA region shown in this diagram? Promoter for A: Present? Yes No (circle one) If present, mark its location and label it. Promoter for B: Present? Yes No (circle one) If present, mark its location and label it. 1c. Electron microscopy experiments failed to show RNA polymerases over the ORF "C" region of DNA. State whether each of the three explanations listed below is valid or not, explaining as necessary: Explanation If valid, just write “Valid.” If invalid, BRIEFLY explain why. _______________________________________________________________________ ORFs B and C are exons of the same gene and a splicing error causes ORF C to be left out. Splicing occurs after transcription. Incorrect splicing doesn't explain why transcription didn't happen. ORF C has a mutation in its start (ATG) codon, preventing transcription. The start codon is used for translation, not transcription... whether or not the start codon is intact, transcription could still happen. ORF C has a promoter mutation preventing transcription. VALID. (A promoter mutation is consistent with failure to transcribe the gene.) 2.
    [Show full text]
  • POST-TRANSCRIPTIONAL REGULATION of AFP and Igm GENES
    University of Kentucky UKnowledge University of Kentucky Doctoral Dissertations Graduate School 2011 POST-TRANSCRIPTIONAL REGULATION OF AFP AND IgM GENES Lilia M. Turcios University of Kentucky, [email protected] Right click to open a feedback form in a new tab to let us know how this document benefits ou.y Recommended Citation Turcios, Lilia M., "POST-TRANSCRIPTIONAL REGULATION OF AFP AND IgM GENES" (2011). University of Kentucky Doctoral Dissertations. 210. https://uknowledge.uky.edu/gradschool_diss/210 This Dissertation is brought to you for free and open access by the Graduate School at UKnowledge. It has been accepted for inclusion in University of Kentucky Doctoral Dissertations by an authorized administrator of UKnowledge. For more information, please contact [email protected]. ABSTRACT OF DISSERTATION Lilia M. Turcios The Graduate School University of Kentucky 2011 POST-TRANSCRIPTIONAL REGULATION OF AFP AND IgM GENES ABSTRACT OF DISSERTATION A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the College of Medicine at the University of Kentucky By Lilia M. Turcios Director: Dr. Martha Peterson Lexington, KY 2011 Copyright © Lilia M. Turcios 2011 ABSTRACT OF DISSERTATION POST-TRANSCRIPTIONAL REGULATION OF AFP AND IgM GENES Gene expression can be regulated at multiple steps once transcription is initiated. I have studied two different gene models, the α-Fetoprotein (AFP) and the immunoglobulin heavy chain (IgM) genes, to better understand post-transcriptional gene regulation mechanisms. The AFP gene is highly expressed during fetal liver development and dramatically repressed after birth. There is a mouse strain-specific difference between adult levels of AFP, with BALB/cJ mice expressing 10 to 20-fold higher levels compared to other mouse strains.
    [Show full text]
  • Mutation Bias Shapes Gene Evolution in Arabidopsis Thaliana ​
    bioRxiv preprint doi: https://doi.org/10.1101/2020.06.17.156752; this version posted June 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Mutation bias shapes gene evolution in Arabidopsis thaliana ​ 1,2† 1 1 3,4 Monroe, J. Grey ,​ Srikant, Thanvi ,​ Carbonell-Bejerano, Pablo ,​ Exposito-Alonso, Moises ,​ 5​ ​ 6 7 ​ 1† ​ Weng, Mao-Lun ,​ Rutter, Matthew T. ,​ Fenster, Charles B. ,​ Weigel, Detlef ​ ​ ​ ​ 1 Department​ of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany 2 Department​ of Plant Sciences, University of California Davis, Davis, CA 95616, USA 3 Department​ of Plant Biology, Carnegie Institution for Science, Stanford, CA 94305, USA 4 Department​ of Biology, Stanford University, Stanford, CA 94305, USA 5 Department​ of Biology, Westfield State University, Westfield, MA 01086, USA 6 Department​ of Biology, College of Charleston, SC 29401, USA 7 Department​ of Biology and Microbiology, South Dakota State University, Brookings, SD 57007, USA † corresponding​ authors: [email protected], [email protected] ​ ​ ​ Classical evolutionary theory maintains that mutation rate variation between genes should be random with respect to fitness 1–4 and evolutionary optimization of genic 3,5 ​ mutation rates remains controversial .​ However, it has now become known that ​ cytogenetic (DNA sequence + epigenomic) features influence local mutation probabilities 6 ,​ which is predicted by more recent theory to be a prerequisite for beneficial mutation 7 rates between different classes of genes to readily evolve .​ To test this possibility, we ​ used de novo mutations in Arabidopsis thaliana to create a high resolution predictive ​ ​ ​ model of mutation rates as a function of cytogenetic features across the genome.
    [Show full text]
  • The Nucleotide Sequence of the Gene for Human Protein C (DNA Sequence Analysis/Vitamin K-Dependent Proteins/Blood Coagulation) DONALD C
    Proc. Natl. Acad. Sci. USA Vol. 82, pp. 4673-4677, July 1985 Biochemistry The nucleotide sequence of the gene for human protein C (DNA sequence analysis/vitamin K-dependent proteins/blood coagulation) DONALD C. FOSTER, SHINJI YOSHITAKE, AND EARL W. DAVIE Department of Biochemistry, University of Washington, Seattle, WA 98195 Contributed by Earl W. Davie, April 9, 1985 ABSTRACT A human genomic DNA library was screened MATERIALS AND METHODS for the gene for protein C by using a cDNA probe coding for the human protein. Three different overlapping A Charon 4A Screening of the Genomic Library. A human genomic phage were isolated that contain inserts for the gene for protein library in X Charon 4A phage (14) was screened for genomic C. The complete sequence of the gene was determined by the clones of human protein C by the plaque hybridization dideoxy method and shown to span about 11 kilobases ofDNA. procedure ofBenton and Davis as modified by Woo (15) using The coding and 3' noncoding portion of the gene consists of a cDNA for human protein C (9) as the hybridization probe. eight exons and seven introns. The eight exons code for a The cDNA started at amino acid 64 of human protein C and preproleader sequence of 42 amino acids, a light chain of 155 extended to the second polyadenylylation signal (9). It was amino acids, a connecting dipeptide of Lys-Arg, and a heavy radiolabeled by nick-translation to a specific activity of 8 X chain of 262 amino acids. The preproleader sequence and the 108 cpm/,ug with all four radioactive ([a-32P]dNTP) connecting dipeptide are removed during processing, resulting deoxynucleotides.
    [Show full text]
  • Chapter 3. the Beginnings of Genomic Biology – Molecular
    Chapter 3. The Beginnings of Genomic Biology – Molecular Genetics Contents 3. The beginnings of Genomic Biology – molecular genetics 3.1. DNA is the Genetic Material 3.6.5. Translation initiation, elongation, and termnation 3.2. Watson & Crick – The structure of DNA 3.6.6. Protein Sorting in Eukaryotes 3.3. Chromosome structure 3.7. Regulation of Eukaryotic Gene Expression 3.3.1. Prokaryotic chromosome structure 3.7.1. Transcriptional Control 3.3.2. Eukaryotic chromosome structure 3.7.2. Pre-mRNA Processing Control 3.3.3. Heterochromatin & Euchromatin 3.4. DNA Replication 3.7.3. mRNA Transport from the Nucleus 3.4.1. DNA replication is semiconservative 3.7.4. Translational Control 3.4.2. DNA polymerases 3.7.5. Protein Processing Control 3.4.3. Initiation of replication 3.7.6. Degradation of mRNA Control 3.4.4. DNA replication is semidiscontinuous 3.7.7. Protein Degradation Control 3.4.5. DNA replication in Eukaryotes. 3.8. Signaling and Signal Transduction 3.4.6. Replicating ends of chromosomes 3.8.1. Types of Cellular Signals 3.5. Transcription 3.8.2. Signal Recognition – Sensing the Environment 3.5.1. Cellular RNAs are transcribed from DNA 3.8.3. Signal transduction – Responding to the Environment 3.5.2. RNA polymerases catalyze transcription 3.5.3. Transcription in Prokaryotes 3.5.4. Transcription in Prokaryotes - Polycistronic mRNAs are produced from operons 3.5.5. Beyond Operons – Modification of expression in Prokaryotes 3.5.6. Transcriptions in Eukaryotes 3.5.7. Processing primary transcripts into mature mRNA 3.6. Translation 3.6.1.
    [Show full text]
  • Classification and Function of Small Open-Reading Frames Abstract
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Sussex Research Online 1 Classification and function of small open-reading frames Juan-Pablo Couso1,2* and Pedro Patraquim2 1Centro Andaluz de Biologia del Desarrollo, CSIC-UPO, Sevilla, Spain and 2Brighton and Sussex Medical School, University of Sussex, Brighton, United Kingdom. *Author for correspondence: [email protected] Abstract Small open-reading frames (smORFs or sORFs) of 100 codons or less are usually - if arbitrarily - excluded from canonical proteome annotations. Despite this, the genomes of a wide range of metazoans, including humans, contain hundreds of smORFs, some of which fulfil key physiological functions. Recently, ribosomal profiling has been employed to show that the transcriptome of the model organism Drosophila melanogaster contains thousands of smORFs of different classes actively undergoing translation which produces peptides of mostly unknown function. Here we present a comprehensive analysis of the smORF repertoire in flies, mice and humans. We propose the existence of several classes of smORFs with different functions, from inert DNA sequences to transcribed and translated cis- regulators of translation, and finally to expression of functional peptides with a propensity to act as regulators of canonical membrane-associated proteins, or as components of ancestral protein complexes in the cytoplasm. We suggest that the different smORF classes could represent steps during the evolution of novel peptide and protein sequences. Our analysis introduces a distinction between different peptide-coding classes in animal genomes, and highlights the role of Drosophila melanogaster as a model organism for the study of small peptide biology in the context of development, physiology and human disease.
    [Show full text]
  • "The" Genetic Code?
    Evolutionary Anthropology 14:6–11 (2005) CROTCHETS & QUIDDITIES “The” Genetic Code? KENNETH M. WEISS AND ANNE V. BUCHANAN The DNA-based code for protein through messenger and transfer RNA is widely themselves, that carry the informa- regarded as the code of life. But genomes are littered with other kinds of coding tion. elements as well, and all of them probably came after a supercode for the tRNA Your life depends on the fidelity of system itself. these many codes. Aberrant codes re- lated to cell behavior can lead to dys- genesis or various metabolic diseases. Evolution and the diversification of Everyone knows of “the” genetic Anomalous cell-surface proteins can organisms are made possible by code, by which nucleotide triplets in cause autoimmune destruction, and vi- codes, or arbitrary assignments of DNA in the nucleus of cells specify the ruses are the Alan Turings of life that “meaning,” in multiple ways. Many amino acid (aa) sequence of proteins. evolve ways to break their receptor are not widely appreciated. Codes al- This is the code described in text- codes to gain illicit entry into cells (Fig. low the same system of components books as the heart of the genetic the- 1). to be used for multiple purposes. ory of life and its evolution. Discover- But there is an additional code, a These can be open-ended, the way the ies in recent years have made things code of codes, that makes all of this alphabet and vocabulary make this more complicated by showing that ge- possible, including “the” genetic code column possible, but the flexibility of nomes are littered with all sorts of itself, and may be the oldest and most a code can become constrained once a other kinds of coding elements.
    [Show full text]
  • Designing Lentiviral Vectors for Gene Therapy of Genetic Diseases
    viruses Review Designing Lentiviral Vectors for Gene Therapy of Genetic Diseases Valentina Poletti 1,2,3,* and Fulvio Mavilio 4 1 Department of Woman and Child Health, University of Padua, 35128 Padua, Italy 2 Harvard Medical School, Harvard University, Boston, MA 02115, USA 3 Pediatric Research Institute City of Hope, 35128 Padua, Italy 4 Department of Life Sciences, University of Modena and Reggio Emilia, 41125 Modena, Italy; [email protected] * Correspondence: [email protected] Abstract: Lentiviral vectors are the most frequently used tool to stably transfer and express genes in the context of gene therapy for monogenic diseases. The vast majority of clinical applications involves an ex vivo modality whereby lentiviral vectors are used to transduce autologous somatic cells, ob- tained from patients and re-delivered to patients after transduction. Examples are hematopoietic stem cells used in gene therapy for hematological or neurometabolic diseases or T cells for immunotherapy of cancer. We review the design and use of lentiviral vectors in gene therapy of monogenic diseases, with a focus on controlling gene expression by transcriptional or post-transcriptional mechanisms in the context of vectors that have already entered a clinical development phase. Keywords: lentiviral vectors; transcriptional regulation; post-transcriptional regulation; miRNA; promoters; retroviral integration; ex vivo gene therapy Citation: Poletti, V.; Mavilio, F. 1. Introduction Designing Lentiviral Vectors for Gene Therapy of Genetic Diseases.
    [Show full text]
  • Analysis of Codon Usage Patterns in Giardia Duodenalis Based on Transcriptome Data from Giardiadb
    G C A T T A C G G C A T genes Article Analysis of Codon Usage Patterns in Giardia duodenalis Based on Transcriptome Data from GiardiaDB Xin Li, Xiaocen Wang, Pengtao Gong, Nan Zhang, Xichen Zhang and Jianhua Li * Key Laboratory of Zoonosis Research, Ministry of Education, College of Veterinary Medicine, Jilin University, Changchun 130062, China; [email protected] (X.L.); [email protected] (X.W.); [email protected] (P.G.); [email protected] (N.Z.); [email protected] (X.Z.) * Correspondence: [email protected]; Tel.: +86-431-8783-6172; Fax: +86-431-8798-1351 Abstract: Giardia duodenalis, a flagellated parasitic protozoan, the most common cause of parasite- induced diarrheal diseases worldwide. Codon usage bias (CUB) is an important evolutionary character in most species. However, G. duodenalis CUB remains unclear. Thus, this study analyzes codon usage patterns to assess the restriction factors and obtain useful information in shaping G. duo- denalis CUB. The neutrality analysis result indicates that G. duodenalis has a wide GC3 distribution, which significantly correlates with GC12. ENC-plot result—suggesting that most genes were close to the expected curve with only a few strayed away points. This indicates that mutational pressure and natural selection played an important role in the development of CUB. The Parity Rule 2 plot (PR2) result demonstrates that the usage of GC and AT was out of proportion. Interestingly, we identified 26 optimal codons in the G. duodenalis genome, ending with G or C. In addition, GC content, gene expression, and protein size also influence G.
    [Show full text]