Maintaining Protein Localization, Structure, and Functional

Total Page:16

File Type:pdf, Size:1020Kb

Maintaining Protein Localization, Structure, and Functional University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 8-2020 Maintaining protein localization, structure, and functional interactions via codon usage and coevolution of gene expression: Combining evolutionary bioinformatics with omics-scale data to test hypotheses related to protein function Alexander Cope University of Tennessee Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Recommended Citation Cope, Alexander, "Maintaining protein localization, structure, and functional interactions via codon usage and coevolution of gene expression: Combining evolutionary bioinformatics with omics-scale data to test hypotheses related to protein function. " PhD diss., University of Tennessee, 2020. https://trace.tennessee.edu/utk_graddiss/6796 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Alexander Cope entitled "Maintaining protein localization, structure, and functional interactions via codon usage and coevolution of gene expression: Combining evolutionary bioinformatics with omics-scale data to test hypotheses related to protein function." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Life Sciences. Michael Gilchrist, Robert Hettich, Major Professor We have read this dissertation and recommend its acceptance: Albrecht von Arnim, Steven Abel Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) Maintaining Protein Localization, Structure, and Functional Interactions via Codon Usage and Coevolution of Gene Expression: Combining Evolutionary Bioinformatics with Omics-scale Data to Test Hypotheses Related to Protein Function A Dissertation Presented for the Doctor of Philosophy Degree The University of Tennessee, Knoxville Alexander Lloyd Cope August 2020 Copyright c by Alexander Lloyd Cope, 2020 All Rights Reserved. ii This is dedicated to my parents, William and Sharleen Cope, for supporting and loving me throughout my academic career. iii Acknowledgments First, I would like to thank my advisors, Dr. Michael Gilchrist and Dr. Robert Hettich. When I proposed this collaboration 5 years ago, I think all of us viewed it as a \high risk, high reward" scenario. I could not be more pleased with how it turned out. Both of you never failed to challenge me intellectually, especially when it came to contextualizing my thoughts in terms of biology, which was undoubtedly my weakest area coming into graduate school. You both also helped me strike a good balance of exploring new ideas, while also giving enough attention to projects that needed to be completed. I would also like to thank my committee, Dr. Albrecht von Arnim and Dr. Steven Abel for their support and feedback over the years. I would also like to extend a special thanks to Dr. Brian O'Meara for serving as a sounding board for many of my ideas over the years. I'm also grateful for the financial support provided by Genome Science and Technology, NIMBioS, the University of Tennessee, Oak Ridge National Laboratory. I would like to thank all of the friends I've made over the past few years. They have been a source of sanity during graduate school, especially when things were not going well with research. Grabbing dinner and drinks together are among my fondest memories of graduate school. I would like to thank my parents, William and Sharleen Cope, and my brother, Austen Cope. They have never failed to love and support me during the past 27 years. They have made many personal sacrifices in order to help put me in a position to succeed. I would likely not be pursuing a PhD if not for them. Finally, I would like to thank Elena for her constant love and support over the course of graduate school. Without her, I'm not sure I would have had the mental or emotional iv strength to make it through to the end. Although she always supported me in my work, she also made sure I would stop long enough to enjoy life, especially during times of stress. v Abstract A major challenge of the omics-era is identifying how a protein functions, both in terms of its specific function and within the context of the various biological processes necessary for the cell's survival. Key elements necessary for a protein to perform its function are efficient and accurate protein localization, protein folding, and interactions with other proteins. Previous work implicated codon usage as a means to modulate protein localization and folding. Using a mechanistic model rooted in population genetics, I examine potential selective differences in codon usage in signal peptides (localization) and protein secondary structures. Although previous work argued signal peptides were under selection for increased translation inefficiency, I find selection is generally consistent with the 5'-regions of non- secreted proteins. I also find that previous work was likely confounded by biases in signal peptide amino acid usage and gene expression. Although the direction of selection on codon usage is mostly consistent between protein secondary structures, the strength of this selection does vary for certain codons. After successful folding and localization of a protein, it must be able to function within the context of other proteins in the cell, often through protein-protein interactions of metabolic pathways. Previous work suggests proteins which are part of the same functional processes within a cell are co-expressed across time and environmental conditions. Using the concept of guilt-by-association, I combine empirical protein abundances (measured via mass spectrometry) with sequence homology based function prediction tools to identify potential functions of proteins of unknown function in C. thermocellum. Building upon the concept that functionally-related genes are co- expressed within a species, I demonstrate how phylogenetic comparative methods can be used to detect signals of gene expression coevolution across species while accounting for the shared ancestry of the species in question. vi Table of Contents 1 Introduction to the relationship between codon usage, gene expression, and protein function1 1.1 Codon usage bias.................................2 1.1.1 Evidence for selection for translation efficiency.............2 1.1.2 Evidence for selection for translation accuracy.............6 1.1.3 Quantifying codon usage bias......................8 1.1.4 Codon usage and protein biogenesis................... 10 1.2 Functional Omics: using transcriptomics and proteomics to detect functionally- related genes................................... 15 1.2.1 Detecting signals of functional-relatedness across species....... 17 1.3 Summary..................................... 20 2 Review of methods 21 3 Quantifying codon usage in signal peptides: Gene expression and amino acid usage explain apparent selection for inefficient codons 29 3.1 Introduction.................................... 29 3.2 Materials and Methods.............................. 33 3.2.1 Signal Peptide Prediction........................ 33 3.2.2 ROC-SEMPPR.............................. 33 3.2.3 CAI and tAI................................ 36 3.2.4 Generating Datasets........................... 36 3.2.5 Analysis of Codon Usage with CAI, tAI, and ROC-SEMPPR..... 39 vii 3.3 Results....................................... 40 3.4 Discussion..................................... 58 3.5 Conclusions.................................... 61 3.6 Supporting Information.............................. 61 3.6.1 Assessing ROC-SEMPPR Model Adequacy............... 61 3.6.2 Maximum-Likelihood Model-II Regression Approach......... 62 4 Quantifying shifts in natural selection on codon usage between protein regions: A population genetics approach 64 4.1 Introduction.................................... 64 4.2 Materials and Methods.............................. 66 4.2.1 Protein secondary structures....................... 66 4.2.2 Analysis with ROC-SEMPPR...................... 67 4.3 Results....................................... 68 4.3.1 Comparing selection on codon usage between secondary structures.. 72 4.3.2 Intrinsically-disordered regions have a minor impact on results.... 77 4.3.3 Simulations suggest differences in codon usage reflect real biological signals................................... 80 4.4 Discussion..................................... 81 4.5 Conclusions.................................... 84 5 An integrated guilt-by-association approach to characterize proteins of unknown function (PUFs) in Clostridium thermocellum DSM 1313 as potential genetic engineering targets 85 5.1 Introduction.................................... 86 5.2 Materials and Methods.............................. 88 5.2.1 Bacterial Strains and Culture Conditions................ 88 5.2.2 Proteome analyses using LC-MS/MS.................. 88 5.2.3 MS database searching, data analysis and interpretation....... 89 5.2.4 Generation of co-expression networks.................. 89 5.2.5 Sequence-based functional predictions.................
Recommended publications
  • Improving the Prediction of Transcription Factor Binding Sites To
    Improving the prediction of transcription factor binding sites to aid the interpretation of non-coding single nucleotide variants. Narayan Jayaram Research Department of Structural and Molecular Biology University College London A thesis submitted to University College London for the degree of Doctor of Philosophy 1 Declaration I, Narayan Jayaram confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Narayan Jayaram 2 Abstract Single nucleotide variants (SNVs) that occur in transcription factor binding sites (TFBSs) can disrupt the binding of transcription factors and alter gene expression which can cause inherited diseases and act as driver SNVs in cancer. The identification of SNVs in TFBSs has historically been challenging given the limited number of experimentally characterised TFBSs. The recent ENCODE project has resulted in the availability of ChIP-Seq data that provides genome wide sets of regions bound by transcription factors. These data have the potential to improve the identification of SNVs in TFBSs. However, as the ChIP-Seq data identify a broader range of DNA in which a transcription factor binds, computational prediction is required to identify the precise TFBS. Prediction of TFBSs involves scanning a DNA sequence with a Position Weight Matrix (PWM) using a pattern matching tool. This thesis focusses on the prediction of TFBSs by: (a) evaluating a set of locally-installable pattern-matching tools and identifying the best performing tool (FIMO), (b) using the ENCODE ChIP-Seq data to evaluate a set of de novo motif discovery tools that are used to derive PWMs which can handle large volumes of data, (c) identifying the best performing tool (rGADEM), (d) using rGADEM to generate a set of PWMs from the ENCODE ChIP-Seq data and (e) by finally checking that the selection of the best pattern matching tool is not unduly influenced by the choice of PWMs.
    [Show full text]
  • Applied Category Theory for Genomics – an Initiative
    Applied Category Theory for Genomics { An Initiative Yanying Wu1,2 1Centre for Neural Circuits and Behaviour, University of Oxford, UK 2Department of Physiology, Anatomy and Genetics, University of Oxford, UK 06 Sept, 2020 Abstract The ultimate secret of all lives on earth is hidden in their genomes { a totality of DNA sequences. We currently know the whole genome sequence of many organisms, while our understanding of the genome architecture on a systematic level remains rudimentary. Applied category theory opens a promising way to integrate the humongous amount of heterogeneous informations in genomics, to advance our knowledge regarding genome organization, and to provide us with a deep and holistic view of our own genomes. In this work we explain why applied category theory carries such a hope, and we move on to show how it could actually do so, albeit in baby steps. The manuscript intends to be readable to both mathematicians and biologists, therefore no prior knowledge is required from either side. arXiv:2009.02822v1 [q-bio.GN] 6 Sep 2020 1 Introduction DNA, the genetic material of all living beings on this planet, holds the secret of life. The complete set of DNA sequences in an organism constitutes its genome { the blueprint and instruction manual of that organism, be it a human or fly [1]. Therefore, genomics, which studies the contents and meaning of genomes, has been standing in the central stage of scientific research since its birth. The twentieth century witnessed three milestones of genomics research [1]. It began with the discovery of Mendel's laws of inheritance [2], sparked a climax in the middle with the reveal of DNA double helix structure [3], and ended with the accomplishment of a first draft of complete human genome sequences [4].
    [Show full text]
  • A Community Proposal to Integrate Structural
    F1000Research 2020, 9(ELIXIR):278 Last updated: 11 JUN 2020 OPINION ARTICLE A community proposal to integrate structural bioinformatics activities in ELIXIR (3D-Bioinfo Community) [version 1; peer review: 1 approved, 3 approved with reservations] Christine Orengo1, Sameer Velankar2, Shoshana Wodak3, Vincent Zoete4, Alexandre M.J.J. Bonvin 5, Arne Elofsson 6, K. Anton Feenstra 7, Dietland L. Gerloff8, Thomas Hamelryck9, John M. Hancock 10, Manuela Helmer-Citterich11, Adam Hospital12, Modesto Orozco12, Anastassis Perrakis 13, Matthias Rarey14, Claudio Soares15, Joel L. Sussman16, Janet M. Thornton17, Pierre Tuffery 18, Gabor Tusnady19, Rikkert Wierenga20, Tiina Salminen21, Bohdan Schneider 22 1Structural and Molecular Biology Department, University College, London, UK 2Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, CB10 1SD, UK 3VIB-VUB Center for Structural Biology, Brussels, Belgium 4Department of Oncology, Lausanne University, Swiss Institute of Bioinformatics, Lausanne, Switzerland 5Bijvoet Center, Faculty of Science – Chemistry, Utrecht University, Utrecht, 3584CH, The Netherlands 6Science for Life Laboratory, Stockholm University, Solna, S-17121, Sweden 7Dept. Computer Science, Center for Integrative Bioinformatics VU (IBIVU), Vrije Universiteit, Amsterdam, 1081 HV, The Netherlands 8Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belvaux, L-4367, Luxembourg 9Bioinformatics center, Department of Biology, University of Copenhagen, Copenhagen, DK-2200,
    [Show full text]
  • SD Gross BFI0403
    Janet Thornton Bioinformatician avant la lettre Michael Gross B ioinformatics is very much a buzzword of our time, with new courses and institutes dedicated to it sprouting up almost everywhere. Most significantly, the flood of genome data has raised the gen- eral awareness of the need to deve-lop new computational approaches to make sense of all the raw information collected. Professor Janet Thornton, the current director of the European Bioinformatics Institute (EBI), an EMBL outpost based at the Hinxton campus near Cambridge, has been in the field even before there was a word for it. Coming to structural biology with a physics degree from the University of Nottingham, she was already involved with computer-generated structural im- ages in the 1970s, when personal comput- ers and user-friendly programs had yet to be invented. The Early Years larities. Within 15 minutes, the software From there to the EBI, her remarkable Janet Thornton can check all 2.4 billion possible re- career appears to be organised in lationships and pick the ones relevant decades. During the 1970s, she did doc- software to compare structures to each to the question at hand. In comparison toral and post-doctoral research at other, recognise known folds and spot to publicly available bioinformatics the Molecular Biophysics Laboratory in new ones. Such work provides both packages such as Blast or Psiblast, Oxford and at the National Institute for fundamental insights into the workings Biopendium can provide an additional Medical Research in Mill Hill, near Lon- of evolution on a molecular level, and 30 % of annotation, according to Inphar- don.
    [Show full text]
  • ISMB 2008 Toronto
    ISMB 2008 Toronto The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Linial, Michal, Jill P. Mesirov, B. J. Morrison McKay, and Burkhard Rost. 2008. ISMB 2008 Toronto. PLoS Computational Biology 4(6): e1000094. Published Version doi:10.1371/journal.pcbi.1000094 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:11213310 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Message from ISCB ISMB 2008 Toronto Michal Linial1,2, Jill P. Mesirov1,3, BJ Morrison McKay1*, Burkhard Rost1,4 1 International Society for Computational Biology (ISCB), University of California San Diego, La Jolla, California, United States of America, 2 Sudarsky Center, The Hebrew University of Jerusalem, Jerusalem, Israel, 3 Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America, 4 Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, United States of America the integration of students, and for the of ISMB. One meeting in South Asia support of young leaders in the field. (InCoB; http://incob.binfo.org.tw/) has ISMB has also become a forum for already been sponsored by ISCB, and reviewing the state of the art in the many another one in North Asia is going to fields of this growing discipline, for follow. ISMB itself has also been held in introducing new directions, and for an- Australia (2003) and Brazil (2006).
    [Show full text]
  • ISCB Ebola Award for Important Future Research on the Computational Biology of Ebola Virus
    ISCB Ebola Award for Important Future Research on the Computational Biology of Ebola Virus Journal Information Article/Issue Information Journal ID (nlm-ta): F1000Res Self URI: f1000research-4-6464.pdf Journal ID (iso-abbrev): F1000Res Date accepted: 13 January 2015 Journal ID (pmc): F1000Research Publication date (electronic): 15 January 2015 Title: F1000Research Publication date (collection): 2015 ISSN (electronic): 2046-1402 Volume: 4 Publisher: F1000Research (London, UK) Electronic Location Identifier: 12 Article Id (accession): PMC4457108 Article Id (pmcid): PMC4457108 Article Id (pmc-uid): 4457108 PubMed ID: 26097686 DOI: 10.12688/f1000research.6038.1 Funding: The author(s) declared that no grants were involved in supporting this work. Categories Subject: Editorial Categories Subject: Articles Subject: Bioinformatics Subject: Theory & Simulation Subject: Tropical & Travel-Associated Diseases Subject: Viral Infections (without HIV) Subject: Virology ISCB Ebola Award for Important Future Research on the Computational Biology of Ebola Virus v1; ref status: not peer reviewed Peter D. Karp Bonnie Berger Diane Kovats Thomas Lengauer Michal Linial Pardis Sabeti Winston Hide Burkhard Rost 1 1International Society for Computational Biology, La Jolla, CA, USA 2 2SRI International, Menlo Park, CA, USA 3 3Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA 4 4Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbruecken, Germany 5 5Hebrew University & Institute of Advanced
    [Show full text]
  • From DNA Sequence to Chromatin Dynamics: Computational Analysis of Transcriptional Regulation
    From DNA Sequence to Chromatin Dynamics: Computational Analysis of Transcriptional Regulation Thesis submitted for the degree of “Doctor of Philosophy” by Tommy Kaplan Submitted to the Senate of the Hebrew University May 2008 This work was carried out under the supervision of Prof. Nir Friedman and Prof. Hanah Margalit Abstract All cells of a living organism share the same DNA. Yet, they differ in structure, activities and interactions. These differences arise through a tight regulatory system which activates different genes and pathways to fit the cell’s specialization, condition, and requirements. Deciphering the regulatory mechanisms underlying a living cell is one of the fundamental challenges in biology. Such knowledge will allow us to better understand how cells work, how they respond to external stimuli, what goes wrong in diseases like cancer (which often involves disruption of gene regulation), and how it can be fought. In my PhD, I focus on regulation of gene expression from three perspectives. First, I present an innovative algorithm for identifying the target genes of novel transcription factors, based on their protein sequence (Chapter 1). Second, I consider how several transcription factors cooperate to process external stimuli and alter the behavior of the cell (Chapter 2). Finally, I study how the genomic position of nucleosomes and their covalent modifications modulate the accessibility of DNA to transcription factors, thus adding a fascinating dimension to transcriptional regulation (Chapters 3 and 4). To understand transcriptional regulation, one should first reconstruct the architecture of the cell’s regulatory map, thus identifying which genes are regulated by which transcription factors (TFs).
    [Show full text]
  • Integrative Annotation of 21,037 Human Genes Validated by Full-Length Cdna Clones
    PLoS BIOLOGY Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones Tadashi Imanishi1, Takeshi Itoh1,2, Yutaka Suzuki3,68, Claire O’Donovan4, Satoshi Fukuchi5, Kanako O. Koyanagi6, Roberto A. Barrero5, Takuro Tamura7,8, Yumi Yamaguchi-Kabata1, Motohiko Tanino1,7, Kei Yura9, Satoru Miyazaki5, Kazuho Ikeo5, Keiichi Homma5, Arek Kasprzyk4, Tetsuo Nishikawa10,11, Mika Hirakawa12, Jean Thierry-Mieg13,14, Danielle Thierry-Mieg13,14, Jennifer Ashurst15, Libin Jia16, Mitsuteru Nakao3, Michael A. Thomas17, Nicola Mulder4, Youla Karavidopoulou4, Lihua Jin5, Sangsoo Kim18, Tomohiro Yasuda11, Boris Lenhard19, Eric Eveno20,21, Yoshiyuki Suzuki5, Chisato Yamasaki1, Jun-ichi Takeda1, Craig Gough1,7, Phillip Hilton1,7, Yasuyuki Fujii1,7, Hiroaki Sakai1,7,22, Susumu Tanaka1,7, Clara Amid23, Matthew Bellgard24, Maria de Fatima Bonaldo25, Hidemasa Bono26, Susan K. Bromberg27, Anthony J. Brookes19, Elspeth Bruford28, Piero Carninci29, Claude Chelala20, Christine Couillault20,21, Sandro J. de Souza30, Marie-Anne Debily20, Marie-Dominique Devignes31, Inna Dubchak32, Toshinori Endo33, Anne Estreicher34, Eduardo Eyras15, Kaoru Fukami-Kobayashi35, Gopal R. Gopinath36, Esther Graudens20,21, Yoonsoo Hahn18, Michael Han23, Ze-Guang Han21,37, Kousuke Hanada5, Hideki Hanaoka1, Erimi Harada1,7, Katsuyuki Hashimoto38, Ursula Hinz34, Momoki Hirai39, Teruyoshi Hishiki40, Ian Hopkinson41,42, Sandrine Imbeaud20,21, Hidetoshi Inoko1,7,43, Alexander Kanapin4, Yayoi Kaneko1,7, Takeya Kasukawa26, Janet Kelso44, Paul Kersey4, Reiko Kikuno45, Kouichi
    [Show full text]
  • Takashi Gojobori Date of Birth 24 October 1951 Place Fukuoka (Japan) Nomination 5 September 2007 Field Evolutionary Genomics Title Professor
    Takashi Gojobori Date of Birth 24 October 1951 Place Fukuoka (Japan) Nomination 5 September 2007 Field Evolutionary Genomics Title Professor Most important awards, prizes and academies Society Prize (Kihara Medal), The Genetic Society of Japan (2005); Society Prize (Motoo Kimura Medal), The Society of Evolutionary Studies, Japan (2004); The Gaetano Salvatore Gold Medal, Stazione Zoologica, Anton Dohrn, Italy (2004); Science Award from Japan Science and Technology Corporation (JST) (1997); Science Award from Hitoshi Kihara Memorial Foundation, Japan (1995); Promotion Award from Japanese Society of Genetics (1987). Fellowships: Visiting Research Fellow at Imperial Cancer Research Fund (ICRF) in London (1989); Fellow of American Association for the Advancement of Science (AAAS) (2006). Academies and Professional organisations: Foreign Honorary Member of the American Academy of Arts and Sciences (2006); Member of the Pontifical Academy of Sciences (2007); Vice-Chairman, The International Society of Molecular Evolution; Member, The American Society of Genetics; Member, The Society of Molecular Biology and Evolution; Member, The Japanese Society of Genetics; Member, The Japanese Society of Evolutionary Studies. Summary of scientific research He has worked extensively on the rates of synonymous and nonsynonymous substitutions, positive selection, horizontal gene transfer, viral evolution, genomic evolution, and comparative gene expressionics. In recent years he has focused on the evolution of the brain and of the Central Nervous System. He has contributed to the DDBJ/GenBank/EMBL international nucleotide sequence database construction as well as the H-Invitational human gene database. Main publications Jung Shan, H., Ohyanagi, H., Hayakawa, S., Osato, N., Nishimiya-Fujisawa, C., Ikeo, K., David, C., Fujisawa, T. and Gojobori, T.
    [Show full text]
  • Assigning Folds to the Proteins Encoded by the Genome of Mycoplasma Genitalium (Protein Fold Recognition͞computer Analysis of Genome Sequences)
    Proc. Natl. Acad. Sci. USA Vol. 94, pp. 11929–11934, October 1997 Biophysics Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium (protein fold recognitionycomputer analysis of genome sequences) DANIEL FISCHER* AND DAVID EISENBERG University of California, Los Angeles–Department of Energy Laboratory of Structural Biology and Molecular Medicine, Molecular Biology Institute, University of California, Los Angeles, Box 951570, Los Angeles, CA 90095-1570 Contributed by David Eisenberg, August 8, 1997 ABSTRACT A crucial step in exploiting the information genitalium (MG) (10), as a test of the capabilities of our inherent in genome sequences is to assign to each protein automatic fold recognition server and as a case study to sequence its three-dimensional fold and biological function. identify the difficulties facing automated fold assignment. Here we describe fold assignment for the proteins encoded by the small genome of Mycoplasma genitalium. The assignment MATERIALS AND METHODS was carried out by our computer server (http:yywww.doe- mbi.ucla.eduypeopleyfrsvryfrsvr.html), which assigns folds to The MG Sequences. The 468 MG sequences were obtained amino acid sequences by comparing sequence-derived predic- from The Institute for Genome Research (TIGR) through its tions with known structures. Of the total of 468 protein ORFs, Web address: http:yywww.tigr.orgytdbymdbymgdbymgd- 103 (22%) can be assigned a known protein fold with high b.html. Three types of annotation (based on searches in the confidence, as cross-validated with tests on known structures. sequence database) accompany each TIGR sequence (10): (i) Of these sequences, 75 (16%) show enough sequence similarity functional assignment—a clear sequence similarity with a to proteins of known structure that they can also be detected protein of known function from another organism was found by traditional sequence–sequence comparison methods.
    [Show full text]
  • Practical Structure-Sequence Alignment of Pseudoknotted Rnas Wei Wang
    Practical structure-sequence alignment of pseudoknotted RNAs Wei Wang To cite this version: Wei Wang. Practical structure-sequence alignment of pseudoknotted RNAs. Bioinformatics [q- bio.QM]. Université Paris Saclay (COmUE), 2017. English. NNT : 2017SACLS563. tel-01697889 HAL Id: tel-01697889 https://tel.archives-ouvertes.fr/tel-01697889 Submitted on 31 Jan 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 NNT : 2017SACLS563 Thèse de doctorat de l’Université Paris-Saclay préparée à L’Université Paris-Sud Ecole doctorale n◦580 (STIC) Sciences et Technologies de l’Information et de la Communication Spécialité de doctorat : Informatique par M. Wei WANG Alignement pratique de structure-séquence d’ARN avec pseudonœuds Thèse présentée et soutenue à Orsay, le 18 Décembre 2017. Composition du Jury : Mme Hélène TOUZET Directrice de Recherche (Présidente) CNRS, Université Lille 1 M. Guillaume FERTIN Professeur (Rapporteur) Université de Nantes M. Jan GORODKIN Professeur (Rapporteur) University of Copenhagen Mme Johanne COHEN Directrice de Recherche (Examinatrice)
    [Show full text]
  • ACCEPTED PAPERS – ABSTRACTS (As of May 5, 2006) Comparative
    ACCEPTED PAPERS – ABSTRACTS (as of May 5, 2006) Comparative Genomics Hairpins in a haystack: recognizing microrna precursors in comparative genomics data Author(s): Jana Hertel, Peter F. Stadler Recently, genome wide surveys for non-coding RNAs have provided evidence for tens of thousands of previously undescribed evolutionary conserved RNAs with distinctive secondary structures. The annotation of these putative ncRNAs, however, remains a difficult problem. Here we describe an SVM-based approach that, in conjunction with a non-stringent filter for consensus secondary structures, is capable of efficiently recognizing microRNA precursors in multiple sequence alignments. The software was applied to recent genome-wide RNAz surveys of mammals, urochordates, and nematodes. Keywords: miRNA, support vector machine, non-coding RNA Comparative Genomics Comparative genomics reveals unusually long motifs in mammalian genomes Author(s): Neil Jones, Pavel Pevzner Motivation: The recent discovery of the first small modulatory RNA (smRNA) presents the challenge of finding other molecules of similar length and conservation level. Unlike short interfering RNA (siRNA) and micro-RNA (miRNA), effective computational and experimental screening methods are not currently known for this species of RNA molecule, and the discovery of the one known example was partly fortuitous because it happened to be complementary to a well-studied DNA binding motif (the Neuron Restrictive Silencer Element). Results: The existing comparative genomics approaches (e.g., phylogenetic footprinting) rely on alignments of orthologous regions across multiple genomes. This approach, while extremely valuable, is not suitable for finding motifs with highly diverged ``non-alignable'' flanking regions. Here we show that several unusually long and well conserved motifs can be discovered de novo through a comparative genomics approach that does not require an alignment of orthologous upstream regions.
    [Show full text]