Genomic Data Mining for Functional Annotation of Human Long Noncoding Rnas Brian L

Clemson University TigerPrints All Dissertations Dissertations 5-2018 Genomic Data Mining for Functional Annotation of Human Long Noncoding RNAs Brian L. Gudenas Clemson University, [email protected] Follow this and additional works at: https://tigerprints.clemson.edu/all_dissertations Recommended Citation Gudenas, Brian L., "Genomic Data Mining for Functional Annotation of Human Long Noncoding RNAs" (2018). All Dissertations. 2146. https://tigerprints.clemson.edu/all_dissertations/2146 This Dissertation is brought to you for free and open access by the Dissertations at TigerPrints. It has been accepted for inclusion in All Dissertations by an authorized administrator of TigerPrints. For more information, please contact [email protected]. GENOMIC DATA MINING FOR FUNCTIONAL ANNOTATION OF HUMAN LONG NONCODING RNAs A Dissertation Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Genetics by Brian L. Gudenas May 2018 Accepted by: Dr. Liangjiang Wang, Committee Chair Dr. Hong Luo Dr. Anand Srivastava Dr. Rajandeep Sekhon ABSTRACT Life may have begun in an RNA world, which is supported by the increasingly vital role that RNA has been shown to perform in biological systems. To understand how the genome encodes life, one must look to the transcriptome, the set of all RNA molecules in a cell. The transcriptome illustrates which RNA transcripts are expressed at what times and this orchestrated network of gene expression is responsible for multicellular development. In humans, most genes are noncoding RNAs, meaning that they do not encode proteins. The largest class of noncoding genes are long noncoding RNAs (lncRNAs), RNA transcripts greater in length than 200 nucleotides which lack protein-coding capacity. Some lncRNAs have been shown to be key regulators; however, most lncRNAs are uncharacterized. Therefore, we developed genomic data mining methodologies for lncRNA functional annotation. Many lncRNAs are brain-specific and their dysregulation is suspected to be involved in neurodevelopmental disorders. Two prevalent brain disorders are intellectual disability (ID) and autism spectrum disorder (ASD), which are genetically heterogeneous with unidentified genetic risk factors. In this study, we created brain developmental gene coexpression networks, for ID and ASD, to identify lncRNAs associated with known disease genes. We found lncRNAs highly co-expressed with ID genes which harbored ID- associated copy number variants (CNVs). To find ASD-associated lncRNAs we identified lncRNAs differentially expressed in the ASD brain and then refined these candidates by filtering for associations with ASD risk genes in a human brain developmental coexpression network. These candidate-ASD associated lncRNAs were associated with the ii synaptic transmission and immune response pathways, in addition to residing within ASD- associated CNVs at a high frequency. The mechanism by which lncRNAs function is partly determined by functional motifs in the RNA transcript sequence. To identify lncRNA motifs, we developed a genetic algorithm capable of finding long motifs and found a motif associated with lncRNA nuclear localization. LncRNA functions are compartmentalized within the cell; therefore, knowledge of lncRNA subcellular localization provides insight into their biological function. We developed a deep learning model that predicts lncRNA subcellular localization from lncRNA transcript sequences. This model obtained high prediction accuracy on lncRNAs with known localizations suggesting that sequence motifs are involved in subcellular localization. In summary, we developed genomic data mining methods for the functional characterization of lncRNAs based on their expression patterns and transcript sequences. iii DEDICATION For my mom and dad, without their lifelong love and support I would have never made it this far. iv ACKNOWLEDGMENTS I want to sincerely thank my research advisor Dr. Wang for his excellent mentorship on bioinformatics, genetics and science. I would not be the scientist I am today without his guidance. I also want to extend gratitude to my graduate committee members, Dr. Luo, Dr. Srivastava and Dr. Sekhon. They have been incredible resources of scientific wisdom which have greatly helped my academic development. v TABLE OF CONTENTS Page TITLE PAGE .................................................................................................................... i ABSTRACT ..................................................................................................................... ii DEDICATION ................................................................................................................ iv ACKNOWLEDGMENTS ............................................................................................... v LIST OF TABLES ........................................................................................................ viii LIST OF FIGURES ........................................................................................................ ix CHAPTER I. LITERATURE REVIEW OF THE FUNCTIONAL ANNOTATION OF LONG NONCODING RNAS THROUGH GENOMIC DATA MINING ... 1 Introduction .............................................................................................. 1 Biological function of lncRNAs .............................................................. 3 Subcellular localization of lncRNAs ....................................................... 5 lncRNAs in human disease ...................................................................... 8 Genomic data mining ............................................................................. 10 Expression-based lncRNA functional annotation .................................. 14 Sequence-based lncRNA functional annotation .................................... 17 Concluding remarks ............................................................................... 21 II. GENE COEXPRESSION NETWORKS IN HUMAN BRAIN DEVELOPMENTAL TRANSCRIPTOMES IMPLICATE THE ASSOCIATION OF LONG NONCODING RNAS WITH INTELLECTUAL DISABILITY ............................................................................................... 22 Introduction ............................................................................................ 23 Methods.................................................................................................. 25 Results and Discussion .......................................................................... 29 Conclusion ............................................................................................. 38 III. INTEGRATIVE GENOMIC ANALYSES FOR IDENTIFICATION AND PRIORITIZATION OF LONG NONCODING RNAS ASSOCIATED WITH AUTISM ...................................................................................................... 40 vi Introduction ............................................................................................ 41 Results and Discussion .......................................................................... 44 Conclusions ............................................................................................ 55 Methods.................................................................................................. 56 IV. A GENETIC ALGORITHM FOR FINDING DISCRIMINATIVE FUNCTIONAL MOTIFS IN LONG NONCODING RNA ........................ 62 Introduction ............................................................................................ 63 Methods.................................................................................................. 65 Results .................................................................................................... 68 Conclusion ............................................................................................. 71 V. PREDICTION OF LNCRNA SUBCELLULAR LOCALIZATION WITH DEEP LEARNING FROM SEQUENCE FEATURES ............................... 73 Introduction ............................................................................................ 74 Methods.................................................................................................. 77 Results .................................................................................................... 83 Conclusion ............................................................................................. 91 V. CONCLUSION ............................................................................................ 93 APPENDICES ............................................................................................................... 98 A: Additional files............................................................................................. 99 B: Supplementary figures ............................................................................... 100 REFERENCES ............................................................................................................ 103 vii LIST OF TABLES Table Page 2.1 Prioritized list of ID-associated lncRNA candidates ................................... 37 5.1 Performance metrics on the test set ............................................................. 86 viii LIST OF FIGURES Figure Page 1.1 Functional themes of lncRNAs ...................................................................... 5 1.2 LncRNA cellular functions ............................................................................ 7 1.3 Overview of coexpression network analysis...............................................

Genomic Data Mining for Functional Annotation of Human Long Noncoding Rnas Brian L

ARTICLE Doi:10.1038/Nature10523

Experimental Eye Research 129 (2014) 93E106

Impact of Disseminated Neuroblastoma Cells on the Identification of the Relapse-Seeding Clone

L1 Retrotransposition Is a Common Feature of Mammalian Hepatocarcinogenesis

A High Throughput, Functional Screen of Human Body Mass Index GWAS Loci Using Tissue-Specific Rnai Drosophila Melanogaster Crosses Thomas J

Characterizing Genomic Duplication in Autism Spectrum Disorder by Edward James Higginbotham a Thesis Submitted in Conformity

Variation in Protein Coding Genes Identifies Information Flow

The Role of Lamin Associated Domains in Global Chromatin Organization and Nuclear Architecture

Table S1. 103 Ferroptosis-Related Genes Retrieved from the Genecards

Molecular Sciences High-Resolution Chromosome Ideogram Representation of Currently Recognized Genes for Autism Spectrum Disorder

L1 Retrotransposition Is a Common Feature of Mammalian Hepatocarcinogenesis

Supplementary Table 1. Loci Associated with Heifer Conception Rate at First Breeding BTA1 UMD 3.1 Position2 SNP ID3,4 P-Value5 M