I Hide and Seq: Novel Approaches to RNA-Sequencing

Hide and Seq: Novel approaches to RNA-sequencing data with a focus on neurodegenerative disease by Marko Melnick Sc.B., University of Colorado at Denver, 2011 Masters in Integrative Physiology, University of Colorado at Boulder, 2018 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirement for the degree of Doctor of Science Department of Integrative Physiology 2021 Committee Members Dr. Marissa Ehringer, Ph.D., Committee Chair Dr. Christopher Link, Ph.D. Dr. Robin Dowell, Ph.D. Dr. Tanya Alderete, Ph.D. Dr. Jerry A. Stitzel, Ph.D. i Hide and Seq: Novel approaches to RNA-sequencing data with a focus on neurodegenerative disease Thesis directed by Dr. Christopher Link, Ph.D. and Dr. Robin Dowell, Ph.D. Humanity is still at the early stages of understanding the highly complex molecular mechanisms of neurodegeneration. Along with the mechanisms, how neurodegeneration starts remains largely unknown for many of these diseases. My research focuses on using RNA- sequencing data to understand these mechanisms as well as gain insights into how these diseases might form. We first found that with heat shock, Caenorhabditis elegans worms have nuclear double stranded RNA (dsRNA) foci that appear similar to foci in worms with a knockout of tdp-1, a worm homolog of the Amyotrophic Lateral Sclerosis (ALS)-related protein TDP-43. Next, we performed RNA-sequencing of immunoprecipitated dsRNA and noticed that transcription of dsRNA is enriched downstream of some genes. We then created a novel algorithm called Dogcatcher which can capture and quantify downstream of gene regions (DoG) genome-wide. We then biologically validate a DoG identified by Dogcatcher using fluorescence imaging. ii We next developed an algorithm called Mystery Miner that uses RNA-seq data to look for the presence of pathogens that might contribute to neurodegeneration. Specifically, we take often discarded non-host reads from RNA-sequencing data and identify microbes present in samples as well as quantify microbes between groups. We then apply Mystery Miner to our novel ALS dataset that consists of over 120 patients from four patient classes (three ALS related, one control). Although we find no convincing evidence for the presence of microbes or a set of microbes that might contribute to neurodegeneration, we do find and biologically confirm the presence of a novel RNA-dependent RNA polymerase-like sequence present in our dataset. Additionally, we apply Mystery Miner to other ALS related datasets in the field and perform a meta-analysis looking for any microbes that might contribute to ALS. Despite multiple types of analysis, we find no convincing evidence for the presence of a microbe or set of microbes in ALS related patient classes for any of the datasets analyzed. Finally, I create an algorithm called MaDDoG that segments DoGs by partitioning loci at points of change (using the mean count) with subsequent quantification of segments between groups of samples. I first generate synthetic data as a proof of concept for MaDDoG, then apply it genome-wide on a real dataset. I then highlight the many applications of MaDDoG such as distinguishing between genes that have truncated or absent control DoGs, identifying regions at the end of DoGs that are likely transcriptional noise, its flexibility with regards to data with high or low variance of mean count, and look for intron retention or alternative exon usage in genic regions. iii I hope any biological insights and algorithms created in my research will help patients suffering from neurodegeneration. Additionally, my algorithms can be applied outside of neurodegeneration and I hope that researchers from highly diverse fields will use my algorithms to gain further insights into many disease processes. iv Acknowledgements I feel truly lucky to be in this position. I have personally seen many graduate students suffer from sexism, mental illness, and inept professors (and this was all pre-pandemic!!!). I am eternally grateful to Dr. Chris Link (Captain Patience) for keeping his “dope slapper” close by during our many conversations and ensuring that some of my “spaghetti on the wall” ideas came to fruition. I would like to thank my bioinformatic mentor Patrick Gonzales, who has been a tremendous teacher with exceptional patience. I would also like to thank Dr. Robin Dowell- Deen, your kindness and mentorship has been the ultimate sanity check. I feel blessed to have made so many friends in the Link and Dowell lab, I will truly miss the daily exciting ideas and belly laughs so please don’t hesitate to contact me no matter the time passed. I am also sorry (not sorry) for asking as many stupid questions as possible. Aside from academia, my family has been an unshakeable foundation of support. I would like to thank my late Grandmother for her boundless kindness and easy-going attitude, my mother for her almost gag-inducing positivity, my stepmom for her support and love, and my aunt Dr. Deborah for her “go-getter” attitude that inspires me regularly. Everyone else in my family has been equally awesome and I know that I could not have done this without my great support network. …And we can’t forget my pup, the best buddy and worst headache in the world Kosmo! v Dedication I would like to dedicate this thesis to my late father Rob Melnick, who always told me to “observe my world” and “get your ass in the chair and do the work!”. He was my greatest supporter and never blinked when I told him my crazy plans. You are missed greatly. “When the power of love overcomes the love of power. The world will know peace.” – Jimmy Hendrix vi Chapters I. Introduction ................................................................................................................ 1 Summary 1 II. Background ................................................................................................................. 3 Common sequencing technology and bioinformatic applications 3 Introduction 3 Differential gene expression 8 Alternative Splicing 12 Repetitive elements 16 RNA Editing 18 Conclusion 22 Neurodegeneration 23 Introduction 23 Alzheimer’s disease 25 Amyotrophic Lateral Sclerosis 28 Next generation sequencing in the clinic 32 Conclusion 33 Downstream of gene transcription 34 Introduction 34 Transcription termination background 34 Stressors induce downstream of gene transcription 36 Mechanisms of downstream of gene induction 37 Putative downstream of gene functions 38 Conclusion 39 Biomes and bioprospecting 40 Introduction 40 Brief history of microbiomes 41 Sequencing technologies for microbiome analysis 42 Pathogens in neurodegenerative diseases 46 Conclusion 48 III. Dogcatcher: Heat shock in C. elegans induces downstream of gene transcription and accumulation of double-stranded RNA ................................................................................. 49 Contributions from fellow researchers 49 Citation 49 Introduction 49 Results 51 Heat shocK induces nuclear dsRNA foci in C. elegans 51 Recovery of dsRNA by J2 immunoprecipitation 54 Measurement of antisense gene transcripts after heat shocK 55 Comparison of dsRNAs identified in worms heat shocked or deleted for tdp-1 58 J2 Enrichment of Repetitive elements 61 Enrichment of transcripts downstream of genes in the J2 pool 63 Additional non-annotated transcripts are minimally enriched in the J2 pool after heat shock or tdp-1 deletion 70 Increased antisense transcription over genes associated with DoGS and ADoGS 73 vii Antisense read through into eif-3.B in heat shocked worms 73 Discussion 77 Materials and methods 81 Caenorhabditis elegans culturing and strains 81 Heat stress treatment 81 RNA isolation, cDNA library preparation, and RNA Sequencing 81 Immunohistochemistry and Fluorescence in situ Hybridization (FISH) 82 Microscopy 84 Quantification of coincidence of J2 and HSF-1 foci over time and J2 foci in mutant strains 84 Sequencing data analysis 85 IV. Mystery Miner: Application of a bioinformatic pipeline to RNA-seq data identifies novel virus-like sequence in human blood ............................................................................ 87 Contributions from fellow researchers 87 Citation 87 Introduction 87 Results 91 Pipeline description 91 Validating Mystery Miner on datasets with known bacterial or viral infection 94 Validating Mystery Miner on a synthetic minibiome 96 Effects of library pulldown or globin depletion in RNA-seq datasets 99 Analysis of Our Study 102 Analysis of published ALS datasets 108 Meta-analysis between datasets 113 Discussion 115 Materials and methods 119 Blood Collection and RNA Extraction 119 Globin Depletion 120 Quantitative RT-PCR for blood RNA samples 121 Transcriptomics 122 Statistical Analysis 122 Data availability 122 V. MaDDoG ................................................................................................................. 123 Introduction 123 Results 124 Pipeline description 124 Toy example of Bayesian change point and BIC model selection 126 Application of MaDDoG to select genes from Vilborg et al., 2015 135 Using MaDDoG for intron retention or alternative exon discovery 137 Global results of MaDDoG from Vilborg et al., 2015 141 Discussion 143 Materials and methods 144 VI. Summary and Conclusion ........................................................................................ 146 Summary of major findings and future experiments 146 viii Mystery Miner and potential pathogens in ALS 146 Heat shocK in C. elegans, Dogcatcher, and MaDDoG 148

I Hide and Seq: Novel Approaches to RNA-Sequencing

Table S5. the Information of the Bacteria Annotated in the Soil Community at Species Level

UNIVERSITY of CALIFORNIA, SAN DIEGO the Comparative Genomics

A Genomic Journey Through a Genus of Large DNA Viruses

Isolation, Identification and Investigation of Their Bioactive Potential Inês Filipa Coelho Ribeiro

D 6.1 EMBRIC Showcases

Systematic Research on Actinomycetes Selected According

An Introduction to Actinobacteria

UNIVERSITY of CALIFORNIA, SAN DIEGO The

Description of Unrecorded Bacterial Species Belonging to the Phylum Actinobacteria in Korea

Struktur Und Dynamik Heterotropher Bakteriengemeinschaften Im Wattenmeer Und Der Deutschen Bucht

The Marine Air-Water, Located Between the Atmosphere and The

Sequedex Documentation Release 1.0-Rc1