Signature Redacted Author
Total Page:16
File Type:pdf, Size:1020Kb
Automated, highly scalable RNA-seq analysis ARCHNES M ASSA HUSETS INS ITUTE by Rory Kirchner RSEP 24 2015 B.S., Rochester Institute of Technology (1999) LIBRARIES Submitted to the Department of Health Sciences and Technology in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Health Sciences and Technology at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2015 D Massachusetts Institute of Technology 2015. All rights reserved. Signature redacted Author. Department of Health S ences and Technology Septem 2015 Signature redacted Certified by... Martha Constantine-Paton Professor of E rain and Cognitive Science Thesis Supervisor Signature redacted Acrented by ...... ........ Emery N. Brown Director, Harvard- Program in Health Sciences and Technology Professor of Computational Neuroscience and Health Sciences and Technology F Automated, highly scalable RNA-seq analysis by Rory Kirchner Submitted to the Department of Health Sciences and Technology on September 1, 2015, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Health Sciences and Technology Abstract RNA-sequencing is a sensitive method for inferring gene expression and provides ad- ditional information regarding splice variants, polymorphisms and novel genes and isoforms. Using this extra information greatly increases the complexity of an analysis and prevents novice investigators from analyzing their own data. The first chapter of this work introduces a solution to this issue. It describes a community-curated, scal- able RNA-seq analysis framework for performing differential transcriptome expres- sion, transcriptome assembly, variant and RNA-editing calling. It handles the entire stack of an analysis, from downloading and installing hundreds of tools, libraries and genomes to running an analysis that is able to be scaled to handle thousands of samples simultaneously. It can be run on a local machine, any high performance cluster or on the cloud and new tools can be plugged in at will. The second chapter of this work uses this software to examine transcriptome changes in the cortex of a mouse model of tuberous sclerosis with a neuron-specific knockout of Tscl. We show that upregulation of the serotonin receptor Htr2c causes aberrant calcium spiking in the Tscl knockout mouse, and implicate it as a novel therapeutic target for tuberous sclerosis. The third chapter of this work investigates transcriptome regulation in the superior colliculus with prolonged eye closure. We show that while the colliculus undergoes long term anatomical changes with light deprivation, the gene expression in the colliculus is unchanged, barring a module of genes involved in energy produc- tion. We use the gene expression data to resolve a long-standing debate regarding the expression of dopamine receptors in the superior colliculus and found a striking segregation of the Drdl and Drd2 dopamine receptors into distinct functional zones. Thesis Supervisor: Martha Constantine-Paton 2 Title: Professor of Brain and Cognitive Science 3 Acknowledgments Thank you to the members of the Constantine-Paton Lab, Robert and Ellen Kirchner, my parents, Dr. Melcher for her unwavering support and to the Imaginary Bridges Group. 4 Contents 1 Automated RNA-seq analysis 11 1.1 Background . .. ............ ........... 12 1.1.1 RNA-seq ........ ............................. 12 1.1.2 RNA-seq processing challenges ................. 13 1.1.3 RNA-seq analysis pipelines ....... ........... 15 1.1.4 Scalability ....... ............ .......... 17 1.1.5 Reproducibility .......................... 23 1.1.6 Configuration .... ....................... 24 1.1.7 Q uantifiable ...... ............ .......... 25 1.1.8 RNA-seq implementation .......... .......... 31 1.1.9 Variant calling and RNA editing .... ........... 45 1.2 D iscussion ....................... .......... 47 1.3 Future development ............ ............ .... 48 2 Transcriptome defects in a mouse model of tuberous sclerosis 50 2.1 Background ................................ 50 2.1.1 Tuberous sclerosis ........................ 50 2.1.2 Tuberous sclerosis is a tractible autism model ......... 59 5 2.1.3 The second order effects of mTor activation are important 61 2.2 Methods ........................... ...... 63 2.2.1 Cortex collection .......... ........ .... .. 63 2.2.2 Library preparation and sequencing ........ ...... 64 2.2.3 Informatics analysis .... ............ .. .. .. 65 2.2.4 Calcium Imaging .................. .... .. 67 2.3 R esults ........ .......... .......... .. ... 68 2.3.1 Differential expression ............... .. .. 68 2.3.2 Pathway analysis ......... ......... .. ... 75 2.3.3 Isoform differential expression .......... .. ... 75 2.3.4 Disease association .......... ....... .. ... 76 2.3.5 RNA editing . .................. .. ... 80 2.3.6 Syn1-Tsc1-/- mice have aberrant calcium signaling. .... 81 2.3.7 Htr2c blocker halts aberrant calcium signaling .. .. ... 83 2.4 Discussion ...... .......... .......... .. ... 83 3 Transcriptome independent retained plasticity of the corticocollic- ular projection of the mouse 87 3.1 Background .... .. ... ......... .............. 87 3.1.1 Superior colliculus ........... ............. 87 3.1.2 Retinotopographic map formation in the colliculus ...... 91 3.2 M ethods ... ............. ............ ...... 100 3.2.1 Eye closure manipulation and colliculus dissection ....... 100 3.2.2 Total RNA isolation and quality control ...... ...... 101 3.2.3 cDNA Library creation and sequencing ....... ...... 104 3.2.4 Informatics analysis ..... ........ ....... ... 105 6 3.2.5 Differential expression ...................... 105 3.2.6 Corticollicular projection mapping and quantitation ...... 106 3.3 R esults .. ............ ............ ......... 107 3.3.1 Corticocollicular projection remodelling .... ........ 107 3.3.2 Sequencing quality control .. .................. 108 3.3.3 Differential gene expression .. ........... ...... 108 3.3.4 Possible X-linked cofactors in LHON ...... ........ 113 3.3.5 Differential exon expression ... ..... ...... ..... 114 3.3.6 Dopamine receptor expression in the superior colliculus .. 114 3.4 Discussion ....... ............ ............. 116 7 List of Figures 1-1 collectl benchmarks of hourly memory, disk and CPU usage during a RNA-seq analysis. ..... ....................... 19 1-2 Schematic of parallelization abstraction provided by ipython-cluster- h elp er. .................................. 20 1-3 The Poisson distribution is overdispersed for RNA-seq count data. 30 1-4 RNA-seq differential expression concordance calculation from two sim- ulated experiments. .......... .................. 32 1-5 Schematic of RNA-seq analysis. ................ ..... 34 1-6 Tuning adapter trimming with cutadapt ...... .......... 36 1-7 Gene level quantification can introduce errors. ........ ..... 39 1-8 RNA-seq improves the rat transcriptome annotation. ..... .... 44 2-1 CNS manifestations of tuberous sclerosis. ................ 51 2-2 The mTor pathway is affected in many disorders with autism as a phenotype. ..... ....................... .... 54 2-3 The laminar structure of cortex in Syn1-Tscl-/- mice is undisturbed. 58 2-4 MA-plot of gene expression in the cortex of wild type vs. Synl-Tscl-- litterm ates. ................................ 69 8 2-5 Culurued Syn1-Tsc1-/- neurons have synchronized calcium flux. .. 82 2-6 Blocking the Htr2c receptor halts aberrant calcium spiking in Syn1- Tscl-/- neurons. ..................... ......... 84 3-1 Cell types of the superior colliculus .... ............... 90 3-2 Eye opening refines the corticollicular projection. ........... 99 3-3 Schematic of superior colliculus dissection. ............... 102 3-4 Effect of eye closure state on corticocollicular arbor development. .. 109 3-5 Trimmed mean of M-values (TMM) normalization effectively normal- izes gene expression. ........................... 110 3-6 Clustering of gene expression in the rat colliculus. ........... 111 3-7 MA-plot (mean expression vs log2 fold change) between eyes open and closed rats. ......... ....................... 112 3-8 MA-plot of differential exon analysis. ....... ........... 115 3-9 Dopamine receptor subtypes are segregated into distinct layers of the superior colliculus. ................... ......... 117 9 List of Tables 1.1 Cleaning the raw Cufflinks reference-guided transcriptome assembly improves the false positive rate. ................. .... 43 2.1 Differentially expressed genes in Synl-Tscl-/~ mice. ...... .... 75 2.2 KEGG pathway analysis of differentially regulated genes ....... 76 2.3 Gene Ontology (GO) term analysis of differentially regulated genes. 77 2.4 mTor signaling is differentially regulated in the Syn1-Tsc-/- mouse. 78 2.5 Autism, intellectual disability and epilepsy genes differentially regu- lated in the Syn1-TscEl-/- mouse ............ ......... 79 2.6 RNA editing events found in Synl-Tsc-/- and WT mice. ... .... 80 3.1 Power estimation of the eyes open vs. eyes closed experiment at a range of fold changes .................. ......... 106 3.2 Differentially expressed genes with FDR < 0.05 ....... ..... 113 Chapter 1 Automated RNA-seq analysis RNA-sequencing (RNA-seq) has largely supplanted array-based methods for infer- ring differential expression at the gene, isoform and event level as RNA-seq is more sensitive and produces more information than array-based methods at a similar cost. However processing RNA-seq data and making use of the extra information that comes from having sequence data is complex