Detecting and Quantifying the Translated Transcriptome with Ribo-Seq Data
Total Page:16
File Type:pdf, Size:1020Kb
Detecting and quantifying the translated transcriptome with Ribo-seq data D i s s e r t a t i o n zur Erlangung des akademischen Grades Ph.D. bzw. Doctor of Philosophy im Fach Biologie eingereicht an der Lebenswissenschaftlichen Fakultät der Humboldt-Universität zu Berlin von MSc Lorenzo Calviello Präsidentin der Humboldt-Universität zu Berlin Prof. Dr.-Ing. Dr. Sabine Kunst Dekan der Lebenswissenschaftlichen Fakultät der Humboldt-Universität zu Berlin Prof. Dr. Bernhard Grimm Gutachter/innen: 1 – Prof. Dr. Uwe Ohler 2 – Prof. Dr. Nils Blüthgen 3 – Prof. Dr. Markus Landthaler Tag der mündlichten Prüfung 08.11.2017 Contents 1 Introduction ____________________________________________________________ 8 1.1 Thesis outline _______________________________________________________ 9 2 Background ___________________________________________________________ 10 2.1 The Molecular Biology of RNA processing _______________________________ 10 2.1.1 Life and the central dogma ________________________________________ 10 2.1.2 A multitude of RNA species _______________________________________ 13 2.1.3 Nuclear processing ______________________________________________ 17 2.1.4 The cytoplasmic fates of an RNA molecule ___________________________ 19 2.1.5 The Translation process __________________________________________ 20 2.1.6 Translation regulation ____________________________________________ 23 2.1.7 Translation and RNA decay _______________________________________ 26 2.2 Omics techniques to understand RNA biology ____________________________ 29 2.2.1 Next-generation sequencing _______________________________________ 29 2.2.2 RNA-seq applications ____________________________________________ 31 2.2.3 Ribosome Profiling ______________________________________________ 34 2.2.4 Proteomics approaches ___________________________________________ 36 2.3 Computational analysis of -omics data ___________________________________ 40 2.3.1 Genomes and transcriptomes ______________________________________ 40 2.3.2 NGS data pre-processing & mapping ________________________________ 41 2.3.3 Quantification and normalization strategies ___________________________ 42 2.3.4 Beyond count-based methods ______________________________________ 44 2.3.5 The Fourier transform and the Multitaper method ______________________ 45 2.3.6 Ribosome profiling data analysis ___________________________________ 50 2.3.7 Evolutionary signatures on genomic regions __________________________ 59 2.3.8 Shotgun proteomics data analysis ___________________________________ 61 3 Results _______________________________________________________________ 64 3.1 A novel approach to Ribo-seq data analysis _______________________________ 64 3.1.1 Spectral analysis of P-sites profiles __________________________________ 64 3.1.2 On sensitivity and specificity ______________________________________ 67 3.1.3 The RiboTaper strategy to identify translated ORFs ____________________ 69 3.2 Identification of actively translated ORFs in a human cell line ________________ 73 3.2.1 Known and novel ORFs across a wide expression range. _________________ 73 3.2.2 Distinct evolutionary conservation patterns in different ORF categories _____ 77 3.2.3 The de novo identified translatome as a proxy for the cellular proteome. ____ 80 i 3.3 ORF detection with an improved protocol in Arabidopsis thaliana ____________ 82 3.3.1 Analysis of an improved, high-resolution Ribo-seq protocol ______________ 82 3.3.2 New, ultra-conserved ORFs in non-coding genes _______________________ 86 3.4 Annotating and quantifying the translated transcriptome _____________________ 89 3.4.1 The SaTAnn strategy _____________________________________________ 89 3.4.2 Validating translation quantification _________________________________ 98 3.4.3 Translation on degraded RNA isoforms _____________________________ 100 4 Discussion ___________________________________________________________ 104 5 References ___________________________________________________________ 112 Appendix A: The Slepian Sequences and the multitaper F-test ______________________ 133 Appendix B: Supplementary Materials _________________________________________ 136 B.1: The Ribo-seq protocol in HEK293 ______________________________________ 136 B.2: Ribo-seq and RNA-seq data processing. __________________________________ 137 B.3: Supplementary Figure 1: Metagene analysis for different Ribo-seq datasets ______ 137 B.4: multitaper analysis ___________________________________________________ 139 B.5: QTI-seq analysis ____________________________________________________ 139 B.6: Supplementary Figure 2: The toddler ncORF ______________________________ 139 B.7: Supplementary Table 1: RiboTaper-detected ORFs in Danio rerio. _____________ 140 B.8: Evolutionary conservation analysis ______________________________________ 140 B.9: Mass spectrometry preparation and data analysis. __________________________ 141 B.10: Supplementary Figure 3: Additional statistics about Ribotaper- and Uniprot-only identified peptides. ______________________________________________________ 142 B.11: Ribo-seq data processing in Arabidopsis Thaliana _________________________ 142 B.12: Supplementary Table 2: Mapping statistics for the different libraries analyzed in Arabidopsis thaliana. ____________________________________________________ 143 B.13: Supplementary Table 3: Read lengths and cutoffs used to infer P-sites position in Arabidopsis thaliana. ____________________________________________________ 144 B.14: Protein Alignments _________________________________________________ 144 B.15: Polysome profiling, nuclear-cytoplasmic comparison and 5’end sequencing. ____ 145 B.16: Supplementary Figure 4: Translation quantification on different transcript biotypes. ______________________________________________________________________ 145 Appendix C: List of main software used in this study. _____________________________ 146 List of Publications ________________________________________________________ 147 Acknowledgments _________________________________________________________ 148 ii List of main figures Figure 1: The genetic code ....................................................................................................... 12 Figure 2: The central dogma and its main molecular actors .................................................... 13 Figure 3: An overview of the human transcriptome ................................................................ 16 Figure 4: mRNA nuclear processing. ....................................................................................... 18 Figure 5: Different RNA cytoplasmic fates. ............................................................................ 19 Figure 6: The main steps of the translation process. ................................................................ 21 Figure 7: Ribosomal translocation. .......................................................................................... 23 Figure 8: Functional heterogeneity of the alternative transcriptome. ...................................... 27 Figure 9: Illumina sequencing-by-synthesis approach. ............................................................ 30 Figure 10: The Ribo-seq protocol. ........................................................................................... 34 Figure 11: Shotgun proteomics example workflow. ................................................................ 37 Figure 12: Example from the GENCODE 19 GTF file. .......................................................... 40 Figure 13: A schematic of the Fourier transform. .................................................................... 47 Figure 14: Different window functions and their spectral leakage. ......................................... 48 Figure 15: Example of multitaper PSD estimation. ................................................................. 48 Figure 16: Example of Slepian sequences. ............................................................................... 49 Figure 17: Sub-codon resolution in Ribo-seq data. .................................................................. 51 Figure 18: Evolutionary signatures of genomic regions. ......................................................... 60 Figure 19: Analysis workflow for MS/MS data. ...................................................................... 62 Figure 20: Metagene analysis of 29 nt Ribo-seq reads in HEK293. ........................................ 65 Figure 21: Spectral analysis of individual exonic P-sites profiles. .......................................... 66 Figure 22: Sensitivity and specificity of the multitaper method. ............................................. 67 Figure 23: Comparison between the multitaper method and other windowing approaches. ... 69 Figure 24: The RiboTaper workflow. ...................................................................................... 70 Figure 25: RiboTaper de novo ORF-finding strategy. ............................................................. 71 Figure 26: Schematics of RiboTaper ORFs annotation. .......................................................... 72 Figure 27: RiboTaper-detected ORFs in HEK293, across gene biotypes and expression values. .................................................................................................................................................. 73 Figure 28: ORFs categories identified by RiboTaper. ............................................................. 74 Figure 29: QTI-seq comparison ad between-samples reproducibility.