Bioinformatic Analysis of Microrna Genes in Free-Living and Parasitic Nematodes
Total Page:16
File Type:pdf, Size:1020Kb
Bioinformatic Analysis of microRNA Genes in Free-Living and Parasitic Nematodes Rina Ahmed November 2014 DISSERTATION zur Erlangung des akademischen Grades der Doktorin der Naturwissenschaften (Dr. rer. nat.) eingereicht im Fachbereich Mathematik und Informatik der Freien Universit¨at Berlin Begutachtet von: Prof. Dr. Martin Vingron Prof. Dr. Kris Gunsalus 1. Gutachter: Prof. Dr. Martin Vingron 2. Gutachterin: Prof. Dr. Kris Gunsalus Disputation: 26. Februar 2015 Preface The work that led to this thesis is part of two collaborative projects in which I par- ticipated. This thesis presents results from both projects. A panel of different bioin- formatics and statistical methods suitable to analyze small RNA deep sequencing data were identified and developed. Individual contributions for each project will be detailed here: Flexbar Project The work presented in Chapter3 was published in the special issue \Next-Generation Sequencing Approaches in Biology" in the journal Biology 1. The Flexible Barcode and Adapter Remover (FLEXBAR) originated from the Flexible Adapter Remover (FAR) and has been developed by Matthias Dodt in the bioinformat- ics group of Dr. Christoph Dieterich. As part of this project, I developed the adapter removal feature for SOLiD color space reads and focused on the application of small RNA-seq in letter and color space. Additionally, I was involved in the design of FAR and in the development of specific features of the subsequently added barcode detec- tion function for demultiplexing. The final version of FLEXBAR (paper version) has been extensively revised and enhanced by Johannes R¨ohr through the introduction of novel and extended features, a cleanup in the source code, redesigned command-line interface, and optimized parameter settings. miRNA Project The bioinformatics workflow and the analysis and results presented in Chapter2 and4 were published in Genome Biology and Evolution 2. As part of this collaborative project, I designed and performed all computational experiments. The experimental data sets were generated in the group of Dr. Christoph Dieterich at the Berlin Institute for Medical Systems Biology (BIMSB) which is part of the Max-Delbruck-Center¨ for Molecular Medicine (MDC). The total RNA of the parasite samples (Strongyloides ratti) were kindly provided by our collaborator PD. Dr. Norbert W. Brattig from the Bernhard Nocht Institute for Tropical Medicine in Hamburg. All iii next-generation sequencing was performed in the group of Dr. Wei Chen at BIMSB. Acknowledgements The research presented in this thesis was funded by the MDC- NYU Exchange Program and was carried out at BIMSB in the group of Dr. Christoph Dieterich and the Center for Genomics and Systems Biology at the New York Univer- sity (NYU) in the group of Prof. Dr. Kris Gunsalus. In the following, I would like to thank all people who have supported and helped me throughout my PhD studies: First of all, I would like to thank my supervisor Christoph Dieterich for giving me the opportunity to pursue this research in his lab and introducing me to the fascinating world of next-generation sequencing. I am very grateful for his ideas, support, and fruitful discussions throughout the years and for giving me the opportunity to attend international conferences. I especially want to thank my co-supervisor Kris Gunsalus for giving me the chance to work in a very stimulating research environment and tightly connected with the wet lab group of Fabio Piano. Kris provided a creative and open minded working environment and I am very grateful for her dedication and support inside and outside the lab. I would also like to thank Martin Vingron for taking the time to supervise my PhD thesis as my University advisor. Furthermore, I am very grateful to all present and former members of the Dietrich, Gunsalus, and Piano groups and all people from BIMSB for creating this inspiring working atmosphere with plenty of joyful coffee breaks. Special thanks goes to Nikolaus Rajewsky and Jutta Steink¨otter for their guidance and dedication and for providing this excellent research program. Moreover, I have to thank Jennifer Stewart and Sabrina Deter for helping me with all organizational issues and loads of paper work. Last but not least, I would like to thank my friends and family for their tremendous support and patience. iv Contents 1 Introduction1 1.1 Objectives and Thesis Structure . .1 1.2 The Animal Models . .2 1.2.1 Caenorhabditis elegans ........................2 1.2.2 Pristionchus pacificus ........................5 1.2.3 Relationship with Parasitic Nematodes . .5 1.3 Post-transcriptional Regulation of Gene Expression . .7 1.3.1 microRNA Genes . .8 1.4 Next-Generation Sequencing . 11 1.4.1 Illumina/Solexa System . 15 1.4.2 ABI SOLiDTM System . 17 1.4.3 Small RNA Sequencing . 19 2 Materials and Methods 23 2.1 Small RNA Sequencing . 23 2.1.1 Nematode Strains and Culture . 24 2.1.2 Total RNA Isolation and Small RNA Library Generation . 25 2.2 Data Sets . 25 2.2.1 Small RNA Sequencing Data . 25 2.2.2 Publicly Available Data . 26 2.3 Bioinformatics Methods . 27 2.3.1 Preprocessing of Small RNA Sequencing Data . 27 2.3.1.1 Quality Filtering . 28 2.3.1.2 Barcode Detection . 29 2.3.1.3 Adapter Removal . 30 2.3.2 Mapping of Short Sequencing Reads to a Reference . 32 2.3.3 Identification of microRNA Genes from Small RNA-Seq Data . 34 v CONTENTS 2.3.3.1 Quantification of microRNA Expression Levels . 35 2.3.3.2 Identification of Novel microRNA Genes . 35 2.3.4 Differential Expression Analysis . 38 2.3.4.1 Normalizing microRNA Sequencing Data . 38 2.3.4.2 Defining Differential Expression . 40 2.3.4.3 Correction for Multiple Hypothesis Testing . 42 2.3.5 Inference of microRNA Gene Families and Phylogeny . 43 2.3.5.1 Grouping of microRNA Gene Families . 44 2.3.5.2 Multiple Sequence-structure Alignments of RNA . 45 2.3.5.3 Building Phylogenetic Trees . 46 2.3.5.4 Performance Evaluation . 49 2.3.6 Single-Mutation Seed Network . 50 3 FLEXBAR - Flexible Barcode and Adapter Processing for Next- Generation Sequencing 51 3.1 Background . 52 3.2 Program Features . 52 3.2.1 Algorithmic Implementation . 54 3.2.2 Trim-end Modes . 55 3.2.3 Quality Clipping and Read Filtering . 56 3.3 Program Usage . 56 3.4 Program Validation . 58 3.4.1 Adapter Removal from microRNA Short Reads in Color Space . 59 4 Conserved microRNAs are Candidate Post-Transcriptional Regula- tors of Developmental Arrest in Free-Living and Parasitic Nematodes 63 4.1 Background . 64 4.2 Sequencing of microRNAs from Three Nematodes . 65 4.3 Unbiased Identification of Novel microRNA Genes . 67 4.4 Most microRNA Genes Are Not Conserved among Distantly Related Nematodes . 69 4.5 Evaluation of microRNA Homology Assignment . 72 4.6 microRNA Expression Changes from Sequencing Data Agree with Pub- lished qRT-PCR Results . 75 4.7 Differential Expression Analysis Identifies Cross-Species Candidate Reg- ulators . 80 vi CONTENTS 4.8 P. pacificus miR-34 Seed Neighbors are Upregulated in Dauer Larvae . 85 5 Discussion 89 5.1 FLEXBAR - Leading Solution in Barcode and Adapter Processing . 90 5.2 Comprehensive Bioinformatic Analysis Identifies Cross-Species Candi- date Regulators in Nematodes . 91 5.3 Future Directions . 96 5.4 Concluding Remarks . 97 References 99 Abbreviations 119 Summary 123 Zusammenfassung 125 Appendix A - Supplemental Material 129 Appendix B - Supplemental CD 133 Curriculum Vitae 135 Selbst¨andigkeitserkl¨arung 139 vii List of Figures 1.1 Major taxonomic groups of the phylum Nematoda . .3 1.2 Life cycle of free-living and parasitic nematodes . .4 1.3 miRNA biogenesis pathway . 10 1.4 Workflow of Sanger versus next-generation sequencing . 14 1.5 Illumina/Solexa sequencing system . 15 1.6 ABI SOLiD sequencing system . 21 2.1 Experimental setup of microRNA gene profiling . 24 2.2 Bioinformatics analysis workflow for small RNA-seq data . 28 2.3 Processing strategy of adapter recognition and removal in color space RIGHT trim mode . 32 2.4 miRDeep2 prediction strategy . 37 2.5 MA-plots before and after normalization . 41 3.1 FLEXBAR's internal workflow . 53 3.2 Sequence tag recognition with a dynamic programming matrix . 61 3.3 Graphical representation of FLEXBAR's sequence trim-end modes . 62 3.4 Benchmark V - Comparison of FLEXBAR and CUTADAPT . 62 4.1 Bioinformatic analysis workflow for 10 small RNA data sets . 66 4.2 Identified miRNA genes by miRDeep2 . 68 4.3 miRNA gene complement in C. elegans, P. pacificus, and S. ratti .... 69 4.4 miRNA homology and seed conservation . 71 4.5 miRNA graph of 51 gene families . 73 4.6 Phylogenetic tree of let-7 family miRNAs from eight animal clades and shuffled precursors . 75 4.7 Small RNA-seq expression profiles in C. elegans agree with qRT-PCR data 80 4.8 Proportion of expression changes between developmentally arrested stages 81 ix LIST OF FIGURES 4.9 mir-71 family miRNAs as cross-species candidate regulators in develop- mental arrest . 83 4.10 mir-34 family miRNAs as cross-species candidate regulators in develop- mental arrest . 84 4.11 MSA of mir-34 family miRNAs from seven animal clades . 85 4.12 Properties of single-mutation seed network. 86 4.13 Expression conservation of miR-34 seed neighbors . 87 A.1 MSA of let-7 family miRNAs from eight animal clades and five random generated precursors . 130 B.1 Multiple alignments of miRNA seed groups . 133 B.2 Multiple alignments of miRNA families . 133 B.3 Expression fold changes grouped by miRNA families . 133 x List of Tables 1.1 Comparison of Sanger and next-sequencing platforms . 13 2.1 Deep sequencing small RNA data sets profiled in nematodes .