Open Xinwei Han Thesis.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School Intercollege Graduate Program in Genetics UNDERSTANDING GENE EXPRESSION AND GENETIC RECOMBINATION BY NEXT GENERATION SEQUENCING A Dissertation in Genetics by Xinwei Han 2012 Xinwei Han Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2012 The dissertation of Xinwei Han was reviewed and approved* by the following: Stephen W. Schaeffer Professor of Biology Chair of Committee Hong Ma Distinguished Professor of Biology Dissertation Advisor Naomi S. Altman Professor of Statistics Gong Chen Associate Professor of Biology Anton Nekrutenko Associate Professor of Biochemistry and Molecular Biology Robert F. Paulson Associate Professor of Veterinary and Biomedical Sciences Chair of Intercollege Graduate Degree Program in Genetics *Signatures are on file in the Graduate School iii ABSTRACT The introduction of next-generation sequencing technologies has been changing the landscape of biological research. The plummeting cost of massive sequencing not only leads to the flourishing of various genome projects, but also opens up many opportunities in previously uncharted areas, two remarkable examples of which are sequencing different individuals of the same species and sequencing transcriptomes. With the availability of multiple genome sequences from a population, it is possible to systematically catalog natural variations and, more importantly, investigate their genome-wide distribution to deduce functional elements through conservation or selection. Moreover, as genomic changes reflect evolution at work, the comprehensive map of natural variations serves as a basis for studying the properties and effects of evolutionary forces. For transcriptomics, sequencing has become a revolutionary way to characterize gene expression. It not only offers high replicability and unparalleled accuracy, but also requires less input material, enabling transcriptome study in tiny structures, and no prior knowledge of gene structure, allowing detection of unknown transcripts. Here, three sequencing-based studies are presented. One study explored natural variations between two Arabidopsis ecotypes, Col and Ler, and then investigated one crucial evolutionary force to shape variations, meiotic recombination. The other two studies investigated the small RNA transcriptome in Arabidopsis meiocytes and mRNA transcriptomes in developing mouse cortex. In the first study, the sequencing and comparison between Col and Ler uncovered 249,171 SNPs, 58,085 small and 2,315 large indels, with highly correlated genome-wide distributions of SNPs and small indels. Disease resistance genes contain significantly more variations, suggesting adaptation to specific environmental niches. These variations were then used as markers to investigate meiotic recombinations, crossovers and gene conversions, in two tetrads, detecting 18 crossovers, 6 of which had an associated gene conversion event, and 4 iv independent gene conversions. The number and length of identified recombination events suggest that Arabidopsis gene conversions are likely fewer and with shorter tracts than those in yeast. In addition, the analysis of variations in offspring plants showed meiosis provided a rapid mechanism to generate copy number variations (CNVs) by reshuffling existing variations. In the second study, a recently developed method was applied to collect Arabidopsis meiocytes, a limited number of cells undergoing meiosis, and then small RNAs were profiled using SOLiD sequencing. 97 of 266 known miRNAs show expression in meiocytes. Interestingly, five miRNAs were found to account for more than half of the total miRNA expression in meiocytes, among which miR158a takes up about one third. The target genes of these five miRNAs have little or low expression in meiocytes. One putative novel miRNA was identified, which shows conservation with rice and maize. Analysis of longer reads provided clues for possible long ncRNAs in meiocytes. The mouse transcritpome study uncovered 3,758 differentially expressed genes between two critical stages of cortex development, embryonic day 18 (E18) and postnatal day (P7). Neurogenesis-related genes, such as Sox4 and Sox11, were more highly expressed at E18 than at P7. In contrast, the genes encoding synaptic proteins were up-regulated from E18 to P7, suggesting cortex development changes focus from neuron generation to synapse formation. In addition, approximately 500 genes with unknown function show dramatic change in expression level, serving as a blueprint for further experimental studies. Thousands of novel splice variants from 2,930 genes were identified, providing clues about another layer of dynamic expression. These sequencing-based studies pushed the limit of previous work. The characterization of meiotic recombination reached single base-pair resolution. The mouse transcriptome work is one of several early studies utilizing RNA-seq. The small RNA transcriptome in meiocytes, single type cells at microscopic scale, was profiled for the first time. All these studies provided unprecedented information about complicated biological processes. v TABLE OF CONTENTS LIST OF FIGURES ................................................................................................................. vii LIST OF TABLES ................................................................................................................... ix ACKNOWLEDGEMENTS ..................................................................................................... x CHAPTER 1 INTRODUCTION ............................................................................................ 1 1.1 Overview of Next Generation Sequencing................................................................. 2 1.1.1 Brief history of DNA sequencing .................................................................... 2 1.1.2 Application of next generation sequencing ..................................................... 6 1.1.3 The development of read mapping tools ......................................................... 10 1.2 Arabidopsis Natural Variation and Meiotic Recombination as One Crucial Shaping Force ........................................................................................................... 12 1.3 Brief Overview of Small RNAs Studies in Arabidopsis ............................................ 15 1.4 Transcriptome Studies in Mouse and Brief Introduction of Early Cortex Development ............................................................................................................ 18 CHAPTER 2 ANALYSIS OF ARABIDOPSIS GENOME-WIDE VARIATIONS BEFORE AND AFTER MEIOSIS AND MEIOTIC RECOMBINATION BY RE- SEQUENCING LANDSBERG ERECTA AND ALL FOUR PRODUCTS OF A SINGLE MEIOSIS .......................................................................................................... 20 2.1 Summary .................................................................................................................... 21 2.2 Introduction ................................................................................................................ 22 2.3 Material and Methods ................................................................................................ 25 2.3.1 Plant material and growth conditions .............................................................. 25 2.3.2 DNA isolation, genotyping, and genome re-sequencing ................................. 25 2.3.3 Calling of small polymorphisms ..................................................................... 26 2.3.4 Identification of larger indels .......................................................................... 26 2.3.5 Further bioinformatic analysis of SNP/indel affected genes ........................... 27 2.3.6 Identification of crossovers and non-crossovers ............................................. 28 2.3.7 Meiosis simulation and statistical analysis ...................................................... 29 2.3.8 Data access ...................................................................................................... 29 2.4 Results ........................................................................................................................ 30 2.4.1 Sequencing of the Ler genome uncovered numerous SNPs with functional Implications ...................................................................................................... 30 2.4.2 Numerous small indels with similar distribution patterns to those of SNPs ... 33 2.4.3 Detection of large indels and CNVs ................................................................ 35 2.4.4 Generating and sequencing “tetrads” of meiotic progeny plants .................... 38 2.4.5 Single-base resolution analysis of COs and NCOs/GCs ................................. 38 2.4.6 Redistribution of genome variations after meiosis .......................................... 43 2.4.7 CNVs due to meiotic reshuffling of structural variants................................... 44 2.5 Discussion .................................................................................................................. 46 2.5.1 Genetic variation and phenotypic variation ..................................................... 46 2.5.2 Possible relationship between frequency of CO, genome size and length of synaptonemal complex ................................................................................. 47 vi 2.5.3 The low frequency of detected NCOs and possible short GC