Statistical Power for RNA-Seq Data to Detect Two Epigenetic Phenomena
Total Page:16
File Type:pdf, Size:1020Kb
Statistical power for RNA-seq data to detect two epigenetic phenomena Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Dao-Peng Chen, B.B.A., M.S. Graduate Program in Statistics The Ohio State University 2013 Dissertation Committee: Prof. Shili Lin, Advisor Prof. Dennis K. Pearl Prof. Asuman Turkmen c Copyright by Dao-Peng Chen 2013 Abstract Epigenetics is the study of heritable changes in gene expression or cellular pheno- type caused by mechanisms other than changing the underlying DNA sequence. Two epigenetic phenomena, genomic imprinting and AEI, are discussed in this dissertation. Genomic imprinting is an epigenetically regulated process by which imprinted genes are expressed in a parent-of-origin-specific manner, and AEI refers to asymmetric expression of two different alleles at the same locus. Many analysis tools had been used to investigate these two phenomena. Among these tools, RNA-seq is a powerful new technology for mapping and quantifying tran- scriptomes using ultra high throughput next generation sequencing technologies. Us- ing RNA-seq, a genome-wide study can investigate genome-wide genomic imprinting and AEI without prior knowledge of genes or coding regions. Compared to microarray hybridization-based methods, RNA-seq does not have background noise due to hy- bridization. Data from RNA-seq experiments are digital, so they do not have limited range of signals. Compared to the traditional sequence-based methods, RNA-seq is more economical which makes genome-wide mapping feasible. Nevertheless, RNA-seq has its own limitations, such as errors in base calling, reading mapping uncertainty in genome, and biases from transcript length and sequence base composition. In this dissertation, we focus on how investigating sequencing parameters may affect power of tests for detecting imprinting and/or AEI, and whether the current ii technology can provide sufficient power for such an endeavor for mouse and human data. Since existing methods in the literatures are not amenable for detecting such effects and since these two effects may be confounded with one another, we also pro- pose a joint test for simultaneous detection of imprinting and AEI. For mouse data, the reciprocal cross design for mouse, and two definitions of informative reads based on binomial distribution are used throughout this dissertation. The proposed joint test and the two-chi-squares test in the literature are used for power calculation and simulation study, and their results are compared and contrasted. The results show that the joint test is not only applicable for simultaneous detection of the two epi- genetic effects, but it is also more powerful compared to the two-chi-squares test. Furthermore, we note that the formula for power calculation in terms of sequencing parameters (sequencing depth and read length) and sequencing divergence is appli- cable for other tests, not just for the joint test. We provide theoretical power under some combinations of parameters. If an in- formative read is defined as covering at least one SNP for mouse reciprocal cross design, sequencing depth E(T ) = 40 is necessary to achieve sufficient power under read length l = 100 and sequence divergence d = 1%, especially when imprinting and AEI effects (p1; p2) are moderate. Fixing E(T ) = 30 and d = 1%, the power does not improve much when l > 100. It suggests that increasing sequence depth, not read length, is the key to improve power, although it can be expensive. On the other hand, if an informative read is defined as covering a particular SNP, then E(T ) at least 130 and l at least 250 is necessary to achieve sufficient power under d = 2%, even when imprinting and AEI effects are strong. Because increasing read length may have a higher error rate of base calling with currently available technology, we iii suggest increasing sequence depth as more reliable, albeit more expensive, alterna- tive. Note that the two definitions of \an informative read" may be interpreted as a genome-wide or a candidate gene based study, respectively. As for human data, we discuss which trio structures are informative, and we still use the joint test to detect imprinting and AEI. In the theoretical power calculation, except for the effects of sequencing parameters and sequence divergence, the number of families in a random sample of N trios is also considered. If an informative read is defined as covering at least one SNP, sequencing depth E(T ) = 8 is necessary to achieve reasonable power under the setting of N = 50, l = 100 and d = 1%, especially when imprinting and AEI effects are moderate. Fixing E(T ) = 4 and d = 1%, the power does not improve much when l > 100. As for the number of families N, under E(T ) = 4, d = 1% and l = 100, even N = 20 leads to a sufficient power for detecting 0 strong imprinting and AEI effects (such as one of the pis is 0.1). However, a larger size such as N = 100 is necessary for more moderate effects. If an informative read is defined as covering a particular SNP, E(T ) of at least 10 and l of at least 250 is necessary to achieve sufficient power under N = 100 and d = 2%, even when imprinting and/or AEI effects are strong. As for the effect of N, under E(T ) = 10, d = 2% and l = 250, N = 200 is necessary to achieve sufficient power even for strong imprinting and AEI effects. iv Dedicated to my parents, sister and fiancee. v Acknowledgments I sincerely thank my adviser Dr. Shili Lin for her guidance and patience in these years. This dissertation would never have been accomplished without her. I would like to thank Dr. Dennis K. Pearl and Dr. Asuman Turkmen for serving on my dissertation committee, and Dr. Hong Zhu for serving on my Ph.D. candidacy exam committee. vi Vita June 28, 1980 . .Born - Taipei, Taiwan 2002 . .B.B.A. Statistics, National Chengchi University, Taiwan 2004 . .M.S. Statistics, National Tsing Hua University, Taiwan 2008-present . .Graduate Research/Teaching Asso- ciate, Department of Statistics, The Ohio State University Publications Research Publications Fields of Study Major Field: Statistics Major Field: Statistics Studies in: RNA-seq data analysis Prof. Shili Lin Statistical Genetics Prof. Shili Lin vii Table of Contents Page Abstract . ii Dedication . .v Acknowledgments . vi Vita......................................... vii List of Tables . .x List of Figures . xi 1. Introduction . .1 1.1 Epigenetics and RNA-seq . .1 1.2 Genomic imprinting . .3 1.3 Allelic expression imbalance . .5 1.4 Connection among the previous two topics . .6 1.5 Contribution and organization of this dissertation . .8 2. Statistical power for detecting imprinting and AEI in mouse data . 10 2.1 Experimental design for mouse . 10 2.2 Distribution of number of informative reads . 11 2.2.1 A read covers at least one SNP . 12 2.2.2 A read covers a particular SNP . 13 2.3 The joint test . 15 2.3.1 Currently available tests for detecting only imprinting or AEI 15 2.3.2 Rationale for the joint test . 18 2.3.3 Simulation study . 20 2.3.4 Real data study . 31 viii 2.4 Theoretic power for joint test . 35 2.4.1 A read covers at least one SNP . 36 2.4.2 A read covers a particular SNP . 41 3. Statistical power for detecting imprinting and AEI in human data . 54 3.1 Informative trio structures . 54 3.2 Joint test and parameter setting . 55 3.3 Theoretic power for joint test . 57 3.3.1 A read covers at least one SNP . 58 3.3.2 A read covers a particular SNP . 60 4. Summary and Discussion . 78 4.1 Summary . 78 4.2 Discussion and future extensions . 81 4.2.1 Arguable assumptions in this dissertation . 81 4.2.2 Discussion in human data . 82 4.2.3 Allele specific methylation . 83 4.2.4 Testing for imprinting, AEI and ASM sequentially . 83 4.2.5 A confidence region for p1 and p2 ............... 85 Bibliography 88 ix List of Tables Table Page 2.1 2 x 2 table in a reciprocal cross design . 11 2.2 Simulation setting under H1 ....................... 21 2.3 Summary of power in simulated data . 28 2.4 Imprinted genes in mouse brain . 33 2.5 Imprinted genes in the non-brain tissues of the mouse . 34 3.1 The six types of informative trio structures . 55 x List of Figures Figure Page 2.1 Rejection regions (black) of the three tests when n1=30 and n2=40 . 16 2.2 Four groups according to the underlying values of p1 and p2 ..... 19 2.3 The nine subsets of H1 .......................... 22 2.4 Counts of total SNPs under H0, and counts of rejection . 24 2.5 Counts of total SNPs under H1.A, and counts of rejection . 25 2.6 Counts of total SNPs under H1.B, and counts of rejection . 26 2.7 Counts of total SNPs under H1.C1 and H1.C2, and counts of rejection 27 2.8 Counts of total SNPs under H1.C3, and counts of rejection . 29 2.9 Counts of total SNPs under H1.C4, and counts of rejection . 30 2.10 Testing result of a SNP (UCSC id: uc009kou.1 2) in gene Cd81 . 35 2.11 Power image plots for different E(T ), where an informative read is defined as covering at least one SNP . 39 2.12 Power curve plots for different E(T ), where an informative read is defined as covering at least one SNP . 40 2.13 Power image plots for different d, where an informative read is defined as covering at least one SNP .