Statistical power for RNA-seq data to detect two epigenetic phenomena

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Dao-Peng Chen, B.B.A., M.S.

Graduate Program in Statistics

The Ohio State University

2013

Dissertation Committee:

Prof. Shili Lin, Advisor Prof. Dennis K. Pearl Prof. Asuman Turkmen c Copyright by

Dao-Peng Chen

2013 Abstract

Epigenetics is the study of heritable changes in expression or cellular pheno- type caused by mechanisms other than changing the underlying DNA sequence. Two epigenetic phenomena, genomic imprinting and AEI, are discussed in this dissertation.

Genomic imprinting is an epigenetically regulated process by which imprinted are expressed in a parent-of-origin-specific manner, and AEI refers to asymmetric expression of two different alleles at the same locus.

Many analysis tools had been used to investigate these two phenomena. Among these tools, RNA-seq is a powerful new technology for mapping and quantifying tran- scriptomes using ultra high throughput next generation sequencing technologies. Us- ing RNA-seq, a genome-wide study can investigate genome-wide genomic imprinting and AEI without prior knowledge of genes or coding regions. Compared to microarray hybridization-based methods, RNA-seq does not have background noise due to hy- bridization. Data from RNA-seq experiments are digital, so they do not have limited range of signals. Compared to the traditional sequence-based methods, RNA-seq is more economical which makes genome-wide mapping feasible. Nevertheless, RNA-seq has its own limitations, such as errors in base calling, reading mapping uncertainty in genome, and biases from transcript length and sequence base composition.

In this dissertation, we focus on how investigating sequencing parameters may affect power of tests for detecting imprinting and/or AEI, and whether the current

ii technology can provide sufficient power for such an endeavor for mouse and human data. Since existing methods in the literatures are not amenable for detecting such effects and since these two effects may be confounded with one another, we also pro- pose a joint test for simultaneous detection of imprinting and AEI. For mouse data, the reciprocal cross design for mouse, and two definitions of informative reads based on binomial distribution are used throughout this dissertation. The proposed joint test and the two-chi-squares test in the literature are used for power calculation and simulation study, and their results are compared and contrasted. The results show that the joint test is not only applicable for simultaneous detection of the two epi- genetic effects, but it is also more powerful compared to the two-chi-squares test.

Furthermore, we note that the formula for power calculation in terms of sequencing parameters (sequencing depth and read length) and sequencing divergence is appli- cable for other tests, not just for the joint test.

We provide theoretical power under some combinations of parameters. If an in- formative read is defined as covering at least one SNP for mouse reciprocal cross design, sequencing depth E(T ) = 40 is necessary to achieve sufficient power under read length l = 100 and sequence divergence d = 1%, especially when imprinting and

AEI effects (p1, p2) are moderate. Fixing E(T ) = 30 and d = 1%, the power does not improve much when l > 100. It suggests that increasing sequence depth, not read length, is the key to improve power, although it can be expensive. On the other hand, if an informative read is defined as covering a particular SNP, then E(T ) at least 130 and l at least 250 is necessary to achieve sufficient power under d = 2%, even when imprinting and AEI effects are strong. Because increasing read length may have a higher error rate of base calling with currently available technology, we

iii suggest increasing sequence depth as more reliable, albeit more expensive, alterna-

tive. Note that the two definitions of “an informative read” may be interpreted as a

genome-wide or a candidate gene based study, respectively.

As for human data, we discuss which trio structures are informative, and we still

use the joint test to detect imprinting and AEI. In the theoretical power calculation,

except for the effects of sequencing parameters and sequence divergence, the number

of families in a random sample of N trios is also considered. If an informative read is defined as covering at least one SNP, sequencing depth E(T ) = 8 is necessary to

achieve reasonable power under the setting of N = 50, l = 100 and d = 1%, especially

when imprinting and AEI effects are moderate. Fixing E(T ) = 4 and d = 1%, the

power does not improve much when l > 100. As for the number of families N, under

E(T ) = 4, d = 1% and l = 100, even N = 20 leads to a sufficient power for detecting

0 strong imprinting and AEI effects (such as one of the pis is 0.1). However, a larger size such as N = 100 is necessary for more moderate effects. If an informative read

is defined as covering a particular SNP, E(T ) of at least 10 and l of at least 250

is necessary to achieve sufficient power under N = 100 and d = 2%, even when

imprinting and/or AEI effects are strong. As for the effect of N, under E(T ) = 10,

d = 2% and l = 250, N = 200 is necessary to achieve sufficient power even for strong

imprinting and AEI effects.

iv Dedicated to my parents, sister and fiancee.

v Acknowledgments

I sincerely thank my adviser Dr. Shili Lin for her guidance and patience in these years. This dissertation would never have been accomplished without her.

I would like to thank Dr. Dennis K. Pearl and Dr. Asuman Turkmen for serving on my dissertation committee, and Dr. Hong Zhu for serving on my Ph.D. candidacy exam committee.

vi Vita

June 28, 1980 ...... Born - Taipei, Taiwan

2002 ...... B.B.A. Statistics, National Chengchi University, Taiwan 2004 ...... M.S. Statistics, National Tsing Hua University, Taiwan 2008-present ...... Graduate Research/Teaching Asso- ciate, Department of Statistics, The Ohio State University

Publications

Research Publications

Fields of Study

Major Field: Statistics

Major Field: Statistics

Studies in: RNA-seq data analysis Prof. Shili Lin Statistical Genetics Prof. Shili Lin

vii Table of Contents

Page

Abstract ...... ii

Dedication ...... v

Acknowledgments ...... vi

Vita...... vii

List of Tables ...... x

List of Figures ...... xi

1. Introduction ...... 1

1.1 Epigenetics and RNA-seq ...... 1 1.2 Genomic imprinting ...... 3 1.3 Allelic expression imbalance ...... 5 1.4 Connection among the previous two topics ...... 6 1.5 Contribution and organization of this dissertation ...... 8

2. Statistical power for detecting imprinting and AEI in mouse data . . . . 10

2.1 Experimental design for mouse ...... 10 2.2 Distribution of number of informative reads ...... 11 2.2.1 A read covers at least one SNP ...... 12 2.2.2 A read covers a particular SNP ...... 13 2.3 The joint test ...... 15 2.3.1 Currently available tests for detecting only imprinting or AEI 15 2.3.2 Rationale for the joint test ...... 18 2.3.3 Simulation study ...... 20 2.3.4 Real data study ...... 31

viii 2.4 Theoretic power for joint test ...... 35 2.4.1 A read covers at least one SNP ...... 36 2.4.2 A read covers a particular SNP ...... 41

3. Statistical power for detecting imprinting and AEI in human data . . . . 54

3.1 Informative trio structures ...... 54 3.2 Joint test and parameter setting ...... 55 3.3 Theoretic power for joint test ...... 57 3.3.1 A read covers at least one SNP ...... 58 3.3.2 A read covers a particular SNP ...... 60

4. Summary and Discussion ...... 78

4.1 Summary ...... 78 4.2 Discussion and future extensions ...... 81 4.2.1 Arguable assumptions in this dissertation ...... 81 4.2.2 Discussion in human data ...... 82 4.2.3 Allele specific methylation ...... 83 4.2.4 Testing for imprinting, AEI and ASM sequentially . . . . . 83

4.2.5 A confidence region for p1 and p2 ...... 85

Bibliography 88

ix List of Tables

Table Page

2.1 2 x 2 table in a reciprocal cross design ...... 11

2.2 Simulation setting under H1 ...... 21

2.3 Summary of power in simulated data ...... 28

2.4 Imprinted genes in mouse brain ...... 33

2.5 Imprinted genes in the non-brain tissues of the mouse ...... 34

3.1 The six types of informative trio structures ...... 55

x List of Figures

Figure Page

2.1 Rejection regions (black) of the three tests when n1=30 and n2=40 . 16

2.2 Four groups according to the underlying values of p1 and p2 ..... 19

2.3 The nine subsets of H1 ...... 22

2.4 Counts of total SNPs under H0, and counts of rejection ...... 24

2.5 Counts of total SNPs under H1.A, and counts of rejection ...... 25

2.6 Counts of total SNPs under H1.B, and counts of rejection ...... 26

2.7 Counts of total SNPs under H1.C1 and H1.C2, and counts of rejection 27

2.8 Counts of total SNPs under H1.C3, and counts of rejection ...... 29

2.9 Counts of total SNPs under H1.C4, and counts of rejection ...... 30

2.10 Testing result of a SNP (UCSC id: uc009kou.1 2) in gene Cd81 . . . 35

2.11 Power image plots for different E(T ), where an informative read is defined as covering at least one SNP ...... 39

2.12 Power curve plots for different E(T ), where an informative read is defined as covering at least one SNP ...... 40

2.13 Power image plots for different d, where an informative read is defined as covering at least one SNP ...... 42

xi 2.14 Power curve plots for different d, where an informative read is defined as covering at least one SNP ...... 43

2.15 Power image plots for different l, where an informative read is defined as covering at least one SNP ...... 44

2.16 Power curve plots for different l, where an informative read is defined as covering at least one SNP ...... 45

2.17 Power image plots for different E(T ), where an informative read is defined as covering a particular SNP ...... 47

2.18 Power curve plots for different E(T ), where an informative read is defined as covering a particular SNP ...... 48

2.19 Power image plots for different d, where an informative read is defined as covering a particular SNP ...... 49

2.20 Power curve plots for different d, where an informative read is defined as covering a particular SNP ...... 50

2.21 Power image plots for different l, where an informative read is defined as covering a particular SNP ...... 51

2.22 Power curve plots for different l, where an informative read is defined as covering a particular SNP ...... 52

3.1 Power image plots for different E(T ), where N = 50 and an informative read is defined as covering at least one SNP ...... 61

3.2 Power curve plots for different E(T ), where N = 50 and an informative read is defined as covering at least one SNP ...... 62

3.3 Power image plots for different d, where N = 50 and an informative read is defined as covering at least one SNP ...... 63

3.4 Power curve plots for different d, where N = 50 and an informative read is defined as covering at least one SNP ...... 64

3.5 Power image plots for different l, where N = 50 and an informative read is defined as covering at least one SNP ...... 65

xii 3.6 Power curve plots for different l, where N = 50 and an informative read is defined as covering at least one SNP ...... 66

3.7 Power curve plots for different N, where an informative read is defined as covering at least one SNP ...... 67

3.8 Power image plots for different E(T ), where N = 100 and an informa- tive read is defined as covering a particular SNP ...... 69

3.9 Power curve plots for different E(T ), where N = 100 and an informa- tive read is defined as covering a particular SNP ...... 70

3.10 Power image plots for different d, where N = 100 and an informative read is defined as covering a particular SNP ...... 72

3.11 Power curve plots for different d, where N = 100 and an informative read is defined as covering a particular SNP ...... 73

3.12 Power image plots for different l, where N = 100 and an informative read is defined as covering a particular SNP ...... 74

3.13 Power curve plots for different l, where N = 100 and an informative read is defined as covering a particular SNP ...... 75

3.14 Power curve plots for different N, where an informative read is defined as covering a particular SNP ...... 76

4.1 Flowchart of testing imprinting, AEI and ASM sequentially ...... 85

4.2 A 99% confidence region (white) for p1 and p2 of a SNP ...... 87

xiii Chapter 1: Introduction

1.1 Epigenetics and RNA-seq

Traditionally, epigenetics was used to describe some phenomena that cannot be understood by genetic principles (Goldberg et al., 2007). Nowadays, in general, epi- genetics is the study of heritable changes in or cellular phenotype caused by mechanisms without changing the underlying DNA sequence (Goldberg et al., 2007). For example, genomic imprinting is an epigenetic phenomenon, which may cause genes to be expressed in a parent-of-origin specific manner without al- tering the gene sequence (Ferguson-Smith, 2011), or may lead to allelic expression imbalance.

Weinhold (2006) states that scientists need more advanced high-throughput and analytical technologies to study epigenetics, such as chromatin immunoprecipitation microarray analysis (ChIP-chip) and ChIP-sequencing. Among these tools, RNA- seq is a developed approach for transcriptome profiling that uses next-generation sequencing (NGS) technologies. This method can help investigate the transcription process of genes and splicing patterns (Wang et al., 2009). Using deep sequencing, gene expression levels of all transcripts can be quantified digitally, the expression of a gene can be estimated using the total number of reads mapped to that gene. In this dissertation, we focus on how to construct a statistical test to detect two epigenetic

1 phenomena, and equally importantly, show how sequencing parameters affect power of such test.

Compared to traditional biotechnologies, RNA-seq has some advantages. In con- trast to traditional microarray hybridization-based approaches, RNA-seq does not have their limitations, including the requirement of a prior knowledge about genes or coding regions, and background noise owing to cross-hybridization. RNA-seq exper- imental data are digital, and therefore they can avoid some problems in analogous signals of microarray approach, such as a limited dynamic range of detection due to background and saturation of signals (Hurd and Nelson, 2009).

In contrast to some traditional sequence-based approaches, RNA-seq also shows some competitive advantages. The traditional Sanger sequencing of cDNA is rela- tively expensive, low throughput, and generally qualitative. Tag-based methods were designed to improve these limitations, such as serial analysis of gene expression (Vel- culescu et al., 1995). Generally tag-based sequencing methods produce digital gene expression results and are high throughput. However, most are still expensive since they are based on Sanger sequencing, and a portion of the tags cannot be mapped uniquely to the reference genome (Wang et al., 2009). These drawbacks limit the use of traditional sequencing technology to study epigenetics.

To sum up, RNA-seq provides a comprehensive investigation of genomes without prior knowledge about genes or coding regions. The digital data format makes com- parison between datasets simpler, and permits unlimited quantitative range compared to analogous signals. Finally, RNA-seq is more economical to make genome-wide mapping of multiple features feasible (Hurd and Nelson, 2009). These features make

RNA-seq a better tool to study epigenetics.

2 Nevertheless, RNA-seq data has its own limitations. Read mapping uncertainty may confound results from RNA-seq experiments. Read mapping uncertainty across different isoforms (Trapnell et al., 2010) and different genes (Li et al., 2010) have been addressed in a rigorous statistical framework by recent methods. In addition to mapping uncertainty, RNA-seq data has some potential biases. Oshlack and Wakefield

(2009) find transcript length biases in RNA-seq data, since longer transcripts tend to have more read counts. Sequence base composition is another bias factor; for example, Dohm et al. (2008) demonstrate a strong relationship between read counts and the GC content along the genome from Solexa DNA sequencing experiments.

Hansen et al. (2010) show non-uniform patterns in the reads distribution along base.

In the following sections 1.2 and 1.3, we briefly introduce two epigenetic top- ics: genomic imprinting and allelic expression imbalance. In section 1.4, connection among these two topics is discussed, and section 1.5 provides a brief description of the contribution of this dissertation.

1.2 Genomic imprinting

Genomic imprinting is an epigenetically regulated process by which imprinted genes are expressed in a parent-of-origin-specific manner (Ferguson-Smith, 2011).

This phenomenon results in unequal expression of the same allele from different parental origins. For imprinted genes, maternally and paternally inherited alleles with identical DNA sequences function differently in the offspring. The gene is completely imprinted if the less expressed allele is totally deactivated or silenced; otherwise, the gene is partially imprinted. Moreover, paternal (maternal) imprinting means that an allele inherited from the father (mother) is expressed less than mother (father).

3 Morison et al. (2001) construct an imprinted-gene database, which contains 61 records for human genes so far. Currently, approximately 100 imprinted genes have been reported in mammals (Bartolomei and Ferguson-Smith, 2011). However, scien- tists believe ∼ 0.1% − 1% of all mammalian genes are imprinted (Pfeifer, 2000).

In the literature there have been some studies to detect genomic imprinting in different type of data. For qualitative trait loci data, tests like the parental-asymmetry test (PAT) are simple and powerful for detecting parent-of-origin effects using human pedigree data when there is no maternal effect (Weinberg, 1999; Zhou et al., 2009).

For quantitative trait loci data, there are several methods for detecting parent-of- origin effects. Allele-sharing methods, such as variance-components approaches and

Haseman-Elston regression, have been extended to test for parent-of-origin effects

(Hanson et al., 2001; Shete et al., 2003). These approaches assume the normality of traits, and also require sampling siblings or extended pedigrees in the analysis. He et al. (2011) propose several PAT-type tests for detecting parent-of-origin effects in complete and incomplete nuclear families, and there is no distribution restriction for quantitative traits. Feng et al. (2011) use a maximum likelihood test to evaluate the parent-of-origin effects of SNPs on quantitative phenotypes in general family studies.

Their method incorporates haplotype distribution to take advantage of inter-marker linkage disequilibrium information in genome-wide association studies.

Studies using RNA-seq data to investigate genomic imprinting in different organ- isms have also been proposed. For mice, Wang et al. (2008) perform quantitative assessments in transcripts from reciprocal crosses of two strains. Babak et al. (2008) and Gregg et al. (2010) use similar experimental designs. These designs of two recip- rocal crosses have not only verified the earlier identified imprinted genes, but also have

4 found some new imprinted genes. Nevertheless, many identified putative imprinted genes have not been validated by other studies (Ferguson-Smith, 2011). The design of reciprocal crosses for mice can be used in other organisms. For plants, Gehring et al. (2011) perform RNA-seq from embryo and endosperm derived from reciprocal crosses between two Arabidopsis thaliana accessions, and identified more than 200 loci that exhibit parent-of-origin effects on gene expression. As for human, the design of reciprocal cross is impossible, and methods for quantitative traits can be used in

RNA-seq data by regarding read counts as quantitative traits.

1.3 Allelic expression imbalance

Allelic expression imbalance (AEI) refers to asymmetric expression of two differ- ent alleles at the same locus, and it provides direct evidence of cis- or trans-acting differences. When an expression quantitative trait loci (eQTL) affecting a gene’s tran- scription maps close to the affected gene, it can be classified as cis-acting, while an eQTL that maps further away on the same , or to another chromosome, can be classified as trans-acting (Fontanillas et al., 2010). Several studies have used expression arrays to measure mRNA levels and coupled this with genome-wide SNP analyses (Sadee, 2009). mRNA levels can then serve as quantitative phenotypes, and associations can be found with eQTLs that either cis- or in trans-acting. Ge et al.

(2009) state that eQTL mapping and expression arrays give information about cis- and tran-acting variants, and this can be compared with information from cis-eQTL mapping and allelic expression measurements to determine which variants are cis- acting. By directly modeling total number of reads mapped to genes using discrete distributions, Sun (2012) provides a statistical framework for eQTL mapping using

5 RNA-seq data. Moreover, through combining the information from total read count and allele-specific expression, Sun (2012) can computationally distinguish cis- and trans-eQTL and further improve the power of cis-eQTL mapping.

In the literature some microarray hybridization-based approaches have been used to detect AEI (Cheung et al., 2005; Kwan et al., 2008; Daelemans et al., 2010). For example, Mei et al. (2000) perform a SNP Array Design to detect loss of heterozygos- ity, a common form of allelic imbalance. A gene has AEI when it has a heterozygous

SNP site with a significantly higher proportion of expression from one allele than from the other. For RNA-seq data, several high-throughput AEI assays based on

Polymerase Chain Reaction (PCR) and NGS technology have also been described recently (Zhang et al., 2009). For example, Xu et al. (2011) describe the use of an efficient PCR/next-generation DNA sequencing-based assay to analyze allele-specific differences in mRNA expression for candidate neuropsychiatric disorder genes in hu- man brain. Fontanillas et al. (2010) provide a binomial test to detect AEI in RNA-seq experiment, and discussed the relationship between sequencing parameters and sta- tistical power for detecting AEI.

1.4 Connection among the previous two topics

In previous sections, the similarity between genomic imprinting and AEI can be observed: they both present different patterns of expressions for alleles of a gene. The difference is: for a gene that is imprinted but has no AEI, its mRNA expressions only depend on the parental origin, irrespective of the allelic types. On the other hand, for a gene that is not imprinted but has AEI, the expressions only depend on types of alleles, regardless of whether these alleles come from mother or father. However, the

6 expressions will depend on both parental origin and allelic types if a gene is imprinted and has AEI. For example, let D and d be two alleles of a genetic locus that regulates the expression of a gene. Suppose the gene is not imprinted but has AEI, and that the allele d expresses normally whereas the allele D has reduced expressions. Then we will see that the allele D expresses less than the allele d, no matter their parental origins. On the other hand, suppose the gene is imprinted but has no AEI, and that

D is maternally imprinted. Then we will see that the allele D expresses less only when D comes from mother. If D is from father, it still expresses normally.

As we detail in later chapters, the effects of imprinting and AEI can be confounded under certain experimental protocol. For example, if an experiment to detect AEI is not designed correctly (e.g., without reciprocal crosses), then false positive (due to imprinting) may result.

Besides the similarity, some causal relationships can be found among these epige- netic phenomena. For examples, in both plants and mammals, two major mechanisms–

DNA methylation and allele-specific histone modifications–are involved in genomic imprinting (Ferguson-Smith, 2011). Li et al. (1993) show that a moderate level of

DNA methylation is necessary to control differential expression of the paternal and maternal alleles of imprinted genes. From these studies, genomic imprinting is a con- sequence of DNA methylation. Similarly, DNA methylation causes AEI if one specific allele is methylated and the other does not (Reynard et al., 2011). Then we can di- vide these epigenetic phenomena into two groups: epigenetic causes and effects. DNA methylation and histone modifications belong to the causes, while genomic imprinting and AEI are the effects.

7 1.5 Contribution and organization of this dissertation

Currently there are several statistical tests for detecting some epigenetic phenom- ena, but these tests do not consider the setting and key parameters in RNA-seq experiment. Fontanillas et al. (2010) use mathematical modeling and computer sim- ulations to identify four key parameters (the four key parameters are introduced in section 2.2) affecting measurements of allelic expression and the detection of AEI with high-throughput sequencing. Hence the questions are: 1) How to design RNA- seq experiments to detect these two phenomena in mouse and human? 2) What are suitable statistical test procedures to detect these phenomena? 3) How do these key parameters affect the powers of statistical tests?

In this dissertation, experimental designs and one statistical test are developed for joint detection of genomic imprinting and AEI in RNA-seq experiments for mouse and human. Statistical tests are provided for these phenomena, and several key parame- ters in RNA-seq experiment are incorporated to evaluate how sequencing parameters affect the power of a reasonable test, and whether the current technology is suffi- ciently advanced for detecting imprinting and AEI genome-wide. Answers to such questions can help scientists design experiments, perform statistical tests, and choose the appropriate sequencing parameters to perform the experiment. For example, un- der fixed cost, it can be known whether longer read length or larger sequencing depth will achieve a better power.

The following is the organization of this dissertation. In chapter 2, an experimental design for mouse is discussed, and a new test is provided to detect imprinting and AEI simultaneously. Then we compare our test with the current methods by simulation and in a real data study. Furthermore, we discuss a formula for power calculations

8 in terms of sequencing parameters, and provide a theoretical power calculation under some combinations of parameters.

In chapter 3, we discuss which type of data are informative to detect imprinting in human data. In the theoretical power calculation, except for parameters in RNA-seq experiment, the number of families in a sample is also considered.

In chapter 4, the proposed methods are summarized, and future research directions are discussed.

9 Chapter 2: Statistical power for detecting imprinting and AEI in mouse data

2.1 Experimental design for mouse

One important advantage of RNA-seq is its ability to measure allele-specific ex- pression, and we can know how many reads are mapped to a single nucleotide poly- morphism (SNP) site. Using this idea to study genomic imprinting in mice, Wang et al. (2008) performed RNA-seq experiments from reciprocal cross progeny of the

AKR/J and PWD/PhJ mouse strains. Total RNA was extracted from postnatal day 2 F1 female mouse whole brains, where F1 means that their parents are pure strained. Finally, sequence data from the PWD x AKR cross (listing maternal strain

first) and from AKR x PWD cross are obtained. To investigate candidate imprinted genes, Wang et al. (2008) identified high quality reads that contain SNPs for the two respective reciprocal crosses, where in this chapter SNPs mean all sequence variations between two mouse strains. Since parents are pure strained, for a specific heterozy- gous SNP, the origin of the two different alleles can be clearly identified, and the relative expression level of the two parental alleles can be quantified from the counts of the AKR and PWD SNP alleles in the read data.

We first introduce some notation. The two mouse strains are denoted by S1 and

S2. For a specific heterozygous SNP, we have a 2 x 2 table (table 2.1), where p1 is

10 Table 2.1: 2 x 2 table in a reciprocal cross design

cross S1 allele counts S2 allele counts proportion

S2 x S1 a bp ˆ1 = a/(a + b)

S1 x S2 c dp ˆ2 = c/(c + d)

the expected percentage of counts of the S1 allele in a S2 x S1 cross (listing maternal strain first), and p2 is the expected percentage of counts of the S1 allele in a S1 x S2 cross. Further, n1 = a + b and n2 = c + d are the total numbers of reads that contain this SNP for the two crosses, respectively. We also use ni, i = 1, 2, to denote number of informative reads. Here, p1 and p2 are the parameters of interest. Hypotheses about these two parameters constitute imprinting and/or AEI effects, which will be tested based on observed data from the reciprocal crosses.

In section 2.2, the distribution of ni is discussed. In section 2.3, we provide a test for detecting genomic imprinting and AEI simultaneously. Theoretic power formula and calculations under different sequencing parameters are given in section 2.4.

2.2 Distribution of number of informative reads

Based on a 2 x 2 table as in table 2.1, we can set up a test for detecting genomic imprinting and AEI simultaneously, where the parameters are pi (expected percentage of counts from one allele) and ni (number of informative reads), where i = 1, 2, n1 = a + b, and n2 = c + d. Among these parameters, n1 and n2 are not parameters of interest, but they are key in considering the power of a test. If the distribution of ni is available, the power can be averaged over ni by the law of total probability.

11 In this section, we define informative reads from two perspectives, and two kinds of

binomial distributions for ni are from these two definitions.

Besides ni and pi, there are other important parameters in RNA-seq experiments.

Through mathematical modeling and computer simulation, Fontanillas et al. (2010)

identify some critical parameters affecting measurements of allelic expression and the

detection of AEI with high-throughput sequencing. They show that the statistical

power of their method depends on four crucial parameters: 1) the relative transcript

abundance (p1 and p2), 2) sequence divergence between alleles (denoted by d), 3) the read length (denoted by l),and 4) sequencing depth (i.e. average number of reads per gene, denoted by a particular value t for a specific gene). The first two parameters affect the proportion of reads per gene that are informative for allelic expression (i.e. contain one or more SNPs that allow reads to be uniquely assigned to an allele).

The last two parameters affect the number of reads mapping to each gene. Based on these parameters, we discuss how to construct two binomial distributions for ni in

the following two subsections.

2.2.1 A read covers at least one SNP

We define sequence divergence (d) as the probability of observing a SNP at each

nucleotide site. The distributions of ni is affected by d and read length (l). For a

fixed l, a larger d will lead to a greater probability of including at least one SNP in

a read. For a fixed d, longer l increases the probability of sequencing a SNP site.

By assuming each nucleotide can be a SNP independent of other nucleotides and d

is constant across the genome, Fontanillas et al. (2010) construct a binomial variable

as follows.

12 Let Y be the number of SNPs covered by a read with length l and divergence d, and assume Y follows a binomial(l, d). Hence, the probability of sampling at least y

SNPs in a read with length l and divergence d is:

l l ! P (Y ≥ y|d, l) = X dk(1 − d)l−k, y = 1, ··· , l. k=y k

If only one SNP is required to uniquely map reads, we get:

P (Y ≥ 1|d, l) = 1 − (1 − d)l.

If we define an informative read as observing at least one SNP in this read, the

probability that a read is informative is 1−(1−d)l. Suppose there are t reads mapped

to a heterozygous SNP in a gene, then the number of informative reads ni follows

binomial(t, 1 − (1 − d)l).

2.2.2 A read covers a particular SNP

The previous subsection considers a read covering at least one SNP. If scientists

plan to study a particular SNP, we can consider another definition for informative

reads: a read is informative if it covers a particular SNP. Assume a gene for a partic-

ular SNP has length G, then the expected number of SNPs in this gene is Gd. Let Y

be the number of SNPs covered by a read with length l, and Y follows a binomial(l,

d). Let A denote the event that a particular SNP in a gene with length G is covered

by a read with length l, and s = dGde, where dxe returns the smallest integer not

less than x. If s ≤ l,

l P (A|G, l, d) = X P (A|G, l, d, Y = y)P (Y = y) y=0 s−1 l = X P (A|G, l, d, Y = y)P (Y = y) + X P (A|G, l, d, Y = y)P (Y = y) y=0 y=s

13 In the first term, we observe y SNPs in this read and it is expected to have s

SNPs in this gene, so P (A|G, l, d, Y = y) can be approximated by y/s, the ratio of

SNP numbers in read and gene. In the second term, since the SNP number in the

read exceeds the expected number of SNPs in gene, P (A|G, l, d, Y = y) is close to 1.

Therefore we get

s−1 y l P (A|G, l, d) ≈ X P (Y = y) + X P (Y = y). y=0 s y=s Pl y When s > l, P (A|G, l, d) = y=0 l P (Y = y). This formula depends on G, which is not a parameter of interest, so P (A|G, l, d)

can be averaged over G. The database of the Jackson laboratory (http://www.informatics.

jax.org/genes.shtml) provides mouse gene lengths, and 22,372 coding genes

are used. Let m=22,372, and {g1, g2, ··· , gm} be the gene lengths. If we assume that every gene has the same chance to selected in a study, then

P (G = gj) = 1/m, j = 1, ··· , m.

Hence, ni follows a binomial(t, P (A|d, l)), where

m X P (A|d, l) = P (A|gj, l, d)/m. j=1 If we compare the two probabilities that a read is informative, under the same

d and l, P (A|d, l) is much less than 1 − (1 − d)l. This is reasonable because in

this subsection a stricter definition of an informative read is used. Since there are

fewer informative reads under the second definition, we expect smaller power. Hence,

if we plan to have the same power under these two scenarios, longer read length

and/or greater sequencing depth are necessary under the second scenario, which would

certainly be more costly. Note that the two definitions for informative reads may be

interpreted as a genome-wide or a candidate gene based study, respectively.

14 2.3 The joint test

2.3.1 Currently available tests for detecting only imprinting or AEI

For the reciprocal cross design in mouse data, there are two statistical tests for detecting genomic imprinting in the literature. One of them is based on the com- parison of two independent binomial proportions, for which there exists a number of methods, including the Fisher’s exact test, the exact unconditional test of Suissa and

Shuster (1985 ), the Liddell’s exact test (Liddell, 1978 ), the Storer-Kim test (Storer and Kim, 1990), the chi-square statistic with the Pirie and Hamdan (1972 ) continuity correction, and the chi-square statistic with Yates (1934 ) continuity correction. Since the Storer-Kim test was shown to be generally powerful, computationally easy, and its true size rarely exceeds the nominal size (Storer and Kim, 1990), it was adopted by Wang et al. (2008) to compare p1 with p2 in their mouse reciprocal cross design.

For a gene with a specific SNP, we can test whether this gene is imprinted by using data in a 2 x 2 table (table 2.1). Recall that p1 and p2 are the parameters of interest.

If this gene does not express in a parent-of-origin manner of inheritance, p1 should equal p2. Therefore, the hypotheses are:

H0 : p1 = p2 vs. H1 : p1 6= p2.

Let X and Y be the S1 allele counts from the two crosses, and let x and y be the realizations of X and Y , respectively. Then X follows a binomial(n1, p1) and Y follows a binomial(n2, p2).

Next, we can define the calculated p-value as

n n X1 X2 T = b(i, n1, pˆ)b(j, n2, pˆ)I(|Z(i, j, n1, n2)| ≥ |Z(x, y, n1, n2)|), i=0 j=0

15 x+y where b(i, n1, pˆ) is the probability mass function of binomial(n1, pˆ),p ˆ = , I(·) is n1+n2 the indicator function, and

i − j n1 n2 Z(i, j, n1, n2) = q . ( 1 + 1 )ˆp(1 − pˆ) n1 n2

H0 is rejected if T ≤ α. Given n1, n2, p1, p2 and α, the rejection region is

X Y | − | ≥ f(n1, n2, p1, p2, α), n1 n2 where f is a function of n1, n2, p1, p2 and α. The left graph in figure 2.1 shows the rejection region of the Storer-Kim test, which is located in the top left and lower right triangles. That is, a SNP is significantly imprinted if |X/n1 − Y/n2| is larger than a constant.

Figure 2.1: Rejection regions (black) of the three tests when n1=30 and n2=40

Although not discussed in Wang et al. (2008), this test ignores the fact that imprinting may be confounded with AEI. Therefore, rejecting the null hypothesis

16 does not necessarily means that there is only imprinting, since AEI may exist as well

if neither p1 nor p2 = 0.5.

The second test for detecting genomic imprinting is provided by Gregg et al.

(2010), and in this dissertation we call it the two-chi-squares test, which uses two

separate chi-square tests to test whether p1 and p2 are both equal to 0.5. The two

sets of hypotheses are:

H01 : p1 = 0.5 vs. H11 : p1 6= 0.5;

H02 : p2 = 0.5 vs. H12 : p2 6= 0.5.

H01 is rejected if x < k1 or x > n1 − k1, and k1 is determined by the maximal integer satisfying

k n X X1 P (X = x|p1 = 0.5) + P (X = x|p1 = 0.5) ≤ α1. x=0 x=n1−k

Similarly, H02 is rejected if y < k2 or y > n2 − k2, and k2 is chosen in the same way.

A gene is concluded to be imprinted if both null hypotheses are rejected for at

0 least one SNP, and this SNP has one of thep ˆi s > 0.5 and the otherp ˆi < 0.5. The middle graph in figure 2.1 shows the rejection region of the two-chi-squares test, which is located in the top left and lower right rectangles. The conclusion for imprinting can be problematic for this procedure as well, again due to confounding with AEI; p1 6= p2 and both less than 0.5 can be the consequence of both imprinting and AEI, and thus will lead to a false negative result.

To detect AEI, Fontanillas et al. (2010) use a binomial test to test whether p1 is equal to 0.5. Note that this test only use one of the crosses, not both; that is, it is not a reciprocal cross design, and thus both false positives and negatives may exist due to the imprinting effect.

17 2.3.2 Rationale for the joint test

As we discuss above, the existing methods are difficult when both imprinting and

AEI occur. Here we propose a new test, called the joint test, to detect imprinting and AEI simultaneously. The hypotheses of the joint test are:

H0 : p1 = 0.5 and p2 = 0.5 (No imprinting and No AEI)

H1 : p1 6= 0.5 or p2 6= 0.5 (Imprinting or AEI)

H0 is rejected if one of thep ˆi’s is far from 0.5. To determine the rejection region, let X and Y be the S1 allele counts from two reciprocal crosses, respectively. Then

X follows a binomial(n1, p1) and Y follows a binomial(n2, p2), and X and Y are independent since these two crosses are performed separately using different mice.

Hence, H0 is rejected if X (Y ) is far from the center n1/2 (n2/2).

Let Pxy be the probability that X = x and Y = y under H0:

! ! n1 n2 P = 0.5n1 0.5n2 , xy x y so Pxy is larger if x (y) is closer to n1/2 (n2/2). To find the rejection region, let P(k) be the sorted Pxy’s from smallest to largest, k = 1, 2, ··· , (n1 + 1) ∗ (n2 + 1). Given

Ps a significance level α, s is the largest integer satisfied k=1 P(k) ≤ α. Therefore, the rejection region RR is the collection of (x, y) corresponding to k = 1, 2, ··· , s. The power of the joint test is then

n ! n ! X 1 x n1−x 2 y n2−y P (Reject H0|p1 6= 0.5, p2 6= 0.5, n1, n2) = p1(1−p1) p2(1−p2) . (x,y)∈RR x y

Note that due to the discrete nature of the rejection region, the joint test is a conser- vative one and the type I error rate may be less than the nominal value α.

18 The right graph of figure 2.1 shows the rejection rejection of the joint test. The rejection region is the area outside of the central ellipse, signifying that H0 is rejected if X or Y is far from the center.

For the joint test, rejecting H0 means that a SNP is imprinted or has AEI. In order to further discuss whether a SNP is imprinted and whether it has AEI, all SNPs can be divided into four groups, as in figure 2.2, according to the underlying values of p1 and p2:

H0: p1 = 0.5 and p2 = 0.5. No imprinting and no AEI.

H1.A: p1 = p2 and both are not equal to 0.5. No imprinting but AEI.

H1.B: One of pi = 0.5 and the other is not. Imprinting but no AEI.

H1.C: p1 6= p2, p1 6= 0.5 and p2 6= 0.5. Imprinting and AEI.

Figure 2.2: Four groups according to the underlying values of p1 and p2

19 In figure 2.2, the two axes are p1 and p2 from 0.1 to 0.9. Figure 2.3 can be divided into 9*9=81 squares, which have center (i ∗ 0.1, j ∗ 0.1), i, j = 1, 2, ··· , 9. Therefore, givenp ˆ1 andp ˆ2 of a SNP, we will make our conclusion based on which square it belongs to if H0 is rejected.

2.3.3 Simulation study

In this section, we compare the joint test with the the two-chi-squares test (Gregg et al., 2010) by a simulated dataset modelled after their data. Here we compare our method only with the two-chi-squares test because the simulation setting is from

Gregg’s data. We use the same notation X and Y to denote the S1 allele counts from two crosses, respectively; then X follows a binomial(n1, p1) and Y follows a binomial(n2, p2). To simulate X and Y for each SNP, the values of n1 and n2 are

needed first. Here we sample (n1, n2) from a dataset in Gregg et al. (2010). Gregg

et al. (2010) perform three sets of reciprocal crosses in three mouse brain samples:

murine embryonic day 15 (E15) brain, adult cortex, and hypothalamus. Each dataset

has around 200,000 SNPs after quality control, and each SNP contains the observed

(x, y) and (n1, n2).

To compare these two tests, we need to know the general significance level of the

two-chi-squares test. We set each chi-square test with α = 0.05, that is,

P (X < k1|p1 = 0.5) = P (X > n1 − k1|p1 = 0.5) = 0.025,

P (Y < k2|p2 = 0.5) = P (Y > n2 − k2|p2 = 0.5) = 0.025.

Therefore, the area of the rejection region in the lower right rectangle in the middle

graph of figure 2.1 is

P (X > n1 − k1, Y < k2|p1 = p2 = 0.5)

20 Table 2.2: Simulation setting under H1

label characterization range of p1 and p2 number of SNPs

H1.A No imprinting but AEI p1=p2 6= 0.5 1,000

H1.B1 Imprinting but no AEI p1=0.5, p2 6= 0.5 160

H1.B2 Imprinting but no AEI p1 6= 0.5, p2 = 0.5 440

H1.C1 Imprinting and AEI p1 > 0.5, p2 < 0.5 200

H1.C2 Imprinting and AEI p1 < 0.5, p2 > 0.5 100

H1.C31 Imprinting and AEI p2 < p1 < 0.5 440

H1.C32 Imprinting and AEI p1 < p2 < 0.5 860

H1.C41 Imprinting and AEI p1 > p2 > 0.5 600

H1.C42 Imprinting and AEI p2 > p1 > 0.5 200

2 = P (X > n1 − k1|p1 = 0.5)P (Y < k2|p2 = 0.5) = 0.025 since X and Y are independent. Thus, the significance level is 0.0252 ∗ 2 = 0.00125.

To base our simulation on Gregg’s data, we carried out preliminary analysis using the joint test on the E15 brain data. By the joint test, around 20% of SNPs in E15 are significant for α = 0.00125. Based on the data and the results from the joint test,

20000 pairs of (n1, n2) in E15 are randomly selected, in which 16,000 pairs (80%) of

(n1, n2) are used to generate X and Y under H0 : p1 = p2 = 0.5, and the other 4,000 pairs (20%) are used to generate X and Y under H1 : p1 6= 0.5 or p2 6= 0.5. We divide

H1 into nine subsets as in table 2.2 and figure 2.3. The distribution of SNPs in these nine subsets are roughly according to their inference results in the joint analysis of the E15 data, with the specific values shown in the last column of table 2.2.

Given the simulated data, we can now compare the performance of the joint test and the two-chi-squares test with a “gold standard”. Figure 2.4 shows the comparison under H0. The top left graph in figure 2.4 presents the counts of SNPs under H0,

21 Figure 2.3: The nine subsets of H1

and the total is 16,000. The two axes are p1 and p2, with each row (column) denotes the width 0.1, except for the central row (column) for practical reason due to the nature of the two-chi-squares test. In that test, a gene is imprinted if it has at least

0 one significant SNP with one of thep ˆi s > 0.5 and the otherp ˆi < 0.5. Therefore, the splitting of the central row (column) is to accommodate the inference. For the rest, the center of the first four rows (columns) are from 0.1 to 0.4, and the last four rows (columns) are from 0.6 to 0.9. The intervals of central two rows (columns) are (0.45, 0.5) and (0.5, 0.55), respectively. In the observed counts, although we set p1 = p2 = 0.5, the distribution covers a wide range given random variation. Hence, many SNPs are located outside H0 region (red).

The top right graph in figure 2.4 presents the counts of significant SNPs inferred by the joint test. Because the counts for some simulated SNPs are far from the center (H0), we got 16 significant SNPs. This gives an estimated type I error rate of

22 16/16,000=0.001, quite close to the nominal value of 0.00125. The lower left graph in figure 2.4 presents the counts of significant SNPs inferred by the two-chi-squares test, and we get 18 significant SNPs, which is close to the joint test’s result. Note that all significant SNPs for this test are always in top-left or lower-right quadrants.

Figure 2.5 shows the comparison under H1.A with the total number of SNPs being

1000. In the top left graph, we can see that most of the counts are located in the diagonal p1 = p2 region (yellow). Because H1.A means SNPs are not imprinted but have AEI, the two-chi-squares test will not be able to make any significant inference.

The joint test detect 703 significant SNPs having AEI, giving an estimated power of

703/1000=70.3% with standard deviation (SD) 1.4%.

Figure 2.6 shows the comparison under H1.B (imprinting but no AEI) with the total number of SNPs being 600. In the top left graph, we can see that most of the counts are located in the central rows and columns (green). We get 11 significant

SNPs for the two-chi-squares test, and 297 for the joint test. These give estimated powers of 297/600=49.5% with SD 2% and 11/600=1.8% with SD 0.5% for the joint test and two two-chi-squares test, respectively.

Figure 2.7 shows the comparison under H1.C1 and H1.C2 (imprinting and AEI) with the total number of SNPs being 300. We get 175 significant SNPs for the two-chi-squares test and 227 for the joint test. We were expecting the performance for these two tests to be close since most of the counts are located in the top left

(H1.C2) or lower right (H1.C1) in which inferences can be made for the two-chi- squares test. Although the performance are close, the joint test with an estimated power of 227/300=75.7% with SD 2.5% still is better than two two-chi-squares test with an estimated power of 175/300=58.3% with SD 2.8%.

23 Figure 2.4: Counts of total SNPs under H0, and counts of rejection

24 Figure 2.5: Counts of total SNPs under H1.A, and counts of rejection

25 Figure 2.6: Counts of total SNPs under H1.B, and counts of rejection

26 Figure 2.7: Counts of total SNPs under H1.C1 and H1.C2, and counts of rejection

27 Table 2.3: Summary of power in simulated data

label range of p1 and p2 power in joint power in 2-chi-sq

H1.A p1=p2 6= 0.5 0.703±0.014 0 0 H1.B One of pis=0.5, the other pi 6= 0.5 0.495±0.02 0.018±0.005 0 H1.C1 & C2 One of pis > 0.5, the other pi < 0.5 0.757±0.025 0.583±0.028

H1.C3 p2 < p1 < 0.5 or p1 < p2 < 0.5 0.773±0.012 0

H1.C4 p1 > p2 > 0.5 or p2 > p1 > 0.5 0.775±0.015 0

The SNPs simulated under H1.C3 and H1.C4 both indicate imprinting and AEI.

The two-chi-squares test does much worse as expected. Figure 2.8 shows the compar-

ison under H1.C3 with the total number of SNPs being 1,300. In the top left graph,

we can see that most of the counts are located in the lower left quadrant except for

the diagonal p1 = p2. We got no significant SNPs for the two-chi-squares test, and

1,005 for the joint test, with an estimated power of 1,005/1,300=77.3% with SD 1.2%.

Figure 2.9 shows the comparison under H1.C4 with the total number of SNPs being

800. In the top left graph, we can see that most of the counts are located in the top

right quadrant except for the diagonal p1 = p2. We get no significant SNP for the two-

chi-squares test, and 620 for joint test, with an estimated power of 620/800=77.5%

with SD 1.5%.

Table 2.3 is a summary of power estimated in the simulated data, where in the

last two column the number after pm is the standard deviation. Through simulation under several alternative scenarios of imprinting and AEI, the joint test is shown to outperform the two-chi-squares test with a type I error rate closely matching the nominal level.

28 Figure 2.8: Counts of total SNPs under H1.C3, and counts of rejection

29 Figure 2.9: Counts of total SNPs under H1.C4, and counts of rejection

30 2.3.4 Real data study

In this section, we compare the performance of the joint test with the two-chi- squares test in the E15 sample in Gregg et al. (2010). For the joint test we still set

α = 0.00125, and for the two-chi-squares test, each individual test uses α = 0.05.

From around 200,000 SNPs, 37,258 SNPs are inferred to be significant by the joint test. As for the two-chi-squares test, 1,915 SNPs are inferred to be significant, nearly twenty fold less than the number of significant SNPs by the joint test.

To compare the results from these two methods, we use the confirmed imprinted genes database in (Morison et al., 2001) as the reference. That is, only the SNPs within the confirmed imprinted genes are used, and we infer whether a significant

SNP is maternally or paternally imprinted by comparingp ˆ1 withp ˆ2.

E15 is a sample from mouse brain, so we investigate the imprinted genes in two sets: brain or non-brain tissues. In Morrison’s database, there are 49 imprinted genes described in mouse brain. After quality control 23 of these 49 genes are found in the

E15 dataset. Table 2.4 shows the testing result and attributes of these 23 imprinted genes. A gene is called significant if it has at least one significant SNP. Twenty of these 23 genes are significant by the joint test (marked with “x” in the third column) and 18 genes are significant by the two-chi-squares test (marked with “x” in the forth column). If all significant SNPs in a gene show maternally (paternally) imprinted, this gene is M (P) in the fifth and sixth columns for the joint and the two-chi-squares test, respectively. If one gene has both maternally and paternally significant SNPs, this gene is M/P or P/M, the letter before “/” signifies the majority of the significant

SNPs. The last column shows the maternal or paternal imprinting classification in

Morrison’s database. Almost all significant genes show the same direction (maternal

31 or paternal imprinting) in the tests and in Morrison’s database; only 3 genes contain significant SNPs imprinted on both directions. Detailed description are given in the caption of the table.

As for the non-brain tissues, there are 71 imprinted genes in Morrison’s database.

After quality control 23 of these 71 genes are found in the E15 dataset. Table 2.5 shows the testing result and attributes of these 23 imprinted genes. Sixteen of these

23 genes are significant in the joint test (the third column) and 15 genes are significant in the two-chi-squares test (the forth column). It is reasonable that the number of significant genes for non-brain tissue is smaller than the number of significant genes in the brain, because E15 studies brain tissue. All significant genes show the same direction of imprinting effects as indicated in Morrison’s database, with one exception where SNPs are significantly imprinted in both directions, as indicated in the caption.

In tables 2.4 and 2.5, three genes are significant in the joint test but not significant in the two-chi-squares test. To further understand the properties of these two tests, the testing result of one of the SNPs (UCSC id: uc009kou.1 2) in gene Cd81 is in figure

2.10. Here n1 = 56, n2 = 67. The blue dot is the observed (x, y) = (13, 21). The green region is the rejection region of joint test. The red region is the rejection region of the two-chi-squares test. The yellow region is the overlapping rejection region of both tests, that is, these regions fall into the rejection region of the joint test and the rejection region of the two-chi-squares test. Since this SNP is in the green region, it is significant in the joint test but not significant in the two-chi-squares test. This

SNP likely has both AEI and imprinting, an alternative is not detectable using the two-chi-squares test.

32 Table 2.4: Imprinted genes in mouse brain Mat. / Pat. imprinting NO Gene symbol Sig. in joint Sig. in 2-chi-sq Joint 2-chi-sq Morrison 1 Grb10 x x P P P 2 Kcnk9 x x P P P 3 H13 x x P/M a P/M a P 4 Mcts2 x x M M M 5 Blcap x x P P P 6 Calcr x x P P P 7 Sgce x x M M M 8 Peg10 x x M M M 9 Copg2 x x P P P 10 Nap1l5 x x M M M 11 Zim5 x x P P P 12 Usp29 x x M M M 13 Ube3a x x P/M b P/M b P 14 Ndn x x M M M 15 Mkrn3 x x M M M 16 Lgf2 x x M M M 17 Rasgrf1 x x M M M 18 Xlr3b x x P P P 19 Ppplr9a x M/P c P 20 Cntn3 x P P 21 Sfmbt2 M 22 Gatm P 23 Xlr4c P a For both joint test and two-chi-squares test, there are 55 significant SNPs in H13. Fifty-three SNPs show paternal imprinting, and only 2 SNPs show ma- ternal imprinting. b For both joint test and two-chi-squares test, there are 8 significant SNPs in Ube3a. Three SNPs show paternal imprinting, 2 SNPs show maternal imprint- ing, and 2 SNPs show AEI but no imprinting effect. c There are 12 significant SNPs in Ppp1r9a. Four SNPs show paternal imprinting, and 8 SNPs show maternal imprinting. 33 Table 2.5: Imprinted genes in the non-brain tissues of the mouse Mat. / Pat. imprinting NO Gene symbol Sig. in joint Sig. in 2-chi-sq Joint 2-chi-sq Morrison 1 Plagl1 x x M M M 2 Zrsr1 x x M M M 3 Dlk1 x x M M M 4 Dlk1 x x M M M 5 Igf2r x x P P P 6 Impact x x M M M 7 Gnas x x P a P a P 8 Asb4 x x P P P 9 Asb4 x x M M M 10 Peg3 x x M M M 11 Peg3 x x M M M 12 Snrpn x x M M M 13 Peg12 x x M M M 14 H19 x x P P P 15 Cdkn1c x x P P P 16 Cd81 x M/P b M 17 Ddc M 18 Htr2a P 19 Axl M 20 Tssc4 P 21 Tnfrsf23 P 22 Tnfrsf23 P 23 Th P a For both joint test and two-chi-squares test, there are 13 significant SNPs in Gnas. Six SNPs show paternal imprinting, and 7 SNPs show AEI but no imprinting effect. b There are 4 significant SNPs in Cd81. One SNP shows paternal imprinting, and 3 SNPs show maternal imprinting.

34 Figure 2.10: Testing result of a SNP (UCSC id: uc009kou.1 2) in gene Cd81

2.4 Theoretic power for joint test

The focus of this dissertation is on the investigation of whether the current se- quencing technology can provide sufficient power for studying imprinting and AEI.

Thus, in this section, the power of the joint test in the reciprocal design of mouse experiments is calculated in terms of two sequencing parameters: sequencing depth

(E(T ), the average number of reads across all genes) and read length (l), and sequence divergence (d). In the power calculation, we fix two of the parameters and change the other one to investigate the effect on power. The significance level α is set to be

5% throughout.

35 To determine the plausible range of values of these parameters, we consider all ex- isting and commonly used platforms. Illumina (http://www.illumina.com) has read lengths from 35 bp to 150 bp; SOLiD (http://solid.appliedbiosystems.com) has read lengths from 35 bp to 75 bp; the read lengths for 454 Sequencing (http://www.454.com) are up to 1000 bp. Although a longer read has a higher chance of covering more SNPs, it typically has a higher error rate (Shendure and Ji, 2008). Hence, in our calculation we use l=35, 100, 150, and 250. For sequencing depth, the main concern is cost, and we use different levels of E(T ) in the following subsection to achieve sufficient power. As for sequencing divergence, it depends on which mouse strains are used in the experiment, so in the calculation below, d = 0.1%, 0.5%, 1%, and 2% to cover a wide range of possibilities.

2.4.1 A read covers at least one SNP

In this subsection we define an informative read as one that covers at least one

SNP, so the distribution of ni is as given in subsection 2.2.1. Let R be the event of rejecting H0 based on the joint test. To average power over ni, the power of detecting imprinting and/or AEI given particular values of p1, p2, t, d, and l is

t t X X P (R|p1, p2, t, d, l) = P (R|n1, n2, p1, p2, t, d, l)P (n1|p1, p2, t, d, l)P (n2|p1, p2, t, l, d) n1=0 n2=0 t t X X = P (R|n1, n2, p1, p2)P (n1|t, d, l)P (n2|t, d, l) n1=0 n2=0

Power can be averaged over t as well since t is different across genes. Let T be the random variable of read numbers across all genes; then, this distribution of T can be empirically determined or approximated by either discrete decay or power law functions (e.g. Ogasawara et al. 2003). Assuming a geometric distribution with

36 expectation E(T ), given p1, p2, E(T ), l and d, we can rewrite power as:

∞ X P (R|p1, p2,E(T ), d, l) = P (R|p1, p2, t, d, l)P (T = t|E(T )) t=0

1 1 t−1 where P (T = t|E(T )) = E(T ) (1 − E(T ) ) , t = 1, 2, ···. In the calculation, it is impossible to have t = ∞, so we let the maximal t be a large number such that at least 99% of the distribution of T is covered, i.e.,

max.t X P (T = t|E(T )) ≥ 0.99. t=1

Using the previous formulas, the theoretical power can be calculated. By a prop- erty of the binomial distribution, the power is symmetric about p1 = 0.5, p2 = 0.5, p1 = p2, and p1 + p2 = 1. In other words, for 0 < k1 < 1 and 0 < k2 < 1, we have

P (R|p1 = k1, p2,E(T ), d, l) = P (R|p1 = 1 − k1, p2,E(T ), d, l),

P (R|p1, p2 = k2,E(T ), d, l) = P (R|p1, p2 = 1 − k2,E(T ), d, l),

P (R|p1 = k1, p2 = k2,E(T ), d, l) = P (R|p1 = k2, p2 = k1,E(T ), d, l), and

P (R|p1 = k1, p2 = k2,E(T ), d, l) = P (R|p1 = 1 − k1, p2 = 1 − k2,E(T ), d, l).

Hence, we do not need to calculate power for all possible combination of (p1, p2), we only need to calculate power for p1, p2 = 0.1, 0.2, ··· , 0.5, and p1 < p2.

We use two kinds of graphs to show the theoretical power in this and the next chapter. To study the effects of parameters, we fix two of (E(T ), d, l) and change the value of the third parameter. In this subsection, E(T ) are set to be 10, 20, 30, 40, and the fixed values of parameters are E(T )=30, d = 1%, and l=100.

Figure 2.11 shows the power image plots for different values of E(T ) but fixed d and l. The parameter values are in the titles of the graphs, and P (Y > 0) = 1 − (1 − d)l

37 is the probability that a read covers at least one SNP, i.e., a read is informative.

The two axes are p1 and p2 from 0.1 to 0.5, so this image plot is divided into 25 combinations of p1 and p2. Since the power is symmetric to the diagonal line p1 = p2, we only show the power values in the upper triangle, and power values are displayed in different color. The upper right subgroup is for p1 = p2 = 0.5, so it stands for the type I error rate. As we mention in subsection 2.3.2, it is always less than or equal to 5% due to the discrete nature of determining the rejection region.

Figure 2.12 shows the second kind of graph, power curve plots, for different values of E(T ) but fixed d and l. The top left plot presents the power curves when p2 is 0.5, and p1 is from 0.5 to 0.1. This is the case that there is imprinting but no AEI. The top right plot presents the power curves when p1 = p2 is from 0.5 to 0.1. This is the no imprinting but AEI case. The lower left presents both imprinting and AEI case, where the x-axis contains seven combinations of p1 and p2. Since power is larger when

(p1, p2) is away from (0.5,0.5), these seven combinations of p1 and p2 are ordered by the Euclidean distance between them and (0.5,0.5). Finally, the parameters are in the lower right corner. From these figures, we can see that power improves as E(T ) increases, and power improves most when E(T ) is from 10 to 20. The top power

(around 81.5%) is for extreme case when there is a very strong imprinting and/or

AEI effect. When the effects are more moderate, the power can be much lower and sequencing depth E(T ) = 40 appears to be the necessary to have a realistic chance of detecting such effect.

Figures 2.13 and 2.14 show the power image and curve plots, for different values of d but fixed E(T ) and l. The power improves when d increases, and power improves most from d = 0.1% to 0.5%, but the improvement is incremental for larger sequence

38 Figure 2.11: Power image plots for different E(T ), where an informative read is defined as covering at least one SNP

39 Figure 2.12: Power curve plots for different E(T ), where an informative read is defined as covering at least one SNP

40 divergence. Figures 2.15 and 2.16 show the power image and curve plots, for different

values of l but fixed E(T ) and d. There is a marked improvement from l=35 to 100,

but the improvement is incremental for larger read length. The top power (around

82.3%) is for extreme case when there is a very strong imprinting and/or AEI effect.

Therefore, if it is at most 1% sequence divergence and E(T ) is 30 or lower, then the

power is moderate at best, especially if the effects are not strong. To increase the

power, increasing the coverage is the key, but that can be expensive.

2.4.2 A read covers a particular SNP

In this subsection the formula of power is the same as subsection 2.4.1. However,

we define a read is informative if it covers a particular SNP, so the distribution of

ni is as given in subsection 2.2.2. The probability that a read is informative under

this scenario is much smaller than that under the much less stringent given the same

d and l. In order to achieve reasonable power, longer read and higher coverage are

needed, which can prove to be prohibitively expensive. We also use larger sequence

divergence, which is probably larger than most organisms and is interpreted as upper

limit. In this subsection, E(T ) are set to be 40, 70, 100, 130, and fixed values of

parameters are E(T )=130, d = 2%, and l=250.

Figures 2.17 and 2.18 show the power image and curve plots, for different values

of E(T ) but fixed d and l. The parameter values are in titles of graphs, and P (A|d, l)

is the probability that a read covers a particular SNP. We can see that the power

improves as E(T ) increases. Note that the y-axis is from 0 to 0.6, and the same range is used in all the power curve plots in this subsection. The top power (around 60%) is for extreme case when there is a very strong imprinting and/or AEI effect. When the

41 Figure 2.13: Power image plots for different d, where an informative read is defined as covering at least one SNP

42 Figure 2.14: Power curve plots for different d, where an informative read is defined as covering at least one SNP

43 Figure 2.15: Power image plots for different l, where an informative read is defined as covering at least one SNP

44 Figure 2.16: Power curve plots for different l, where an informative read is defined as covering at least one SNP

45 effects are more moderate, the power can be much lower. Therefore sequencing depth

E(T ) at least 130 appears to be the necessary to have a realistic chance of detecting such effect, and we even need a larger E(T ).

Figures 2.19 and 2.20 show the power image and curve plots, for different values of d but fixed E(T ) and l. In the titles of figure 2.19 we can see P (A|d, l) = 0.042, 0.069,

0.069, and 0.073 at the four values of d. Because P (A|d, l) does not increase much as sequence divergence d is higher than 0.5%, the power does not improve much as d is higher than 0.5%. Figures 2.21 and 2.22 show the power image and curve plots, for different values of l but fixed E(T ) and d. Different from subsection 2.4.1, the power improves steadily as l increases. This is not surprising as the chance of covering a particular SNP is much smaller and increasing the length will definitely improve the chance. However, the overall power remain low even in the most optimistic setting.

To sum up, if an informative read is defined as covering at least one SNP (genome- wide study), sequencing depth of E(T ) = 40 is necessary to achieve sufficient power under read length l = 100 and sequence divergence d = 1%, especially when imprint- ing and AEI effects (p1, p2) are moderate. Fixing E(T ) = 30 and d = 1%, the power does not improve much when l > 100. It suggests that increasing sequencing depth, not read length, is the key to improve power, although it can be expensive. On the other hand, if an informative read is defined as covering a particular SNP (candidate gene based study), then E(T ) of at least 130 and l of at least 250 is necessary to achieve sufficient power under d = 2%, even when imprinting and AEI effects are strong. These large sequencing parameters render it unrealistic to carry out such a study. Because increasing read length may have a higher error rate of base calling

46 Figure 2.17: Power image plots for different E(T ), where an informative read is defined as covering a particular SNP

47 Figure 2.18: Power curve plots for different E(T ), where an informative read is defined as covering a particular SNP

48 Figure 2.19: Power image plots for different d, where an informative read is defined as covering a particular SNP

49 Figure 2.20: Power curve plots for different d, where an informative read is defined as covering a particular SNP

50 Figure 2.21: Power image plots for different l, where an informative read is defined as covering a particular SNP

51 Figure 2.22: Power curve plots for different l, where an informative read is defined as covering a particular SNP

52 with currently available technology, we suggest increasing sequence depth as a more reliable approach, although it can be more expensive.

53 Chapter 3: Statistical power for detecting imprinting and AEI in human data

3.1 Informative trio structures

For humans, a reciprocal design is clearly infeasible. Therefore we turn to the trio

design, which is known to be useful for detecting imprinting (Weinberg, 1999). Here

we observe a sample of families each with exactly one father, one mother, and one

child. Consider a SNP with two alleles A and a, where A denotes the variant allele,

the allele of interest. Following the notation in He et al. (2011), let F , M and C be the number of copies of allele A in the father, mother and child, respectively. Then the possible values of F , M and C are 0, 1 and 2, which stand for the genotypes aa,

Aa, and AA, respectively. Hence, FMC represents the genotype structure of a trio for a SNP site. There are only 15 genetically possible values of FMC: 212, 122, 211,

121, 112, 111, 110, 101, 011, 100, 010, 222, 201, 021, and 000. For example, 202 is impossible since a child cannot be AA if the mother is aa.

Not all of these combinations are informative for detecting imprinting and/or AEI, though. Only children with heterozygous SNP (i.e. C=1) are considered, because the parental source of each allele inherited by the child can be unambiguously determined, unless both parents are also heterozygous. Therefore, only six types of FMC are informative: 201, 211, 101, 021, 121, and 011. A child gets allele A from the father

54 Table 3.1: The six types of informative trio structures type 1 2 3 4 5 6 FMC 201 211 101 021 121 011 origin of AA from father A from mother

in the former three types, and from the mother in the latter three types, as indicated

in table 3.1.

3.2 Joint test and parameter setting

To study the effect of imprinting and AEI, we first introduce the parameters of

interest: p1 and p2. For a specific SNP with two alleles A and a, let p1 be the proportion of reads containing A from the former three types of families in table 3.1, and p2 be the proportion of reads containing A from the latter three types. Hypotheses about these two parameters constitute imprinting and AEI effects.

Before RNA-seq experiments, we need to figure out in a random sample of trios from the population, how many families are expected in each of the six types. We first introduce some notation. Let vi be the expected number of FMC type i families in the

sample, which are proportional to the probabilities of these six types of informative

trios in the population given the child has heterozygous genotype and parents are

not both heterozygous. Let N be the total number of families in a sample, and fi be

the probability of a type i family in the population given the child has heterozygous

genotype and parents are not both heterozygous, i = 1, ··· , 6. Then vi = N ∗fi, which

will be rounded to integers in our power calculation.

55 Under the assumption of Hardy−Weinberg equilibrium (HWE), we have

f1 = P (F = 2,M = 0|C = 1,F &M not both = 1)

∝ P (C = 1,F &M not both = 1|F = 2,M = 0)P (F = 2,M = 0)

= P (F = 2,M = 0) = P (F = 2)P (M = 0).

Similarly, we can find the probabilities for the other five types:

f2 ∝ P (F = 2)P (M = 1)/2,

f3 ∝ P (F = 1)P (M = 0)/2,

f4 ∝ P (F = 0)P (M = 2), f5 ∝ P (F = 1)P (M = 2)/2, f6 ∝ P (F = 0)P (M = 1)/2.

Furthermore, given HWE, the probabilities can be expressed in terms of gene frequencies. Note that

P (F = 2) = P (M = 2) = P (A)2,

P (F = 1) = P (M = 1) = 2P (A)P (a),

P (F = 0) = P (M = 0) = P (a)2.

In the calculation below for investigating theoretical power, we set P (A) = 0.7, so

P (a) = 0.3, although we do not expect the results regarding sequencing parameters to change qualitatively for different allele frequencies.

After determining the number of families in each type, we discuss the number of informative reads in each type. Let ui be the number of reads containing the SNP in one of type i family, i = 1, 2, ··· , 6. Using the idea in section 2.2, we assume ui follows a binomial(t, f(d, l)), where f(d, l) depends on the definition of informative reads.

We further assume ui to be independent of one another in our power calculation.

56 From the above discussion, we see that ui ∗ vi is the total number of informative

reads containing the SNP for type i families. For the six types of informative families,

the former three types have the information for p1, and the latter three types have

the information for p2. Therefore, let Qi be the total number of reads covering allele

A in type i family, i = 1, 2, ··· ,6. Then conditional on the values of ui and vi,

Qi ∼ binomial(uivi, p1) for i = 1, 2, 3,

Qi ∼ binomial(uivi, p2) for i = 4, 5, 6.

Let X be the total number of reads covering allele A from father, and Y be the total number of reads covering allele A from mother. By assuming Qi, i = 1, 2, 3 are independent of one another and Qi, i = 4, 5, 6 are independent of one another, we have 3 3 X X X = Qi ∼ binomial( uivi ≡ n1, p1), i=1 i=1 6 6 X X Y = Qi ∼ binomial( uivi ≡ n2, p2). i=4 i=4 Now the joint test can be used directly since we have turned the problem into a setting analogous to mouse data.

3.3 Theoretic power for joint test

Let R be the event of rejecting H0. Suppose the numbers of families in each type vi, i = 1, 2, ··· , 6 are fixed, given p1, p2 and ui, i = 1, 2, ··· , 6, so the power is then

P (R|p1, p2, ui, vi, i = 1, 2, ··· , 6). To average power over ui, the power of detecting imprinting given p1, p2, t, l and d is

P (R|p1, p2, t, l, d)

Pt Pt Pt Pt Pt Pt = u1=0 u2=0 u3=0 u4=0 u5=0 u6=0{P (R|p1, p2, t, l, d, ui, vi, i = 1, 2, ··· , 6) ×

57 Q6 i=1 P (ui|p1, p2, t, l, d)}

Pt Pt Pt Pt Pt Pt = u1=0 u2=0 u3=0 u4=0 u5=0 u6=0{P (R|p1, p2, ui, vi, i = 1, 2, ··· , 6) ×

Q6 i=1 P (ui|t, l, d)}. This power can be averaged over t in the same way as before for mouse data:

∞ X P (R|p1, p2,E(T ), d, l) = P (R|p1, p2, t, l, d)P (T = t|E(T )), t=1

1 1 t−1 where P (T = t|E(T )) = E(T ) (1 − E(T ) ) , t = 1, 2, ···.

Based on commonly used read lengths in the current technologies, we still set l=35, 100, 150, or 250 in human data, and set d = 0.1%, 0.5%, 1%, or 2% to cover a wide range of possibilities. As for sequencing depth E(T ), because there are N families in human data but only one family for mouse data, E(T ) for human is set to be much smaller than E(T ) for mouse. Besides the effects of sequencing parameters, for human data we also investigate the effect of changing N (number of families in the sample). The image plots and power curve plots are used to present the effects of parameters.

3.3.1 A read covers at least one SNP

In this subsection we define an informative read as it covers at least one SNP, so the distribution of ni is as in subsection 2.2.1. In this subsection, E(T ) are set to be

2, 4, 6, or 8, and the values of the parameters that are fixed (i.e., only one parameter is set to be varying) are E(T )=4, d = 1%, and l=100.

We set N = 50 in figures 3.1 to 3.6, which leads to the numbers of the six types of families being v = (v1, v2, ··· , v6) = (7, 15, 3, 7, 15, 3) when P (A) = 0.7 and we have

rounded to the nearest integer. The power is symmetric about p1 = 0.5, p2 = 0.5,

58 p1 = p2, and p1 + p2 = 1, which is the same as section 2.4. Therefore in the image

plots, we only show the power in the upper triangle.

Figures 3.1 and 3.2 show the power image and curve plots, for different values

of E(T ) but fixed d and l. The parameter values are in the titles of the graphs, and P (Y > 0) is the probability that a read covers at least one SNP. From these

figures, we can see the power increases as E(T ) increases, and power improves most

when E(T ) moves from 2 to 4. The top power (around 86%) is for the extreme case when there is a very strong imprinting and/or AEI effect. When the effects are more moderate, the power can be much lower; sequencing depth E(T ) = 8 appears to be the necessary to have a realistic chance of detecting such effect.

In figure 3.2, we investigate the relationship between (p1, p2) and power. For

detecting only imprinting or only AEI (the upper two plots), the power increases

steadily as (p1, p2) is further from (0.5,0.5). However, for detecting both imprinting

and AEI (the lower left plot), the power is quite large for (p1, p2)=(0.3,0.4), and

does not improve much as we move from there toward (p1, p2)=(0.1,0.2). The same

phenomenon can be observed in the other power curve plots in this subsection.

Figures 3.3 and 3.4 show the power image and curve plot, for different values of

d but fixed E(T ) and l. The power increases as d increases, and power improves

most when d moves from 0.1% to 0.5%. Figures 3.5 and 3.6 show the power image

and curve plot, for different values of l but fixed E(T ) and d. There is a marked improvement from l=35 to 100, but the improvement is incremental for larger read length. These phenomena also occur in mouse data when an informative read is defined as covering at least one SNP (subsection 2.4.1). Therefore, if there is at most

1% sequence divergence and E(T ) is 4 or lower, then the power is moderate at best,

59 especially if the effects are not strong. In general, to increase the power, increasing

the coverage is still the key.

Figure 3.7 show the effect of family size N. We use N = 20, 30, 50, or 100. Note

the range of the y-axis is from 0 to 0.8. In each plot, all curves seem to be close to

each other in the right side of the x-axis. However, there is still appreciable difference

between the curves for different N, although the differences are smaller toward the right (strong effect) compared to the middle (moderate effect). That is, if p1 or p2 is very far from 0.5, such as 0.1 (strong effects), family size N has a small effect on

the power. On the other hand, if p1 and p2 are not far from 0.5 (moderate effects), a larger N is necessary to improve power.

3.3.2 A read covers a particular SNP

In this subsection we define an informative read as it covers a particular SNP, so the distribution of ni is as in subsection 2.2.2. To find P (A|l, d), genome-wide human

gene lengths are needed. The Ensembl project (http://useast.ensembl.org/info/data/ftp)

provides such information on the ; the current version is Human release

69, which includes 21,980 protein coding genes.

Similar to subsection 2.4.2, the probability that a read is informative under this

scenario is much smaller than that under the much less stringent scenario above

given the same d and l. In order to achieve reasonable power, longer read and higher

sequencing depth are both needed. We also use larger sequence divergence as an

illustration, although we caution that the higher ones are probably larger than the

reality in human data and should be interpreted as upper limit. In this subsection,

60 Figure 3.1: Power image plots for different E(T ), where N = 50 and an informative read is defined as covering at least one SNP

61 Figure 3.2: Power curve plots for different E(T ), where N = 50 and an informative read is defined as covering at least one SNP

62 Figure 3.3: Power image plots for different d, where N = 50 and an informative read is defined as covering at least one SNP

63 Figure 3.4: Power curve plots for different d, where N = 50 and an informative read is defined as covering at least one SNP

64 Figure 3.5: Power image plots for different l, where N = 50 and an informative read is defined as covering at least one SNP

65 Figure 3.6: Power curve plots for different l, where N = 50 and an informative read is defined as covering at least one SNP

66 Figure 3.7: Power curve plots for different N, where an informative read is defined as covering at least one SNP

67 E(T ) are set to be 4, 6, 8, 10, and fixed values of the parameters are set to be

E(T )=10, d = 2%, and l=250.

We set N = 100 in figures 3.8 to 3.13, which gives the rounded numbers of the six types of families of v = (13, 31, 6, 13, 31, 6) when P (A) = 0.7 and HWE is assumed.

Figures 3.8 and 3.9 show the power image and curve plots, for different values of

E(T ) but fixed d and l. The parameter values are in the titles of the graphs, and

P (A|d, l) is the probability that a read covers a particular SNP. In figure 3.9, the

power increases steadily as E(T ) increases, and the difference among curves increases

as (p1, p2) is further from (0.5,0.5). Note that the range of the y-axis is from 0 to 0.7.

The top power (59%) is for the extreme case when there is a very strong imprinting

and/or AEI effect. When the effects are more moderate, the power can be much

lower, so sequencing depth E(T ) of at least 10 appears to be necessary to have a

realistic chance of detecting such effect.

Figures 3.10 and 3.11 show the power image and curve plots, for different values of

d but fixed E(T ) and l. In the titles of figure 3.10 we can see that P (A|d, l) = 0.048,

0.049, 0.051, and 0.057 at the four values of d. We can see that P (A|d, l) are very

close for the d0s that are at most 1%, so in figure 3.11, the curves of d = 0.1% and

0.5% are almost overlapping, and power improve most when d moves from 1% to 2%.

We can also see that power for (p1, p2) = (0.2, 0.3) is surprisingly larger than that

power for (p1, p2) = (0.1, 0.4), which may be due to the discrete nature of the joint

test.

Compared to P (A|d, l) for the mouse data in figure 2.19, under the same d and

l, P (A|d, l) for mouse is generally larger than P (A|d, l) for human. The reason is

that human genes are generally longer than mouse genes. The median gene length

68 Figure 3.8: Power image plots for different E(T ), where N = 100 and an informative read is defined as covering a particular SNP

69 Figure 3.9: Power curve plots for different E(T ), where N = 100 and an informative read is defined as covering a particular SNP

70 for mouse is 15,210, and for human is 21,550. Recall that in subsection 2.2.2,

s−1 y l P (A|G, l, d) ≈ X P (Y = y) + X P (Y = y), y=0 s y=s where G is the gene length; s = dGde is the expected number of SNPs in this gene; y is the actual number of SNPs in this gene. The coefficient of P (Y = y) in the first term is y/s, which is not greater than 1, and the coefficient in the second term is 1.

Under the same d and l, since s for human is generally larger than s for mouse, human has a higher proportion of y’s in the first term than mouse. Therefore, P (A|d, l) for mouse is general larger than P (A|d, l) for human.

Figures 3.12 and 3.13 show the power image and curve plots for different values of l but fixed E(T ) and d. The power increases steadily as l increases as we can see for each effects size in the plots. The top power (59%) is for the extreme case when there is a very strong imprinting and/or AEI effect. When the effects are more moderate, the power can be much lower, so read lengths of at least 250 appears to be necessary to have a realistic chance of detecting such effects. However, a longer read has a higher chance to have wrong calls, so increasing sequencing depth is still the key to improve power.

Figure 3.14 shows the effect of family sample size N. We use N = 50, 100, or 200, and fix E(T )=10, d = 2%, l=250. Note the range of the y-axis is from 0 to 0.8. The power increases steadily as N increases for any (p1, p2), and the top power is 67% for the extreme case. This suggests that increasing the number of families can lead to a substantial increase in power, especially when the effects are moderate. When

N = 100 or 200, we can see that the power for (p1, p2) = (0.2, 0.3) is larger than the power for (p1, p2) = (0.1, 0.4), which may once again be due to the discrete nature of the joint test.

71 Figure 3.10: Power image plots for different d, where N = 100 and an informative read is defined as covering a particular SNP

72 Figure 3.11: Power curve plots for different d, where N = 100 and an informative read is defined as covering a particular SNP

73 Figure 3.12: Power image plots for different l, where N = 100 and an informative read is defined as covering a particular SNP

74 Figure 3.13: Power curve plots for different l, where N = 100 and an informative read is defined as covering a particular SNP

75 Figure 3.14: Power curve plots for different N, where an informative read is defined as covering a particular SNP

76 In summary, if an informative read is defined as covering at least one SNP

(genome-wide study), sequencing depth of E(T ) = 8 is necessary to achieve rea-

sonable power under the setting of N = 50, l = 100 and d = 1%, especially when

imprinting and AEI effects are moderate. Fixing E(T ) = 4 and d = 1%, the power

does not improve much when l > 100. As for the number of families N, under

E(T ) = 4, d = 1% and l = 100, even just N = 20 leads to a sufficient power for

0 detecting strong imprinting and AEI effects (such as one of the pis is 0.1). However, a larger size such as N = 100 is necessary for more moderate effects. On the other hand, if an informative read is defined as covering a particular SNP (candidate gene based study), an E(T ) of at least 10 and l of at least 250 are necessary to achieve

sufficient power for N = 100 families and d = 2%, even when imprinting and/or AEI effects are strong. As for the effect of N, under E(T ) = 10, d = 2% and l = 250,

N = 200 is necessary to achieve sufficient power even for strong imprinting and AEI

effects.

77 Chapter 4: Summary and Discussion

4.1 Summary

Epigenetics is the study of heritable changes in gene expression or cellular phe- notype caused by mechanisms other than changing the underlying DNA sequence

(Goldberg et al., 2007). Two epigenetic phenomena, genomic imprinting and AEI, are discussed in this dissertation. Genomic imprinting is an epigenetically regulated process by which imprinted genes are expressed in a parent-of-origin-specific manner, and AEI refers to asymmetric expression of two different alleles at the same locus.

As we discussed in section 1.2 and 1.3, many analysis tools had been used to investigate these two phenomena. Among these tools, RNA-seq is a powerful new technology for mapping and quantifying transcriptomes using ultra high throughput

NGS technologies. Using RNA-seq, a genome-wide study can investigate genome- wide genomic imprinting and AEI without prior knowledge of genes or coding regions.

Compared to microarray hybridization-based methods, RNA-seq does not have back- ground noise due to hybridization. Data from RNA-seq experiments are digital, so they do not have a limited range of signals (Hurd and Nelson, 2009). Compared to the traditional sequence-based methods, RNA-seq is also more economical which makes genome-wide mapping feasible. Nevertheless, RNA-seq has its own limitations, such as errors in base calling, reading mapping uncertainty in genome (Li et al., 2010),

78 and biases from transcript length (Oshlack and Wakefield, 2009) and sequence base composition (Dohm et al., 2008).

In this dissertation, we focus on how investigating sequencing parameters may affect the power of tests for detecting imprinting and/or AEI, and whether the current technology can provide sufficient power for such an endeavor for mouse and human data. Since existing methods in the literatures are not amenable for detecting such effects and since these two effects may be confounded with one another, we also propose a joint test for simultaneous detection of imprinting and AEI. In chapter 2, the reciprocal cross design for mouse is discussed, and two definitions of informative reads based on the binomial distribution are used throughout this dissertation. The proposed joint test and the two-chi-squares test in the literature are investigated in simulation and in a real data study, and their results are compared and contrasted.

The results show that the joint test is not only applicable for simultaneous detection of the two epigenetic effects, but it is also more powerful compared to the two-chi- squares test. Furthermore, we note that the formula for power calculation in terms of sequencing parameters (sequencing depth and read length) and sequencing divergence is applicable for other tests, not just for the joint test.

We provide theoretical power under some combinations of parameters. If an in- formative read is defined as covering at least one SNP for mouse reciprocal cross design, sequencing depth E(T ) = 40 is necessary to achieve sufficient power under read length l = 100 and sequence divergence d = 1%, especially when imprinting and

AEI effects (p1, p2) are moderate. Fixing E(T ) = 30 and d = 1%, the power does not improve much when l > 100. This suggests that increasing sequence depth, not read length, is the key to improve power, although it can be expensive. On the other

79 hand, if an informative read is defined as covering a particular SNP, then E(T ) at least 130 and l at least 250 are necessary to achieve sufficient power under d = 2%, even when imprinting and AEI effects are strong. Because increasing read length may have a higher error rate of base calling with currently available technology, we suggest increasing sequence depth as more reliable, albeit more expensive, alterna- tive. Note that the two definitions of “an informative read” may be interpreted as a genome-wide or a candidate gene based study, respectively.

Chapter 3 is on detecting imprinting and AEI for human data. We discuss which trio structures are informative, and we still use the joint test to detect imprinting and AEI. In the theoretical power calculation, except for the effects of sequencing parameters and sequence divergence, the number of families in a random sample of

N trios is also considered. If an informative read is defined as covering at least one

SNP, sequencing depth E(T ) = 8 is necessary to achieve reasonable power under the setting of N = 50, l = 100 and d = 1%, especially when imprinting and AEI effects are only moderate. Fixing E(T ) = 4 and d = 1%, the power does not improve much when l > 100. As for the number of families N, under E(T ) = 4, d = 1% and l = 100, even N = 20 leads to a sufficient power for detecting strong imprinting and

0 AEI effects (such as one of the pis is 0.1). However, a larger size such as N = 100 is necessary for more moderate effects.

If an informative read is defined as covering a particular SNP, E(T ) of at least

10 and l of at least 250 are necessary to achieve sufficient power when N = 100 and d = 2%, even when imprinting and/or AEI effects are strong. As for the effect of N, under E(T ) = 10, d = 2% and l = 250, N = 200 is necessary to achieve sufficient power even for strong imprinting and AEI effects.

80 4.2 Discussion and future extensions

4.2.1 Arguable assumptions in this dissertation

We stress that the theoretical power presented in this dissertation should be treated as guidelines only because a number of assumptions of our model can be violated with real data. In the following, we discuss several such issues, their poten- tial consequences, and possible remedies which will be explored in the future research.

We modelled the distribution of the number of reads in gene (T ) using a geometric distribution, but the distribution in real organisms may not fit any geometric distri- bution well. In particular, deviation in the tail is observed for mouse data, signifying that there are many more genes with extremely high expression levels than what is expected from a geometric distribution. We will explore the use of a mixture model to deal with the tail deviation issue. We will also consider directly using the empirical distribution of the number of reads in genes if adequate data are available.

By assuming that each nucleotide can be a SNP independent of other nucleotides and that sequencing divergence (d) is a constant across the genome, the binomial distribution was used to model the number of SNPs in a read (subsection 2.2.1). This assumption was made for modelling and computational convenience. However, in reality, SNPs are not distributed randomly in a genome. For instance, it is known that SNPs occur in greater frequencies in 5’ or 3’ UTRs than in nonsynonymous sites

(Andolfatto, 2005). In general, regions of transcripts with greater sequence variation are more likely to produce informative reads than regions with fewer polymorphic sites. To do away with the assumptions of independent nucleotides and constant sequencing divergence, we will consider the use of a non-homogeneous Poisson process

81 to model the number of informative reads in SNPs, in which the rate may change

across the genome, depending on the density of SNPs.

Another assumption that is likely to be violated in real data is the uniform dis-

tribution of reads in genes. For example, Dohm et al. (2008) demonstrated a strong

relationship between read counts and GC content along the genome based on Solexa

DNA sequencing experiments. Hansen et al. (2010) show non-uniform patterns in the

reads distribution along base.

4.2.2 Discussion in human data

In section 3.2, we use the idea of population-based sampling to determine the

number of families vi’s of each type in the sample, which are later treated as constants in the power calculation. Without population-based sampling, we can consider vi’s as

variables, and find their distributions. We emphasize here that such an assumption

is certainly invalid in the typical setting of disease-gene association studies based on

a retrospective design. Under that scenario, the reads may be treated as quantitative

traits and its association in the presence of imprinting may be tested using a method

such as that proposed in He et al. (2011).

Also in section 3.2, we treat every read as equally important, therefore we combine

all reads covering allele A from the father and denote this number by X, and combine

all reads covering allele A from the mother and denote this number by Y , that is,

3 6 X X X = Qi,Y = Qi. i=1 i=4

This is simply for computational consideration. Alternatively, we may also put dif-

0 ferent weights on different Qis, either to take the types of families into account or

82 to incorporate other covariate information. He et al. (2011) propose a test for de- tecting imprinting for quantitative traits data, and discuss that the covariates, such as physiologic and environmental variables, may also influence the quantitative trait.

Hence, they regress the quantitative trait on the covariates, and regard the residuals as the new quantitative trait. We may consider our observed reads from each trio as a quantitative trait and follow He et al. (2011), but note that the residuals are no longer count data. Hence, binomial inference cannot be used, and other models and methods are needed.

4.2.3 Allele specific methylation

DNA methylation is another epigenetic mechanism that plays a direct role in transcriptional regulation. DNA methylation patterns are tissue-specific. Embryonic stem cells undergoing differentiation show significant changes in DNA methylation patterns (Shoemaker et al., 2010). In addition to DNA methylation pattern differences between cell lines, DNA methylation can also be allele-specific within a cell line and is thus linked to allele-specific gene expression. Allele specific methylation (ASM) means that DNA methylation almost always occurs only on one specific allele. For a heterozygous SNP site, it is said to have ASM if the reads contain only one type of allele. In other words, pi is close to 1 or 0. Thus, ASM may be considered as an extreme case of AEI since pi is far from 0.5. The binomial test can then be used to test whether pi is close to 1 or 0.

4.2.4 Testing for imprinting, AEI and ASM sequentially

In this dissertation, we use the joint test for detecting imprinting and AEI simul- taneously. Another idea is to test imprinting, AEI and ASM sequentially, and figure

83 4.1 is an example flowchart of such a sequential procedure. Testing for imprinting is

the first step; the Storer-Kim test can be utilized to detect this effect. A gene is called

imprinted if it contains a SNP with p1 6= p2. The second step is to test for AEI. The

null hypothesis depends on the outcome of the fist step, although the binomial test

can be used in either scenario. If a gene is not imprinted then p1 = p2. Hence the

null hypothesis is simply p = 0.5 by pooling the data from both crosses for testing the hypothesis with a single parameter. Otherwise, the hypothesis is composite; tests for p1 and p2 may be carried out separately but this can be potentially problematic as discussed in the next paragraph. Finally, testing for ASM is the last step. When p1 = p2 (no imprinting) and p 6= 0.5 (AEI), we can use the binomial test to see whether p is close to 0 or 1. To simplify the null hypothesis, here we assume p is the percentage of counts from the allele that has fewer counts, so ASM implies that p

is close to 0. Therefore c in figure 4.1 is close to 0. We do not test for ASM under

p1 6= p2 (imprinting) since this will cause inconsistency.

This sequence of tests can deal with imprinting, AEI and ASM, but there are

potential problems. For example, when the first step is rejected (i.e., p1 6= p2), it is non-standard to set up the hypotheses in the second step and determining the rejection region can be challenging. The null hypothesis H0 : p1 = 0.5 or p2 = 0.5 is in fact conditional on p1 6= p2. Therefore, a test for H0 cannot be simply carried out for p1 and p2 separately. To see this, let X follow a binomial(n1, p1) and Y follow a binomial(n2, p2); then the rejection region is {X ≤ k1 or X ≥ n1 − k1} and {Y ≤ k2 or Y ≥ n2 − k2}. In other words, the rejection region is located in the four corners

0 of the XY plane. However, since we already know p1 6= p2, in the XY plane (x, y) s

84 Figure 4.1: Flowchart of testing imprinting, AEI and ASM sequentially

along the diagonal are impossible to occur, including part of the lower-left and top-

right corners. Further study is needed to evaluate the practical feasibility of this

sequential-nature testing procedure and in designing appropriate tests, especially for

the second and third steps as the hypotheses are conditional on the outcome of the

previous steps.

4.2.5 A confidence region for p1 and p2

Instead of making a conclusion based on hypothesis testing, one may construct a

100(1 − α)% confidence region for p1 and p2 centred around the observedp ˆ1 andp ˆ2.

Suppose that X follows a binomial(n1,p ˆ1) and Y follows a binomial(n2,p ˆ2), and X ˆ and Y are independent. Let Pxy be the probability that X = x and Y = y:

! ! n1 n2 Pˆ = pˆ x(1 − pˆ )n1−x pˆ y(1 − pˆ )n2−y. xy x 1 1 y 2 2

85 ˆ To find the confidence region, let P(k) be the sorted Pxy’s from smallest to largest,

k = 1, 2, ··· , (n1 + 1) ∗ (n2 + 1). Given a significance level α, s is the largest integer

Ps ˆ satisfied k=1 P(k) ≤ α. Therefore, the confidence region is the collection of (x, y)

corresponding to k = s + 1, ··· , (n1 + 1) ∗ (n2 + 1).

Figure 4.2 show a 99% confidence region (in white color) for p1 and p2 of a SNP

(UCSC id: uc007iaz.1 831) in Gregg’s data (Gregg et al., 2010), where n1 = 110,

n2 = 106,p ˆ1 = 0.6 andp ˆ2 = 0.26. The center of this confidence region is (p1, p2) =

(p ˆ1, pˆ2), and this region does not include (p1, p2) = (0.5, 0.5). It implies that this SNP

will be rejected in the joint test with α = 0.01. Indeed, the p-value in the joint test

for this SNP is 5.53E-07.

For the joint test, rejecting H0 means that a SNP is either imprinted, or has AEI,

or both. To further discuss whether this SNP is imprinted and whether it has AEI,

we divide H1 into three parts (H1.A, H1.B and H1.C in figure 2.2 and also shown

in figure 4.2), and then we will make our conclusion based on which part includes

(p ˆ1, pˆ2). However, this procedure is ad hoc. On the other hand, the confidence region

constructed as described above may lend itself to further delineate the outcome. Our

99% confidence region in the example includes regions both in H1.B and H1.C but

not in H1.A, indicating with high confidence that there is paternal imprinting, but

there may not be AEI. However, futther study is need to evaluate the utility of the

confidence region procedure.

86 Figure 4.2: A 99% confidence region (white) for p1 and p2 of a SNP

87 Bibliography

Andolfatto, P. Adaptive evolution of non-coding DNA in Drosophila. Nature, 437

(7062):1149–1152, 2005.

Babak, T., DeVeale, B., Armour, C., Raymond, C., Cleary, M. A., van der Kooy, D.,

Johnson, J. M., and Lim, L. P. Global survey of genomic imprinting by transcrip-

tome sequencing. Current Biology, 18(22):1735–1741, 2008.

Bartolomei, M. S. and Ferguson-Smith, A. C. Mammalian genomic imprinting. Cold

Spring Harbor Perspectives Biology, 3(7), 2011.

Cheung, V., Spielman, R., Ewens, K., Weber, T., Morley, M., and Burdick, J. Map-

ping determinants of human gene expression by regional and genome-wide associ-

ation. Nature, 437(7063):1365–1369, 2005.

Daelemans, C., Ritchie, M. E., Smits, G., Abu-Amero, S., Sudbery, I. M., Forrest,

M. S., Campino, S., Clark, T. G., Stanier, P., Kwiatkowski, D., Deloukas, P.,

Dermitzakis, E. T., Tavare, S., Moore, G. E., and Dunham, I. High-throughput

analysis of candidate imprinted genes and allele-specific gene expression in the

human term placenta. BMC Genetics, 11, 2010.

Dohm, J. C., Lottaz, C., Borodina, T., and Himmelbauer, H. Substantial biases in

ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids

Research, 36(16), 2008.

88 Feng, R., Wu, Y., Jang, G. H., Ordovas, J. M., and Arnett, D. A powerful test of

parent-of-origin effects for quantitative traits using haplotypes. PLoS ONE, 6(12),

2011.

Ferguson-Smith, A. C. Genomic imprinting: the emergence of an epigenetic paradigm.

Nature Reviews Genetics, 12(8):565–575, 2011.

Fontanillas, P., Landry, C. R., Wittkopp, P. J., Russ, C., Gruber, J. D., Nusbaum, C.,

and Hartl, D. L. Key considerations for measuring allelic expression on a genomic

scale using high-throughput sequencing. Molecular Ecology, 19(1):212–227, 2010.

Ge, B., Pokholok, D. K., Kwan, T., Grundberg, E., Morcos, L., Verlaan, D. J., Le, J.,

Koka, V., Lam, K. C. L., Gagne, V., Dias, J., Hoberman, R., Montpetit, A., Joly,

M.-M., Harvey, E. J., Sinnett, D., Beaulieu, P., Hamon, R., Graziani, A., Dewar,

K., Harmsen, E., Majewski, J., Goering, H. H. H., Naumova, A. K., Blanchette,

M., Gunderson, K. L., and Pastinen, T. Global patterns of cis variation in human

cells revealed by high-density allelic expression analysis. Nature Genetics, 41(11):

1216–U78, 2009.

Gehring, M., Missirian, V., and Henikoff, S. Genomic analysis of parent-of-origin

allelic expression in Arabidopsis thaliana seeds. PLoS ONE, 6(8), 2011.

Goldberg, A. D., Allis, C. D., and Bernstein, E. Epigenetics: a landscape takes shape.

Cell, 128(4):635–638, 2007.

Gregg, C., Zhang, J., Weissbourd, B., Luo, S., Schroth, G. P., Haig, D., and Dulac,

C. High-resolution analysis of parent-of-origin allelic expression in the mouse brain.

Science, 329(5992):643–648, 2010.

89 Hansen, K. D., Brenner, S. E., and Dudoit, S. Biases in Illumina transcriptome

sequencing caused by random hexamer priming. Nucleic Acids Research, 38(12),

2010.

Hanson, R., Kobes, S., Lindsay, R., and Knowler, W. Assessment of parent-of-

origin effects in linkage analysis of quantitative traits. American Journal of Human

Genetics, 68(4):951–962, 2001.

He, F., Zhou, J.-Y., Hu, Y.-Q., Sun, F., Yang, J., Lin, S., and Fung, W. K. Detection

of parent-of-origin effects for quantitative traits in complete and incomplete nuclear

families with multiple children. American Journal of Epidemiology, 174(2):226–233,

2011.

Hurd, P. and Nelson, C. Advantages of next-generation sequencing versus the mi-

croarray in epigenetic research. Briefings in Functional Genomics and Proteomics,

8(3):174–183, 2009.

Kwan, T., Benovoy, D., Dias, C., Gurd, S., Provencher, C., Beaulieu, P., Hudson,

T. J., Sladek, R., and Majewski, J. Genome-wide analysis of transcript isoform

variation in humans. Nature Genetics, 40(2):225–231, 2008.

Li, B., Ruotti, V., Stewart, R. M., Thomson, J. A., and Dewey, C. N. RNA-Seq

gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):

493–500, 2010.

Li, E., Beard, C., and Jaenisch, R. Role for DNA methylation in genomic imprinting.

Nature, 366(6453):362–365, 1993.

90 Liddell, D. Practical Tests of 2 x 2 Contingency Tables . The Statistician , 25 :

295–304 , 1978 .

Mei, R., Galipeau, P., Prass, C., Berno, A., Ghandour, G., Patil, N., Wolff, R., Chee,

M., Reid, B., and Lockhart, D. Genome-wide detection of allelic imbalance using

human SNPs and high-density DNA arrays. Genome Research, 10(8):1126–1137,

2000.

Morison, I., Paton, C., and Cleverley, S. The imprinted gene and parent-of-origin

effect database. Nucleic Acids Research, 29(1):275–276, 2001.

Oshlack, A. and Wakefield, M. J. Transcript length bias in RNA-seq data confounds

systems biology. Biology Direct, 4, 2009.

Pfeifer, K. Mechanisms of genomic imprinting. American Journal of Human Genetics,

67(4):777–787, 2000.

Pirie, W. R. and Hamdan, M. A. Some Revised Continuity Corrections for Discrete

Distributions . Biometrics , 28 :693–701 , 1972 .

Reynard, L. N., Bui, C., Canty-Laird, E. G., Young, D. A., and Loughlin, J. Ex-

pression of the osteoarthritis-associated gene GDF5 is modulated epigenetically by

DNA methylation. Human Molecular Genetics, 20(17):3450–3460, 2011.

Sadee, W. Measuring cis-acting regulatory variants genome-wide: new insights into

expression genetics and disease susceptibility. Genome Medicine, 1(12):116, 2009.

Shendure, J. and Ji, H. Next-generation DNA sequencing. Nature Biotechnology, 26

(10):1135–1145, 2008.

91 Shete, S., Zhou, X., and Amos, C. Genomic imprinting and linkage test for quanti-

tative trait loci in extended pedigrees. American Journal of Human Genetics, 73

(5), 2003.

Shoemaker, R., Deng, L., Wang, W., and Zhang, K. Allele-specific methylation

is prevalent and is contributed by CpG-SNPs in the human genome. Gnnome

Research, 20(7):883–889, 2010.

Storer, B. and Kim, C. Exact properties of some exact test statistics for comparing

2 binomial proportions. Journal of the American Statistical Association, 85(409):

146–155, 1990.

Suissa, S. and Shuster, J. J. Exact Unconditional Sample Sizes for the 2 x 2 Binomial

Trial . Journal of the Royal Statistical Society, Ser. A , 148 : 317–327, 1985 .

Sun, W. A statistical framework for eQTL mapping using RNA-seq data. Biometrics,

68(1):1–11, 2012.

Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren,

M. J., Salzberg, S. L., Wold, B. J., and Pachter, L. Transcript assembly and

quantification by RNA-Seq reveals unannotated transcripts and isoform switching

during cell differentiation. Nature Biotechnology, 28(5):511–U174, 2010.

Velculescu, V., Zhang, L., Vogelstein, B., and Kinzler, K. Serial analysis of gene

expression. Science, 270(5235):484–487, 1995.

Wang, X., Sun, Q., McGrath, S. D., Mardis, E. R., Soloway, P. D., and Clark, A. G.

Transcriptome-wide identification of novel imprinted genes in neonatal mouse brain.

PLoS ONE, 3(12), 2008.

92 Wang, Z., Gerstein, M., and Snyder, M. RNA-Seq: a revolutionary tool for transcrip-

tomics. Nature Reviews Genetics, 10(1):57–63, 2009.

Weinberg, C. Methods for detection of parent-of-origin effects in genetic studies of

case-parents triads. American Journal of Human Genetics, 65(1):229–235, 1999.

Weinhold, B. Epigenetics - the science of change. Environmental Health Perspectives,

114(3):A160–A167, 2006.

Xu, X., Wang, H., Zhu, M., Sun, Y., Tao, Y., He, Q., Wang, J., Chen, L., and

Saffen, D. Next-generation DNA sequencing-based assay for measuring allelic ex-

pression imbalance (AEI) of candidate neuropsychiatric disorder genes in human

brain. BMC Genomics, 12, 2011.

Yates, F. Contingency tables involving small numbers and the chi-square test . Jour-

nal of the Royal Statistical Society , 1 :217–235 , 1934 .

Zhang, K., Li, J. B., Gao, Y., Egli, D., Xie, B., Deng, J., Li, Z., Lee, J.-H., Aach, J.,

Leproust, E. M., Eggan, K., and Church, G. M. Digital RNA allelotyping reveals

tissue-specific and allele-specific gene expression in human. Nature Methods, 6(8):

613–8, 2009.

Zhou, J.-Y., Hu, Y.-Q., Lin, S., and Fung, W. K. Detection of parent-of-origin effects

based on complete and incomplete nuclear families with multiple affected children.

Human Heredity, 67(1):1–12, 2009.

93