The Genetics of Variation in Gene Expression
Chris J. Cotsapas
A thesis submitted in full llment of the requirements for the degree of Doctor of Philosophy
University of New South Wales
2005 Abstract
The majority of genetic di erences between species and individuals have been hypothesised to impact on the regulation, rather than the structure, of genes. As the details of genetic variation are uncovered by the various genome sequencing projects, understanding the functional e ects on gene regulation will be key to uncovering the molecular mechanisms underying the genesis and inheritance of common phenotypes, such as complex human disease and commercially important traits in plants and animals. Unlike coding sequence polymorphisms, genetic variants a ecting gene expression will reside in the transcriptional machinery and its regulatory inputs. As these are largely speci c to cell– or tissue–types, we would expect that regu- latory variants will also a ect nal mRNA levels in a tissue speci c manner. Genetic variation between individuals may therefore be more complex than the sum total of sequence di erences between them. Demonstrating this hypothesis is the main focus of this thesis. We use microarrays to measure mRNA levels of approximately 22,000 transcripts in inbred and recombinant inbred strains of mice, and present compelling evidence that the genetic in uences on these levels are tissue–speci c in at least 85% of cases. We uncover two loci which apparently in uence transcript levels of multiple genes in a tissue–speci c manner. We also present evidence that failure of microarray data normalisation may cause spurious linkage of expression phenotypes leading to erroneous biological conclusions, and detail a novel, extensible mathematical framework for performing tailored normalisation which can remove such systematic bias. The wider context of these results is then discussed.
1 Contents
1 Introduction 11 1.1 Summary ...... 12 1.2 Detection strategies ...... 13 1.2.1 Allelic discrimination ...... 13 1.2.2 Genetical genomics ...... 15 1.3 Genetics of regulatory variation ...... 18 1.3.1 Extent of genetic e ects ...... 18 1.3.2 cis, trans, and master regulators ...... 20 1.3.3 Heritability, epistasis, and the number of determinants 21 1.3.4 Tissue speci city ...... 22 1.4 Biological implications ...... 23 1.5 Outline ...... 25
2 Microarray normalisation for genetical genomics 26 2.1 Introduction ...... 27 2.1.1 A note on microarray data visualisation ...... 28 2.2 Normalisation – mathematical bias removal ...... 29 2.2.1 Scaling ...... 30 2.2.2 Analysis of Variance ...... 31 2.2.3 Principal Components Analysis ...... 31 2.2.4 Intensity–dependent smoothing ...... 33 2.3 Correcting multiple non–linear biases in microarray data . . . 34 2.3.1 Non–linear artefacts ...... 35 2.3.2 Additive model normalisation ...... 36
2 CONTENTS
2.4 Failure of normalisation in genetical genomics experiments . . 41 2.4.1 Experimental design ...... 42 2.4.2 Lack of agreement between normalisation results . . . 43 2.5 Conclusions ...... 46 2.6 Materials and Methods ...... 47 2.6.1 Sample handling ...... 47 2.6.2 Expression pro ling ...... 47 2.6.3 Normalisation ...... 48 2.6.4 Linkage analysis ...... 48
3 Genetic in uence on mRNA levels is tissue speci c 49 3.1 Introduction ...... 50 3.2 Experimental design ...... 51 3.3 Tissue speci city of in uences on gene expression ...... 53 3.3.1 The majority of genetic in uences are tissue speci c . 54 3.3.2 Expression levels do not re ect complexity of in uences 56 3.4 Functional bias in in uenced transcripts ...... 57 3.4.1 Over–representation of functional themes ...... 59 3.5 Discussion ...... 61 3.6 Materials and Methods ...... 63 3.6.1 RNA preparation ...... 63 3.6.2 Microarray hybridisation and washing ...... 63 3.6.3 Data processing ...... 64 3.6.4 Overlap analysis ...... 64 3.6.5 Detecting changes in expression between strains . . . . 64 3.6.6 Gene Ontology analysis ...... 66
4 Dissection of genetic in uences on mRNA levels in a Re- combinant Inbred panel 67 4.1 Introduction ...... 68 4.2 Experimental design ...... 69 4.2.1 Moderated linkage statistics ...... 70 4.3 Independent tissue analysis ...... 72
3 CONTENTS
4.3.1 Linkage complexity ...... 73 4.4 Expression level correlation analysis detects biological themes under genetic in uence ...... 74 4.4.1 Correlation analysis of genetically variant expression levels identi es biological pathways ...... 75 4.4.2 Correlated clusters have common genetic determinants 76 4.5 Discussion ...... 77 4.6 Materials and Methods ...... 79 4.6.1 RNA preparation ...... 79 4.6.2 Microarray hybridisation and washing ...... 80 4.6.3 Data processing ...... 80 4.6.4 Linkage analysis ...... 81
5 Resolvable genetic determinants of mRNA levels are tissue speci c 82 5.1 Introduction ...... 83 5.2 Experimental design ...... 84 5.3 Tissue speci city of parentally in uenced genes ...... 84 5.4 Transgressive segregation of mRNA levels ...... 87 5.5 Some loci in uence multiple transcript mRNA levels . . . . . 89 5.5.1 A region on chromosome 1 in uences multiple tran- scripts in brain ...... 90 5.5.2 A region on chromosome 8 in uences transcripts in all tissues ...... 92 5.6 Discussion ...... 94 5.7 Materials and Methods ...... 97 5.7.1 RNA preparation ...... 97 5.7.2 Microarray hybridisation and washing ...... 97 5.7.3 Data processing ...... 98 5.7.4 Linkage analysis ...... 98 5.7.5 Overlap analysis ...... 99
4 CONTENTS
6 Discussion 100 6.1 Results summary ...... 101 6.2 De ning regulatory circuits ...... 102 6.2.1 Regulatory interactions as networks ...... 103 6.2.2 Tissue speci c regulatory interactions ...... 104 6.2.3 Mapping regulatory metaphenotypes ...... 105 6.3 The implications of tissue speci city ...... 106 6.3.1 Understanding relationships between individuals . . . 107
Literature cited 109
A Signi cant linkages in the BxD panel 128
B Over–represented GO terms in correlation clusters 204
5 List of Tables
1.1 Summary of genetic in uences on gene expression levels . . . 19
2.1 E ect of normalisation on gene identi cation ...... 44 2.2 E ect of normalisation on locus identi cation ...... 45
3.1 Genetic in uences on mRNA levels in each tissue ...... 54 3.2 Expression of genetically in uenced genes across tissues . . . 57 3.3 Extrapolating genetic in uence between tissues ...... 58 3.4 Enriched Gene Ontology terms for genetically in uenced genes. 60
4.1 Linkage results in three tissues ...... 73 4.2 Complexity of expression variation ...... 74 4.3 Linkage aggregation in RI brain ...... 76 4.4 Linkage aggregation in RI kidney ...... 77 4.5 Linkage aggregation in RI liver ...... 78
5.1 Linkage results for genetically in uenced genes ...... 84 5.2 cis and trans e ects on gene expression ...... 86 5.3 Genetic dissection of transgressive mRNA levels ...... 88 5.4 cis and trans e ects on transgressively segregating genes . . . 89 5.5 A locus a ecting multiple transcripts in brain ...... 90 5.6 Loci a ecting transcript levels in multiple tissues ...... 92 5.7 Linkage signi cances for genes a ected by chromosome 8 . . . 93
A.1 Gene linkages in brain ...... 129
6 LIST OF TABLES
A.2 Gene linkages in kidney ...... 186 A.3 Gene linkages in liver ...... 195
B.1 Over–represented GO terms in BxD brain ...... 205 B.2 Over–represented GO terms in BxD kidney ...... 206 B.3 Over–represented GO terms in BxD liver ...... 207
7 List of Figures
2.1 Basic microarray visualisation ...... 29 2.2 Non–linear biases in microarray data ...... 37 2.3 Systematic biases in microarrays ...... 39 2.4 Systematic biases in microarrays ...... 39 2.5 Deposition order biases in each channel ...... 40
3.1 Tissue–speci c genetic in uence on mRNA levels ...... 55
4.1 Signi cance comparisons for linkage analysis ...... 71
5.1 Overlap of linkage results for genetically in uenced genes . . 85 5.2 Overlap for transgressive genes ...... 88 5.3 E ects mapping to chromosome 1 ...... 91 5.4 Linkage scan showing e ects in all tissues ...... 93
8 LIST OF FIGURES
Acknowledgments
This thesis is the culmination of four years of work with Professor Peter Little, without whom it would have been impossible. I owe him a great debt of gratitude for years of support and friendship, but most of all for the patience he exhibited whilst I transformed myself from dilettante to scientist. The trust and friendship he has extended to me have been an honour and a privilege, and I can only hope they have not been misplaced. Dr. Rohan Williams has played a central role in this work, as collabora- tor, o ce–mate, food buddy, aesthetic sounding–board and purveyor of ex- otic quantitative methods. His broad knowledge and burning curiosity made our wide–ranging scienti c conversations fascinating, and he has taught me much. The Gene Ontology analyses presented in Chapter 3 are his work, and many points in the Discussion have arisen as a consequence of brain– storming sessions. The statistical aspects of this work are the fruit of a collaboration with the School of Mathematics, UNSW, with Professors William Dunsmuir and Matt Wand, and especially Dr. David Nott. David has gently tutored me into at least passing competence with elementary statistics; the B–statistic modi cations in Chapter 3 are his work, and he has had a major in uence on the normalisation methods presented in Chapter 2. He has kindly proof–read and corrected some parts of this work. The additive model normalisation procedure in Chapter 2 was devised by Professor Matt Wand, who also wrote an early implementation. The material presented here rests, directly or indirectly, on the work of other members of our laboratory. The expression data in Chapter 3 was generated by Jeremy Pulvers as part of his Honours project; Eva Chan kindly provided genetic map data; and Mark Cowley has contributed program code for several analyses. They have provided a stimulating environment to work in, and I am fortunate to count them as friends and colleagues. Bronwyn Robertson and Geo Kornfeld of the Ramaciotti Centre for Gene Function Analysis, UNSW, provided microarrays and facilities for the experiments described here. A/Prof. Russell Standish of the High Perfor-
9 LIST OF FIGURES mance Computing Support Unit at UNSW wrote an optimised implemen- tation of the bootstrapped Student’s t–test used in Chapter 4, and kindly arranged for compute time on the Barossa cluster. John Schimenti (Jack- son Laboratories) and Maja Bucan (University of Pennsylvania) and their laboratories kindly assisted in obtaining BxD mouse strains. They all have my sincere thanks. On a personal note, I would never have embarked on so ambitious a move without the support of my family. They had always taught me that everything was possible, and although I suspect this is not what they had in mind, were directly responsible for me moving hemispheres on a whim- sical decision to follow science. Friends I have made in Australia made me welcome and provided respite and refreshment over the last four years: Neil Saunders, Stephen Harrop, and Greg Tyrelle as Team Linux; Kai and Kerry Schindlmayr for endless friendship and hospitality; Julie Lim, Emma Collinson, Yael Azriel were my female consiences; Jo Gibson forebore to judge, and Laura Coleman taught me I was not alone. All have my thanks and love.
10 Chapter 1
Introduction
11 1.1. SUMMARY
1.1 Summary
The majority of genetic di erences between species and individuals impact on the regulation, rather than the structure, of genes [King and Wilson, 1975; Paigen, 1979]. As the details of genetic variation are uncovered by the various genome sequencing projects, understanding the functional e ects on gene regulation will be key to uncovering the molecular mechanisms under- ying the genesis and inheritance of common phenotypes, such as complex human disease [Gabellini et al., 2004; Theuns and Van Broeckhoven, 2000] and commercially important traits in plants and animals. The transcriptome is presently the most amenable level at which to study gene regulation and its genetics at a global level, although we still know little of the regulatory forces which shape these processes. This is due in part to the di culty in predicting regulatory variants from sequence, and in part to the absence until recently of high–throughput quantitative methods of assaying gene expression [Brenner et al., 2000; Schena et al., 1995; Velculescu et al., 1995]. As the latter become available, established methods and genetic resources may be used to dissect the genetics of regulation, by treating expression levels as quantitative traits with genetic determinants. Unlike coding sequence polymorphisms, genetic variants a ecting gene expression will reside in the transcriptional machinery and its regulatory inputs. As these are largely speci c to cell– or tissue–types [reviewed by Wray et al., 2003], we would expect that regulatory variants will also a ect expression levels in a tissue speci c manner. It is therefore probable that the e ects of genetic variation between individuals may therefore be more complex than the sum total of sequence di erences would imply. Demonstrating the validity of this hypothesis is the main focus of this thesis. In this chapter we review the methods and results of large–scale quantitative studies of genetic e ects on gene expression levels, followed by a consideration of their phenotypic consequences and tissue speci city. we then describe the experimental designs used throughout this work. In fol- lowing chapters we present a series of experiments using mRNA abundances from inbred and recombinant inbred mice as molecular phenotypes in ge-
12 1.2. DETECTION STRATEGIES netic mapping experiments, demonstrating the tissue speci city of heritable modulations on gene expession. we also describe a new method for normalis- ing microarray data which removes biases capable of generating artefactual genetic signal in these experiments, and conclude by arguing that this evi- dence supports the rethinking of genetic di erences between individuals as a highly contextual measure of functional changes, as opposed to a simple proportion of sequence divergence, and suggesting that this concept may also be useful in characterising between–species di erences.
1.2 Detection strategies
A genetic variant a ecting transcriptional regulation will alter either the e ective concentration of mRNA transcribed from a gene, or the spatio– temporal distribution of message. Stable di erences of gene expression can therefore be used to infer the presence of such variants. This has been done either by looking for di erences in transcription between alleles in heterozy- gous individuals (allelic discrimination), or by treating overall expression levels as heritable traits in genetically distinct individuals in a population (genetical genomics [Jansen and Nap, 2001]). In the latter case, a natural population may be used to “phenotype” for expression levels, and a segre- gating population to map their genetic determinants in the genome. This approach is equivalent to quantitative trait locus (QTL) analysis, from which analytical methods have been extensively borrowed.
1.2.1 Allelic discrimination
Polymorphisms within expressed sequence can be used to quantitate the expression levels of alleles in either heterozygous individuals [Cowles et al., 2002; Pastinen et al., 2004; Wittkopp et al., 2004; Yan et al., 2002], poly- ploids [Schaart et al., 2005], or natural populations [Lo et al., 2003], by adapting genotyping technologies such as single–base primer extension [Cowles et al., 2002; Pastinen et al., 2004; Yan et al., 2002], pyrosequencing [Schaart et al., 2005; Wittkopp et al., 2004], genotyping arrays [Lo et al., 2003], and
13 1.2. DETECTION STRATEGIES mass spectrometry [Knight et al., 2003] for use with cDNA. The method lends itself to detection of e ects in cis (for example, vari- ants in promoter binding sequences), which will be in linkage disequilibrium with the expressed marker. However, e ects in trans can only be formally excluded in obligatory heterozygotes, such as intercross [Cowles et al., 2002] or interspeci c [Wittkopp et al., 2004] individuals derived from homozygous backgrounds. They can be inferred by comparison to the parental strains [Wittkopp et al., 2004]: an allelic imbalance between the parentals, but not within the F1 generation being indicative of a trans e ect. In outbred populations such as the CEPH human reference pedigrees [Dausset et al., 1990], where individuals are not necessarily heterozygous at all loci, the assumption that e ects are due to cis variants cannot be made without testing for either association or haplotype transmission within pedigrees [Pastinen et al. 2004; Yan et al. 2002, c.f. Lo et al. 2003]. These techniques generally allow resolution of small relative di erences between allele abundances: Wittkopp et al. [2004] resolve 1.1–fold or better di erences in 29 genes between inter-speci c Drosophila melanogaster D. simulans crosses; Knight et al. [2003] measure a 1.3–fold allelic expression imbalance for LTA in humans; and Cowles et al. [2002] measure 1.5–fold or better di erences for 69 genes in F1 mouse inbred strain crosses. Lo et al. [2003] sacri ce sensitivity for throughput, screening over 1000 poly- morphisms in human tissues using array–based formats, but resolving 2–fold or greater changes in expression. This tradeo becomes a recurrent theme, as the other, more sensitive approaches have limited throughput due to for- mat constraints. The main limitation of the allelic discrimination approach is the reliance on expressed polymorphisms, which implies knowledge of coding sequence variations within the study populations. Whilst such data are being gen- erated for humans [The International HapMap Consortium, 2003] and mice [Wade et al., 2002; Wiltshire et al., 2003], they are not yet available for many other organisms, and are unlikely to ever be for those not used as genetic models. This raises the problem of de novo identi cation of suit- able polymorphisms for each gene under study, which can be expensive and
14 1.2. DETECTION STRATEGIES time consuming. Furthermore, cis–acting variants may be in elements some distance from the expressed polymorphism, and recombinations in this in- terval will reduce power in detecting e ects. As each gene must be tested separetely, massively–parallel execution is also problematic. Further, the ap- proach does not distinguish between genetic and epigenetic e ects, so that prior knowledge, such as imprinting status, is also required. These prob- lems make allelic discrimination cumbersome at the whole–transcriptome level, but the high speci city and sensitivity make these methods ideal for focussing on particular sets of genes.
1.2.2 Genetical genomics
The second approach is to treat overall expression levels in a sample as a quantitative trait, without di erentiating between alleles. There is no re- quirement for prior knowledge of the genes under consideration, so generic expressed sequence or cDNA resources may be used in high–throughput expression formats, such as microarrays [Schena et al., 1995], SAGE [Vel- culescu et al., 1995] or MPSS [Brenner et al., 2000], widening the scope to include non–model organisms. Extensive genetic and physical maps are al- ready available for common model organisms, as is genotypic information for stable populations such as recombinant inbred mouse strains [Bailey, 1971] and CEPH family cell lines [Dausset et al., 1990]. The availability of high– throughput genotyping platforms also makes these experiments feasible in transient intercross or backcross populations [e.g. Schadt et al., 2003]. As most or all the transcriptome may be queried, there is no bias from prefer- ential target gene selection. Two experimental designs are possible: comparing individuals within a population [Jin et al., 2001; Kluger et al., 2004; Oleksiak et al., 2002] to de- termine heritability of di ences in gene expression levels, or mapping such di erences in a segregating population [Brem et al., 2002; Morley et al., 2004; Schadt et al., 2003; Yvert et al., 2003] as QTLs (eQTL [Schadt et al., 2003]). Sampling from an outbred population captures a large proportion of the natural variation of an organism. However, the absence of de ned
15 1.2. DETECTION STRATEGIES relationships between individuals and potential strati cation e ects, allow little resolution power beyond estimating the proportion of e ected genes and a rough measure of heritability [Falconer, 1989]. A segregating popula- tion o ers the advantage of dissecting the number and relative contribution of determinants, at the expense of decreasing the proportion of total ge- netic variation sampled. The two strategies may be used in combination, for example to limit mapping attempts to transcripts with highly heritable expression levels [Morley et al., 2004]. Whilst this decreases the multiple testing involved in mapping thousands of expression phenotypes, it will ex- clude the apparently common transgressive segregation events [e.g. Brem and Kruglyak, 2005; Chesler et al., 2005]. High–throughput expression assays allow us to look for e ects at the level of molecular processes, rather than of single genes, which may be more biologically relevant. Clustering [Eisen et al., 1998; Gasch and Eisen, 2002], dimensionality reduction [Alter et al., 2000, 2003], and information theory [Basso et al., 2005] amongst other computational approaches [Deng et al., 2005; Lipan and Wong, 2005; Perrin et al., 2003; Qian et al., 2003; Tavazoie et al., 1999] have already been applied to large expression datasets, to de- tect correlations in patterns of gene expression across conditions indicative of group co–regulation and functional relationship. These methods allow us to detect novel links between groups of genes — particularly valuable information for the majority of transcripts in large expressed collections, for which little information is available. In a genetical genomics setting, Yvert et al. [2003] use mean expression levels of groups identi ed by hierarchi- cal clustering [Eisen et al., 1998] to represent overall trends in co–regulated genes; and Kluger et al. [2004] use principal components to collapse ex- pression of genes in biological pathways from KEGG [Kanehisa et al., 2004] and BioCarta, rather than interrogate expression levels per se. Downstream applications of such approaches [Bing and Hoeschele, 2005; Li et al., 2005] may reveal any genetic perturbations of transcriptome subnetworks, which may be more informative in terms of phenotypic consequences than changes to individual genes. At a methodological level, mapping trends in groups of genes rather than individual gene expression pro les will substantially
16 1.2. DETECTION STRATEGIES reduce the multiple testing problem [Brem et al., 2002; Yvert et al., 2003] inherent in genetical genomics. The primary limitation of segregating populations is the obligatory use of limited genetic backgrounds so that only a proportion of variants present in a species may be assayed. Solutions to address this problem, such as recom- binant inbred panels derived from multiple genetic backgrounds [Churchill et al., 2004], are being developed, although these are likely to be limited to common model organisms for the forseeable future and are inapplicable to humans. There are several aspects of methodology that are far from resolved. Data normalisation, particularly for microarrays, to remove systematic arte- facts has mostly relied on simplistic data transformation such as mean stan- dardising or scaling, which are known to be inadequate even for simple outlier detection experiments [Williams et al., 2006, in press]. Residual structure within the data is therefore often mapped, in the mistaken be- lief that it represents large–scale biological phenomena. This artefactual signal detection is exacerbated by use of established linkage statistics with- out a thorough examination of their applicability to such datasets. The use of non–signi cant, “best” LRS [Churchill and Doerge, 1994; Doerge and Churchill, 1996] score by Chesler et al. [2005] as a measure of e ects in trans (discussed in the next section), for example, promotes the overestimation of such phenomena. Determination of signi cance may also be inaccurate when using nominal p–values for certain statistical tests, which ignore the small sample size and non–normal nature of the underlying data. This may be particularly damaging in combination with the universal lack of minimum intensity data thresholding based on negative control sequences present on arrays, as the high variance observed at low intensities will in ate test statis- tics [Smyth, 2004]. These issues are discussed in detail in Chapter 2, where we present an extensible normalisation framework and associated method- ology to address these problems.
17 1.3. GENETICS OF REGULATORY VARIATION
1.3 Genetics of regulatory variation
The methods described above allow us to begin dissecting the genetic com- ponent of gene expression levels, and interpreting these results will inform our understanding of transcriptome regulation. A summary of the current literature is given in Table 1.1. Results from Wittkopp et al. [2004] are in- cluded for completeness, although it should be noted that they compared two species of drosophilids, D. melanogaster and D. simulans.
1.3.1 Extent of genetic e ects
The primary observation is the proportion of transcripts whose expression is under clear genetic control. There is a striking di erence between results from allelic discrimination and genetical genomics approaches, the former indicating that more than half the genes investigated are in uenced. Whilst this may be due to the increased sensitivity of the underlying experimen- tal methods, these studies generally examine a small number of carefully selected genes, and will therefore tend to overestimate such e ects. This no- tion is reinforced by the observation that, with the exception of the results of Lo et al. [2003], there appears to be a decreasing trend in the proportion of a ected genes within species, as the number of genes surveyed increases (Table 1.1). The more inclusive genetical genomics studies suggest that 10 – 28% of transcripts are subject to genetic modulation. The false discovery rates (FDR) vary between these reports, however, as does the method used (if any) to lter transcripts considered for analysis. Monks et al. [2004], for example, consider genes expressed at reliable detection levels; Morley et al. [2004] require di erential expression in grandparents of CEPH families before considering parent–children heritability; and Chesler et al. [2005] restrict themselves to transcripts with a certain heritability. A rough estimation from all these results would therefore indicate that one third to one half of transcripts will be under some genetic expression in uence, subject to the caveats of tissue speci city and temporal modulation of these e ects. A comparison to the number of genes with heritable expression levels, however,
18 1.3. GENETICS OF REGULATORY VARIATION
Species† Genes A ected FDR cis trans Refs (Modality‡) studied (%) % (%) (%) Dm (AD)?\ 29 29 (100) N/A 28 (97) 16 (55) Wittkopp et al., 2004 Hs (AD) 13 6 (46) N/A 6 (100) — Yan et al., 2002 Hs (AD) 15 7 (47) N/A 7 (100) — Bray et al., 2003 Hs (AD)§ 602 326 (54) N/A — — Lo et al., 2003 Hs (AD) 129 23 (18) N/A 23 (100) — Pastinen et al., 2004 Mm (AD) 69 4 (6) N/A 4 (100) — Cowles et al., 2002 Mm (NP) 7,169 73 (1) — — Sandberg et al., 2000 Dm (NP) 3,931 () — — Jin et al., 2001 Fh (NP) 907 161 (18) — — Oleksiak et al., 2002 Sc (NP) 5,908 433 (7) — — Townsend et al., 2003 Fh (NP) 192 92 (48) — — Whitehead and Crawford, 2005 Hs (SP) 7,861 2,123 (27) () () Schadt et al., 2003 Hs (SP)? 3,554 142 (4) 32 (23) 115 (81) Morley et al., 2004 Hs (SP) 3,554 984 (28) ?? () ?? () Morley et al., 2004 Hs (SP) 2,340 762 (31) 5 8 (26) 12 (39) Monks et al., 2004 Sc (SP) 1,528 570 (37) 205 (36) 365 (64) Brem et al., 2002 Sc (SP) 6,215 1716 (28) (20) 992? (75) Yvert et al., 2003 Mm (SP) 12,422 1,218 (10) 162 (13) 1,056 (87) Bystrykh et al., 2005 Mm (SP) 608 101 (17) 25 92 (91) 9 (9) Chesler et al., 2005 o Rn1 (SP) 15,923 1,833 (12) 75 622 (34) 1,211 (66) Hubner et al., 2005 o Rn2 (SP) 15,923 2,051 (13) 64 800 (39) 1,251 (61) Hubner et al., 2005
Table 1.1: Study summary of signi cant genetic in uences on gene expres- sion within a population. Only e ects demonstrating linkage are reported for genetical genomics studies. † Fh: Fundulus heteroclitus; Dm: Drosophila melanogaster; Hs: Homo sapiens; Mm: Mus Musculus; Rn: Rattus norvegi- cus; Sc: Saccharomyces cerevisiae. ‡ AD: allelic discrimination; NP: nat- ural population; SP: segregating population. ? These studies report both cis and trans in uences on transcript levels, so that the sum of e ects is greater than the number of genes in uenced. \ This is an inter– rather than intra–, speci c population, between D. melanogaster and D. simulans. § Lo et al. [2003] cannot distinguish between cis and trans e ects. o Hubner et al. [2005] provide results on fat tissue (Rn1) and kidney (Rn2) at several genome–wide thresholds: P < 0.05 is used here for constistancy with other studies [Hubner et al., 2005].
suggests that this is still an underestimate (discussed in section 1.3.3). Transgressive segregation [Brem and Kruglyak, 2005; Chesler et al., 2005] appears common for gene expression levels. In yeast, Brem and Kruglyak
19 1.3. GENETICS OF REGULATORY VARIATION
[2005] estimate that up to 59% of transcripts display expression di erences in segregants of a cross between laboratory and wild strains, and Chesler et al. [2005] report that the phenomenon is common in laboratory mouse crosses. This suggests that multiple determinants of gene expression are common, so that they may be regarded as complex quantitative traits.
1.3.2 cis, trans, and master regulators
In segregating population experiments, we can dissect the aetiology of regu- latory variation by estimating the proportions of variants genetically in cis and in trans to the a ected genes. The former will reside in regulatory elements, such as promoter/enhancer binding sites or transcript stability motifs [Cartegni et al., 2002]; trans–acting factors are generally thought of as transcription factor alleles [Chesler et al., 2005], although they may be co–factors, accessory proteins, or reside in more indirect regulatory compo- nents [Yvert et al., 2003], for example altering propagation e ciency in a signalling cascade resulting in transcriptional changes. Of nine studies, eight report 60 – 90% trans–acting determinants; Chesler et al. [2005] alone report 91% cis e ects, which may be due to the compar- atively small number of genes passing their ltering process. This suggests that the most common mechanism for expression level variation is through regulatory factors a ecting transcription, rather than changes to the control sequences of a transcript. Since multiple transcripts will be modulated by the same regulatory machinery, the number of cis and trans variants may be the same, the discrepancy in e ect frequency being attributable to the pleiotropy of the latter. Pleiotropic e ects can be detected by examining the map locations of trans e ects on all transcripts, with those coincident identifying loci con- taining “master regulators” of transcription [Morley et al., 2004]. Chesler et al. [2005] then go on to show that these loci also give “best” (i.e. non– signi cant) linkage scores for a large number of other transcripts irrespective of their heritability, suggesting that up to 10% of the transcriptome may be modulated by any one of these loci. Schadt et al. [2003] indicate seven such
20 1.3. GENETICS OF REGULATORY VARIATION linkage “hotspots” containing more than 1% of approximately 4,300 QTLs detected. In contrast, Morley et al. [2004] nd two hotspots e ecting more than six transcripts each: plausible indications of, for example, a variant transcription factor. Monks et al. [2004] nd 12 such loci, none in uencing more than six transcripts, in fteen human pedigrees. The concept of mas- ter regulators in uencing thousands of transcripts must therefore be treated with scepticism, particularly where a lack of signi cance coupled with low heritability may suggest that many of these linkages are spurious. Observa- tions from our laboratory further suggest that even highly signi cant linkage may be artefactual: failure of normalisation allows subtle data structure to remain in expression value matrices, which will by chance be described by some genotype pattern in the genetic map and therefore produce “linkage” signal [Chapter 2 and Williams et al., 2006, in press]. This possibility would also explain the inconsistancy of locations for these master regulators be- tween experiments (e.g. Schadt et al. [2003] c.f. Chesler et al. [2005]).
1.3.3 Heritability, epistasis, and the number of determinants
The ability to map determinants of a quantitative trait is dependent on both its heritability (the proportion of variance attributable to genetic factors), and its complexity (the number of contributing factors) [Falconer, 1989]. Classical QTLs [Korstanje and Paigen, 2002] are thought to follow an ex- ponential distribution (the so–called –model [Morton, 1998]) rather than Fisher’s original proposition of an in nitessimal model [Fisher, 1930], such that a few loci of large e ect account for the majority of genetic variance, with tens or hundreds supplying the balance [reviewed in Farrall, 2004]. A complication occurs if two loci interact to produce an overall e ect, in a process called epistasis [Falconer, 1989; Lynch and Walsh, 1998]. Since the e ects are, by de nition, non–additive, each locus will, when considered in isolation, contribute only a small fraction of the heritable variance. QTL of this nature will tend to follow an in nitessimal model, and will not show statistically signi cant linkage. The observation that a substantial proportion of highly heritable expres-
21 1.3. GENETICS OF REGULATORY VARIATION sion levels do not show robust evidence of linkage may therefore be explained, after lack of power considerations, by either an in nitessimal, or an epistatic argument. Chesler et al. [2005] indicate that only 17% of highly heritable transcripts show linkage in mice, Monks et al. [2004] report the same number for humans, and Morley et al. [2004] 4% in humans. Other studies either do not calculate heritability, or report similarly low gures. In an explicit study of the properties of eQTL, Brem and Kruglyak [2005] show that ap- proximately half the expression levels in yeast are best explained by models containing more than ve additive loci; through simulation, they show that up to 61% may be accountable by ten–locus models. We can therefore conclude that a substantial proportion of heritable transcription levels are either in uenced by many loci, perhaps approaching an in nitessimal, rather than exponential, model of e ect; or that they may be explained by a relatively small number of epistatic interactions between causative variants. A recent study in yeast [Storey et al., 2005] indicates that 14% of transcription level di erences are explainable by epistasis between two loci. However, the majority of studies report a small percentage of multiple linkages, which implies that at least some expression levels are the result of additive e ects following an exponential model. Whether these represent extremes of a continuum of complexity for expression traits, or two discrete modes of regulatory variation, is still unclear.
1.3.4 Tissue speci city
Despite several studies reporting results from multiple tissues in the same populations, there has, as yet, been no systematic comparison between cell types beyond observations of low overlaps in e ected genes. Hubner et al. [2005] report 7 – 22% overlap of e ects in recombinant inbred rat fat and kidney tissues, with far fewer e ects in trans detected at more stringent thresholds. Chesler et al. [2005] and Bystrykh et al. [2005] report 50% cis e ects shared between brain and hematopoietic stem cells from recombinant inbred mice, and only four corresponding trans e ects, where the number of in uenced genes is not reported. It is not clear from these reports whether
22 1.4. BIOLOGICAL IMPLICATIONS the proportion of overlap is due to the absence of expression, or variants in tissue speci c regulatory components. Given the predominantly tissue– speci c nature of the gene regulation machinery [Wray et al., 2003], the latter is likely to be a signi cant contributor of tissue–speci c e ects. Contrary to published claims [Chesler et al., 2005], the proportion of overlaps make extrapolating regulatory e ects between tissues untenable. It further suggests that the full spectrum of variation between individuals can only be understood by extensive tissue sampling. An interesting implication is that genetic similarity is a tissue–speci c quantity, which would add fur- ther complexity to population structure, genetic epidemiology, and species evolution.
1.4 Biological implications
The rst demonstration of functional regulatory variation between individ- uals came over fty years ago [Law et al., 1952], and the general importance of regulatory variants both between and within species was already an es- tablished concept by the 1970s [reviewed in King and Wilson, 1975; Paigen, 1979]. The eld of evolutionary developmental biology has since coalesced around the theory that changes to gene regulation during development is responsible for di erences in body plans, and ultimately speciation [Carroll, 2005; Levine and Tjian, 2003]. Unsurprisingly, the consequences of more subtle gene regulatory di erences between individuals is as yet unclear, as these di erences are themselves only now being elucidated. Gene expression changes in perturbed systems or disease states have been used as molecular phenotype surrogates in complex disease dissection [Eaves et al., 2002; Karp et al., 2000], and as a signature in cancer diagnosis [Sorlie et al., 2001]. In this context, there is no requirement for them to be causative: they must simply be indicative of a state alteration to be useful. Genetical genomics experiments are now being used in a similar fashion to relate heritable di erences in gene expression levels to variation in physiolog- ical traits. Hubner et al. [2005] nd that cis eQTLs in a recombinant inbred rat panel derived from spontaneously hypertensive and normal progenitor
23 1.4. BIOLOGICAL IMPLICATIONS strains correspond to previously mapped hypertension–related trait QTLs (“pQTLs”). These results not only suggest that at least some physiological trait di erences are caused by changes in expression, but also the identity of causative genes, a major stumbling block in classical quantitative trait anal- ysis [Nadeau and Frankel, 2000]. Schadt et al. [2005] use genes di erentially expressed between lean and obese F2 intercross mice fed an atherogenic diet [Drake et al., 2001; Schadt et al., 2003] to identify causative, rather than resultant, expression di erences with genetic modelling methods capable of assigning directionality to a correlation. The examples above attempt to identify candidate causative genes for a trait, by focussing on a particular study population chosen for maximal trait di erences. Whilst of obvious utility in dissecting known phenotypes, this approach ignores other dimensions of variation between individuals, such as the possibility that di erences in the topology of the transcriptome may be important. The network of interactions between gene products displays a hierarchical structure of subnetworks connected by hubs, which are them- selves connected together [Han et al., 2004]. These hub proteins often have regulatory e ects on their nodes, particularly at the transcriptional level, so that interactome topology will re ect the structure of the transcriptome. We may speculate that changes to the structure or dynamics of these sub- networks induced by regulatory variants could have pervasive e ects on the gross structure of the transcriptome. Such subtle changes will either be un- detected or their scale will not be re ected in gene–by–gene approaches. The phenotypic consequences of such second order e ects are presently unknown, but may yield unexpected insights into molecular phenotypes, particularly little–understood phenomena such as background modi er activity [Nadeau, 2001] or nuclear spatial organisation [Casolari et al., 2005; Parada et al., 2004]. A recent study [Denver, DR. and Morris, K. and Streelman, JT. and Kim, SK. and Lynch, M. and Thomas, WK., 2005] suggests that mutational changes to the transcriptome accumulate rapidly; it is tempting to speculate that new combinations of alleles brought together at each meiosis may have a similar, if smaller, e ect on the transcriptome: a constant revelation of “cryptic” variation [Gibson and Dworkin, 2004] between individuals.
24 1.5. OUTLINE
1.5 Outline
We investigate genetic in uences on gene expression using two inbred mouse strains and a panel of thirty–one recombinant inbred strains derived from them. We use microarrays to measure expression levels for approximately 22,000 transcripts, representing the majority of the murine gene comple- ment. In Chapter 2 we present a novel, extensible framework for microarray data normalisation using additive models to eliminate subtle systematic bi- ases. We show that this method is appropriate to genetical genomics appli- cations and describe the e ects of failure of normalisation. In Chapter 3, we compare mRNA levels in three tissues of two inbred mouse strains, and conclude that the majority of di erences are tissue– speci c. We further show that genes whose products are involved in tran- scription and its regulation are particularly susceptible to di erences in mRNA levels across genetic backgrounds, suggesting that many variants in uencing these levels may reside in molecules only indirectly associated with transcription. We conclude that extrapolation between tissues of ge- netic e ects on mRNA levels is not possible, and that the in uence of genetic variation on gene regulation is tissue–speci c. In Chapter 4, we measure mRNA levels in a panel of thirty–one re- combinant inbred strains, and map genetic determinants of these molecular phenotypes by assessing linkage to a dense genetic map. We show that only approximately 10% of genes with altered expression in the parental strains have resolvable determinants, but many genes not found altered in the parentals exhibit genetic linkage. We also show that two loci in uence mRNA levels in a tissue–speci c manner, and conclude that gene regulation is genetically complex and largely tissue–speci c. In Chapter 5, we discuss the implications of these ndings, particularly the notion that alterations to the regulation of a core set of genes may be responsible for tissue–speci city, rather than tissue–speci c suites of regula- tors.
25 Chapter 2
Microarray normalisation for genetical genomics
26 2.1. INTRODUCTION
2.1 Introduction
The elimination of random and systematic noise from data prior to analysis is a key step in experimental science. As microarray technology has matured, a number of sources of noise have been identi ed [Yang et al., 2002], and both experimental and theoretical methods to account for them have been proposed. These sources may be divided into three broad categories: (i) Manufacturing practices: non–random spacing of gene probes in an array leading to areas of high/low signal; time–dependent chemical di erences in slide surface. (ii) Pre–processing methods: biases introduced by feature selection and background estimation algorithms. (iii) Experimental artefacts: background signal generated by the hybridis- ation process; di erences in uorescence properties between dyes; dif- ferential brightness–induced scale and range di erences. Progress has been made in addressing all three sources of bias with a combi- nation of practical and theoretical approaches. It is notable that the latter has focused heavily on borrowing methodology from other elds, particularly applications of multivariate statistics. The rst e orts to improve microarray signal revolved around changes to manufacturing practices and experimental protocols. Manufacturing was improved by clone and control selection [Loftus et al., 1999; Schuchhardt et al., 2000], particularly after so–called housekeeping genes were shown to uctuate dramatically in expression across cell types [Lee et al., 2002]; se- lection of appropriate oligonucleotide sequences representing transcripts to minimise cross-hybridisation and provide cleaner signal; and the improve- ment of slide–coating substrates to minimise background uorescence. A signi cant improvement to experimetal protocol was the adoption of indi- rect uorescent labelling: here, amino allyl–modi ed nucleotides are incor- porated during cDNA synthesis followed by dye coupling, rather than direct labelling with dye–nucleotide complexes. These can have markedly di erent steric properties leading to di erential incorporation rates and hence strong channel intensity bias, which is decreased with indirect labelling. The ex-
27 2.1. INTRODUCTION periments described in this thesis have been carried out with this labelling strategy. Advances in image processing software have also been made, particu- larly in the estimation of background intensities for each spot. The process of spot identi cation in a scan image (segmentation) has evolved from over- laying grids of xed–diameter perfect circles, to adaptive tting algorithms such as seeded region growing, capable of accounting for deviant spot mor- phology and thus more accurately capturing real signal [Smyth et al., 2003]. Background uorescence estimation has progressed from subtracting an av- erage for the whole slide, to local sampling of the inter–spot spaces, to two–dimensional smooth imputation of background under the spot itself by the techniques of morphological opening [Smyth et al., 2003; Soille, 1999]. Whist these improvements have greatly increased the quality of microar- ray data being collected, further biases remain, and must be removed post hoc by mathematical manipulation: this process is generally referred to as normalisation [Smyth and Speed, 2003]. The nature of the remaining biases may vary across laboratories, so investigation and tailoring of methods is warranted. This tailoring is especially cogent in genetical genomics, where the goal is not to identify large expression level ratios as in typical microarray experiments, but to use them as quantitative traits. Biases must therefore be carefully removed to avoid spurious inference of genetic in uence, a phe- nomenon more prevalent than previously thought (discussed below in sec- tion 2.4). The relatively modest magnitude of heritable expression changes makes this data treatment all the more important.
2.1.1 A note on microarray data visualisation
Visual inspection is key to exploratory analysis [Cleveland, 1993, 1994; Tufte, 1990, 1997], but direct plots of microarray intensities are rarely informative (Fig 2.1A), even when log2 transformed to decrease the scale (Fig 2.1B). Mean di erence plots [Bland and Altman, 1986, 1999] are generally used to expose more subtle data structure by contrasting the ratio of measurements to the average measurement (Figure 2.1C). For two–colour microarray data,
28 2.2. NORMALISATION – MATHEMATICAL BIAS REMOVAL the notation proposed by Yang et al. [2002] is often used: intensity ratio M = log2(R G); and geometic mean intensity A = log2(R + G), where R and G are the background subtracted red and green channel spot intensities, respectively. I shall use this notation thoughout this chapter, and refer to the mean di erence plot as the MA plot.
A: Raw B: Logged C: Mean difference 15 2 0 50000 10 −2 Green 5 log2 Green −4 20000 Mean difference M −6 0 0
0 20000 50000 0 5 10 15 6 8 10 12 14 16
Red log2 Red Mean intensity A
Figure 2.1: basic visualisation of microarray data. An array representing approximately 15,200 transcript cDNAs is represented as A: background subtracted intensities; B: log2–transformed background subtracted intensi- ties; C: mean di erence [Bland and Altman, 1986, 1999], or MA [Yang et al., 2002], plot. Subtle di erences, such as a tendency towards higher expression ratios M at low mean intensity A, are more obvious in the latter.
2.2 Normalisation – mathematical bias removal
A bewildering number of statistical methods have been developed to re- move bias from microarray data, for both single–channel and dual channel platforms. The main approaches for the latter can be divided into four broad categories, reviewed below. With the exception of methods based on smoothing, all assume that bias is linear in microarray data — a awed assumption, as shown below. Microarray data is assumed log–normal, so logarithmic transformation, usually to base 2, is universal in microarray analysis.
29 2.2. NORMALISATION – MATHEMATICAL BIAS REMOVAL
2.2.1 Scaling
Scale adjustments by linear transformation are the simplest normalisation strategy used, the goal being to make M comparable across slides which may di er in scale and/or range of values. They do not, however, address within– slide biases. Dividing by slide–wise mean or median M [Hughes et al., 2000; Monks et al., 2004; Schadt et al., 2003], and other forms of standardisation have been used. Yang et al. [2002] suggest scaling by the median absolute deviation (MAD), which is robust to outlying values. Bolstad et al. [2003] and Yang and Thorne [2003] have suggested an alternative to these transformations, dubbed quantile normalisation. Aver- age values are computed for each of N genes in a dataset, and N quantiles are calculated for the average distribution. For each slide, genes are then ranked, and their values replaced with the N quantile values in ascending order. So, the gene with the lowest/most negative M value in each slide is assigned the rst quantile value, the second lowest/most negative gene assigned the second quantile value, and so on, irrespective of gene identity. Thus each slide has exactly the same scale, since the expression values are now drawn from the average distribution. The process is analogous to using ranks of genes rather than expression levels; the quantiles, however, mir- ror any density changes in the average distribution, whereas ranks are of necessity integers. The greatest weakness of this method is the inability to account for missing values (caused by e.g. background subtraction resulting in negative intensities): during the substitution process, some quantile values will have to be ignored, but how these are to be selected is arbitrary. As the number of missing values grows large, the shape of the quantile distribution will change, negating the aim of the method. One solution is to impute the original value of the gene from other slides in the dataset, for which several methods have been proposed [Kim et al., 2005; Ouyang et al., 2004; Troyanskaya et al., 2001].
30 2.2. NORMALISATION – MATHEMATICAL BIAS REMOVAL
2.2.2 Analysis of Variance
Kerr et al. [2000] construct an analysis of variance (ANOVA) model of mi- croarray data of the form
log(yijkg) = + Ai + Dj + Vk + Gg + (AG)ig + (V G)kg + ijkg where yijkg is the measurement for array i, dye j, variety k and gene g. is the overall average signal; Ai accounts for gross array di erences such as hybridisation success; Dj for dye–speci c e ects such as di erential incor- poration rates; Vk for sample (variety) e ects, and Gg for gene e ects . The interaction term array gene ((AG)ig) represent bias such as spatial in- consistancies or deformations on a particular slide. The (V G)kg term is the quantity of interest, as this represents alterations to expression associated with a particular variety — genuine biological signal. The ANOVA approach uni es normalisation and analysis into one proce- dure, and the model–based approach allows the addition of terms describing other artefacts as appropriate. However, the current model does not account for non–linear bias within a slide, and it is di cult to see how this could be achieved without positing complex interaction terms between parameters. Degrees of freedom soon become limiting in linear model approaches where the number of samples is small, as in most microarray experimental designs. There is therefore a limit to the number of terms one can include in the model, particuarly if several degrees of freedom are to be retained for error estimation, as is common [Kerr et al., 2000; Simono , 1996].
2.2.3 Principal Components Analysis
Alter et al. [2000] use principal components analysis (PCA) calculated by the singular value decomposition (SVD) to analyse a microarray time–series experiment charting approximately one period of the Saccharomyces cere- visiae cell cycle at 30 minute intervals for 390 minutes [Spellman et al., 1998]. PCA is a dimensionality reduction technique: its aim is to capture the maximum amount of variance in a dataset by transforming to a small
31 2.2. NORMALISATION – MATHEMATICAL BIAS REMOVAL number of new, uncorrelated variables, or principal components (PCs). It is thus essentially a remapping of data along a limited number of axes of variance [Joli e, 2002]. Vectors describing these new axes in terms of the original dataspace axes are termed eigenvectors, and can be calculated in a number of ways, SVD being a common choice [Joli e, 2002]. The decom- position can be reversed to reconstitute the original data. An important implication of the independence of the PCs is that they capture di erent variance trends. Any one of these can be removed by eliminating the rel- evant component (eigenvector): this collapses the data in that dimension, removing that variance trend, but not altering the data on other axes. We thus have a mechanism for removing given variance trends (e.g. artefactual signal) from a dataset without a ecting other trends. Alter et al. [2000] use just this strategy, eliminating an eigenvector de- scribing an upward tendency inconsistant with the expected periodicity of cell cycle phenomena. They assume that this tendency is therefore artefac- tual, and normalise their data by reconstituting without that eigenvector. Other eigenvectors are then found to describe two periodic trends across the data correlating with cell cycle phase changes, and these are interpreted as expression “signatures” of biological origin, corresponding to periodicities in the cell cycle. In later work, Alter et al. [2003] use a generalised version of this approach to make comparisons between the yeast and human cell cycles. The main limitation of PCA is one of interpretation. Eigenvectors do not necessarily correspond to discrete physical processes: they simply describe trends of variance in data. There is thus no guarantee that a single eigen- vector captures a single experimental variable. In the experiment described by Alter et al. [2000], biological variance is expected to be periodic, due to regular changes in expression at di erent points of the cell cycle. They there- fore assume that non–periodic variance trends can be dismissed as artefact. In a less well–de ned experiment, there would be no a priori method of distinguishing eigenvectors describing biological signal from those capturing artefact. Interpretation would then have to proceed by correlating eigenvec- tors back to experimental variables in order to deduce their provencance, and so decide if they should be removed.
32 2.2. NORMALISATION – MATHEMATICAL BIAS REMOVAL
2.2.4 Intensity–dependent smoothing
Dudoit et al. [2000] re–analysed a microarray dataset comparing ApoA1 knockout and SR-BI transgenic mice to inbred controls in an investigation of low HDL cholesterol models [Callow et al., 2000]. They found a non– linear dependence of log ratio M on mean channel intensity A; in other words, expression ratio changes as a function of mean intensity, and this function is not linear (shown for our own data in the next section). Dudoit et al. [2000] and later Yang et al. [2002] and Smyth and Speed [2003], point out that scaling of one channel to another – a linear transformation – cannot adequately remove such non–linear bias. These authors use a robust local regression method — loess [Cleveland and Devlin, 1988; Cleveland et al., 1992] — to account for the non–linearity, and adjusting log ratio M by the local t residuals. The procedure ts a locally linear regression to the data over a scrolling window of de ned width [Simono , 1996], which amounts to tting a smooth curve through the data. Normalisation by taking residuals of this function e ectively alters the data such that the smooth function is now a straight line y = 0. All these authors further report that this intensity–dependance can be di erent for each subgrid of spots, deposited by a single print–tip. Microar- ray slides are robotically manufactured in a process where hollow metallic pins dip into microtitre plates containing oligonucleotide solutions, which are then deposited onto treated–surface glass slides. Thus, each pin will de- posit a sub–grid or block of spots in the same area of the microarray, which is in fact a grid of spot grids. Since each pin has subtly di erent proportions, the capillary and surface tension forces which draw up and deposit oligonu- cleotide solution will vary slightly, leading to changes in spot morphology. The proposed solution is to normalise data from each subgrid independently by tting print–tip speci c loess curves; however, this solution, like any smooth t, may be hindered by over tting the function [Simono , 1996]. Wilson et al. [2003] o er a slightly di erent approach to this problem, by spatially smoothing the residuals of the loess t to adjust for variations in median intensity across a slide. This eliminates the awkward transition of
33 2.3. CORRECTING MULTIPLE NON–LINEAR BIASES IN MICROARRAY DATA values between subgrids, but is susceptible to over tting caused by abrupt changes in the smoothness of data [Simono , 1996]. Analogous approaches are o ered by Finkelstein et al. [2001], who iter- atively apply linear regression for each subgrid, removing outliers until the regression stabilises, giving a linear data transformation; Sapir and Churchill [2000] who use the orthogonal residuals from a robust regression of red in- tentisy R on green intensity G in place of the intensities themselves; and Kepler et al. [2002], who attempt to nd a core set of invariant genes, by iteratively reweighted least–squares, against which they calculate a normal- isation constant. All these approaches, like the ANOVA described above, assume that the majority of genes have invariant expression across samples, so that M → 0 and hence the overall relationship of the two channels should be linear. A particular strength is the ability to di erentially weight spots in all these schemes, so that, for instance, constant control spots can be exploited. How- ever, the sparsity of controls in most microarrays (of the order of several per subgrid) usually precludes their exclusive use for normalisation. Smooth approaches can be applied iteratively to other non–linear biases, so the data is progressively smoothed for multiple e ects. However, the interaction be- tween such iterative applications, if any, is not clear, and may generate new biases.
2.3 Correcting multiple non–linear biases in mi- croarray data
As discussed above, a non–linear systematic relationship between mean in- tensity and log ratio [Smyth and Speed, 2003; Yang et al., 2002] may exist in two–colour microarray data. At least one other non–linear bias has also been reported, where a change in intensity correlates with the order in which spots within a subgrid are deposited onto the slide surface. Bal azsi et al. [2003] show a time–dependent trend associated with atmospheric exposure during the slide printing process, and Mary-Huard et al. [2004] report a pe-
34 2.3. CORRECTING MULTIPLE NON–LINEAR BIASES IN MICROARRAY DATA riodic bias in expression ratios, consistant with a spot deposition order bias. Investigation of other possible sources of systematic bias is therefore war- ranted, as is the development of a normalisation method capable of handling multiple non–linear e ects. In this section we shall show that both an intensity and a deposition order dependence exist in data generated in our laboratory, and present a novel method for removing them, based on Generalised Additive Models (GAMs) [Ruppert et al., 2003; Simono , 1996].
2.3.1 Non–linear artefacts
Figure 2.2A shows a mean–di erence (MA) plot for a typical microarray experiment from our laboratory. The array contains approximately 15,200 cDNAs from the NIA 15k mouse clone set [Kargul et al., 2001]. The smooth line describes a loess function after Yang et al. [2002], above, illustrating a tendency for the value of M to change according to the magnitude of A. This tendency is obviously not constant, which would produce a lin- ear relationship; it is non–linear, resulting in a smooth curve. Figure 2.2B shows the same plot, but with a total of thirty–two loess lines tted, each corresponding to a subgrid of spots on the array as described above. As previously reported [Smyth and Speed, 2003; Yang et al., 2002], the bias in each subgrid has slightly di erent properties, suggesting a subgrid–speci c procedure is appropriate for normalisation. The source of this bias appears to be due to inherent dye properties. It is present in self–self hybridisations, where the same RNA sample is used in both channels of the array [Dudoit et al., 2002; Pulvers, 2004]. There should therefore be no di erence in e ciency of any step of the process — apart from cDNA labelling and channel–speci c scanning parameters. The use of alternatives to cyanine based uorophores does not completely remove the problem [Pulvers, 2004], suggesting that the bias may actually be due to uor–speci c physical properties. Figure 2.2C shows the same M data, plotted in order of spot deposition. There is a clear trend for spots layed down towards the end of the printing
35 2.3. CORRECTING MULTIPLE NON–LINEAR BIASES IN MICROARRAY DATA process to have slightly higher M values than those deposited earlier. The relationship is non–linear, and Figure 2.2D shows that it can vary across subgrids in the same way as intensity–dependence. It corresponds to the deposition–order artefact reported by Bal azsi et al. [2003], where they show that the order of deposition of spots within a subgrid a ects intensities in a time–dependent fashion. Microarrays are produced over a series of days, during which the slides are exposed to uctuations in light, temperature, and humidity; this exposure appears to create di erences in signal intensity. A similar bias is described by Smyth and Speed [2003], who nd much sharper gradations to the bias: these authors attribute the di erences in expression ratios to di erences in the quality of the cDNA libraries used to assemble the arrays they examine [Callow et al., 2000], and use scale adjustment to correct the bias. We therefore have two non–linear biases in this data, which must be accounted for prior to downstream analysis. Since none of the normalisation methods reviewed in the previous section are capable of dealing with more than one non–linearity, a new method is called for.
2.3.2 Additive model normalisation
We may incorporate the non–linearities described above, along with any linear e ects, into an Additive Model (AM) [Hastie and Tibshirani, 1990; Simono , 1996]. AMs are extensions of linear models, where at least one of the terms in the mean is expressed as a smooth function of the predictor [Hastie and Tibshirani, 1990; Ruppert et al., 2003]. We can therefore use this framework to integrate multiple non–linear biases into a single expression. The residuals of the model, once the non–linear mean has been accounted for, will then be the normalised expression values.
We can describe gene g’s observed expression ratio Mg as
Mg = + f1(Ag) + f2(Dg) + g where is an intercept term, f1(Ag) and f2(Dg) are (smooth) functions of mean intensity and deposition order, respectively, and g is the error term.
36 2.3. CORRECTING MULTIPLE NON–LINEAR BIASES IN MICROARRAY DATA
A B 2 2 1 1 0 0 M M −1 −1 −2 −2
6 8 10 12 14 6 8 10 12 14 C D 2 2 1 1 0 0 M M −1 −1 −2 −2
0 100 300 500 0 100 300 500
Figure 2.2: systematic non–linearities in microarray data from our labora- tory, revealed by loess smooth curve tting. Top: expression ratios are de- pendent on intensity (right), and this relationship can vary for each subgrid on an array. Bottom: expression ratios are also subject to spot deposition order bias (left), which can vary between subgrids (right).
An immediate problem is the possibility of over– tting the model: if the smooth functions are over–sensitive to data uctuations, they will tend to t too closely to the local changes in data, resulting in a very “wiggly” smooth curve. In contrast, under–sensitivity will fail to capture local trends in data, negating the utility of using a curve, rather than a straight line, to capture local data variations. Smoothness is controlled by penalising roughness in the tting procedure, which can be done in a variety of ways. In this case, the smooth functions used are penalised regression splines. Over– and under– tting are controlled by adjusting the level of penalisation using a smoothing parameter. Calculating this parameter is problematic, as we are trading o smoothness to accuracy of t. If the parameter is too small, we over t, and if too large, the model is insensitive to data uctuation.
37 2.3. CORRECTING MULTIPLE NON–LINEAR BIASES IN MICROARRAY DATA
Here, smoothing parameters are chosen using generalised cross–validation (GCV): cross validation is a leave–one–out procedure, where each data point is iteratively left out of the smoothing t, and the square error of the t to the point left out is computed. The objective is then to minimise the sum of the squared errors. The generalised case reweights the cross–validation terms according to some criteria. Once again, this non–parametric approach allows us to avoid external assumptions about the form of the sooth function, which may compromise the solution. Taking the residuals of Mg from this model then provides normalised expression values. Figure 2.3 shows the spline ts from this model for the slide used above. The top panel describes intensity dependence, and is similar to the trend revealed by loess (cf. Figure ??A): the e ect varies over approximately 2.5 M units, or 34% of the total range of M. The lower panel shows the bias due to deposition order, of the order of 0.3 M units ( 5% of the range of raw M). There is a clear periodicity in M, corresponding to the four days of the printing process. It would appear that as printing progresses on each day, there is a commensurate increase in M. This pattern is common to all slides in the data set: another example is shown in Figure 2.4. The source of this artefact is uncertain, but looking at the raw fore- ground and background intensities provides a clue: the periodicity exists in both foreground estimates and the Cy5 (red) background (top three panels, Figure 2.5), but not the Cy3 (green) background intensity (bottom panel, Figure 2.5). These trends exist in the other slides in this dataset (data not shown), and suggest that signal, rather than background, may be increas- ing. There seems to be a cycling over the four days of printing, which is performed at near–ambient temperature (25 C, but high humidity (> 50%). Martinez et al. [2003] show a channel–speci c bias, abrogated by exposure to ambient conditions of slides prior to printing; this demonstrates that at- mospheric conditions may a ect measured intensity signal. It appears that a similar e ect occurs in our microarrays, such that increased hydration of the slide surface during prolonged exposure to high humidity tends to in- crease the signal in some fashion. This may be due to increased e ciency of oligonucleotide binding to the slide surface at elevated hydration levels,
38 2.3. CORRECTING MULTIPLE NON–LINEAR BIASES IN MICROARRAY DATA 1.5 0.5 −0.5 6 8 10 12 14 0.05 −0.10 0 100 200 300 400 500
Figure 2.3: Penalised regression spline ts for two non–linear biases in a typical microarray experiment, modelled as a GAM. Top: Intensity depen- dence. Bottom: spot deposition order dependence. 1.0 0.0 −1.0 6 8 10 12 14 0.00 −0.08 0 100 200 300 400 500
Figure 2.4: Penalised regression spline ts for two non–linear biases in a typical microarray experiment, modelled as a GAM. Top: Intensity depen- dence. Bottom: spot deposition order dependence.
39 2.3. CORRECTING MULTIPLE NON–LINEAR BIASES IN MICROARRAY DATA which would increase foreground signal but not background levels. Irrespec- tive of source, this is a clear data artefact, and should be removed.
A 0.04 0.00 −0.04
0 100 200 300 400 500 B 0.05 −0.05
0 100 200 300 400 500 C 0.0015 0.0000
−0.0020 0 100 200 300 400 500 D 0.010 0.000 −0.010 0 100 200 300 400 500
Figure 2.5: deposition order biases in red and green (A,B) foreground and (C,D) background intensities for a typical microarray. Curves obtained by tting a GAM to each set of intensities with deposition order as the single smooth predictor. All but the green background would appear to have an embedded periodicity corresponding to the diurnal cycles of the printing process.
The detection of this time–dependent periodicity in our data is a good example of the possibility of under tting [Ruppert et al., 2003; Simono , 1996]. The lower panels of Figure 2.2 demonstrate that a loess function
40 2.4. FAILURE OF NORMALISATION IN GENETICAL GENOMICS EXPERIMENTS with a default parameter set fails to adequately describe the periodicity subsequently uncovered through GAM spline ts, which are parameterised from the data using GCV. A smooth function requires, amongst other pa- rameters, a local area span to be speci ed. This controls the area of data around which local ts are made when estimating the smooth curve (best vi- sualised as the size of a scrolling window across the data). The loess default of 2/3 is simply too large to allow detection of a four–cycle e ect, leading to an inadequately t data model. This observation does not re ect the superiority of one smooth function over another; rather, it reveals the im- portance of data–driven parameter estimation, with other properties (such as robusticity to outliers) being of secondary importance. Cross–validation is a powerful method for arriving at such estimates Wood [2004], making the overall process of normalisation using GAMs more sensitive to data trends. Generalised Additive Models are therefore a robust framework for remov- ing multiple non–linear biases from microarray data. Multiple such trends, of which at least two have been reported can be accomodated as described, and the application of cross–validation techniques to parametrise the smooth terms within the GAM provide sensitivity to data alterations. This method- ology can be applied to each data from each subgrid of an array in the same way as the print–tip loess procedure. It provides a strong alternative to cur- rent methods which fail to adequately account for the multiple systematic biases in microarray data.
2.4 Failure of normalisation in genetical genomics experiments
The artefacts described above appear to pervade microarray data. Perhaps surpisingly, there has not yet been a systematic investigation of the e ect of normalisation methods on expression level linkage results. This is possibly due to an implicit assumption that incidental systematic noise from any one slide will not be able to generate artefacts which will a ect linkage analysis. In this section, we explore the e ects of normalisation method on
41 2.4. FAILURE OF NORMALISATION IN GENETICAL GENOMICS EXPERIMENTS linkage results from a genetical genomics experiment using a panel of sixteen BxD Recombinant Inbred (RI) Strains [Bailey, 1971]. we compare linkage results obtained with raw data, and after median adjustment, whole–slide and subgrid loess, and the GAM normalisation described above. Studies using two–colour microarrays have generally used linear trans- formations to scale between arrays: Brem et al. [2002] and Yvert et al. [2003] average ratios, assuming log–normality (after Fazzio et al. [2001]), although Brem and Kruglyak [2005] report that linkage results from these experiments in a yeast segregating population are robust whether scaling or ANOVA normalisation is used; and Schadt et al. [2003] and Monks et al. [2004] scale channel intensities by mean intensity division prior to ratio calcu- lation (after Hughes et al. [2000]). Other investigators, using single channel A ymetrix GeneChips, generally use the default trimmed median scaling provided in the A ymetrix MAS data analysis suite [Bystrykh et al., 2005; Hubner et al., 2005, for example]. Chesler et al. [2005] report that their linkage results from BxD RI mouse lines are similar when either the default normalisation method or a robust variant, RMA [Irizarry et al., 2003], is used; this conclusion is based on the visual inspection of summary plots for the whole genome. It is, however, misleading, as closer examination shows that only 35% of transcripts are identi ed in data from both methods as having signi cant genetic determinants in the genome [RBH Williams, CJ Cotsapas, et al, submitted].
2.4.1 Experimental design
Each of sixteen RI strains was represented by three age– and sex–matched individuals. Pooled total RNA from each strain was reverse transcribed and co–hybridised with a common reference to microarrays containing the NIA 15K set [Kargul et al., 2001], giving a dataset of sixteen slides. After im- age extraction, background correction and removal of control spot values, expression ratios were calculated as M = log2(R) log2(G) as described in section 2.1.1, where R is the intensity of the red channel (RI sample), and G that of the green channel (reference sample). The data was normalised in
42 2.4. FAILURE OF NORMALISATION IN GENETICAL GENOMICS EXPERIMENTS each of the following ways: no normalisation, median scaling, loess smooth- ing applied to the whole slide [Smyth and Speed, 2003], loess smoothing applied separately to each subgrid [Smyth and Speed, 2003; Yang et al., 2002], and GAM smoothing applied to each subgrid as described above. The latter three methods are followed by median absolute deviation (MAD) between–array scaling: global loess–treated data is scaled as whole slides; the other two treatments are scaled per subgrid, as described in the Mate- rials and Methods section for this chapter, and Yang et al. [2002]. I shall abbreviate these treatments to Raw, Median, Loess, Print–tip, and Gam, respectively. The sixteen M values across the panel for each gene are then used as expression phenotypes in a linkage analysis to determine genetic in uences on gene expression. Linkage, to a genetic map comprising 387 markers spanning all autosomes and the X chromosome, was assessed for each of the ve datasets using a bootstrapped t–test. This is equivalent to the more common regression–based methods for linkage analysis on to two genotypes: RI lines are obligatory homozygotes at all loci, so only the two homozygous genotypes are considered. Signi cance was de ned as either P 0.0013 or P 0.000025 (genome–wide Bonferroni corrected p 0.05 and p 0.01, respectively) for association to any marker.
2.4.2 Lack of agreement between normalisation results
If more conservative normalisation methods simply remove artefactual link- age signal, we would expect to see a gradual decrease in the number of genes identi ed with progressively more conservative treatments. There should, however, be many genes in common between analyses, which are presum- ably under genuine genetic in uence. Contrary to this prediction, Table 2.1 shows that the number of genes identi ed increases, but there is very little agreement across all the data treatments. This is true for both genome–wide corrected p 0.05 and p 0.01. Perhaps unsuprisingly, there is strong agreement between results from untreated and median scaled data. Since median scaling is a rst–order ad-
43 2.4. FAILURE OF NORMALISATION IN GENETICAL GENOMICS EXPERIMENTS justment, any spurious linkage due to the non–linear biases discussed above will not be removed. The lack of concordance with data treatments capable of removing second–order structure suggests that virtually all linkage iden- ti ed with the more permissive methods is spurious. The somewhat higher agreement between the three conservative methods (L,P, and G in Table 2.1) indicates that increasingly sensitive removal of subtle bias begins to stabilise linkage results. These two groups of overlap, within but not between rst– and second–order data corrections, suggests that these results are not due to low power, but to fundamental changes in internal data structure.
R (%) M (%) L (%) P (%) Total R – – – – 295 M 248 (70) – – – 308 L 16 (2) 20 (3) – – 400 P 12 (2) 13 (2) 198 (35) – 366 G 9 (1) 13 (2) 97 (14) 105 (16) 409 R – – – – 60 M 38 (46) – – – 60 L 4 (2) 4 (2) – – 110 P 4 (3) 5 (4) 33 (21) – 82 G 4 (3) 3 (2) 17 (10) 18 (13) 77
Table 2.1: E ect of normalisation method on the identi cation of genes under genetic in uence. Linkage signi cant at top: p 0.05; and bottom: p 0.01. Proportions are calculated as the common fraction of unique genes in two analyses (i.e. intersect/union).
It should be noted, however, that the false discovery rate due to multiple testing in this experiment is crippling: at p 0.05, we expect 15206 0.05 ' 760 genes by chance, and at p 0.01, we expect 15206 0.01 ' 150. The over- laps within the two groups (none/permissive, and conservative) of treat- ments suggest that at least some of the identi cations re ect signal within the data, rather than false positives. To test this suggestion, we have com- pared the overlap in loci being identi ed as exerting genetic in uence in the ve analyses. If complex artefacts have no signi cant e ect on linkage results, but study power is low, we might expect that there will be little overlap in
44 2.4. FAILURE OF NORMALISATION IN GENETICAL GENOMICS EXPERIMENTS the individual genes identi ed (as above), but the loci identi ed would be reasonably similar. Furthermore, we can ameliorate the false discovery rate by only looking at loci appearing to in uence multiple genes. Such loci are biologically interesting as they would indicate that the e ected genes are functionally related (“regulons” [Cotsapas et al., 2003]). The expected false positive number of loci appearing to in uence a single gene at p 0.05 is still 15206 0.05 ' 760; however, the number of loci in uencing n genes by chance is now 15206 0.05n, which is 2 for n = 3. Similarly, at p 0.01 we would expect 152 and 0.02 for n = 1 and n = 3, respectively.
R (%) M (%) L (%) S (%) Total R – – – – 16 M 15 (79) – – – 18 L 6 (14) 7 (16) – – 32 P 6 (12) 6 (12) 23 (47) – 40 G 8 (12) 9 (14) 25 (39) 27 (39) 57 R – – – – 7 M 6 (62) – – – 6 L 2 (22) 2 (25) – – 4 P 2 (17) 3 (18) 6 (57) – 7 G 2 (18) 2 (20) 5 (67) 5 (62) 6
Table 2.2: e ect of normalisation on identi cation of loci in uencing at least three transcript levels. top: p 0.05; bottom: p 0.01.Proportions are calculated as the common fraction of unique genes in two analyses (i.e. intersect/union).
Table 2.2 summarises the overlaps between loci apparently in uencing at least three transcripts at the two genome–wide signi cance levels. The same pattern of overlaps as seen previously emerges: there is a clear cor- respondance within, but not between, the two groups of data treatments. Since the numbers of loci detected are much higher than those expected by chance, we may conclude that these results re ect structure, of either biological or systematic origin, in the underlying data. The agreement be- tween the three smoothing techniques, particularly at the more stringent signi cance level, shows that these approaches are capable of identifying at least some of the same e ects. In contrast, the lack of agreement with the
45 2.5. CONCLUSIONS two permissive treatments indicates that the majority of linkage identi ed in data processed with the latter is not robust to removal of known data bias, and therefore should be regarded with suspicion.
2.5 Conclusions
We have shown that multiple non–linear biases may exist in microarray data, and described an extensible mathematical framework to remove them, based on generalised additive models. This approach bene ts from robust methods for parameter estimation, ensuring that the models describe the data as accurately as possible. we have further shown that normalisation method can have profound e ects on linkage analyses performed with the resulting expression measurements. The implication here is inescapable: microarray data contains complex systematic biases which can generate biologically plausible spurious signal in linkage analyses. Failure to remove these biases seems to generate link- age signal which appears biologically plausible, but which must be regarded as artefactual, since it is not robust to further bias removal. A normalisa- tion strategy tailored to each dataset is therefore a prerequisite for avoiding spurious inference of genetic in uences on gene expression levels. Other evidence also suggests that this problem exists in studies on larger populations and on di erent platforms [RBH Williams, CJ Cotsapas, et al, submitted]: we have shown that similar lack of concordance between data treatments pertains to an independent study of thirty two BxD strains [Chesler et al., 2005], so such e ects are not unique to the data presented, or a consequence of small population size. These results lead to the uncomfortable conclusion that many of the observations reported in the literature are suspect, and may be artefactual. However, tailoring normalisation methods should resolve any such problems, so that re–analysis of previous results is a viable option.
46 2.6. MATERIALS AND METHODS
2.6 Materials and Methods
2.6.1 Sample handling
Three eight–week old males from BxD strains 1, 2, 6, 9, 11, 12, 13, 14, 16, 18, 19, 21, 24, 29, 31, and 32 were obtained from the Jackson Laboratory, Bar Harbor, Maine. Animals were housed in standard conditions for one week to acclimatise, and then sacri ced by cervical dislocation. Whole brains, livers, kidneys, spleens and testes were harvested immediately and snap frozen in liquid nitrogen. Ten C57Bl/6J males were processed in the same way, to provide a reference sample. Total RNA was extracted from whole brains with TriZol reagent (Invit- rogen, Carlsbad, NJ) as per the manufacturer’s protocols. Quality of RNA was assessed by spectophotometry (A260/A280 absorbance ratios of > 2) and electrophoresis (rRNA bands visible on 1% agarose gels). Pools for each strain were then created by mixing equal amounts of RNA from each individual.
2.6.2 Expression pro ling
50 g RNA from each strain pool was reverse transcribed and indirectly la- belled using a commercial kit (Invitrogen, Sydney, Australia) as per the manufacturer’s instructions. BxD samples were labelled with Cyanine 5 dye (Invitrogen, Sydney, Australia), and the C57Bl/6J reference samples with Cyanine 3 dye. Each sample/reference pair was concentrated to 2-3 l and resuspended in 50 l DIG Easy bu er (Roche, Paris, France) containing 5 l each 10mg/ml yeast tRNA and 10mg/ml calf thymus DNA (Sigma, Syd- ney, Australia). This mixture was then applied to microarrays printed with the NIA 15K set (Clive and Vera Ramaciotti Centre, UNSW, Sydney, Aus- tralia) and hybridised under a coverslip at 37 C for 15 hours. Coverslips were removed by immersion in 1xSSC; slides were then washed three times for 15 mins at 50 C with 1xSSC,0.1% SDS, rinsed three times in 1xSSC, and dried by centrifugation. Slides were scanned in an ArrayWorx (Ap- plied Precision, ) microarray scanner for 0.4s (Cy3) and 0.5s (Cy5). The
47 2.6. MATERIALS AND METHODS resultant ti images were then extracted with Spot v. 2 (CSIRO, Australia http:\\experimental.act.cmis.csiro.au/Spot/index.php), and expression ra- tios were calculated after control removal and morphological opening back- ground subtraction.
2.6.3 Normalisation
Median, global loess, and print–tip loess normalisations were carried out as implemented in the limma v. 1.8.6 package of Bioconductor [Gentleman et al., 2004], as described in Smyth [2004]. General Additive Models were tted using the mgcv package [Wood, 2001] for the R programming language [R Development Core Team, 2005], using default parameters. A GAM is tted to each subgrid of each array, with smooth functions of mean intensity A and deposition order D used as predictors for expression ratio M. The residuals of the model are then taken as normalised M values. For all three smoothing–based methods, scaling is then performed using the limma library.
2.6.4 Linkage analysis
A genetic map of 387 informative mouse markers spanning all autosomes and the X chromosome was compiled from publicly available information on the BXD strains at http://www.nervenet.org [Eva Chan, UNSW, pers. comm.]. Markers with missing genotypes were excluded, as were those with redun- dant Strain Distribution Patterns (genotype strings across the panel; SDPs). Those with less than two of either genotype were considered uninformative, and also excluded. For each marker, a Student’s t–test was calculated for each gene by separating the 16 expression ratios into two groups by genotype. Expression ratios were then randomly resampled with replacement to create new groups of the same numbers of observations, from which a new t–test was calculated. This process was repeated 15,000 times to give a distribution of permuted t–tests for each gene. The P value is then the proportion of permuted t– tests greater in magnitude than the observed statistic for the gene. These P values are therefore limited to a minimum value of 1/10000, or 1 10 5.
48 Chapter 3
Genetic in uence on mRNA levels is tissue speci c
49 3.1. INTRODUCTION
3.1 Introduction
One of the more signi cant insights in modern genetics has been the realisa- tion that phenotypic diversity does not necessarily re ect a commensurate level of genetic di erence. Thus, genetically closely related species can in fact di er substantially in morphology, behaviour and cognition, and bio- chemistry [for example, Gompel et al., 2005; Hunter et al., 2005]. It has been suggested that there is therefore insu cient genetic variation to explain such divergence in terms of coding sequence polymorphisms that lead to al- terations in gene product activity, and therefore variants which alter gene regulation may have signi cant roles in the generation of diversity [King and Wilson, 1975]. The general principle of phenotype:genotype variation imbalance can also be applied to the di erences between individuals of a species. The recent demonstrations of substantial heritability in, and genetic in uences on, many mRNA levels in yeast [Brem et al., 2002; Yvert et al., 2003], rodents [Bystrykh et al., 2005; Chesler et al., 2005; Hubner et al., 2005], and humans [Monks et al., 2004; Morley et al., 2004; Schadt et al., 2003] reinforce this notion, suggesting that regulatory variants are a major mechanism of phenotypic variation between individuals. By analogy to sequence polymorphisms, it has generally been assumed that a signi cant proportion of regulatory polymorphisms between individ- uals can be detected by extrapolation from a single tissue or state [Chesler et al., 2005]. Since such polymorphisms must reside in regulatory mechanism components, this would imply that these components are common between tissues, despite extensive evidence that both cis–acting functional elements and trans–acting transcription e ectors are known to be tissue speci c [Wray et al., 2003], as are at least some expression di erences between individuals [Bystrykh et al., 2005; Chesler et al., 2005; Cowles et al., 2002]. The total incidence of regulatory variation is therefore unknown, even within the well characterised genetic environments of segregating experimental populations. Here, we present an experiment using microarrays to measure di erences in 22,000 gene expression levels between three tissues in two common inbred
50 3.2. EXPERIMENTAL DESIGN strains of mice, C57BL/6J and DBA2/J. The experiment is performed using pools of total RNA from many individuals raised in a constant environment, so that any di erences between strains may be considered of genetic origin. The strains are amongst the oldest extant mouse strains [Beck et al., 2000], having been developed at the beginning of the last century [Festing, 1998]. They have a signi cant amount of strain–speci c polymorphisms, and di er in many physiological and behavioural phenotypes, such as responses to ethanol, sugar preference, and stress–related behaviours [Festing, 1998]. In these experiments, we show that >95% of expression level di erences are tissue–speci c in transcripts detectable in all three tissues, and that these are mostly genes involved in transcription and associated metabolic path- ways. Finally, we demonstrate that extrapolation between tissues misiden- ti es the vast majority of e ects in the predicted tissue, a result which has implications for experimental design, particularly in human genetics.
3.2 Experimental design
We wished to identify genes whose mRNA levels are in uenced by genetic variation between the two inbred mouse strains, and ascertain whether these in uences occur across multiple tissues or are tissue–speci c. However, not all genes on our arrays are expressed in any or all of the three tissues: we therefore de ne expressed genes as those reliably detected in each tissue, i.e. having a mean intensity greater than the 95th percentile of negative control values. Our experiment compares multiple age– and sex–matched individuals from the two strains, raised in identical environmental conditions and processed in the same way. We therefore considered signi cant changes in mRNA levels of reliably detected genes to be the result of genetic varia- tion, and term such genes genetically in uenced. Genes can therefore be classi ed as: