The Influence of Genetic Variation in Expression

Eva King-Fan Chan

A thesis submitted in fulfilment of the requirements for the degree of

Doctor of Philosophy

2007

School of Biotechnology and Biomolecular Sciences

University of New South Wales Certificate of originality

I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.

______Eva Chan 18th July 2007

i Abstract

Abstract

Variations in gene expression have long been hypothesised to be the major cause of individual differences. An initial focus of this research thesis is to elucidate the genetic regulatory architecture of gene expression. Expression quantitative trait locus (eQTL) mapping analyses have been performed on expression levels of over 22,000 mRNAs from three tissues of a panel of recombinant inbred mice. These analyses are “single-locus” where “linkage” (i.e. significant correlation) between an expression trait and a putative eQTL is considered independently of other loci. Major conclusions from these analyses are: 1. Gene expression is mainly influenced by genetic (sequence) variations that act in trans rather than in cis; 2. Subsets of are controlled by master regulators that influence multiple genes; 3. Gene expression is a polygenic trait with multiple regulators.

Single-locus mapping analyses are not designed for detecting multiple regulators of gene expression, and so observation of multiple-linkages (i.e. one expression trait mapped to multiple eQTLs) formed the basis of the second objective of this research project: to investigate the relationship between multiple-linkages and genotype pattern-association. A locus-pair is said to have associated genotype patterns if they have similar inheritance pattern across a panel of individuals, and these are attributed to one of fours sources: 1. linkage disequilibrium between loci located on the same ; 2. non-syntenic association; 3. random association; 4. un-associated.

ii Abstract

To understand the validity of multiple-linkages observed in single-locus mapping studies, a newly developed method, bqtl.twolocus, is applied to confirm two-locus effects for a total of 898 out of 1,233 multiple-linkages identified from the three studies mentioned above as well as from seven publicly available eQTL-mapping studies. Combining these results with information of genotype pattern-association, a subset of 478 multiple- linkages has been deduced for which there is high confidence to be real.

iii Acknowledgements

Acknowledgements

Three-and-a-half years and it is now over. It had been intellectually challenging, stressful at times (most of the time), frustrating (still can’t believe I have not yet thrown anything at the computer despite my many threats to do so), but all-in-all it had been a thoroughly enjoyable experience.

Needless to say, the person I would like to thank most is my supervisor, Peter Little, from whom I learnt the true meaning of “research”. Peter is an exceptional mentor and I am truly grateful for his guidance and the privilege to share in his passion.

Three very special people I would very much like to acknowledge are Rohan Williams, Mark Cowley, and Chris Cotsapas, with whom I have shared the lab over the pass several years. It is near impossible to count all that I have learnt from Rohan and I’d like to thank him for his advices, encouragements, and friendship. Many thanks to Mark for being my IT guru and for showing me scientists need not be overly eccentric (and that it is okay to listen to pop). Thanks also to Chris for his endless efforts to stimulate the unwary mind with everything and anything.

Much appreciation to David Nott, our resident Statistician, who has on so many occasions helped simplify a complicated problem to an elegant Bayesian model.

Acknowledgements also to many students that have contributed directly or indirectly to my work: Jeremy Pulvers, Michael Liu, Oscar Luo, and Andrew Liu. Thanks to my internal reviewers Ian Dawes and Andrew Brown.

iv Publications

Publications

Published papers

Rohan B H Williams, Chris J Cotsapas, Mark J Cowley, Eva Chan, David J Nott, Peter F R Little. 2006 Normalization procedures and detection of linkage signal in genetical-genomics experiments. Nature Genetics 38: 855- 856.

Chris J. Cotsapas, Rohan B.H. Williams, Jeremy N. Pulvers, David J. Nott, Eva K.F. Chan, Mark J. 2006 Genetic dissection of gene regulation in multiple mouse tissues. Mammalian Genome 17 (6): 490-495.

David J. Nott, Zeming Yua, Eva Chan, Chris Cotsapas, Mark Cowley, Jeremy Pulvers, Rohan Williams and Peter Little. Hierarchical Bayes variable selection and microarray experiments. Journal of Multivariate Analysis (accepted)

Papers in preparation

Eva KF Chan, Mark J Cowley, Rohan BH Williams, Chris J Cotsapas, David J Nott, Peter FR Little. 2006 Multiple linkages in eQTL studies of mice, rats, and yeast. (in preparation)

Rohan BH Williams, Eva KF Chan, Mark J Cowley, Peter FR Little. The influence of genetic variation on gene expression. Genome Research (invited review).

Mark J Cowley, Chris J Cotsapas, Rohan BH Williams, Eva KF Chan, Jeremy N Pulvers, Michael Y Liu, David J Nott, Peter FR Little. The effects of genetic variation on gene expression in multiple tissues. (submitted)

Refereed Published Conference Proceedings

Cotsapas C, Chan E, Kirk M, Tanaka M, Little P. 2003 Genetic variation and the control of transcription IN Cold Spring Harbour Symposia on Quantitative Biology, Symposium 68, pp. 109-114.

v Table of Contents

Table of Contents

Certificate of originality ...... i Abstract ...... ii Acknowledgements ...... iv Publications ...... v Published papers ...... v Papers in preparation ...... v Refereed Published Conference Proceedings ...... v Table of Contents ...... vi Table of Figures...... ix Table of Tables...... xiii 1. Introduction...... 1 1.1. Natural variation...... 1 1.2. Gene expression regulatory variation ...... 3 1.2.1. Cis-acting regulatory elements...... 4 1.2.2. Trans-acting regulatory factors...... 5 1.3. Mapping expression quantitative trait loci ...... 8 1.3.1. eQTL mapping approaches ...... 8 1.3.2. Segregating populations...... 10 1.4. Microarrays and expression traits...... 13 1.4.1. Normalisation of expression data...... 16 1.5. Marker genotype patterns...... 18 1.6. Single-locus mapping ...... 21 1.6.1. Significance test and multiple testing ...... 23 1.7. Multi-locus mapping ...... 27 1.8. Aim and hypotheses ...... 34 1.8.1. Influence of genetic variation on mRNA levels (Chapter 3)...... 34 1.8.2. Multi-locus influence of gene expression variation (Chapter 4)...... 34 1.8.3. Influence of genotype pattern association in multiple linkage (Chapter 5) ...... 35 2. Materials and Methods...... 36 2.1. Brain, kidney, and liver BXD eQTL-mapping studies ...... 36 2.1.1. BXD RI mice ...... 36 2.1.2. Marker genotype data ...... 36 2.1.3. Expression microarray data...... 37 2.1.4. eQTL-mapping...... 38 2.1.5. Defining “expressed” genes...... 39 2.2. Meta-analysis...... 41 2.2.1. Experimental crosses ...... 41 2.2.2. Marker genotype data ...... 41 2.2.3. Expression data...... 42 2.2.4. eQTL-mapping...... 43 2.3. Simulation studies...... 44

2.3.1. sim0B/K/L ...... 44

2.3.2. simnorm ...... 45

2.3.3. sim(0)/(1)/(2)/(3) ...... 46

vi Table of Contents

2.3.4. permB/K/L...... 47 2.4. Genotype pattern similarity ...... 48 2.4.1. Simulation study: testing inter-chromosomal genotype pattern similarity ...... 49 2.5. remove.LD...... 50 2.5.1. Algorithm...... 50 2.5.2. R-script ...... 51 2.6. bqtl.twolocus test...... 53 2.7. R-scripts to eliminate redundant linkages...... 55 2.7.1. Method: support interval of 2 LOD unit ...... 55 2.7.2. Method: monotonic decay...... 56 2.7.3. Method: HUBNER et al. (2005)...... 57 3. Linkage analysis of expression traits in three tissues of recombinant inbred mice... 58 3.1. Experimental design ...... 62 3.1.1. Single-locus eQTL mapping ...... 62

3.1.2. Simulation studies of the null hypothesis: sim0B/K/L ...... 64 3.2. Expression quantitative trait loci ...... 65 3.2.1. Defining Linkage Significance ...... 66 3.2.2. Extent of Mappable Expression Traits...... 72 3.2.3. Artefacts of residual inter-array variation ...... 72 3.2.4. Artefacts of low level signal in microarray data ...... 79 3.2.5. Genetic influence of expressed and non-expressed genes...... 81 3.2.6. Expression trait-variance...... 82 3.3. Multiple regulators of gene expression...... 86 3.4. Cis- and Trans- acting modulators of gene expression...... 93 3.5. Master regulators of gene expression ...... 102 3.6. Tissue-specificity...... 111 3.6.1. Tissue-specific gene expression...... 111 3.6.2. Tissue-specific genetic influence of gene expression ...... 114 3.7. Chapter summary and discussions...... 122 3.7.1. Facts and figures ...... 122 3.7.2. Quality control of microarray eQTL-mapping studies...... 123 3.7.3. False positive and false negative rates ...... 124 3.7.4. Involvements of genotype patterns in eQTL mapping studies...... 126 4. Validation and analysis of two-locus influence on mRNA levels...... 129 4.1. Experimental design ...... 131 4.1.1. Genotype pattern-association and removal of redundant linkages due to linkage disequilibrium ...... 131 4.1.2. Validation of two-locus effects on expression traits: bqtl.twolocus..... 135 4.2. Identification of unique multiple linkages in the BXD brain, kidney, and liver data...... 138 4.2.1. Methods for eliminating redundant linkages...... 138 4.2.2. Identifying unique multiple linkages ...... 145 4.3. Verification of two-locus interactions in the BXD brain, kidney, and liver data ...... 149 4.4. Step-wise search for two-locus effects and its dependency on sample size .. 162 4.5. Meta-analysis of multiple linkages...... 165 4.5.1. Identifying transcripts linked to at least one eQTL...... 168 4.5.2. Identifying transcripts linked to two or more unique eQTLs...... 171 4.5.3. Verification of two-locus interactions in ten eQTL datasets...... 175

vii Table of Contents

4.6. Chapter summary and discussions...... 184 4.6.1. Facts and figures ...... 184 4.6.2. Linkage disequilibrium and remove.LD ...... 185 4.6.3. Confirming two-locus effects and bqtl.twolocus ...... 186 4.6.4. Prelude to Chapter 5: genotype pattern-association...... 186 5. Genotype pattern association and multiple-linkages...... 188 5.1. Experimental design ...... 193 5.1.1. Classifying syntenic association ...... 194 5.1.2. Classifying non-syntenic association ...... 195 5.1.3. Classifying random genotype pattern association...... 195 5.1.4. Classifying unassociated genotype patterns...... 197 5.2. Genotype pattern similarity can cause multiple-linkages ...... 198 5.3. Residual linkage disequilibrium as a source of multiple-linkages...... 201 5.4. Non-syntenic association as a source of multiple-linkages ...... 206 5.5. Random genotype pattern association as a source of multiple-linkages...... 217 5.6. Unassociated multiple-linkages...... 222 5.7. Chapter summary and discussions...... 229 5.7.1. The pipeline...... 229 5.7.2. Sources of genotype pattern association ...... 231 5.7.3. Multiple-linkages and polygenic expression traits...... 232 6. Discussions ...... 233 6.1. eQTL -mapping pipeline ...... 234 6.2. Gene expression is an oligogenic trait...... 238 6.2.1. Solution: multi-locus mapping? ...... 238 6.2.2. Influence of genotype pattern association in mapping multi-locus effects...... 240 6.3. Regulation of gene expression is pleiotropic...... 242 6.3.1. Mechanism of control of gene expression ...... 242 Appendix ...... A-1 A.1. Quantile-normalised data ...... A-1 A.1.1. Brain, kidney, and liver BXD studies...... A-1

A.1.2. sim(0)/(1)/(2)/(3) ...... A-4 A.2. Number of linkages per transcript...... A-8 A.3. cis-linkages...... A-14 A.4. Linkage hotspots...... A-16 A.4.2. Analysis of potential master regulators in the BXD brain (CotB), kidney (CotK), and liver (CotL) studies...... A-19 A.5. Null transcripts...... A-27 A.6. Linkages at D8Mit189...... A-36 A.7. bqtl.twolocus results...... A-39 A.7.1. From the BXD brain (CotB), kidney (CotK), and liver (CotL) studies ...... A-39 A.8. Genotype pattern and multiple linkages...... A-51 A.9. Effect of LD on joint two-locus effects...... A-53 References ...... R-1 Referenced Websites ...... R-9

viii Table of Figures

Table of Figures

Figure 1-1 Generation of recombinant inbred lines...... 12

Figure 1-2 Density distribution of background-corrected and log2-transformed hybridization intensities of the red and green channel...... 15 Figure 1-3 Genotype patterns...... 19 Figure 2-1 Process of sampling sim0 data...... 44 Figure 2-2 Process of sampling expression data from a normal distribution...... 45

Figure 2-3 Process of sampling sim(1)...... 46 Figure 2-4 Process of permuting permB/K/L...... 47 Figure 2-5 Process of randomizing across a panel of RI lines...... 49 Figure 3-1 Example genome scan...... 65 Figure 3-2 Distributions of expression data in brain samples of 31 BXD RI mice, before (top) and after (bottom) quantile-normalisation...... 75 Figure 3-3 Density distribution of standard deviations per expression trait from sets of five-replicates of four simulated expression datasets...... 76 Figure 3-4 Distribution of signal levels of empty spots (black), blank spots (red), and negative ScoreCards (green) in the BXD brain (left), kidney (middle), and liver (right) datasets...... 80 Figure 3-5 Distributions of A-values of actual transcripts on the arrays (blank) and all 875 negative controls (red) in brain (left), kidney (middle), and liver (right) BXD dataset...... 80 Figure 3-6 Behaviours of trait-variances, average expression levels (A-value), and linkage significance (Praw), of all significant linkages (Praw0PBON) in the BXD brain (left column), kidney (middle column), and liver (right column) studies...... 83 Figure 3-7 Proportions of transcripts linked to 1-5, or >5 marker loci, prior to removal of potentially LD-induced multiple linkages...... 89 Figure 3-8 Physical map of the 779 markers used in the BXD eQTL mapping analyses...... 90 Figure 3-9 Genome scan of Ppp4c ( phosphatase 4, catalytic subunit) transcript in the BXD brain study...... 91 Figure 3-10 Proportion of transcripts linked to 1-5, or >5 unique linkages, following application of remove.LD to eliminate potentially LD-induced multiple linkages...... 92 Figure 3-11 Example cis-linkage of a RIKEN transcript (1810029E06Rik; AK007639) from BXD liver study...... 93 Figure 3-12 Example transcript (pax3: paired-box 3) from our BXD brain data with both cis and trans linkages...... 96 Figure 3-13 Percentages cis-linkages defined by different ‘cis-windows’...... 97 Figure 3-14 Distribution of the 779 genetic markers across the 20 chromosomes...... 98

Figure 3-15 Percentage cis-linkages at different Praw thresholds for defining significant linkage...... 99

Figure 3-16 Percentage syntenic linkages at different Praw thresholds...... 101 Figure 3-17 Linkage hotspots in the BXD brain, kidney, and liver data...... 105

Figure 3-18 Linkage hotspots in the sim0B, sim0K, and sim0L studies...... 108 Figure 3-19 Linkage hotspots of negative controls in the BXD brain, kidney, and liver studies...... 110 Figure 3-20 Proportions of tissue-specific and tissue-non-independent expressions...... 111

ix Table of Figures

Figure 3-21 Extent of null expressed (completely unexpressed) transcripts in one, two, or all three tissues...... 113 Figure 3-22 Extent of transcripts mapped uniquely to and commonly between tissues. 115 Figure 3-23 Average numbers of linkages at markers with different numbers of B- genotypes...... 128 Figure 4-1 Heat map representation of all pair-wise genotype pattern similarities of 779 markers across 31 BXD RI strains...... 134 Figure 4-2 Pearson’s correlation between D1Mit169 and other markers...... 139 Figure 4-3 Probability distributions of inter- (black) and intra- (red) chromosomal marker genotype pattern correlations...... 140 Figure 4-4 Genome-scan of a Shoc2 (suppressor of clear-2 homolog) transcript...... 143 Figure 4-5 Proportions of transcripts linked to 1-5 or >5 eQLTs before and after application of remove.LD in the BXD, sim0B/K/L, and sim0norm studies...... 148 Figure 4-6 Significance of LOD scores from two-locus linkage analyses in our 31 BXD Brain (left), Kidney (middle), and Liver (right) data...... 150 Figure 4-7 M-value distributions of expression data from our real BXD brain (blue), kidney (red), and liver (green) studies, and from the five simnorm replicate studies (black)...... 160 Figure 4-8 Estimated false positive rates and posterior probabilities of primary locus and joint two-locus effects in datasets using brain tissues of 31 BXD RI mice (A); 31 haploid yeast (B); and 86 haploid yeast...... 164 Figure 4-9 Proportions of transcripts with multiple linkages before and after remove.LD is applied to remove potentially LD-induced linkages...... 171 Figure 4-10 Physical maps of markers used in the four experimental crosses...... 173 Figure 4-11 Proportions of transcripts linked to 1, 2, 3, 4, or 5 unique eQTLs after applying remove.LD...... 174 Figure 4-12 Proportions of transcripts with multiply-linked eQTLs showing evidence of two-locus interactions and two-locus epistatic interactions...... 176 Figure 4-13 Boxplot summary of the LOD scores from two-locus linkage validation tests across the ten eQTL-mapping data...... 176 Figure 4-14 Boxplot summary of expression variations (standard deviations of expression values across all individuals) of transcripts linked at multiple eQTLs...... 180 Figure 4-15 Average expression levels in HubK (left) and HubFC (right) of all 15,866 transcripts (black); 404 (HubK) or 227 (HubFC) transcripts with multiple linkages (red); and Affymetrix BioB hybridisation controls (green)...... 180 Figure 4-16 Number of the 404 and 227 transcripts, in HubK (top) and HubFC (bottom), respectively, with multiple linkages, linked at each marker locus (x-axis)...... 181 Figure 5-1 Diagram exemplifying how trait y can significantly link to both marker loci L1 and L2...... 189 Figure 5-2 Non-syntenic association...... 190 Figure 5-3 Schematic diagram outlining the pipeline (decision tree) for determining real (circled) and spurious (squared) multiple-linkages...... 193 Figure 5-4 Distribution of genotype pattern correlations between all pair-wise inter- chromosomal markers from the marker maps of different panels of individuals...... 196 Figure 5-5 Genome scan and correlation between genotype pattern similarity and linkage of Foxa3...... 199 Figure 5-6 Linkages to genotype pattern inverts...... 200 Figure 5-7 Similar genotype patterns between syntenically associated marker loci..... 202 Figure 5-8 Percentages of transcripts linked to multiple eQTLs that are in synteny. ... 203

x Table of Figures

Figure 5-9 Percentages of transcripts linked to syntenically-associated eQTLs that are confirmed for two-locus and two-locus epistatic effects...... 203 Figure 5-10 Example of NSA multiple-linkage...... 208 Figure 5-11 Example of non-NSA multiple-linkage...... 209 Figure 5-12 Percentages of transcripts linked to multiple eQTLs that are non- syntenically associated...... 213 Figure 5-13 Percentages of transcripts linked to NSA eQTLs that are confirmed for two-locus and two-locus epistatic effects...... 213 Figure 5-14 Percentage of transcripts linked to multiple eQTLs that are randomly associating (defined using tail-50% criterion)...... 219 Figure 5-15 Percentages of transcripts linked to randomly associated eQTLs (using tail-50% criterion) that are confirmed for two-locus and two-locus epistatic effects...... 219 Figure 5-16 Percentages of transcripts linked to multiple eQTLs that are randomly associating (using tail-10% criterion)...... 221 Figure 5-17 Percentages of transcripts linked to randomly associated eQTLs (using tail-10% criterion) that are confirmed for two-locus and two-locus epistatic effects...... 221 Figure 5-18 Percentages of transcripts linked to multiple eQTLs that are unassociated, following defining random association with the tail -50% criterion (Figure 5-14)...... 223 Figure 5-19 Percentages of transcripts linked to unassociated eQTLs (following defining random association with the tail-50% criterion) that are confirmed for two-locus and two-locus epistatic effects...... 223 Figure 5-20 Percentages of transcripts linked to multiple eQTLs that are unassociated, following defining random association using tail-10% criterion (Figure 5-16)...... 225 Figure 5-21 Percentages of transcripts linked to unassociated eQTLs (following defining random association using tail-10% criterion) that are confirmed for two-locus and two-locus epistatic effects...... 225 Figure 5-22 Lewontin’s D’ measure of LD between locus-pairs identified from sinlge-locus mapping analyses of the CotB/K/L data, against estimated joint effects estimated from bqtl.twolocus...... 226 Figure 5-23 Comparisons between Pearson’s correlation coefficient with D’ (LEWONTIN and KOJIMA 1960) and R2 (HILL and ROBERTSON 1968) as measures of LD...... 228 Figure 5-24 Summary result from pipeline for determining real and spurious multiple-linkages...... 230 Figure 6-1 Pipeline for analysing the genetic regulatory architecture of gene expression using eQTL-mapping studies...... 237 Figure 6-2 Expression correlation between potential master regulators and their target genes...... 245 Figure 6-3 Schematic diagram of inter-dependent master regulation of gene expression...... 249 Figure A-1 Boxplot representations of M-value distributions of the brain BXD (CotB) data before (top) and after (bottom) quantile-normalisation...... A-1 Figure A-2 Boxplot representations of M-value distributions of the kidney BXD (CotK) data before (top) and after (bottom) quantile-normalisation...... A-2 Figure A-3 Boxplot representations of M-value distributions of the liver BXD (CotL) data before (top) and after (bottom) quantile-normalisation...... A-3

Figure A-4 Boxplot representations of M-value distributions of five (a-e) sim(0) datasets before (top) and after (bottom) quantile-normalisation...... A-4

Figure A-5 Boxplot representations of M-value distributions of five (a-e) sim(1) datasets before (top) and after (bottom) quantile-normalisation...... A-5

xi Table of Figures

Figure A-6 Boxplot representations of M-value distributions of five (a-e) sim(2) datasets before (top) and after (bottom) quantile-normalisation...... A-6

Figure A-7 Boxplot representations of M-value distributions of five (a-e) sim(3) datasets before (top) and after (bottom) quantile-normalisation...... A-7 Figure A-8 Linkage hotspots in ChesB (top), BysHSC (middle), and SchL (bottom). A-16 Figure A-9 Linkage hotspots in HubK (top) and HubFC (bottom)...... A-17 Figure A-10 Linkage hotspots in YveY3 (top), and YveY5 (bottom)...... A-18 Figure A-11 Pearson’s correlation coefficient measure of linakge disequilibrium between locus-pairs identified from single-locus mapping analyses of the CotB/K/L data, against estimated joint two-locus effects on the corresponding expression traits using bqtl.twolocus...... A-53 Figure A-12 Pearson’s correlation coefficient measure of linakge disequilibrium between locus-pairs identified from single-locus mapping analyses of the ChesB, BysHSC, and SchL data, against estimated joint two-locus effects on the corresponding expression traits using bqtl.twolocus...... A-54 Figure A-13 Pearson’s correlation coefficient measure of linakge disequilibrium between locus-pairs identified from single-locus mapping analyses of the HubFC/K data, against estimated joint two-locus effects on the corresponding expression traits using bqtl.twolocus...... A-55 Figure A-14 Pearson’s correlation coefficient measure of linkage disequilibrium between locus-pairs identified from single-locus mapping analyses of the YveY3/Y5 data, against estimated joint two-locus effects on the corresponding expression traits using bqtl.twolocus...... A-56

xii Table of Tables

Table of Tables

Table 2-1 Experimental crosses used in seven currently available eQTL-mapping studies...... 41 Table 3-1 Summary estimate of cis-acting eQTLs in eight recent eQTL-mapping studies...... 60

Table 3-2A eQTL-mapping results using PBON for defining linkage significance...... 69

Table 3-2B eQTL-mapping results using PFULL for defining linkage significance...... 69

Table 3-2C eQTL-mapping results using PRED for defining linkage significance...... 71

Table 3-3 Summary linkage results on simulation studies: sim(0)/(1)/(2)(3) Results are averages of five simulation studies...... 77 Table 3-4 Significant linkages per transcript before (middle two columns) and after (unique; last two columns) application of remove.LD (Section 4.2.2) to eliminate potentially redundant linkages due to linkage disequilibrium...... 88 Table 3-5 Extent of transcripts with known genomic location linked in cis and/or in trans at at least one unique eQTL...... 95 Table 3-6 Numbers and percentages of expected and observed syntenic linkages in the BXD Brain, Kidney, and Liver data...... 98 Table 3-7 Linkage hotspots in the BXD brain, kidney, and liver eQTL studies...... 103 Table 3-8 Number of transcripts that are null in one tissue (rows) but expressed in all 31 BXD stains in another (columns)...... 113 Table 3-9 Transcripts mappable in all three tissues...... 116 Table 4-1 Number (%) eQTLs linked to a transcript for the 2,996 mappable expression traits in the BXD brain study...... 144 Table 4-2 Number (%) transcripts with at least one significant eQTL (top row), with multiple potentially redundant linkages (middle row) and multiple unique linkages (bottom row), in our BXD brain, kidney, and liver studies...... 146 Table 4-3 Summary of two-locus linkages verified by bqtl.twolocus in the BXD brain, kidney, and liver data...... 150 Table 4-4A Transcripts mapped to locus-pairs showing significant interaction in the BXD brain study...... 153 Table 4-4B Transcripts mapped to locus-pairs showing significant interaction in the BXD kidney study...... 153 Table 4-4C Transcripts mapped to locus-pairs showing significant interaction in the BXD liver study...... 153 Table 4-5 Average numbers of multiple linkages and confirmed two-locus effects in the permB, permK, and permL studies...... 157 Table 4-6 Average numbers (%) of multiple linkages and confirmed two-locus effects in the sim0B, sim0K, and sim0L studies...... 159 Table 4-7 Summary of eQTL-mapping datasets used in meta-analysis...... 167 Table 4-8 Extent of linked transcripts across the ten eQTL-mapping analyses...... 170 Table 5-1 Proportions of all pair-wise inter-chromosomal genotype pattern similarities, as measured by Pearson’s correlation coefficient, using an original panel of 32 BXD RI lines and replicate sets of simulated 32 RI lines...... 210 Table 5-2 NSA locus-pairs where one of the eQTLs is located on the same chromosome as the expression trait, in the CotB/K/L studies...... 216 Table 6-1 Over-represented transcription factor (TF) binding-motifs identified in more than 10% of 372 transcripts mapped to the Chr1 linkage hotspot...... 250

xiii Table of Tables

Table A-1 Numbers of transcripts linked to 1-5 or >5 eQTLs in the brain, kidney, and liver BXD studies...... A-8

Table A-2 Numbers of simulated genes linked to 1-5 or >5 eQTLs in the sim0B studies...... A-9

Table A-3 Numbers of simulated genes linked to 1-5 or >5 eQTLs in the sim0K studies...... A-10

Table A-4 Numbers of simulated genes linked to 1-5 or >5 eQTLs in the sim0L studies...... A-11

Table A-5 Number of simulated genes linked to 1-5 or >5 eQTLs in the sim0norm studies...... A-12 Table A-6 Numbers of transcripts linked to 1-5 or >5 eQTLs in the ten eQTL- mapping studies analysed in this thesis project...... A-13 Table A-7 Number (%) of linkages defined as cis- or trans- acting using different “cis-window” sizes in the BXD brain, kidney, and liver studies...... A-14

Table A-8 Number (%) of linkages at different Praw that are defined as cis- or trans- acting in the BXD brain, kidney, and liver studies...... A-15 Table A-9 Master regulator analysis for CotB, CotK, and CotL...... A-19 Table A-10 Transcripts that are “null expressed” in one tissue but “completely expressed” in another...... A-27

Table A-11 Transcripts linked to the Chr8: 10-46Mb region at Praw0PRED, in the BXD brain, kidney, and liver studies...... A-36 Table A-12 bqtl.twolocus results for 72 transcripts linked to two eQTLs in the brain 31 BXD (CotB) study...... A-39 Table A-13 bqtl.twolocus results for 17 transcripts linked to two eQTLs in the kidney 31 BXD (CotK) study...... A-44 Table A-14 bqtl.twolocus results for 35 transcripts linked to two or three eQTLs in the liver 31 BXD (CotL) study...... A-46 Table A-15 Permuted expression phenotypes that passed the bqtl.twolocus test in the permB/K/L studies...... A-49 Table A-16 Number and percentages of transcripts with multiple unique linkages that are confirmed for two-locus and two-locus epistatic effects in the ten eQTL-mapping studies analysed in this thesis project...... A-50 Table A-17 Genotype pattern-association classification and bqtl.twolocus results of multiple-linkage from the ten eQTL-mapping studies...... A-51 Table A-18 Alternate version to Error! Reference source not found.17 where random association is defined using the “tail-10%” criterion (Section 5.5)...... A-52

xiv Introduction

1. Introduction

1.1. Natural variation

The importance of natural variation is central to biology. In evolutionary genetics, natural variation is believed to be a major factor contributing to the survival of a population and subsequent speciation (DARWIN 1859). In medicine and health science, natural variation may cause a person to be resilient, or conversely predisposed, to certain diseases, pathological invasions, and other environmental assaults (CRAWFORD et al. 2005). In pharmacogenetics and pharmacogenomics, natural variation influences an individual's sensitivity and tolerance to drugs and toxins (MCLEOD and

EVANS 2001; SADÉE and DAI 2005). In agriculture, natural variation has long been exploited in the cultivation and breeding of plants and animals for more desirable qualities (DEKKERS and HOSPITAL 2002; MORGANTE and

SALAMINI 2003; ZAMIR 2001).

More than 30 years ago, KING and WILSON (1975) proposed the hypothesis that variation between individuals and species are predominantly driven by variation in the control of gene expression rather than in protein coding sequences. Since then, findings supporting protein sequence variation is not the major cause of individual variation include: 1) There is a high degrees of sequence identity between various

evolutionarily related species (CLARK et al. 2003; GIBBS et al. 2004;

KING and WILSON 1975; LANDER et al. 2001; LINDBLAD-TOH et al.

2005; THE CHIMPANZEE SEQUENCING AND ANALYSIS CONSORTIUM

2005; VENTER et al. 2001; WATERSTON et al. 2002); 2) There is >99.9% sequence identity within the human population

(LANDER et al. 2001); 3) Only ~4.2% of SNPs (single nucleotide polymorphisms) are found

within human exonic regions (SACHIDANANDAM et al. 2001); and

1 Introduction

4) Less than 1% of all SNPs lead to variation in (VENTER et al. 2001).

To demonstrate the involvement of genetic variation in the control of gene expression, recent studies have successfully shown differences in gene expression between individuals within genetically distinct populations of yeast (CAVALIERI et al. 2000), flies (JIN et al. 2001), and mice (COTSAPAS et al. 2003; COTSAPAS et al. 2006; SANDBERG et al. 2000). Gene expression have also been shown to vary among individuals of the same segregating population in fish (OLEKSIAK et al. 2002; WHITEHEAD and CRAWFORD

2005), maize (SCHADT et al. 2003), mice (COTSAPAS 2005; COTSAPAS et al.

2003; SCHADT et al. 2003), and man (WHITNEY et al. 2003). Taken together, these studies strongly suggest a genetic basis to gene expression variation.

Further evidences supporting King and Wilson’s hypothesis include several recent studies confirming gene expression is heritable; that is, gene expression variation is attributable to genotypic differences. BREM et al. (2002) estimated 84% of expression variations within a population of 40 haploid yeast are heritable. SCHADT et al. (2003) estimated 29% of genes that are differentially expressed within 8 of 16 founders from four human CEPH (Centre d’Etude du Polymorphisme Huamain) families have a detectable genetic component. Similarly, using 15 CEPH families, MONKS et al. (2004) estimated 31% of differentially expressed genes between at least half the children to have a median heritability of 34%. In another study, CHEUNG et al. (2003) inferred heritability of gene expression by showing increasing levels of differential expression from genetically identical individuals, using monozygotic twins, to individuals with ~50% identical genetic background, using sibships, to unrelated individuals.

2 Introduction

1.2. Gene expression regulatory variation

In eukaryotes, gene expression is a complicated process that converts a DNA sequence into a functional gene-product (protein or non-coding RNA), and involves many steps: 1) DNA modification, including methylation of DNA and acetylation of histones; 2) DNA transcription, including assembly of transcriptional complexes and transcription initiation; 3) RNA processing, including alternate pre-mRNA splicing and RNA stability; 4) RNA transport out of the nucleus and localisation in the cytosol; 5) Translation of mRNA into proteins; 6) mRNA degradation and stability; and 7) Post-translation modification of gene product, including protein folding and stability.

Differential regulation at any one or more of these steps may be a source of individual differences. Comparing the endogenous levels of mRNAs between any two individuals, using high throughput technologies such as microarrays provides us with a snapshot of the differences in gene expression between these individuals, and an opportunity to elucidate how this variation contributes to individual differences.

Fundamental to the above steps of gene expression is the binding of regulatory protein factors (trans-acting factors) to regulatory DNA/RNA sequence elements (cis-acting elements). Polymorphisms in either or both cis-acting and trans-acting regulators may, therefore, alter the expression level(s) of one or many genes. These two classes of effects have many different influences on genes and these are discussed below.

3 Introduction

1.2.1. Cis-acting regulatory elements

A cis-acting element is a DNA sequence that serves as a recognition site for the binding of regulatory proteins. These sequences may be located relatively close to the target gene; for example, the well-known TATA-box generally resides 25-30bp upstream of the cognate gene’s transcription start site and acts as a binding site for RNA polymerase II (STRACHAN and READ 1999). Cis-elements may also situate remotely from the target gene; for example, a regulatory element, ZRS, involved in the initiation and spatially specific expression of the sonic hedgehog (Shh) gene, resides approximate

1Mb from the gene (LETTICE et al. 2003).

These sequences may function to promote or repress transcription, influence RNA stability, initiate RNA degradation, promote alternate splicing, or signal RNA transportation. Thousands of cis-acting elements have been identified (e.g. MATYS et al. 2006) and each is specific for a different regulatory protein or a family of regulatory proteins. Cis-elements can be classified into several overlapping groups depending on their function, including: • Locus control regions (LCR) are involved in long-distance temporal and tissue-specific regulation of gene expression via major chromatin remodelling events. • Promoters act as binding sites for the assembly of transcriptional machineries to promote transcription. They are usually located immediately upstream of their target genes (within 200bp of transcriptional start site). • Enhancers/silencers serve to increase/decrease the basal level of transcription, and include both ubiquitous and tissue-specific elements. Their function is independent of their orientation and distance from the gene they regulate. • Intron/exon boundary elements, such as 3’ and 5’ splice sites and exonic elements, serve to regulate alternate pre-mRNA splicing.

4 Introduction

• Untranslated regions (UTRs) of both the 3’ and 5’ ends control transcription initiation, mRNA processing, mRNA stability, and mRNA degradation.

Polymorphisms in cis-acting regulatory elements explain at least 25%-30% of inter-individual variations in gene expression, some of which have important medical implications (PASTINEN and HUDSON 2004; WITTKOPP 2005). A classic example is from the well-studied !-globin LCR which is necessary for orchestrating the developmental expression of the globin gene

(for review see LI et al. 1999). A 100kb and a 30kb deletion within this locus have been associated to a Dutch (VAN DER PLOEG et al. 1980) and an

Hispanic (DRISCOLL et al. 1989) thalassemia, respectively.

1.2.2. Trans-acting regulatory factors

Trans-acting factors are gene-products (usually proteins) that serve to regulate one or more processes of gene expression. These factors are brought into physical proximity of their target genes via the complementary recognitions and interactions between structural motifs of the regulatory factors and cis-acting elements of the target genes. Different classes of structural motifs may bind ubiquitously or selectively to regulate basal or specific (temporally or spatially) gene expression. Once bound, an alternate domain on the trans-acting factor may be activated to regulate gene expression in several ways, and we classify many trans-acting factors base on their functions: • General (basal) transcription factors are bound to the promoters to support the assembly of transcription machinery. There are only a few factors within this class, but each is present in high abundance. • Transcriptional activators/repressors bind to enhancers/silencers to directly or indirectly activate/inhibit transcription. Indirect actions include recruiting additional regulatory proteins, providing additional binding sites, and stabilising transition states.

5 Introduction

• Spliceosomes and other factors function to convert a pre-mRNA to many different mature mRNAs, and consequently different proteins, via the removal of specific introns and joining of specific exons • Signal transduction proteins bind to other regulatory proteins to activate their function to or stabilise their binding to target genes. • Inducible transcription factors are essentially transcriptional activators/repressors but must be induced by external stimuli. • Co-factors are non-protein substances that function to initiate or stabilise binding of trans-acting regulators to the cis-acting elements. • Non-coding RNAs are RNA molecules (i.e. not proteins) that regulate gene expression through the modulation of chromatin structure, transcription efficiency, and mRNA stability. • Poly-A binding proteins bind to the poly-A tails of mRNAs to influence mRNA maturation, transportation, stability, and degradation.

Polymorphisms in many trans-acting factors, particularly in transcription factors and co-activators, have been linked to a variety of human diseases and inter-individual variations in drug response (SADÉE and DAI 2005;

VILLARD 2004). A classic example of the importance of trans-acting variants in human disease is from the maturity onset diabetes of youth (MODY) where variations of up to four transcription factors have been linked to four different MODY subtypes (WINTER and SILVERSTEIN 2000).

While the action of a cis-acting regulatory element is specific to its cognate gene, a trans-acting factor often has a more pleiotropic effect: a single trans-acting factor may regulate the expression of many genes via its interaction with a common cis-acting element contained within, or near, all target genes. Variations in trans-acting factors are therefore able to influence multiple genes in contrast to cis-acting variants. In current literatures, however, the extents of cis and trans variants influencing gene

6 Introduction

expression are contradictory (BYSTRYKH et al. 2005; CHESLER et al. 2005;

COTSAPAS 2005; HUBNER et al. 2005; MONKS et al. 2004; MORLEY et al.

2004; OLEKSIAK et al. 2002; SCHADT et al. 2003; WITTKOPP 2005; YVERT et al. 2003; Table 3-1), probably because of the differences between these studies, including the type of individuals used, the experimental design and platform, and differences in analytical methods (ROCKMAN and KRUGLYAK 2006). One common finding amongst many studies, though, is that cis- acting influences are easier to detect than trans-variations (HUBNER et al.

2005; MONKS et al. 2004; SCHADT et al. 2003).

7 Introduction

1.3. Mapping expression quantitative trait loci

To dissect the genetic control of gene expression variation, the approach of

‘genetical genomics’ or ‘eQTL mapping’ (JANSEN and NAP 2001) has been adopted by us and others. The term ‘genetical genomics’ was coined for the two pieces of information involved: marker genotype data (genetics) and gene expression data of a near genomic scale (genomics). The term ‘eQTL’ (expression quantitative trait locus) refers to the genomic region at which an expression trait is statistically significantly linked and thus responsible for the variation in the trait’s expression level. This term evolved from the principles and techniques used to identify them; that is, from classical QTL mapping studies with the exceptions that gene expressions are used as quantitative traits (commonly known as expression traits) and the number of traits being tested is several orders of magnitudes greater than in traditional QTL mapping studies.

1.3.1. eQTL mapping approaches

The aim of expression quantitative trait locus (eQTL) mapping is to identify genomic regions controlling gene expression variation. The underlying theory is: if a genetic locus harbours a variant involve in the regulation of a gene’s expression level, then the region and gene would co- segregate more often then expected by chance, resulting in ‘linkage’. But because the causative variations, the eQTLs, are unknown, marker genotype information is used to infer linkage. Base on the assumption of linkage disequilibrium between closely spaced markers (LD; co- segregation of nearby DNA sequences) a significant correlation (i.e. linkage) between a gene’s expression level (Section 1.4) and a marker’s genotypes (Section 1.5) is indicative of the presence of an eQTL located near the marker.

8 Introduction

Tests for linkage can be classified into a single- or multi- locus mapping approach. As discussed in Section 1.2, many steps are involved in the regulation of gene expression and changes at any one or more stages may alter gene expression level. The choice between these two eQTL-mapping approaches is dependent on whether gene expression is affected by a single or multiple variations, or eQTLs.

A single-locus approach assesses a gene for linkage at each of many marker loci independently along the genome (Section 1.6). A multi-locus mapping approach assesses an expression trait for linkage at multiple marker loci simultaneously (Section 1.7). Being a quantitative trait, gene expression is likely complex in origin, where its variation is influenced by multiple eQTLs each exerting small effects, or oligogenic in origin, where its variation is influenced by a few eQTLs exerting large effects and many other eQTLs exerting smaller effects.

Single-locus mapping approach focuses on identifying the one eQTL with the predominant effect while a multi-locus attempts to identify several of the major affecting, and possibly minor affecting, loci. While a multi-locus mapping approach may provide better insight into gene expression regulation, a single-locus mapping approach is often preferred because it is statistically less complex and computationally more tractable than multi- locus mapping.

Details of single-locus mapping and multi-locus mapping are discussed in Section 1.6 and Section 1.7, respectively. But first, the two necessary data are reviewed: gene expression data in Section 1.4, and marker genotype data in Section 1.5. The types of individuals from which these data are obtained are reviewed below in Section 1.3.2.

9 Introduction

1.3.2. Segregating populations

Any eQTL mapping analysis requires a segregating population, one where different alleles at any locus segregate randomly among different individuals. eQTL mapping studies rely on the expectation that an expression phenotype will co-segregate with its genetic regulatory element, consequently resulting in a significant statistical correlation between the gene’s expression level and the genotypes corresponding to the genomic locus harbouring the genetic regulatory element.

To avoid confusion, this manuscript refers to ‘an individual’ of a segregating panel as one or several animals with a unique genetic background. So in a natural population where each animal, plant, or person has a unique genetic background, an individual is one organism. But in an experimental cross where a strain or line of organisms is bred to homozygosity, an individual may imply any number of organisms of the same genotype.

The types of segregating populations that have been used in eQTL mapping studies include the BXD mouse RI (recombinant inbred) lines (CHESLER et al. 2005; COTSAPAS 2005; COTSAPAS et al. 2003; HUBNER et al. 2005), the

BXH/HXB RI rat (BYSTRYKH et al. 2005), a mouse intercross (SCHADT et al. 2003), a maize intercross (SCHADT et al. 2003), an eucalyptus backcross

(KIRST et al. 2004), a population of haploid yeast segregants (BREM et al.

2002; YVERT et al. 2003), and human pedigrees (MONKS et al. 2004;

MORLEY et al. 2004; SCHADT et al. 2003).

Considering the majority of the eQTL studies, including those generated in our lab (COTSAPAS 2005; COTSAPAS et al. 2003), use RI lines, the generation and importance of this type of experimental cross are discussed here. These animals are inbred homozygous, derived from brother-sister matings from the F1 generation of an initial cross between a pair of homozygous inbred founder strains (Figure 1-1). For example, the BXD RI

10 Introduction

mice (PEIRCE et al. 2004; TAYLOR 1978; TAYLOR et al. 1999), were derived from an initial cross between the diverged strains C57BL/6J and DBA/2J.

As a result of inbreeding for over 120 generations (>F120; of the BXD RI mice available at The Jackson Laborartory, Bar Harbor, Maine) and meiotic recombination, all animals of an RI panel are homozygous inbred and the genome of each line is essentially a mosaic pattern of the two parental genomes (BAILEY 1971). That is, at any given locus, a RI line has a homozygous allele originating from either of the two parents (Figure 1-1). For example, each locus of a BXD strain will have genotypes BB or DD (shorted to B and D in this thesis). It should be emphasise that, as a result of inbreeding, all animals belonging to a strain are genetically identical and will breed true regardless of environmental or temporal variations.

11 Introduction

B D

F1

F2

>F20

BXD

BXD1 BXD2 BXD3 BXD29 BXD30 BXD31

Figure 1-1 Generation of recombinant inbred lines. BXD RI mice are derived from continuous brother-sister mating from initial pairs of F1s, which are themselves derived from a cross between two divergent inbred mouse strains, C57BL/6J and DBA/2J. Only one pair of chromosomes is represented here, with purple denoting the genomic background of the B mice and green denoting the genomic background of the D mice. After 20 generations (>F20), each locus of an RI line is likely to have reached homozygosity and the genomic composition of each RI strain would have become a mosaic pattern of the two founder genomes. At present there are 88 BXD lines available (PEIRCE et al. 2004), but only 31 are used in our eQTL studies (COTSAPAS 2005).

12 Introduction

1.4. Microarrays and expression traits

A microarray is a 2-dimensional array on which hundreds to tens of thousands of DNA molecules are deposited or synthesized at predefined spatial locations. Generally, each DNA molecule represents a unique gene or transcript, and depending on the micorarray platform either cDNAs, long oligonucleotides (65mer), or short oligonucleotides (25mer) are used. In the case of cDNA arrays, an entire transcript of a gene is represented, but in the case of an oligonucleotide array, only part of a transcript of a gene is represented. On many microarray platforms, only one transcript (full or part of a splice variant) of a gene is represented. Thus, technically, conclusions drawn from a microarry experiment are only relevant to the DNA molecules that are represented on the arrays and not necessarily to all splice variants of the corresponding genes. However, for simplicity and unless otherwise specified, each DNA molecule on an array will be refer to as a ‘transcript’ or a ‘gene’ throughout this manuscript.

Aside from differences in the type DNA sequences used, microarrays are also commonly classified into single- or dual- colour platforms. In a single- colour microarray platform, only one sample is hybridised onto an array after fluorescent or biotin labelling and the level of expression for each gene is directly correlated to the level of fluorescence or biotin detected at the corresponding feature. In a dual-colour platform, hybridisation is competitive: the sample of interest is co-hybridised onto an array with a reference sample and the expression level of each gene is represented as a ratio of the two channels (i.e. relative expression level).

In this thesis project, analyses are mainly performed on expression data generated from our laboratory (COTSAPAS 2005; COTSAPAS et al. 2003) using a dual-colour oligonucleotide microarray platform and an indirect experimental design. An indirect experimental design is one where each test sample is independently co-hybridised to an array with a common

13 Introduction

reference. In eQTL mapping studies, gene expression variations between multiple individuals are of interest but it is impossible to compare mRNA samples from all individuals directly. A common reference experimental design allows separate microarray experiments, corresponding to different individuals, to be standardised and subsequently compared.

To distinguish between the test and reference samples, on each array, the mRNAs belonging to the two samples are labelled with different fluorescent dyes; e.g. all test samples with red fluorescent dye and the common reference with green. Following hybridisation, the two colour channels of each array are scanned and the detected intensities are taken as direct measurements of mRNA levels following background-corrections (removal of estimated background signal due to non-specific hybridisation). These background-corrected intensity levels often have dynamic ranges >65,000 pixels where the majority of signal is <5,000 pixels resulting in non-normal distributions (Figure 1-2). To coerce intensity values to normal distributions, to allow for applications of many test statistics that assume normality, these red and green values are generally log2-transformed, reducing the intensity ranges to ~16 log2-units (Figure 1-2). Thus, expression microarray data for an eQTL mapping study consists of a matrix of Nxn R-values (log2(red)) and a matrix of Nxn G-values (log2(green)), where N is the number of genes on the arrays and n is the number of individuals.

14 Introduction

Figure 1-2 Density distribution of background-corrected and log2-transformed hybridization intensities of the red and green channel. The expression data used is obtained from 31 BXD recombinant inbred mice, each measured using a 65mer oligonucleotide dual-colour microarray platform. The dynamic range of the un-logged data is ~65,000 with the majority of the data having intensity levels < 2,000. Following log2- transformation, the distributions are much more normal, with dynamic range of ~5 to 16.

As these are indirect experimental designs, test samples (R-values) are standardised to the common reference (G-values) prior to comparisons between arrays. Values from the two channels are combined through the A- and M- values, converting the R and G matrices to A and M matrices: A-value = (R + G)/2

M-value = R – G = log2(red/green) An A-value, for each gene on an array, is the average fluorescent signal between the test and reference samples. It provides an estimate of the average expression level of the corresponding gene on the array. An M- value is the relative signal between the test and reference samples. Because a common reference is used in all n arrays, an M-value provides an estimate of standardised measure of a gene’s expression value, thus allowing M- values of the same gene to be compared across arrays. M-values are therefore used surrogates for expression levels in eQTL mapping analyses. The term “expression trait” is used to refer to the set of M-values of a given gene across the n individuals.

15 Introduction

1.4.1. Normalisation of expression data

Data extracted from microarray experiments must be appropriately normalised prior to their use in eQTL mapping analyses. Normalisation is necessary to adjust for a wide range of variations inherent to microarray experiments, including differences in array quality, differences in initial RNA quantities, differences in labelling efficiency, differences between fluorescent dye characteristics, and differences between microarray hybridisation such as the uniformity at which mRNA samples are hybridised onto different arrays. All, or at least a majority, of these variations must be eliminated if real signals are to be extracted with accuracy and precision.

There is a large collection of normalisation methods for microarrays (e.g.

BOLSTAD et al. 2003; QUACKENBUSH 2002; e.g. SMYTH and SPEED 2003;

YANG et al. 2002), and the choice for any is dependent on the distributional attributes of the data, the microarray platform, and the experimental design. In general, for eQTL mapping studies, normalisations are performed within each array to remove systematic (technical) variations, and between arrays to allow unbiased comparisons between arrays (individuals). Again, focusing upon a dual-colour microarry platform, two normalisation methods, loess print-tip and quantile- normalisation, which have been applied to the expression data generated from our laboratory reviewed.

Loess, or Lowess (locally weighted linear regression), print-tip normalisation (YANG et al. 2002) corrects M-values for both intensity - and spatial - based variations. The “loess” part of this method corrects for the intensity-dependent biases: increased variability of M-values with the decrease in A-values. Loess normalisation essentially performs local regressions on subsets of M-values corresponding to continuous sliding windows of A-values. Residuals from these regressions are taken as the normalised M-values.

16 Introduction

The “print-tip” part of a loess print-tip normalisation corrects for spatial- (or print-tip group) dependent biases. Spotted arrays are synthesised in blocks where the spots in each block are spotted by the same spotting (or printing) pin, and so a print-tip group refers to the block of DNA molecules spotted by the same pin (a grid location on the array). A print-tip loess normalisation is basically a series of loess normalisations performed separately on each print-tip group.

Quantile-normalisation (BOLSTAD et al. 2003; SMYTH and SPEED 2003) corrects for differences between arrays. In eQTL mapping studies, we are interested in the differences between individuals, but as mRNA levels for each individual are measured with an independent array, any variations between the arrays may resemble variations between individual. Qunatile- normalisation coerces the distribution of M-values (or A-values) across a set of arrays to the same distribution, by first ranking the values of each array, then taking the average at each rank across all arrays, follow by replacing the original value at each rank with the averaged value of the corresponding rank.

Performing at least these two normalisation methods prior to application of subsequent tests and analyses is recommended to eliminate the majority of systematic variations. In the event of replication (multiple microarrays per individual), an average is generally taken following normalisation such that each expression pattern consists of only one expression value per individual.

17 Introduction

1.5. Marker genotype patterns

A genotype pattern is a categorical string of genotypes at a specific genomic region, or genetic marker, across the same panel of segregating individuals from which the expression traits are measured. Hundreds to thousands of genetic markers spaced along the genome are used in eQTL mappings: a statistically significant correlation between an expression trait and a genetic marker suggests a genetic variation within the region of the marker is potentially influencing the gene expression levels.

For genetically immortalized cell lines (e.g. CEPH lymphoblatoid cell lines) and homozygous inbred laboratory crosses (e.g. RI lines), markers need only be typed once and so a large amount of genotypic data is readily available for these samples (e.g. JIROUT et al. 2003; KONG et al. 2004;

PRAVENEC et al. 1999; TAYLOR et al. 1999; THE INTERNATIONAL HAPMAP

2005; WILLIAMS et al. 2001). For less-well typed populations or to create a finer genetic map, high throughput SNP (single nucleotide polymorphism) typing has become available in recent years for easy and rapid typing of thousands of genetic SNP-markers (eg. MATSUZAKI et al. 2004; STEEMERS et al. 2006).

For simplicity, many eQTL studies use segregating populations that are bi- allelic. For some crosses, such RILs and backcrosses, there exist only two possible genotypes at any given locus, and so each genotype pattern is essentially a dichotomous (binary) string. For example (Figure 1-3), in the BXD RI mouse panel, a strain is either homozygous BB (B for short) or homozygous DD (D for short) at each position on the genome. So, given a panel of, say, six BXD strains, if the first, second and fourth strains (BXD1,

BXD2, BXD4) have genotype B at marker M1 and the other three BXDs

(BXD3, BXD5, BXD6) has genotype D, then the genotype pattern at M1 is BBDBDD.

18 Introduction

In contrast, for intercrosses and outbred pedigrees, a total of three genotypes are possible at each genomic position corresponding to the two homozygous alleles and the heterozygous allele, thus resulting in trichotomous genotype patterns. For instance, suppose an F2 intercross has genotypes BB, BD, and DD, represented by B, H, and D, then the genotype pattern at any marker locus can be any combinations of these three characters; e.g. BHBDDBHBDH across tens F2 strains.

BXD1 BXD2 BXD3 BXD4 BXD5 BXD6

M1

M2

M3

M1 BBD B DD

M2 BBB BDD

M3 B DD DBB Figure 1-3 Genotype patterns. This example considers one chromatid in 6 imaginary BXD RI lines. The genetic background of each RI strain consists of an assortment of the genomes of the two founder strains, C57BL/6J (genome represented by purple) and DBA/2J (green). Chaining together the six genotypes from the six BXDs at marker M1 produces the genotype pattern BBDBDD. Similarly, M2 has genotype pattern of BBBBDD and M3 has genotype pattern of BDDDBB.

The implication of categorical genotype patterns is that they can become associated. Let us consider the three marker loci in Figure 1-3. The patterns at M1 and M2 are almost identical differing at only the third position

(BXD3), but the patterns between M1 and M3 are much more dissimilar with differing genotypes in BXDs 2, 4, 5 and 6. There are several reasons for genotype pattern similarity. In Figure 1-3, M1 and M2 are associated because of linkage equilibrium between neighbouring genomic regions (Section 5.3). Linkage disequilibrium between non-neighbouring loci can

19 Introduction

also occur, and can likewise result in genotype pattern association (Section 5.4). Finally, genotype patterns can become associated by chance (Section 5.5) because there are n (the number of individuals) possible changes to converting any marker genotype pattern into a different pattern, implying each pattern has n 1-degree relatives (genotype patterns differing at only n(n-1) one position). Similarly, each genotype pattern has /2 2-degree relatives, n(n-1)(n-2 )/2x3 3-degree relatives, and so on.

Association between genotype patterns can consequently confound eQTL mapping analyses because linkage is assessed on the strength of correlation between the levels of an expression trait and genotypes at a marker locus. Given a pair of markers with similar genotype patterns, significant correlation (i.e. linkage) between an expression trait and the first marker will imply similar correlation significance between the trait and the second marker. The influence of genotype pattern association on eQTL mapping studies and multiple-linkage effects is a major focus of this research thesis and are mainly discussed in Section 3.3, Chapter 4, and Chapter 5.

20 Introduction

1.6. Single-locus mapping

In a single-locus mapping approach (also known as a genome scan), the null hypothesis of no linkage is tested for each gene expression trait at all putative eQTLs positioned along the genome. A putative eQTL may be a marker locus or any locus on the genome whose genotypic values can be estimated from the set of known marker genotypes.

A wide range of statistical methods have been used in single-locus eQTL mapping studies, including Welch’s t-test (COTSAPAS 2005; COTSAPAS et al. 2003), simple regression (BYSTRYKH et al. 2005; CHESLER et al. 2005;

COTSAPAS 2005; HUBNER et al. 2005), Wilcoxon-Mann-Whitney test

(BREM et al. 2002; YVERT et al. 2003), interval mapping (SCHADT et al.

2003), composite interval mapping (KIRST et al. 2004), Haseman-Elston regression (MORLEY et al. 2004), variance component analysis (MONKS et al. 2004; SCHADT et al. 2003), and mixture model analysis (MONKS et al.

2004; SCHADT et al. 2003).

In this thesis project ten eQTL mapping expression datasets are analysed: three mouse BXD RIL studies (BYSTRYKH et al. 2005; CHESLER et al.

2005; COTSAPAS 2005; HUBNER et al. 2005), a rat BXH/HXB RIL study

(HUBNER et al. 2005), a mouse F2 study (SCHADT et al. 2003), and a haploid yeast study (BREM et al. 2002; YVERT et al. 2003). Although various statistical methods were used by the corresponding investigators in each study, in this thesis project all have been re-analysed using a simple regression method, implemented in QTLReaper, developed by MANLY and

WANG 2005 (http://sourceforge.net/projects/qtlreaper/).

Simple linear regression is a statistical tool for determining the relationship between a continuous response variable (expression values) and a predictor variable (genotypic values). For each gene and marker pair, expression

21 Introduction

values are regressed onto the genotypic values according to the linear model: β β ++= ε yi 10 x ii

Equation 1

th yi = expression value for the i individual, where i =1..n

β0 = mean expression value across the n individuals

β1 = marker effect (i.e. the slope of regression line) xi = predictor variable for the marker genotypes (see below)

i = residual error; distribution is assumed normal

In Equation 1, the unknown parameters β0, β1, and i, are estimated using the least-squares (LS) method to determine the slope, β1, that returns the minimum residual sum of squares between β1 and expression values, yi=1..n. The slope of the regression line is then used to compare the null hypothesis,

H0: β1 = 0, against the alternate hypothesis, H1: β1 + 0, using the likelihood ratio (LR) statistic.

In the presence of a dense marker map, xi acts as an indicator variable for the genotypes at a given locus. For example, in the BXD RI cross, the two marker genotypes BB and DD may be coded as follows: I0 if BB x = J i K1 if DD

And for a BXD F2 cross: I0 if BB L = xi J1 if BD L K2 if DD

In the presence of a dense marker map a single-marker or single-point analysis is employed such that only marker loci are considered as putative locations for eQTLs because testing between markers will not yield additional information (CARLBORG et al. 2005; COFFMAN et al. 2003).

22 Introduction

However, given a sparse marker map, where there could be several recombination breakpoints between adjacent markers, a multi-point analysis, such as interval mapping (LANDER and BOTSTEIN 1989), may be adopted. Instead of actual marker loci, hypothetical eQTLs spaced at regular intervals along the genome are assessed for linkage. Expression values are regressed onto the genotypic probabilities, xi, of the putative eQTL, which is determined based on its relative position to known markers and their genotypic information.

Single-point analyses have been adopted for the genome scans performed in this thesis project because the marker maps are relatively dense and genotype data is relatively complete (not much missing data).

1.6.1. Significance test and multiple testing

A LR (likelihood ratio) statistic is a measure of the relative probability that there is association between a gene and marker locus compare to no association: −= 2(ln 1 ln LLLR 0 )

Equation 2 where lnL0 and lnL1 are log-likelihood corresponding to the null (H0) and alternate (H1) hypotheses.

A large LR rejects the null hypothesis indicating linkage. For a single- hypothesis test, the significance of a LR score is dependent on the acceptable rate of false positives (the type I error rate), : the acceptable rate at which a true null hypothesis is rejected. In multiple-hypothesis testing,  is adjusted such that the probability of making one or more type I errors amongst all hypotheses (family-wise error rate; FWER) remains below . In eQTL mapping studies there are two levels of multiple testing: 1) Linkage is tested at multiple marker loci for each expression trait; 2) Multiple genome scans are performed: one for each trait.

23 Introduction

Testings at multiple marker loci are often corrected using a genome-wide

P-value threshold (PGWS), while correction for testing multiple genes is performed using a false discovery rate (FDR) method.

A P-value is the probability of observing a certain LR under random expectation; it is a measure of LR significance. P-values for LRs may be estimated from a Chi-square (02) distribution or from permutation tests.

When sample size (n = the number of individuals) is large, LRs are expected to follow a 02 distribution. The meaning and usefulness of this is that the probability of scoring a given LR statistic under random expectation (i.e. the nominal or raw P-value: Praw) can easily be estimated from the 02 distribution. In an eQTL study of N genes and M markers there is a total of NxM LR scores and Praw, each corresponding to a gene and marker pair. To adjust for the first level of multiple testing, a simple Bonferroni correction is employed for eQTL mapping analyses in this thesis project. For each gene, a critical value, to control for a genome-wide type I error rate of , is set at /M, such that, for each gene-marker pair, H0 is rejected if the Praw≤/M. Although simple, Bonferroni correction is the most conservative method and it trades off false positives (type I errors) for false negatives (type II errors).

Undesirable increase in type II errors is particularly prominent when sample size, n, is small, as is often in the case of eQTL studies. To avoid this, less conservative methods, such as the Bonferroni Step-down method

(HOLM 1979) and the Benjamini and Hochberg FDR method (BENJAMINI and HOCHBERG 1995), may be used instead of a standard Bonferroni correction. Alternatively, exact P-values may be calculated using permutation testing (CHURCHILL and DOERGE 1994; DOERGE and

CHURCHILL 1996), as follows: For each gene, 0 1) Calculate the LR at each of M marker loci: Li , for i = 1…M

24 Introduction

2) Permute the expression values by randomly re-assigning the values to different individuals 3) Calculate LR at each M marker loci as before and record the

maximum LR score, L’1 4) Repeat (2) – (3) x times, where x is the number of permutations, to

obtain x maximum LR scores, L’k for k = 1…x 0 5) Determine the empirical probability for each Li , using L’k as the

empirical LR statistic distribution: Pi for i = 1…M th 6) Declare linkage between the gene and the i marker if Pi ≤  7) Repeat (1) – (6) for all genes

This procedure outlines one method of permutation testing whereby the empirical probability is gene-specific and therefore less conservative compare to the Bonferroni correction whilst retaining a similar genome- wide  (WESTFALL and YOUNG 1993). It is obvious that empirical P-values obtained from permutation testing provide a better estimate of LR significance than nominal P-values, but while permutation testing is conceptually simple, the computational time required in undesirably demanding. This is because P-value precision is dependent on the number of permutations, and the computational load per permutation scales with the number of genes and number of markers.

The second level of multiple testing (linkage tests for multiple expression traits) may similarly be adjusted using the Bonferroni correction or a permutation test as above. For each trait, a genome-wide threshold of  is equivalent to a point-wise threshold of /M. Similarly, an experiment-wise threshold of  equates to a point-wise threshold of /(NxM). A Bonferroni correction can also be applied following a permutation test to correct for both types multiple testing: in step (6) of the permutation test, above, linkage is declared between a gene and maker locus if Pi ≤ /N. The problem of applying a second multiple-testing correction is that the critical

25 Introduction

value becomes extremely stringent, often resulting in high levels of type II errors.

Alternatively, and more commonly, the method of false discovery rate (FDR) is used. FDR is the proportion of all rejected null hypotheses that are actually true nulls; i.e., number of falselinkages FDR = total number of significant linkages It differs to the false positive rate, , which is the rate of true negatives that are called positives. The idea of FDR is to estimate the number of false linkages at a given genome-wide significance level (e.g. /M) by estimating the q-value. The q-value is the minimum FDR for a fixed significance threshold and can be determined by empirically optimising the prior probability of no linkage (STOREY and TIBSHIRANI 2003).

26 Introduction

1.7. Multi-locus mapping

Gene expression is a quantitative trait and is likely to be complex; i.e., control by multiple loci. Several eQTL mapping studies have noticed transcripts showing linkage to more than one eQTL (BREM et al. 2005;

BREM et al. 2002; CHESLER et al. 2005; COTSAPAS 2005; HUBNER et al.

2005; MONKS et al. 2004; SCHADT et al. 2003; YVERT et al. 2003), implying gene expression may be influenced by multiple eQTLs that may act: Additively: each locus contributes to the gene's expression independently such that the total level of gene expression is the sum of all locus effects; or Epistatically (or interactively): the effect of one locus is dependent on another such that total gene expression is attributed to the combined effects of two or more loci.

Interestingly, with the exception of BREM et. al. (2005), single-locus approaches were used in each of the eQTL studies demonstrating multiple linkages. Single-locus mapping techniques assume the expression of a gene is predominantly controlled by a single eQTL. If a gene is controlled by more than one locus, the use of these techniques may prevent eQTLs from being detected (false negative) or lead to false detections (false positives) because: • Linkage disequilibrium between neighbouring marker loci may lead to “ghost” eQTLs (MARTINEZ and CURNOW 1992) that resemble multiple linkage; and • eQTLs with opposite effects may counteract each other, thus preventing their detection.

These problems may be solved by using a multi-locus mapping approach which models multiple eQTLs simultaneously, and by doing so it: • Has greater power to detect eQTLs as it appropriately considers the effect of each individual locus as well as their joint effect;

27 Introduction

• Better separates linked eQTLs, particularly if multiple eQTLs are located on the same chromosome; and • Allows the identification of interacting (non-additive) eQTLs.

At present there are only two proposed methods for multi-locus mapping in a genetical genomics setting: a step-wise forward regression approach

(STOREY,AKEY,andKRYGLYAK 2005) and a mixture over markers approach (KENDZIORSKI et al. 2006). The former method searches for the strongest associating eQTL in a stepwise manner, for successive eQTLs, for each expression trait. The latter approach extends beyond searching for multiple genetic loci for each expression trait simultaneously to searching for multiple eQTLs: it attempts to reduce the dimensionality of transcriptomic data by using a multi-trait approach.

For this thesis project, the main objective is to assess the validity of multiple eQTLs identified from single-locus eQTL mapping studies. An obvious approach for achieving this objective is to perform a multi-locus mapping analysis and to compare the results to those from single-locus mapping studies. However, having performed an initial multi-locus mapping analysis (Section 4.4) using the step-wise forward regression method (STOREY,AKEY,andKRYGLYAK 2005), we found that a majority of the currently published genetical genomic studies where sample size is relatively small (<100) lack the statistical power to detect any multi-locus effects. Thus, a two-locus validation method, bqtl.twolocus (Section 4.1.2), was developed to assess multiple linkages identified from single- locus mapping analyses.

To understand why and how bqtl.twolocus was developed, the concept of multi-locus mapping will be described here. As with single-locus mapping, there are two broad classes of multi-locus mappings: single-point and multi-point. Both consider the effects of multiple putative QTLs on a quantitative trait, with the major difference being the location at which the

28 Introduction

putative QTL is tested. In a single-point analysis, a putative QTL is simply a marker locus at which genotypes is known across the mapping panel

(reviewed in BROMAN 2001). In a multi-point analysis, a putative QTL is a locus whose genotypic possibility is estimated from the known genotypes of its pair of flanking markers. Examples of multi-point multi-locus mapping approaches include: composite interval mapping (JANSEN 1993;

ZENG 1993), multiple interval mapping (KAO et al. 1999; ZENG,KAO, and

BASTEN 1999), and, more recently, haplotype-based methods (SCHAID et al.

2002; SCHAID 2004).

Since a single-point single-locus approach was chosen for our expression traits, and to avoid statistical complication, we focus on a single-point multi-locus approach. Let us consider, for simplicity: 1) The presence of a dense marker map such that only marker loci are considered for eQTLs; 2) Only a co-dominant effect for segregating populations with heterozygous genotypes; and 3) A maximum of digenic epistasis (gene expression is influenced by two dependent loci, at most). Then, given M markers, the expression level of a gene in the ith individual can be represented as:

M M ! ! ()!  i =y 0 + 1 j ij +x BB 2 ikijjkjk +xx i j=1 ≠kj

Equation 3

β0 = average gene expression across all individuals th β1j = additive effect of the j marker th th xij = indicator genotype of the j marker in the i individual (as in th Equation 1; e.g. xij=0 if i individual has genotype BB at marker j) th th β2jk = digenic epistatic effect between the j and k marker

jk = indicator variable for epistasis: value of one if there is interaction between the jth and kth markers and zero otherwise

29 Introduction

2 i = error estimate that is assumed normal, N(0,  )

The fundamental aim of a multi-locus mapping is to identify the set of markers that best explain the variance of an expression trait. These markers may act independently or interact epistatically, thus we are interested in those markers for which β1j + 0 and/or β2jk + 0.

The search for this set of affecting markers for an expression trait is a multi-dimensional search problem. In single-locus mapping, each marker locus is systematically assessed for linkage: one-dimensional search. In multi-locus mapping the search space is exponentially increased as all combinations of different numbers of marker loci are scrutinized for linkage, for each trait. This problem is further magnified in genetical genomics studies as each multi-dimensional search is repeated for tens of thousands of expression traits. Thus, unsurprisingly, as a consequence of the large computational demand and loss of statistical power from performing such a large number of tests, multi-locus mapping is not routinely performed on large-scale expression data.

As with single-locus mapping, a LR (likelihood ratio) score is used to assess the relative probability of HA to H0, where

H0 is the null hypothesis that an expression trait is not concurrently influenced by a given set of marker loci; and

HA is the alternate hypothesis that an expression trait is concurrently influenced by a given set of marker loci

H0 is much more difficult to define here than the equivalent null hypothesis in single-locus mapping because in multi-locus mapping there is a multitude of scenarios of H0 for each HA. For example, to test whether a gene is influenced by to two loci, L1 and L2, HA is:

gene = L1 + L2 + L1:L2 The gene’s expression is genetically controlled by the additive (+) and interactive (:) effects of both loci. In this case, the possible H0 are:

30 Introduction

gene = 1

The gene is not influenced either L1 or L2; or

gene = L1 or gene = L2 The gene is only controlled by one of the loci: a single-locus effect; or

gene = L1 + L2 or gene = L1:L2 The gene is influenced by both loci, but their actions are completely independent (additive) or completely dependent (i.e., the effect of the variant at L1 is only present if the corresponding variant at L2 is also present). Rejecting any of the null cases while HA is truly false will result in a false positive. Thus H0 is important in multi-locus mapping because it provides the assumptions behind the null distribution for assessing the significance of LR scores and so may influence both type I and type II error rates. If LR exceeds a critical value representing a desired level of significance, HA is accepted. This critical value can be determined using a method that models the null distribution: 1) A simulation study: it is computationally manageable but is highly dependent on the model parameters used; or 2) A permutation study: is non-parametric but can be highly computationally demanding.

As discussed in the previous chapter, there are two levels of multiple testings in a genetical genomics study: test of linkage at multiple loci and for multiple traits. The first level of multiple testing can be addressed using a simulation or permutation method to set a significance threshold. For the second level of multiple testing, multiple testing corrections or the concept of false discovery rate (FDR) may be adopted as in single-locus mapping.

STOREY,AKEY,andKRUGLYAK (2005) described a new multi-locus mapping method that incorporates Bayesian techniques with FDR to overcome the problem of multiple null hypotheses as well as the problem of performing multiple tests for tens of thousands of expression traits. Using this method, BREM et al. (2005) identified 225 transcripts with significant

31 Introduction

2-locus effects, of which 65% have significant interaction between the two loci. Interestingly, using a single-locus mapping approach on the same dataset, they identified 547 genes linked to two marker loci, but significant interaction between the locus-pairs were only found in 13% were these.

In many genetical genomics studies where a single-locus mapping analysis was performed, many transcripts have demonstrated linkage at more than one eQTL (BREM et al. 2005; BREM et al. 2002; CHESLER et al. 2005;

COTSAPAS et al. 2006; HUBNER et al. 2005; MONKS et al. 2004; SCHADT et al. 2003; YVERT et al. 2003). These observations are interesting for two reasons. Firstly, if these multiple linkages are real then it implies, in these studies, there is sufficient statistical power to detect multiple eQTLs.

Unfortunately, this hypothesis is unlikely, as BREM et al. (2005) have demonstrated that ~67% secondary loci failed to be detected when a single- locus analysis was used. Secondly, if these observations are false positives, it is important to determine why multiple loci are linked to those gene expression traits as this may provide insight into whether any of the eQTLs and which of the eQTLs are true positives or true negatives.

Unfortunately, to perform a full-scale multi-locus mapping analysis on each eQTL data is extremely computationally intensity, and is estimated to take approximately one month to do so. For the purpose of this thesis project, we are interested in determining whether multiple linkages identified from single-locus mapping analyses are real or not. Thus, base on the idea of a traditional multi-locus mapping analysis, as described here, a two-locus validation method, bqtl.twolocus, is developed for this purpose (Chapter 4; Section 2.6). This method recognises, but is not concerned with, the multi-locus effects that are likely to have been missed from single-locus mapping analyses, rather it focuses on determining the potential spurious multiple linkages identified from one-dimensional genome-scans.

32 Introduction

Additional note: Since the completion of this thesis, three other methods of multi-locus mapping approaches in a expression traits at a transcriptomic level have emerged: a Bayesian clustering method that simultaneously searches for transcripts that are jointly influenced by multiple markers and markers that influences multiple traits (JIA and XU 2007); the extension of the renowned multiple interval mapping mapping for gene expression QTL analysis (ZOU and ZENG unpublished); and a pattern discovery-based approach for assessing pattern-trait association (LI,FLORATOS,WANG,and

CALIFANO unpublished).

33 Introduction

1.8. Aim and hypotheses

The focus of this thesis is to elucidating the influence of genetic variation on gene expression, as well as the extent, validity, and cause of multiple linkage identified from single-locus eQTL mapping analyses.

1.8.1. Influence of genetic variation on mRNA levels (Chapter 3)

Hypothesis: mRNA abundance is influenced by genetic variations Aim: To determine the extent of gene expression variation that is controlled in cis and the extent that is controlled in trans

Using mRNA samples from brain, kidney, and liver tissues of 31 BXD RI mice, single-locus mapping analyses are performed to identify the extent and identity of transcripts whose expression variations can be linked to at least one eQTL. By examining the relative location of the transcripts and linked eQTLs, the extent of cis- and trans- acting variations are deduced.

1.8.2. Multi-locus influence of gene expression variation (Chapter 4) Hypothesis: Multiple eQTLs shown as linked to the same transcript using single-locus mapping analyses are real and have concurrent influence on the transcript’s expression levels Aim: To find evidence for additive and/or epistatic association amongst each set of multiply lined eQTLs identified from single-locus mapping analyses

For each transcript linked to two or more eQTLs, two-locus effects, between any pairs of eQTLs within the multiply linked eQTL set, are validated using bqtl.twolocus. This analysis is extended to seven other

34 Introduction

datasets (BYSTRYKH et al. 2005; CHESLER et al. 2005; HUBNER et al. 2005;

SCHADT et al. 2003; YVERT et al. 2003).

1.8.3. Influence of genotype pattern association in multiple linkage (Chapter 5) Hypothesis: Genotype pattern association is a major source of spurious multiple linkages Aim: To determine the sources of genotype pattern associations that will result in spurious multiple linkage

Using our brain, kidney, and liver BXD data and seven other publicly available eQTL datasets, the type of relationship between genotype patterns of multiply linked eQTLs identified from 1-D genome scans are determined and used to study the effect of genotype pattern-association on eQTL mapping and to help prioritise the sets of multiple linkages.

35 Materials and Methods

2. Materials and Methods

2.1. Brain, kidney, and liver BXD eQTL-mapping studies

2.1.1. BXD RI mice

For our eQTL-mapping studies, a panel of 31 BXD RI (recombinant inbred) mice is used. Three eight-week-old male mice from the strains BXD- 1, 2, 5, 6, 8, 9, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, 27, 28, 29, 31, 32, 33, 34, 36, 38, 39, 40, 42, are purchased form the Jackson Laboratory. Because these mice are age- and sex- matched and also kept under identical living conditions, any variations between strains are assume purely genetic in origin. At the time of purchase, these mice are, on average, bred to more than 117 generations (>F117), implying absolute homozygousity at every locus and so genetics are not confounded by heterozygosity.

♣ RNAs are extracted from whole brain, kidney, and liver of these mice. Equal amounts of RNAs from the three mice of a given strain are pooled, resulting in a total of 93 RNA samples, corresponding to three tissues per strain. Pooling between individual mice of the same strain dilutes potential non-genetic individual differences. RNAs from the corresponding tissues are also obtained and pooled from ten C57BL/6J mice, which is one of the founder strains of the BXD panel and is used in our microarray experiments as the common reference.

2.1.2. Marker genotype data

A major advantage of using the BXD RI mice is that the founder strains (C57BL/6J and DBA/2J) are very well studied and their genomes have

♣ Chris Cotsapas carried out all bench-work, including animal and sample handling

36 Materials and Methods

been completely sequenced. In addition, a dense set of genotyped markers is readily available for these mice from the Mouse Genome Informatics (http://www.informatics.jax.org/) and the GeneNetwork (a.k.a. WebQTL; http://www.genenetwork.org/) databases.

We downloaded genotype data of 779 markers for our 31 BXD mice from the GeneNetwork database prior to July 2005.

2.1.3. Expression microarray data

We used two-colour long-oligonucleotide spotted microarrays printed with the Compugen 22k mouse library from the Clive and Vera Ramaciotti Centre, University of New South Wales, Sydney, Australia. All RNA samples corresponding to the BXD strains have been labelled with Cy5 (red) dye (Invitrogen, Sydney, Australia) and co-hybridised with the Cy3 X (green) labelled C57BL/6J common reference.

There are a total of 23,040 probes on each array, containing 21,765 mouse oligonucleotides, 384 blank spots (spots not printed with any probes), 474 empty spots (spots printed with spotted buffer), 232 GAPDH transcripts (NM_008084) as a house-keeping positive control, and 184 Lucidea Universal ScoreCard (Amersham Biosciences).

Following hybridisation, arrays were scanned using an ArrayWorx scanner (Applied Sciences) and images extracted¥ using Spot v.2 (CSIRO, Australia).

For each array, red and green intensity measures are background-corrected: median foreground measures are subtracted from median morphological opening values. Intensity levels are then log2-transformed: R = log2(red)

X Chris Cotsapas performed all microarray experiments, including hybridization and scanning ¥ Mark Cowley performed all image extractions.

37 Materials and Methods

and G=log2(green); followed by calculations of M and A values: A =

½(log2R + log2G) and M = log2R – log2G.

For the 23,040 M-values on each array, loess-print-tip normalisation is applied to remove within-array biases. For the 31 arrays per tissue, quantile-nomalisation is performed on both M- and A- values to remove potential between-array variations. See Section 1.4.1 for description on microarray normalisation.

2.1.4. eQTL-mapping

Single-locus mapping analyses presented in this manuscript are performed using regression analysis implemented via the software QTL Reaper, release 1.1.1, developed by KF Manly and J Wang, (http://sourceforge.net/projects/qtlreaper/). This software, written with the programming languages C and Python, assesses linkage at a specific marker locus for a given set of expression traits. A wrapper, scripted in BASH and Python by Mark Cowley, is used to assess for linkage systematically at all marker loci via QTL Reaper.

QTL Reaper is used only to perform regression analyses and calculate “additive” and LR scores but the permutation function of QTL Reaper, that is, the number of permutations was set to zero. Instead, Praw-values have been sampled directly from a χ2-distribution at 1-degree of freedom, using the LR scores from QTL Reaper, via the R-statistical software (R

DEVELOPMENT CORE TEAM 2006). We have two reasons for not using the QTL Reaper built-in permutation function; first, the permutation is marker- specific but not trait-specific; and secondly, it would have been too time- consuming to perform permutations at all marker loci.

Performing linkage for all expression traits at all marker loci results in three

NxM matrices of LR scores, Praw, and “additive” scores (see below), where

38 Materials and Methods

N=number of expression traits=23,040 and M=number of markers=779. Three P-value thresholds were explored for declaring linkage significance:

Praw 0 PBON; Praw 0 PFULL; and Praw0PRED (Section 3.2.1).

The “additive” scores also returned by QTL Reaper are the regression slopes from the regression analyses. These values are proportional to the trait-variance (Section 3.2.6 and Figure 3-6): trait-variance = 2 x “additive” A positive trait-variance implies up-regulation of gene expression in the strains with a D-genotype at the marker locus of interest compare to strains with a B-genotype, and visa versa for negative trait-variances. In Section 3.2.6 trait-variance is represented as fold-changes, which is the value of trait-variance raised to the power of two: fold-change = 2trait-variance This is because M-values are used in the regression analyses, which are in log2-units.

Note that, other single-locus mapping statistical methodologies and softwares, including Student’s two-sample t-test implemented in the R- statistical software (R DEVELOPMENT CORE TEAM 2006) and the R/qtl package (BROMAN et al. 2003), have been considered, But because these codes and softwares have not been optimised to run batch analyses, the computational time was too intensive.

2.1.5. Defining “expressed” genes

An expression trait is defined as “expressed” in a particular tissue if at least half of the expression values across a panels of individuals (i.e. >15 out of 31 BXD strains) are greater than the 90th percentile of all the negative controls (NC) on the 31 arrays (Section 3.2.5). That is, a gene with A- values across the 31 BXD strains (A1, A2, …, An) is “expressed” if

39 Materials and Methods

n ()>> B i AA NC 15 i=1 th where ANC = 90 percentile of the distribution of all 875 NCs from 31 arrays. The 875 NCs consist of 384 blank spots, 475 empty spots, and 16 negative ScoreCard controls (See Section 2.1.3 and Section 3.2.4).

The 90th rather than the 100th percentile is chosen as a lower limit of expression to avoid under-estimating the number of expressed genes due to outlier measurements in these spots.

The reason for defining a gene to be expressed when at least half rather than all of the strains are expressed above ANC is to catch any null genotypes. If one of the two genotypes (B or D) suppresses the expression of a gene, then the expression values in those strains with the repressor- genotype is expected to be low, but expression values from the strains with the alternate genotype should not be. Thus, a lack of expression in a faction of the strains does not necessarily imply the gene is endogenously unexpressed. However, setting this fraction too low may retain too much low signals thus introducing too much noise into the linkage tests. The choice of “at least half” is a good compromise because an even partition of expression values into the two genotype-groups is expected to have optimal statistical power compare to an uneven genotype-group-sample-size.

40 Materials and Methods

2.2. Meta-analysis

A summary of the seven other eQTL mapping studies is provided in Table 4-7.

2.2.1. Experimental crosses

Table 2-1 Experimental crosses used in seven currently available eQTL-mapping studies. Number of Type Founders Tissue individuals ChesB 32 Forebrain BXD RI C57BL/6J Hematopoiet BysHSC mice DBA/2J 22 ic stem cell HubK BXH/HXB BN.Lx/Cub Kidney 30 HubFC RI rat SHR/Ola Fat cell YveY3 Haploid BY 86 Whole yeast YveY5 yeast RM C57BL/6J SchL F mice 111 Liver 2 DBA/2J

The 22 BXD strains used in BysHSC (BYSTRYKH et al. 2005; NOTE: 30 strains were surveyed by authors but data for only 22 strains are available from GEO) are all included in the 31 BXD strains in our three-tissue study

(CotB/K/L) and also included in the 32 BXD strains in ChesB (CHESLER et al. 2005). All 31 BXDs used in CotB/K/L are also in ChesB; the extra strain in ChesB but not CotB/K/L is BXD-29.

2.2.2. Marker genotype data

ChesB and BysHSC: identical 779 markers as in CotB/K/L (Section 2.1.2) downloaded from GeneNetwork (http://www.genenetwork.org/) prior to July 2005

41 Materials and Methods

HubK/FC: 558 markers downloaded from GeneNetwork (http://www.genenetwork.org/) prior to July 2005 YveY3/Y5: 526 markers and genotypes obtained directly from corresponding authors† SchL: 134 markers and genotypes obtained directly from corresponding authors‡ Refer to Figure 4-10 for a representation of the marker maps used in these studies.

2.2.3. Expression data

ChesB: Authors used a single-colour Affymetrix platform. A total of 88 Affymetrix .CEL files were obtained from GeneNetwork. Arrays have been

RMA-normalised using the Bioconductor/affy package (BOLSTAD et al. 2003). Resulting expression levels are averages across replicates for each RI strain (generally three replicates per strain).

BysHSC: Authors used a single-colour Affymetrix platform. A total of 44 Affymetrix .CEL files were downloaded from the Gene Expression Omnibus (GEO; ftp://ftp.ncbi.nlm.nih.gov/pub/geo/data/geo/). Arrays have been RMA-normalised and final expression values are averages across duplicates for each strain.

HubK/FC: Authors used the Affymetrix platform. A total of 240 raw expression data in the form of .XML files format were downloaded from ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) with the accession number E-AFMX-7. These files have been transformed to .CEL files and RMA-normalised for each of HubK and HubFC. Resultant expression values are averages across replicates for each strain in each tissue (typically four per strain).

† I would like to thank Rachel B. Brem for providing this data. ‡ I would like to thank Eric E. Schadt for providing this data.

42 Materials and Methods

YveY3/Y5: Duplicates of 86 raw expression arrays were downloaded from GEO (ftp://ftp.ncbi.nlm.nih.gov/pub/geo/data/geo/). Print-tip loess normalisation (YANG et al., 2002) has been performed on each array to remove dye-specific systematic biases. All 172 slides (along with 8 arrays of direct parental comparisons) have then been scale-normalised to adjust for inter-slide variations.

SchL: A total of 111 normalised dataset was obtained directly from the corresponding author§. Quantile-normalisation has further been applied across the 111 arrays to remove residual inter-array biases.

2.2.4. eQTL-mapping

The methodology, softwares, and scripts that have been used for our three- tissue BXD data (CotB/K/L) have also been applied to these seven datasets.

§ I would like to thank Eric E. Schadt for providing this data.

43 Materials and Methods

2.3. Simulation studies

2.3.1. sim0B/K/L Expression data are generated base on the CotB/K/L expression datasets. For each simulation, begin with the matrix of 23,040-by-31 (row-by- column) quantile-normalised dataset (e.g. for sim0B, begin with the brain 31 BXD expression dataset), and then obtain expression values for each column (i.e. array) of the new simulated expression matrix by sampling, without replacement, from the corresponding column of the original real expression matrix (Figure 2-1). Repeat this five times and repeat for each of the three tissues, generating a total of 15 simulated expression data.

M·1 M·1

M·1 M·1 M·1

M·1 M·1 randomise

s1 ……sj sn s1 ……sj sn g g 1 M11 M1j M1n 1 M·1 g g 2 M21 M2j M2n 2 M·1 : : : : : : g …… g i Mi1 Mij Min i M·1 : : : : : : g g N MN1 MNj MNn N M·1

Original New simulated expression data expression data

Figure 2-1 Process of sampling sim0 data. For new (simulated) array (i.e. column), expression values are sampled from the corresponding column of the original expression th th th th matrix. gi = i gene for i=1..N; sj = j strain for j=1…n; Mij = M-value of i gene in j th strain; M•j = M-value of an unknown (randomised) gene in the j strain.

44 Materials and Methods

2.3.2. simnorm For each simulated array, 23,040 expression values are sampled from the normal distribution with mean of zero and standard deviation of one. This is repeated 31 times for the 31 arrays of each simulated expression matrix (Figure 2-2). Quantile-normalisation is performed across the 31 new simulated arrays, as is done for real data. This will coerce the 31 arrays to have identical sets of values, though not necessarily at the same position.

This process is repeated five times to generate five simnorm datasets.

s1 ……sj sn g 1 M11 M1j M1n g 2 M21 M2j M2n : : : : g …… i Mi1 Mij Min : : : : g N MN1 MNj MNn

Figure 2-2 Process of sampling expression data from a normal distribution. Expression values for each new array is sampled form the N(0,1). Notations are same as Figure 2-1.

45 Materials and Methods

2.3.3. sim(0)/(1)/(2)/(3)

Expression values for sim(0) are generate identically to simnorm (Figure 2-2).

For sim(1), sim(2), and sim(3), the process is the same, except, one, two, or three random arrays are sampled from a normal distribution with doubled the standard deviation; i.e. N(0,2) instead of N(0,1). For example (Figure

2-3), for sim(1), expression values for 30 of the 31 arrays are sampled as with sim(0), but for a randomly chosen array, the expression values are sampled from N(0,2). For each simulated expression matrix, quantile- normalisation is applied across the 31 arrays. Each of these simulated datasets are repeated five times with different arrays sampled from N(0,2) (Appendix A.1.2).

s1 ……sj sn g 1 M11 M1j M1n g 2 M21 M2j M2n : : : : g …… i Mi1 Mij Min : : : : g N MN1 MNj MNn

Figure 2-3 Process of sampling sim(1). Expression values for n-1 arrays are sampled from (N(0,1) indicated by the black density plot. However, expression values for a randomly chosen array, j, are sampled from a normal distribution with a larger standard deviation, N(0,2), indicated by the blue density plot. Notations are as in Figure 2-1.

46 Materials and Methods

2.3.4. permB/K/L

This set of simulation studies is specific for the transcripts with multiple linkages (Table 4-3; Section 4.3). For the 72, 17, and 35 transcripts from the brain, kidney, and liver studies, expression values of each transcript is permuted across strains (Figure 2-4). For example, for the 1st of 72 transcripts in permB, the 31 values are sampled from the 31 values of the 1st transcript in the real brain data, without replacement. This is repeated for all transcripts and repeated five times for each permB, permK, and permL.

s1 ……sj sn s1 ……sj sn g g 1 M11 M1j M1n 1 M1• M1• M1• g g 2 M21 M2j M2n 2 M2• M2• M2• : : : : : : : : g …… g …… i Mi1 Mij Min i Mi• Mi• Mi• : : : : : : : :

g randomise g N MN1 MNj MNn N MN• MN• MN•

Figure 2-4 Process of permuting permB/K/L. For each set of transcripts with multiple linkages, expression values per transcript is permuted such that the n values for the ith gene, corresponding to the n strains, are changed in the perm sample. Notations are the same as Figure 2-1 and Mi = M-value of the ith gene from a random strain.

47 Materials and Methods

2.4. Genotype pattern similarity

Section 4.1.1 provides a detailed explanation of genotype pattern similarity. In the analyses presented in this manuscript, genotype pattern similarity is measured using Pearson’s correlation coefficient, r.

Several other association measures have also been considered. Initially, the phi-coefficient was considered because genotype patterns from RI lines are dichotomous variables. To extend to other diallelic populations where a total of three genotypes are possible at each locus (two homozygous and one heterozygous alleles), association measures such as Persons’ correlation coefficient (r), Spearman’s rho (-), and Kendall’s tau (), were then considered. However, all these measures are found to be identical for measuring genotype pattern similarity because: 1) there are never more than three genotypes; 2) there can never be outliers (genotypes are not continuous measures); 3) when coding genotypes as numerals, intervals between genotypes are constant: e.g. BB=0, BD=0.5, DD=1, or BB=-1, BD=0, DD=1 Pearson’s correlation coefficient was finally chosen for its simplicity and convenience as it is implemented in most statistical software.

For each marker map with M markers, Pearson’s correlation coefficient is computed for all pair-wise marker genotype patterns, resulting in an MxM matrix of coefficient values (e.g. Figure 4-1). In instances of missing genotype values, the corresponding strains are ignored, such that correlation is between a pair of shorter strings. For example, given n strains and the marker-pair M1 and M2, if all n genotypes are known for M1 but the th th genotype of the i strain at M2 is missing, then the i value of the M1 genotype pattern is removed and correlation is only computed for the pair of n-1 strings.

48 Materials and Methods

2.4.1. Simulation study: testing inter-chromosomal genotype pattern similarity

To test whether the extent of physically unlinked allelic association (NSA; Section 5.4) is significantly larger than chance expectation, in recombinant inbred (RI) animals, we want to assess the overall strength of inter- chromosomal genotype pattern similarity from a panel of 31 BXD RI lines against a set of simulated panel. To fully simulate an RI panel is not feasible because the vagaries of the inbreeding process and the potential extent of shared ancestry between theses mice are near impossible to estimate. Instead, working under the assumption and null hypothesis that chromosomes segregate independent, we can simulate a new panel of RI lines by mixing the chromosomes across a given panel of mice (Figure 2- 5). For our purpose, this process was performed ten times: five for random sampling without replacement and five with replacement. Randomising chromosomes with replacement means that more than one simulated animal share the same chromosome; e.g. original Chr1 of RI-1 becomes the Chr2 of simulated RIs 2 and 3.

original BXD panel simulated BXD panel

BXD 1123456 2 3 4 5 6 1123456 2 3 4 5 6

Chr1

Chr2

Chr3

Chr4 randomise chromosomes across RIs chromosomes randomise

Figure 2-5 Process of randomizing chromosomes across a panel of RI lines. Randomisation is on a per chromosome basis.

49 Materials and Methods

2.5. remove.LD

2.5.1. Algorithm

This algorithm is designed for removing redundant linkage results potentially induced by LD (linkage disequilibrium) between neighbouring markers.

Given a set of M genetic markers, first calculate the M-by-M matrix, C, of genotype pattern correlation coefficients, such that C(i,j) is the genotype correlation between ith and jth markers. Then, from those coefficients calculated between different chromosomal marker-pairs, determine cmax, the 95th percentile (top 5%) of the resultant distribution, to be used as the correlation LD-threshold (e.g. Figure 4-3).

For each genome scan, which is essentially a list of M Praw-values (Pi where i=1,..,M) corresponding to the linkage significance at the M markers, apply the following algorithm to remove redundant linkage results potentially induced by LD:

1) Identify marker Mx with the most significant linkage (minimum Praw),

Px

2) Walk upstream of Mx to the next marker My, where y=(x+1) and Mx and

My belong on the same chromosome. If C(x, y)>cmax, then remove Py.

Repeat for next markers My={x+2, x+3, ...} until C(x, y) 0 cmax

3) Similarly for markers downstream of Mx, My={x-1, x-2, …}, if C(x,y)>cmin,

then remove Py until C(x, y) 0 cmax 4) Repeat from (1) until all markers have been examined

50 Materials and Methods

2.5.2. R-script

This algorithm is implemented in R (R DEVELOPMENT CORE TEAM 2006) as follows: remove.LD <- function(pvals, chrs, geno.cor, cor.thresh) { ### A function to remove LD-induced linkages. Returns ### a matrix of P-values with LD-induced linkages ### removed (NA’ed). ### ### pvals: a matrix of genes by markers linkage p-values; ### with row and column names corresponding to ### gene and marker IDs ### chrs: chromosome of markers; expected to be in ### same order as markers (columns) in pvals ### geno.cor: square matrix of genotype pattern ### corresltion; both rows and columns are ### expected to be in same order as pvals columns ### cor.thresh: single numeric value for LD-threshold

pvals <- as.matrix(pvals) geno.cor <- as.matrix(geno.cor)

n <- nrow(pvals) # number of genes m <- ncol(pvals) # number of markers ## initialise a matrix for storing results; ## same dimension as orginal pvals matrix res <- matrix(NA, ncol=ncol(pvals), nrow=nrow(pvals))

### safety check if(ncol(geno.cor) != m | nrow(geno.cor) != m) stop("Marker numbers in geno.cor <> pvals \n") ### *** the function ***

for(i in 1:nrow(pvals)) { # for each gene oldp <- pvals[i,] # original P newp <- rep(NA, length(oldp)) # iterate thru each P until oldp is all NA’ed while(sum(!is.na(oldp))>0) { min.ind <- which.min(oldp) # current minP newp[min.ind] <- oldp[min.ind] oldp[min.ind] <- NA # RHS of current peak j <- 1 while( (min.ind+j <= m) && (!is.na(oldp[min.ind+j])) && (chrs[min.ind] == chrs[min.ind+j]) && (geno.cor[min.ind,min.ind+j]>cor.thresh) ) { newp[min.ind+j] <- oldp[min.ind+j] <- NA j <- j+1 } # LHS of current peak j <- -1

51 Materials and Methods

while( (min.ind+j >= 1) && (!is.na(oldp[min.ind+j])) && (chrs[min.ind] == chrs[min.ind+j]) && (geno.cor[min.ind,min.ind+j]>cor.thresh) ) { newp[min.ind+j] <- oldp[min.ind+j] <- NA j <- j-1 } } # close while-loop res[i,] <- newp } # close for-loop ### annotate res colnames(res) <- colnames(pvals) rownames(res) <- rownames(pvals)

### return res res

} # end of function

52 Materials and Methods

2.6. bqtl.twolocus test

The concept underlying this test is discussed in Section 4.1.2. The function, written in R and utilises the R/bqtl package (BERRY 2005) for performing multiple regression analyses, is provided below. bqtl.twolocus <- function (multi.peaks, ana.obj) { ### A function to perform two-locus linkages on each ### of the expression traits at all pair-wise ### combinations of the set of corresponding eQTLs ### provided in multi.peaks ### ### multi.peaks: a list of numeric vectors: each object ### in the list corresponds to a transcript, ### which contains the indices of the markers ### at which the transcript is linked; names ### of these numeric vectors should be the ### corresponding marker IDs ### ana.obj: analysis object from the make.analysis.obj ### function from R/BQTL. Contains expression ### data (ana.obj$data), genotype data, and ### marker map info library(bqtl) # requires R/bqtl package

genes <- names(multi.peaks) # genes w/ multiple eQTLs

## initialise data.frame for storing results res <- data.frame(gene=NA, primary=NA, secondary=NA, LOD=NA, p=NA, epi.coef=NA, coef.p=NA)

## check: nec. Expr. Data provided if(sum(is.element(genes, colnames(ana.obj$data)))

peaks <- multi.peaks[[g]] # linked eQTLs npeaks <- length(peaks) # num. eQTLs

for(i in 1:npeaks) { # for each eQTL

## single locus model single.bqtl <- bqtl(as.formula(paste(genes[g], "~", names(peaks)[i])), ana.obj)

## two-locus model at all marker loci ## with i-th marker as primary eQTL full.bqtls <- bqtl(as.formula(paste(genes[g], "~", names(peaks)[i],"*locus(all)")), ana.obj) ## LOD scores from comparing the two models lods <- loglik(full.bqtls) - loglik(single.bqtl)

53 Materials and Methods

## coefficient measuring the epistasis ## (joint) interaction of the primaty and ## secondary loci coefs <- unlist(lapply(lapply(full.bqtls, "coefficients"),"[",4))

for(j in (1:npeaks)[-i]) { # two-locus effects # for each of the # eQTLs in the set lod <- lods[peaks[j]] p <- (sum((lods>=lod),na.rm=T)/length(lods)) coe <- coefs[peaks[j]] coe.p <-( sum((abs(coefs)>=abs(coe)),na.rm=T) /length(coefs) ) res <- rbind(res, c(genes[g], names(peaks)[i], names(peaks)[j], lod, p, coe, coe.p))

} # end 3rd for-loop } # end 2nd for-loop } # end 1st for-loop

res <- res[-1,] res

} # end function

54 Materials and Methods

2.7. R-scripts to eliminate redundant linkages

2.7.1. Method: support interval of 2 LOD unit width_2lod <- function( p, stat, lod=F ) { ### A script to iteratively eliminate markers that ### are within 2-LOD unit from the main linkage ### peak ### ### p: matrix of P-values ### stat: matrix of test statistics ### lod: logical indicating whether stat=lod scores if( !lod ) stat <- stat/(2*log(10)) n <- nrow(p) m <- ncol(p) res <- matrix(NA,ncol=m,nrow=n,dimnames=dimnames(p)) for( i in 1:n ) { cur.stat <- as.numeric(stat[i,]) cur.pval <- as.numeric(p[i,]) while( sum(is.na(cur.pval))!=m ) { cur.peak <- which.min(cur.pval) res[i,cur.peak] <- cur.pval[cur.peak] # up stream u <- cur.peak-1 while( u>0 && !is.na(cur.pval[u]) && (cur.stat[cur.peak]-cur.stat[u])<2 ) { cur.pval[u] <- NA u <- u-1 } # down stream u <- cur.peak+1 while( u<=m && !is.na(cur.pval[u]) && (cur.stat[cur.peak]-cur.stat[u])<2 ) { cur.pval[u] <- NA u <- u+1 } cur.pval[cur.peak] <- NA } } res }

55 Materials and Methods

2.7.2. Method: monotonic decay mono_decay <- function(p) { ### To remove markers whose linkage scores are in ### monotonic decay from main linkage peak. ### ### p: matrix of P-values n <- nrow(p) m <- ncol(p) res <- matrix( NA, ncol=m, nrow=n, dimnames=dimnames(p) ) for( i in 1:n ) { cur.pval <- as.numeric(p[i,]) while( sum(is.na(cur.pval))!=m ) { cur.peak <- which.min(cur.pval) res[i,cur.peak] <- cur.pval[cur.peak] # up stream u <- cur.peak-1 prev.p <- cur.pval[cur.peak] while( u>0 && !is.na(cur.pval[u]) && cur.pval[u]>=prev.p ) { prev.p <- cur.pval[u] cur.pval[u] <- NA u <- u-1 } # down stream u <- cur.peak+1 prev.p <- cur.pval[cur.peak] while( u<=m && !is.na(cur.pval[u]) && cur.pval[u]>=prev.p ) { prev.p <- cur.pval[u] cur.pval[u] <- NA u <- u+1 } cur.pval[cur.peak] <- NA } } res }

56 Materials and Methods

2.7.3. Method: HUBNER et al. (2005) hubner <- function(p, marker.info) { ### To remove potentially redundant linkage due to LD ### using method described in Hubner et al. (2005) ### ### p: matrix of P-values ### marker.info: list of two vectors: chr & lox n <- nrow(p) m <- ncol(p) chrs <- marker.info[[1]] lox <- marker.info[[2]] res <- matrix(NA,ncol=m,nrow=n,dimnames=dimnames(p)) for( i in 1:n ) { cur.pval <- as.numeric(p[i,]) while( sum(is.na(cur.pval))!=m ) { cur.peak <- which.min(cur.pval) cur.chr <- chrs[cur.peak] cur.lox <- lox[cur.peak] res[i,cur.peak] <- cur.pval[cur.peak] # up stream u <- cur.peak-1 while( u>0 && !is.na(cur.pval[u]) && cur.pval[u]>=cur.pval[cur.peak] && chrs[u]==cur.chr && abs(cur.lox-lox[u])<=5 ) { cur.pval[u] <- NA u <- u-1 } # down stream u <- cur.peak+1 while( u<=m && !is.na(cur.pval[u]) && cur.pval[u]>=cur.pval[cur.peak] && chrs[u]==cur.chr && abs(cur.lox-lox[u])<=5 ) { cur.pval[u] <- NA u <- u+1 } cur.pval[cur.peak] <- NA } } res }

57 Linkage analysis of expression traits in three tissues of RI mice

3. Linkage analysis of expression traits in three tissues of recombinant inbred mice This chapter aims to elucidate the genetic components influencing transcript abundance. For statistical and computational simplicity, we used a single-locus mapping approach to map genetic regulators with major influences on gene expression. Using similar approaches, several recent studies have shown different extent of genetic influence on gene expression variation (BREM et al. 2002; BYSTRYKH et al. 2005; CHESLER et al. 2005;

COTSAPAS et al. 2006; HUBNER et al. 2005; MONKS et al. 2004; MORLEY et al. 2004; SCHADT et al. 2003; YVERT et al. 2003).

We performed linkage analyses on mRNA levels of approximately 22 thousand transcripts using brain, kidney, and liver tissues from a panel of 31 recombinant inbred (RI) mouse strains. Because actual causative variations influencing gene expressions are unknown, the purpose of eQTL mapping analyses is to identify regions of the genome (eQTLs represented by genetic markers) harbouring these variations rather than to identify the causative variations themselves. The presence and involvement of an eQTL in the variation of a gene’s expression level is inferred if the genotypes corresponding to the eQTL correlate significantly with the gene’s expression levels. The gene is then said to be “linked” or “mapped” to the eQTL or genetic marker.

Five objectives are addressed in this chapter. The first and most obvious is to determine the proportion of genes whose variation in expression levels has a mappable genetic component. Due to intrinsic variations associated with microarrays, such as intensity- and spatially- dependent biases (YANG et al. 2002), measurement reliability decreases with expression level and eQTLs mapped by genes expressed at low levels are potentially artefacts of microarray variation(s). To avoid this type of false linkages, we may want to consider only the subset of ‘expressed’ genes, where an empirical threshold is used to define ‘expression’. Clearly, low expression level does

58 Linkage analysis of expression traits in three tissues of RI mice

not necessarily imply the absence of genetic influence and so excluding lowly expressed genes in an eQTL analysis may potentially increase the rate of false negatives. We consider the expected proportion of linkages using negative controls and simulation studies.

Our second objective is to examine the number of eQTLs mapped to each transcript. Expression traits are complex and likely to be influenced by multiple eQTLs. The support for this assumption is suggestive in several eQTL mapping studies where the number of linkages often exceeds the number of transcripts that are actually linked (BREM and KRUGLYAK 2005;

BREM et al. 2002; CHESLER et al. 2005; HUBNER et al. 2005; MONKS et al.

2004; SCHADT et al. 2003; YVERT et al. 2003). Interestingly, these observations are from single-locus mapping approaches that are only capable of identifying multiple eQTLs with independent (additive) effects. Multiple eQTLs that interact dependently (epistatic interaction) would likely have been missed in these studies. The validity and reasons for multiple linkages are discussed in much more detail in Chapter 4.

Our third objective is to determine the type of regulatory mechanisms influencing transcript abundance. Gene expression level may be regulated by one or more cis-acting or trans-acting elements (Section 1.2). A cis- acting regulatory element is a sequence element such as a transcription effector-binding site that affects transcript level of a nearby gene (Section 1.2.1). A trans-acting regulatory factor is a gene-product that may directly or indirectly influence the transcript level of one or more genes (Section 1.2.2). Therefore, determining the proportions of genes influenced by cis- or trans- acting variations will provide insight into whether transcriptional control is mainly gene-specific (the variation influences only one gene) or pleiotropic (the variation influences multiple genes).

In current literatures, estimates of these proportions are highly variable; 19% - 100% of mappable genes (genes with at least one linked eQTL) with

59 Linkage analysis of expression traits in three tissues of RI mice

known genomic locations have been defined as under cis-acting regulation (Table 3-1). Discrepancies between the studies may be due to differences in their experimental designs, including differences in the type and size of segregating population, the analytical techniques employed, the critical values chosen to define linkage, and the relative genomic distance used to define cis-regulation (see review by ROCKMAN and KRUGLYAK, 2006). Despite these differences, a common hypothesis from some of these studies is that, because cis-acting variants are gene-specific, they tend to have larger effects then trans-acting variants whose effects tend to be second- order and pleiotropic. Consequently, cis-acting variants are likely to be more detectable at stringent linkage significance thresholds than trans- acting variants.

Table 3-1 Summary estimate of cis-acting eQTLs in eight recent eQTL-mapping studies. % cis eQTLs = (# genes with cis-linkage)/(# genes with at least one eQTL); PGSW: genome-wide significance level; Praw: point-wise significance level (raw P-value). References Samples % cis eQTLs

-5 (BREM et al. 2005) 40 haploid yeast 32.5% @ Praw<5x10 -5 (YVERT et al. 2003) 86 haploid yeast 25% @ Praw<3.4x10

34% @ PGWS<0.05 111 F2 mice (SCHADT et al. 2003) -4 71% @ PGWS<10 -4 76 F2 maize 80% @ PGWS<10

(CHESLER et al. 2005) 32 BXD mice 94% @ PGWS<0.02

(BYSTRYKH et al. 2005) 22 BXD mice 34% @ PGWS<0.05

35%-40% @ PGWS<0.05 (HUBNER et al. 2005) 30 BXH/HXB rat -4 80%-100% @ PGWS<10 -4 20% @ Praw<5x10 (MONKS et al. 2004) 15 CEPH families -6 40% @ Praw<5x10

(MORLEY et al. 2004) 14 CEHP families 19% @ PGWS<0.001

60 Linkage analysis of expression traits in three tissues of RI mice

Due to the pleiotropic effects of trans-acting factors, it is possible that only a few trans-acting variants are necessary to instigate multiple individual phenotypic differences. Thus, although statistical power may be limited in the detection of trans-effects, co-regulated genes may concurrently map to the same eQTL, resulting in a linkage “hotspot” and implying the presence of a “master regulator” (BREM et al. 2002; CHESLER et al. 2005; MORLEY et al. 2004; SCHADT et al. 2003; YVERT et al. 2003). Similarly, it is known that in gene-rich regions, co-regulation may be achieved via a common regulatory element (e.g., SRIVASTAVA et al. 2000), thus master regulators may also be cis-acting as well as trans-acting. Evidences of linkage hotspots are also inconsistent in current eQTL literatures. In one example, over 1.5 thousand genes are found to link to the same genomic region

(CHESLER et al. 2005), yet, in another study no more than 6 genes are found to link to a common locus (MONKS et al. 2004). The fourth aim of this study is to ascertain the evidences for the presence or absence of linkage hotspots.

The final aim of this analysis is to compare linkage results between three tissues: brain, kidney, and liver. Tissue-specific gene expression is regulated by tissue-specific trans-acting factors and/or tissue-specific cis- regulatory elements. It has been shown in rat that the majority (~85%) of eQTLs are tissue-specific and a large portion of these act in trans (HUBNER et al. 2005). By comparing the results of our three tissue eQTL studies we want to determine the ratio of ubiquitous to tissue-specific gene expression as well as the regulatory mechanisms underlying tissue-specific gene expression.

61 Linkage analysis of expression traits in three tissues of RI mice

3.1. Experimental design

3.1.1. Single-locus eQTL mapping

To understand the genetic regulation of gene expression, we performed linkage analysis on 23,040 expression traits (the total number of probes on each microarray) in three different tissues using a panel of 31 BXD recombinant inbred mice (Section 2.1). Expression levels were measured using a two-coloured microarray platform, resulting in 93 sets (three tissues of 31 strains) of A-values (average expression of strain and reference) and M-values (relative expression of strain to reference).

Although A-values represent the actual average intensity levels of the probes on an array, but as we are comparing between microarrays, we have adopted the common reference microarray experimental design where sample from each individual has been hybridised to a common reference. In such an experimental design the relative expression levels (M-values) allow comparison of gene expressions across arrays. Therefore, analyses have been performed using M-values, and the terms “expression values” and “expression levels” are generally reserved for M-values in this manuscript. In rare occasions where these terms are used to refer to A-values, such usage will be stated explicitly. Please see Section 1.4 and Section 2.1.3 for full explanation of the use of microarray data in eQTL studies.

For each tissue, eQTL mapping is performed on normalised M-values (expression values) using simple regression. For each linkage test, the 31 normalised M-values of an expression trait are divided into either the B or D genotype group corresponding to the strain’s genotypes (homozygous BB or DD) at the marker. Regression lines are then fitted through the data and the slope of the line that best fit the data (i.e., the line that results in the smallest residual sum of squares) is used to calculate the likelihood ratio (LR). LR is a measure of the likelihood that the transcript and marker are

62 Linkage analysis of expression traits in three tissues of RI mice

linked. This process is repeated for each of 23,040 probes on the microarray against each of 779 genetic markers.

To determine the significance of these LR scores, we first sample, from a 2 Chi-square (0 ) distribution, point-wise nominal P-value (Praw) corresponding to each LR score. Three methods of multiple-testing correction are then explored for defining linkage significance; i.e. for determining Praw thresholds:

1) A genome-wide Bonferroni corrected PBON corresponding to the conventional 95% confidence interval is used: given there are

M=779 markers, a gene is link to a marker if its Praw 0 PBON = 0.05/779 ~ 6.4e-5. 2) P-value thresholds calculated from simulated data (Section 3.1.2) and corresponding to false positive rate of 5% are used. That is,

average Praw, from five datasets simulated under the null hypothesis of no linkage, corresponding to the 5th percentile of each full dataset -5 -5 is used: PFULL=8.6x10 for Brain data, PFULL=7.8x10 for Kidney, -5 and PFULL=9.6x10 for Liver. 3) Analogous to 2nd method, but takes into account of marker-

dependence. The null Praw distributions are generated after redundant linkages caused by linkage disequilibrium between neighbouring markers are first removed using an algorithm,

remove.LD, described in Chapter 4. That is, the null Praw

distribution is generated from a REDuced set of Praw. This results in

much less conservative, but more appropriate, Praw thresholds: -4 -4 PRED=1.4x10 for Brain, PRED=1.2x10 for Kidney, and -4 PRED=1.5x10 for Liver.

63 Linkage analysis of expression traits in three tissues of RI mice

3.1.2. Simulation studies of the null hypothesis:

sim0B/K/L To understand the null distribution and to assess the rate of false linkages, three simulation studies (sim0B, sim0K, and sim0L) corresponding to the brain, kidney, and liver BXD studies have been conducted. The aim of these studies is to explore the behaviours of our three BXD eQTL-mapping studies under the null hypothesis of no linkage. That is, we want to break any real relationship between expression traits and marker genotypes but retain any residual between-array variations that may cause spurious linkages.

To achieve this, for each sim0 dataset, expression values (M-values) for each of 31 microarrays (individuals) are sampled from the corresponding array of the corresponding BXD dataset (Section 2.3.1): e.g. M-values of the first array in sim0B are sampled from the normalised M-values of the first array of the BXD brain study. The new randomised datasets are then subjected to eQTL linkage analyses as performed for the real data. This entire process is repeated five times for each of sim0B, sim0K, and sim0L. The results presented in this chapter are averages of the five simulations.

64 Linkage analysis of expression traits in three tissues of RI mice

3.2. Expression quantitative trait loci

A genome scan (e.g. Figure 3-1), or expression quantitative trait locus (eQTL) mapping, is a process that searches the genome for regions, or eQTLs, potentially influencing an expression trait (i.e. expression level of a transcript/gene/mRNA of interest). Genetic linkage, or association, between an expression trait and an eQTL is said to be detected when correlation test between expression trait values and marker genotypes can be shown to be significant—exceeds some pre-defined statistical threshold

(e.g. PBON, or PFULL, or PRED). The expression trait is then said to be mapped or linked to the marker or eQTL.

10 D8Mit189

8

6

-log10(Praw) 4

2

0

1234567891012141618X

chromosomes

Figure 3-1 Example genome scan. Transcript AK007639 (corresponding to the RIKEN clone: 1810029E06Rik) is linked to marker D8Mit189 on Chr8, at Praw0PBON (equivalent to –log10Praw14.2; dotted line).

65 Linkage analysis of expression traits in three tissues of RI mice

3.2.1. Defining Linkage Significance

Genetic linkage significance, or criterion for calling a correlation between an expression trait and a marker locus as significant, can be defined using several methods. Three are explored in this thesis (Section 3.1.1):

1. Bonferroni correction (PBON); 2. Empirical P calculated using simulation studies and assuming marker-

independence; i.e. where the null distribution of nominal Praw values is

obtained using the full dataset (PFULL); 3. Empirical P calculated using simulation studies and correcting for

marker-dependence; i.e. where the null distribution of nominal Praw

values is obtained using a “reduced” dataset (PRED). With all three methods, an acceptable false positive rate of 5% is used. Summary of the results using each of these three methods are presented in Table 3-2A, B, and C, respectively.

The Bonferroni correction for multiple testing is an obvious choice for its simplicity whereby the point-wise Praw threshold for each trait is simply

PBON=/M where  is the acceptable false positive rate and M is the number of putative eQTL being tested (i.e. the number of markers for a single- locus, single-point, mapping). Thus, allowing an expression trait to show -5 random linkage 5% of the time, PBON=0.05/779C10 is set for the brain, kidney, and liver BXD eQTL-mapping analyses.

An underlying assumption of the Bonferroni correction is that of independence between tests: test of association for each marker-trait pair is assumed independent of other markers and expression traits. To test whether this assumption holds in eQTL-mapping, a series of five simulations (5xsim0B, 5xsim0K, and 5xsim0L), have been performed for each of the three-tissue datasets, to model the null hypothesis of no linkage (Section 3.1.2). If the assumption of independence truly holds, then the extents of linkages in these sim0 data should average 5%. Contrary to this

66 Linkage analysis of expression traits in three tissues of RI mice

expectation, a false positive rate of less than 5% is consistently observed across all simulation studies (last column of Table 3-2A), suggesting the presence of marker and/or trait dependence.

The inappropriateness of the Bonferroni adjustment prompts at the need for empirical P-value thresholds for defining linkage significance. Empirical

Praw cutoffs can be determined using the sim0 studies: for an acceptable 5% th false positive rate, an average PFULL corresponding to the 5 percentile of the null distributions across the five simulations is calculated for each of the brain, kidney, and liver data independently (Section 3.1.1). -5 Interestingly, these PFULL (~10 ) are not much different to PBON, hinting the extent of linkages using these thresholds would not be dramatically different to when PBON was applied.

The design of the sim0 studies is such that any dependency between expression traits is essentially broken because the expression values are re- sampled independently for each individual animal (i.e. microarray). Thus, if the conservativeness of the Bonferroni correction is due to the presence of trait-dependency then the extent of linkages observed in the sim0 studies, using these PFULL, should be 5%. Again, this expectation is under-achieved

(Table 3-2B last column) and comparable to those observed when PBON was applied, implicating marker-dependence as the confounding factor.

If two putative eQTLs, L1 and L2, are correlated, possibly due to linkage disequilibrium (LD), then, statistically, the strength of linkage for an expression trait at L1 will be similar to that at L2. Consequently, marker- correlation can lead to multiple eQTLs per expression trait and hence the allusion of multiple genetic control on the gene’s expression level (Section 3.3). To address this problem, a novel method (remove.LD), described in Chapter 4, for eliminating redundant linkages caused by potential LD, is first applied prior to calculating PRED.

67 Linkage analysis of expression traits in three tissues of RI mice

th As with PFULL, PRED is the average Praw corresponding to the 5 percentile of the null Praw distribution, where the null is generated from performing single-locus eQTL-mapping analyses on the sim0 datasets follow by -4 applications of remove.LD. The resulting PRED (~10 ) for all three datasets (Brain, Kidney, and Liver) are notably less conservative compared to the previous two multiple-testing correction methods (PFULL and PBON).

Using these new empirical PRED, the extents of observable linkages in the sim0 studies finally fall within expectation (Table 3-2C last column): average of 4.7%, 4.8% and 5% of simulated expression traits are significantly mapped to at least one eQTL in the sim0B, sim0K, and sim0L data respectively. Clearly, this is the most ideal method for correcting for multiple-testing and defining linkage significance, thus from here onwards,

PRED are used.

68 Linkage analysis of expression traits in three tissues of RI mice

Table 3-2A eQTL-mapping results using PBON for defining linkage significance. Results for each of the BXD BRAIN, KIDNEY, and LIVER study are shown. “Expressed” transcripts are defined by the criterion that at least half of the mice have expression values > 90% of the values of the negative controls (NC). Unique linkages are the subset of total linkages following application of removeLD to remove potentially redundant linkages due to linkage disequilibrium. Note that, the number of linkages does not equate to the number of mappable transcripts because an expression trait may be linked to more than one eQTL. The last column shows the average results of the five sim0 studies: sim0B, sim0K, and sim0L.

All “Expressed” Average NC transcripts transcripts sim0

Total # 1,7706 21,765 875 23,040 transcripts 81.4%

# Transcripts 1,781 1,539 58 538 with linkages 8.2% 8.7% 6.6% 2.3% BRAIN # Linkages 3,484 3,024 85 859

# unique 1,853 1,603 60 554 linkages

Total # 9,237 21,765 875 23,040 transcripts 42.4%

# Transcripts 670 309 44 599 with linkages 3.1% 3.3% 5.0% 2.6%

KIDNEY # Linkages 1,062 483 60 951

# unique 687 320 46 616 linkages

Total # 10,728 21,765 875 23,040 transcripts 49.3%

# Transcripts 855 440 60 467 with linkages 3.9% 4.1% 6.9% 2.0% LIVER # Linkages 1,459 789 98 745

# unique 891 459 65 476 linkages

69 Linkage analysis of expression traits in three tissues of RI mice

Table 3-2B eQTL-mapping results using PFULL for defining linkage significance. See Table 3-2A for legend.

All “Expressed” average NC transcripts transcripts sim0

Total # 17,706 21,765 875 23,040 transcripts 81.4%

# transcripts 2,185 1,878 79 712 with linkages 10.0% 10.6% 9.0% 3.1%

BRAIN # linkages 4,365 3,774 117 1,155

# unique 2,293 1,972 21 735 linkages

Total # 9,237 21,765 875 23,040 transcripts 42.4%

# transcripts 783 361 51 714 with linkages 3.6% 3.9% 5.8% 3.1%

KIDNEY # linkages 1,252 567 74 1,151

# unique 809 337 54 735 linkages

Total # 10,728 21,765 875 23,040 transcripts 49.3%

# transcripts 1,246 640 83 703 with linkages 5.7% 6.0% 9.5% 3.1%

LIVER # linkages 2,125 1,121 147 1,153

# unique 1,308 678 89 722 linkages

70 Linkage analysis of expression traits in three tissues of RI mice

Table 3-2C eQTL-mapping results using PRED for defining linkage significance. See Table 3-2A for legend

All “Expressed” average NC transcripts transcripts sim0

Total # 17,706 21,765 875 23,040 transcripts 81.4%

# transcripts 2,996 2,560 113 1,100 with linkages 13.8% 14.5% 12.9% 4.8%

BRAIN # linkages 6,281 5,385 190 1,849

# unique 3,217 2,745 118 1,150 linkages

Total # 9,237 21,765 875 23,040 transcripts 42.4%

# transcripts 1,195 518 67 1,101 with linkages 5.5% 5.6% 7.7% 4.8%

KIDNEY # linkages 2,020 879 116 1,853

# unique 1,258 551 74 1,150 linkages

Total # 10,728 21,765 875 23,040 transcripts 49.3%

# transcripts 1,821 940 113 1,105 with linkages 8.4% 8.8% 12.9% 4.8%

LIVER # linkages 3,255 1,709 213 1,890

# unique 1,931 999 123 1,152 linkages



71 Linkage analysis of expression traits in three tissues of RI mice

3.2.2. Extent of Mappable Expression Traits

Of a total of 23,040 probes represented on the microarrays ~5.5% are controls; i.e. non-DNA molecules or non-mouse DNA molecules. Excluding these controls, 13.8% of all 21,765 actual mouse transcripts are significantly mapped to at least one eQTL (marker locus) in the brain, 5.5% are mapped in the kidney, and 8.4% are mapped in the liver study (Table 3- 2C column “All transcripts”).

With an expected false positive rate of 5%, the relatively high proportions of transcripts demonstrating linkage in the brain and liver studies are promising and imply up to 14% of transcripts are under genetic influence. Although the proportion of eQTLs identified in the kidney data just exceeds random expectation, there may still be real significant linkages embedded within the “noise”. These “real” signals are expected to yield more replicable results (PEIRCE et al. 2006), and so are likely to be those with the best ranking P-values.

It has been suggested that post hoc analyses through the incorporation of other sources of information , such as the relative location of the transcript and eQTL, can be used to make more informed inferences and in identifying those linkages that are more likely to be real (CARLBORG et al. 2005). The effects of noise and systematic variation on eQTL-mapping are discussed in the following subsections (Section 3.2.3 and Section 3.2.4), discussions on tissue-specificity can be found in Section 3.6, and analyses and discussions of cis- and trans-linkages (relative location between transcript and eQTL) are in Section 3.4.

3.2.3. Artefacts of residual inter-array variation

In an eQTL-mapping experiment, each microarray measures the expression levels of tens of thousands of transcripts in one individual (or strain). For example, in the brain BXD eQTL data, each of 31 brain mRNA samples

72 Linkage analysis of expression traits in three tissues of RI mice

corresponding to 31 BXD RI mice are hybridised to 31 independent microarrays. A potential problem of such an experimental design is that, any disparity between the microarrays will resemble individual variation. Consequently, if inter-array variation causes the values of a transcript to fluctuate in such a way that it correlates significantly to a marker genotype pattern, the transcript will appear to be under the influence of, and thus show linkage to, that marker locus. More considerably, if inter-array variation influences only a subset of genes, then this subset of transcripts will appear to be influenced by a common eQTL: master regulator (see Section 3.5).

In anticipation of this problem, some process of between-array normalisation, including scale-, median-, and quantile- adjustments, is generally performed on expression microarray data. Quantile- normalisation, possibly the most robust inter-array normalisation method

(WILLIAMS et al. 2006), coerces expression values across a set of arrays (31 arrays in our case) to an identical distribution (SMYTH and SPEED 2003; e.g. Figure 3-2). This method has been applied to our BXD data, following loess print-tip normalisation to remove potential intra-array variations (see Section 1.4.1).

As a consequence of the massively parallel nature of microarray experiments, residual inter-array variations may be retained even following quantile-normalisation. To test this concern, we simulated four expression datasets, sim(0), sim(1), sim(2) and sim(3) (Section 2.3.3), to assess the effects of residual inter-array variation on trait variances. In sim(0), expression values in each of 31 arrays are sampled from a normal distribution with (mean) )=0 and (standard deviation) SD=1. In the other three datasets, the same is performed for 30 (sim(1)), 29 (sim(2)), and 28 (sim(3)) arrays, respectively. For the remaining 1 (sim(1)), 2 (sim(2)), and 3 (sim(3)) arrays, expression values are sampled from a normal distribution with the same mean ()=0) but twice the variance (SD=2). That is, we have randomly

73 Linkage analysis of expression traits in three tissues of RI mice

increased the variance of 1 to 3 arrays, from an otherwise identical set of expression data (see Appendix A.1.2 for boxplot representations of the data).

74 Linkage analysis of expression traits in three tissues of RI mice

Figure 3-2 Distributions of expression data in brain samples of 31 BXD RI mice, before (top) and after (bottom) quantile-normalisation. Boxplot representations of M- values (y-axis) of each 31 arrays (x-axis) are shown. The top panel shows the distributions following loess print-tip normalisation which coerces all arrays to have a mean of M=0. The bottom panel shows the distributions following quantile-normalisation, which coerces all the arrays to have identical distributions. See Appendix A.1.1 for similar plots with the kidney and liver data.

75 Linkage analysis of expression traits in three tissues of RI mice

Following quantile-normalisation (Appendix A.1.2) SD per transcript is examined in each of the four datasets and the distributions plotted in

Figure 3-3: sim(0) in black, sim(1) in red, sim(2) in green and sim(3) in blue. To ensure these observations are not chance events, these simulations have been repeated five times, and all results are the same (Figure 3-3). Since quantile-normalisation coerces all arrays in a dataset to an identical distribution, trait-variances in all four datasets are expected to revert back to SD=1. But this is not observed: as the number of “variable” arrays (arrays with increased variance) increases, trait-variance also increases, thus confirming our suspicion that residual inter-array variation may remain even following quantile-normalisation.

Figure 3-3 Density distribution of standard deviations per expression trait from sets of five-replicates of four simulated expression datasets. In sim(0), sim(1), sim(2), and sim(3), expression values of 0 (black), 1 (red), 2 (green), or 3 (blue) of 31 arrays have been sampled from a normal distribution with SD=2 rather than SD=1.

76 Linkage analysis of expression traits in three tissues of RI mice

A well-known problem associated with microarray data is that of spatial variation: transcripts spotted (or synthesised) on different physical sections of an array may be influenced by different technical/systematic variations, attributable either to array synthesis or hybridisation process. Consequently, various subsets of transcripts may become affected by different levels of “noise” for a given microarray, and under such scenario, global array-wise normalisation procedures may have unforeseeable affects on individual transcripts.

To explore the effects of residual inter-array variation on eQTL-mapping, single-locus mapping tests were performed on these simulated datasets.

Significant linkages are defined using a PRED threshold (Section 3.1.1) corresponding to an average false positive rate of 5% calculated from the sim(0) data, after accounting for potential redundant linkages due to LD (i.e. following application of remove.LD; see Section 4.1.1).

Table 3-3 Summary linkage results on simulation studies: sim(0)/(1)/(2)(3) Results are averages of five simulation studies. Unique linkages are deduced from the algorithm remove.LD (Chapter 4) to remove potentially redundant linkages due to LD.

sim(0) sim(1) sim(2) sim(3)

Average number of 967.8 965.4 986.4 977.8 simulated transcripts (4.8%) (4.8%) (4.9) (4.9%) with linkage

Average number of 1,020 995 1,013.4 1,006 unique linkages

Notably, the average percentages of simulated genes significantly mapped to at least one eQTL are not different to the expected 5% (Table 3-3). While this may suggest the different inter-array variations associated with these sim(1)/(2)/(3) datasets do not affect linkage, it must be emphasised that these simulation studies are relatively simple; the nature and affect of inter- array variations are expected to be much more complex in reality.

77 Linkage analysis of expression traits in three tissues of RI mice

Two obvious over-simplifications of these simulation studies are that, (1) at most, only three of 31 arrays have been manipulated to have increased variance; and (2) the results from five studies have been averaged out. In practise, the number of variable arrays may be larger and, more problematically, the pattern of variable arrays and the extent of their variation may randomly correlate to genotype patterns of specific marker loci. In these simulation studies, a pattern of 'spiked' arrays may in fact (partly) correlate with the genotype pattern of a marker locus, but because the overall extent of linkages have been averaged out across the five studies, such ambiguous confounding factor may have been disguised. When such a correlation between inter-array variation and marker genotype pattern occurs, the allusion of linkage may arise.

In conjunction with false linkages, inter-array variation may also increase the rate of false negatives. For those transcripts whose trait-variance is relatively large, normalisation within and/or between arrays can potentially deflate the true extent of real trait-variance. In extreme cases, this may cause a true linkage to be missed altogether. Such a phenomenon may be the reason for the comparable extent of mappable simulated transcripts across the sim(0), sim(1), sim(2), and sim(3) studies: the amounts of false positives and false negatives may have potentially been balanced out.

Although obvious support for false linkages cannot be demonstrated with the simulation studies presented here, it is hoped that this section has highlighted the importance of appropriate normalisation in microarray and eQTL-mapping studies. This is further illustrated in WILLIAMS et al. (2006), which illustrates the influence of various normalisation methods on single- and dual- coloured microarray data in eQTL-mapping analyses.

78 Linkage analysis of expression traits in three tissues of RI mice

3.2.4. Artefacts of low level signal in microarray data

The distribution of expression values in each microarray generally follows a quasi-inverse-gamma distribution with a dynamic range of >65,000 pixels (A-valuesC216). The significance of this is that the majority of transcripts are expressed at relatively low levels while a subset of transcripts is expressed at extremely high levels. In order to include those highly expressed transcripts, the dynamic range of microarray measurements are necessarily large to avoid saturating too many highly expressed transcripts. The consequence of this is that measurements at low levels become more sensitive to systematic variations. That is, the reliability of microarrays to measure mRNA level accurately is expected to decrease with expression level. If this is true, linkage analyses using transcripts expressed at low levels may be influenced by systemic variations. This hypothesis is supported by a recent study that demonstrated a positive correlation between eQTLs replication and expression level (PEIRCE et al. 2006).

To assess this hypothesis in our own data, we performed eQTL-linkage analyses on a set of negative controls (NC). Our reasoning is that these control spots should have low, background, “signal”, and so should not demonstrate linkage to any marker. Any evidence of linkage would have been driven by variations associated with low signal level inherent to microarrays.

On the microarrays used in our BXD eQTL-mapping studies, NCs consist of 384 blank spots (spots on the arrays where nothing have been spotted), 475 empty spots (spots on the arrays where only buffer have been spotted), and 16 NCs from the Lucidea Universal Scorecards (Amersham Biosciences). These three sets of negative controls have different distributions of signal levels (Figure 3-4), corresponding to differences in the absorption and emission wavelengths inherent to the glass slides of the microarrays (blank spots), the spotting buffer (empty spots), and non-

79 Linkage analysis of expression traits in three tissues of RI mice

specific DNA hybridization (NC ScoreCards). As a group, these 875 NCs are expressed at lower levels than real transcripts (Figure 3-5).

Figure 3-4 Distribution of signal levels of empty spots (black), blank spots (red), and negative ScoreCards (green) in the BXD brain (left), kidney (middle), and liver (right) datasets. There are 384 blank spots with nothing spotted, 475 empty spots containing only spotting buffer, and 16 negative Lucidea Universal ScoreCards.

Figure 3-5 Distributions of A-values of actual transcripts on the arrays (blank) and all 875 negative controls (red) in brain (left), kidney (middle), and liver (right) BXD dataset. The bi- and tri- modal distributions of the NCs are due to the different types of NCs (Figure 3-4). Vertical lines indicate the 90th percentile of all NCs: the threshold at which the A-values of a transcript of at least half of the 31 BXD RI lines must exceed in order to be defined as “expressed”.

80 Linkage analysis of expression traits in three tissues of RI mice

To examine the influence of systematic variations associated with low microarray signal, eQTL-linkage analyses are performed using all 875 NCs in each of the three BXD studies, as was conducted with the 21,765 real transcripts. The extents of mappable NCs are at least as much as that observed in real data (Table 3-2A, B, and C), irrespective of the method used for defining significant linkage (PBON, PFULL, or PRED; Section 3.1.1; Section 3.2.1). This implies NCs are more prone to spurious linkages, and that transcripts expressed at low levels are subjected to spurious linkages due to systematic noise associated with low microarray signal level. Interestingly, the extents of NCs demonstrating linkage compared to actual transcripts are more pronounced in the kidney and liver studies, suggesting systematic noise at low detection level is more prominent in these two datasets.

3.2.5. Genetic influence of expressed and non- expressed genes

Based on the studies reported above, we wanted to determine the extent at which linkage is associated with variations inherent to low expression levels. We therefore divided our data into an ‘expressed’ or ‘unexpressed’ group, by defining a trait as expressed if at least half of its expression values (>15 of the 31 BXD strains) are greater than the 90th percentile of all negative control values (Figure 3-5 and Section 2.1.5). With this definition, more than half of the transcripts in kidney (~58%) and liver (~51%) are unexpressed (Table 3-2 columns “Expressed transcripts”). Given that the brain is perceived as the most complex organ, it is perhaps not surprising that much more transcriptional activity (>81% expressed) is observed in the brain compare to kidney or liver.

Of the transcripts significantly linked to at least one eQTL, 85.4%, 43.3%, and 51.6% are defined as “expressed” in brain, kidney, and liver, respectively (Table 3-2). That is, in kidney and liver, approximately half of

81 Linkage analysis of expression traits in three tissues of RI mice

all mappable transcripts would have been called “unexpressed”. This is particularly important as systematic variation associated with low detection signal has been shown to be more pronounced in the kidney and liver data (Section 3.2.4).

Lowly expressed transcripts are not necessarily unimportant, but by classifying transcripts base on their level of expression allows eQTL linkage results to be prioritised. We therefore have applied this filtering method post-hoc, and have performed all subsequent analyses using the full set of transcripts followed with removal of “unexpressed” transcripts only where appropriate. This decision was also made in anticipation to comparing eQTL mapping results from these three studies against other eQTL mapping studies where filtering for “expressed” genes is not always possible (see Section 4.5 and Chapter 5 for meta-analysis).

3.2.6. Expression trait-variance

The amount of mapped difference in expression levels between the two genotype-groups is known as trait-variance. In the three BXD studies, the majority of mapped trait-variances are less than 2-fold, with an average of 1.6 in brain, 1.4 in kidney, and 1.3 in liver. Overall, only between 0.2%- 1.3% of linkages have trait-variances of >3-fold, with a maximum observed trait-variance of ~7.8.

To understand the relationship between trait-variance, expression level, and linkage test statistic, we compare them in turn (Figure 3-6). For each significant linkage, we first compared the magnitude and direction of trait- variance with the trait’s average expression level (average A-values) (top panel of Figure 3-6). The first thing to notice from these plots is that there is an absence of data points at fold-change of ~1 (i.e. ratio difference of the two genotype-groups ~1, which is equivalent to no change). This observation indicates linkages with extremely small trait-variances have

82 Linkage analysis of expression traits in three tissues of RI mice

successfully failed the PRED threshold and not declared as significant. This is true irrespective of the traits’ overall average expression levels (average A-values along the x-axis).

Figure 3-6 Behaviours of trait-variances, average expression levels (A-value), and linkage significance (Praw), of all significant linkages (Praw0PBON) in the BXD brain (left column), kidney (middle column), and liver (right column) studies. Top row shows the direction and magnitude of trait-variance (y-axis) at varying A-values (x-axis). Data points above one indicate the trait is expressed at higher level in animals with D genotype at eQTL than B genotype, and visa versa. The blue line indicates the cut-off for “expression” (90th percentile of negative controls) in each of the three tissue data. The middle row compares Praw with average A-values. The bottom row compares Praw with trait-variance. Blue data points indicate those traits with more than half their values below the 90th percentile of the negative controls (i.e. considered as “unexpressed”). Red data points in the bottom panels indicate cis-linkages.

More generally, the magnitude of trait-variance tends to increase (deviate from one) with A-values. This is because systematic variation is anti- correlated with expression levels, consequently resulting in higher signal-

83 Linkage analysis of expression traits in three tissues of RI mice

to-noise ratio, thus leading to improved estimate of trait-variance. Evidently, low expression does not necessarily mean small trait-variance as we noted that changes >2-fold are being mapped for moderately expressed traits (A-valueC10-12; moderate compare to “unexpressed” traits with A- valuesC9-10: blue lines in Figure 3-6).

In addition to the magnitudes, the directions of trait-variances are also indicated on these plots. In each linkage test, a trait’s expression values are divided into two genotype-groups B and D, corresponding to the genotypes (homozygous BB and DD) at the marker locus to be assessed and the null hypothesis of no difference between the two groups are tested. Thus, the resulting trait-variance can either be B-directed, that is, the trait is higher expressed in strains with a B genotype at the corresponding eQTL than strains with a D genotype; or D-directed, where the reverse is true. In the BXD data, many traits appear to be B-directed. This is consistent with previous observations that many transcripts tend to have higher expression levels in the C57BL/6J parental strain than the DBA/2J strain (COTSAPAS et al. 2006), further supporting the fact that variation in transcript abundance is heritable.

Conversely, because the oligo-nucleotide library spotted on the microarrays used in these studies were based on the genome of C57BL/6J mouse inbred strain, it has been suggested that transcripts from this mouse strain may hybridise more tightly than other strains if the oligonucleotides contain polymorphisms between the different strains (DOSS et al. 2005; PEIRCE et al. 2006). If this is true, then one would observe more cis-linkages that are B-directed than D-directed (see Section 3.4 for analysis on cis- and trans- linkages). This appear to be true (Figure 3-6 bottom panel, red data- points), particularly in the brain study. Clearly, a more detailed examination of the corresponding sequences would be needed to confirm or refute this speculation. However, if these B-directed cis-linkages are real, then a direct comparison of the oligonucleotide sequences spotted onto the miroarrays

84 Linkage analysis of expression traits in three tissues of RI mice

between the parental strains would provide invaluable insight into the potential functional variants influencing the corresponding expression traits. Furthermore, copy number differences have been shown for some genomic segments between these two parental strains (LI et al. 2004). Thus, segments with copy number variations would appear as cis-acting eQTLs, assuming copy number is correlated with transcript abundance

(PEIRCE et al. 2006).

Closer examination of unexpressed transcripts (traits expressed below the 90th percentile of NCs; Section 2.1.5; blue lines in Figure 3-6) shows that they appear to map with relatively large trait-variances (>2-fold). However, when we compared linkage significance (-log10(Praw); point-wise P-value) against average A-values (Figure 3-6 middle row), we see that Praw of “unexpressed” traits are less significant than “expressed” traits. This is because linkage test statistic depends on both effect size (i.e. trait-variance) and within-group variance (i.e. non-genic variance, including environment and technical variances); although some traits expressed at low levels may have large trait-variances, their associated within-genotype variation is relatively large and so their final Praw remain less significant than for traits expressed at higher levels.

Finally, a more direct comparison (Figure 3-6 bottom row) between Praw and average A-values of all significant linkages (Praw0PRED) summarises the observation that, in general, large trait-variance results in more significant

Praw and that unexpressed genes (blue data points) tend to have smaller trait-variances and less significant Praw.

85 Linkage analysis of expression traits in three tissues of RI mice

3.3. Multiple regulators of gene expression

Of the mappable transcripts in the brain, kidney, and liver BXD eQTL- mapping studies, between 38% and 48% are significantly mapped to more than one putative eQTLs (Table 3-4; Figure 3-7; Table 3-2). In one extreme case, in the liver study, a transcript is significantly linked to 14 eQTLs. Linkages to multiple genetic loci may be indicative of multiple genetic influences on the corresponding expression trait.

To determine whether these observations are likely real, we return to our sim0B/K/L studies, previously discussed in Section 3.2. Recall these studies are designed to model the null hypothesis of no linkage (Section 2.3.1 and Section 3.1.2), thus chance and spurious linkages should be randomly distributed; that is, an average of one linkage per gene is expected. Contrary to this expectation, up to an average of 38.8% of simulated transcripts are mapped to more than one marker loci (Table 3-4). In retrospect, these spurious multiple linkages are not surprising: the presence of marker-correlation, as was demonstrated in Section 3.2.1, is the likely cause of multiple linkages.

Multiple linkages may occur due to multiple genetic regulation or linkage disequilibrium (LD). Because multiple linkages are observed in the sim0B/K/L data, we predict the majority of multiple linkages are spurious and due to LD. The 779 markers used in our linkage analyses are densely and not evenly spaced (Figure 3-8). When two markers (or genes) are located physically close together on the genome, they tend to co-segregate during meiotic recombination, resulting in LD. This causes the majority of individuals in a population to have identical genotypes at nearby genomic regions. Consequently, neighbouring loci will tend to have similar genotype patterns, and so linkage of a transcript at a particular marker locus will often imply similar linkage significance at its adjacent loci. For example (Figure 3-9), a Ppp4c transcript is most significantly linked to a Chr1

86 Linkage analysis of expression traits in three tissues of RI mice

marker D1Mit9, but because of LD, the transcript is also significantly -4 linked (Praw<10 ) to several markers within 5Mb upstream and downstream of D1Mit9 (See Section 4.2 for a more in depth discussion on LD-induced multiple linkages).

Occasionally, a gene may be influenced by multiple adjacent loci, but as there is no evidence to suggest this to be a common occurrence, potentially redundant linkages likely caused by LD are eliminated using a novel algorithm: remove.LD (see Chapter 4 for full discussion). Following this procedure, the average proportions of simulated transcripts significantly linked to two or more putative eQTLs is reduced to 04.3% (Table 3-4; Figure 3-10). In addition to LD between neighbouring loci, random marker-correlation (genotype pattern association) between non- neighbouring markers can also lead to spurious multiple linkages, thus providing an explanation for these 4.3% of multiple linkages observed in the sim0 studies. A more detailed discussion of multiple linkages can be found in Chapter 4 and a detailed discussion on the affects of genotype pattern association is presented in Chapter 5.

Following the application of remove.LD, the extents of multiple linkages are reduced to 5.0-7.2% in our three-tissue BXD studies (Table 3-4; Figure 3-10). Interestingly, even after the removal of potentially redundant linkages, the amounts of transcripts mapping significantly (Praw0PRED) to

>1 eQTLs in the real datasets are more than that observed in the sim0B/K/L studies, suggesting genuine multiple genetic regulation of gene expression (see Chapter 4 and Chapter 5 for further discussions).

87 Linkage analysis of expression traits in three tissues of RI mice

Table 3-4 Significant linkages per transcript before (middle two columns) and after (unique; last two columns) application of remove.LD (Section 4.2.2) to eliminate potentially redundant linkages due to linkage disequilibrium. First three rows show results from the real transcripts of the BXD Brain, Kidney, and Liver studies, and the last three rows are averages of five simulation studies from the sim0B, sim0K, and sim0L data.

# # Total # transcripts transcripts mappable % w/ > 1 % with >1 transcripts unique eQTL eQTL

Brain- 2,996 1,440 48.1 215 7.2 BXD

Kidney- 1,195 455 38.1 60 5.0 BXD

Liver- 1,821 685 37.6 108 5.9 BXD

sim0B 1,100 418 38.0 47 4.3

sim0K 1,100 414 37.6 48 4.3

sim0L 1,105 428 38.8 46 4.2

88 Linkage analysis of expression traits in three tissues of RI mice

BXD-brain

BXD-kidney

BXD-liver

sim0B

sim0K

sim0L

0% 20% 40% 60% 80% 100% % trancripts linked to n eQTLs

one two three four five more than five

Figure 3-7 Proportions of transcripts linked to 1-5, or >5 marker loci, prior to removal of potentially LD-induced multiple linkages. Percentages are based on the total number of actual transcripts in the real BXD brain, kidney, and liver studies (BXD), and the number of simulated genes from the sim0B, sim0K, and sim0L studies (Appendix A.2 and Table 3-4).

89 Linkage analysis of expression traits in three tissues of RI mice

Location (Mb)

Figure 3-8 Physical map of the 779 markers used in the BXD eQTL mapping analyses. Physical positions (x-axis) of each 779 markers are represented by ‘|’, separately on each chromosome (y-axis).

90 Linkage analysis of expression traits in three tissues of RI mice

Figure 3-9 Genome scan of Ppp4c (protein phosphatase 4, catalytic subunit) transcript in the BXD brain study. Each point on the plots indicates the point-wise linkage significance (-log10Praw) of Ppp4c (y-axis) at each marker along the genome (x- axis). The transcript is linked most significantly to multiple eQTLs on chromosome 1 (highlight in insert and enlarged in main plot). The transcript is linked to 4 other eQTLs (within dotted lines of main plot; between 78Mb-86Mb) 5Mb upstream and downstream of -6.3 the most significantly linked eQTL, D1Mit9, at Praw=10 .

91 Linkage analysis of expression traits in three tissues of RI mice

BXD-brain

BXD-kidney

BXD-liver

sim0B

sim0K

sim0L

0% 20% 40% 60% 80% 100% % transcripts linked to n eQTLs

one two three four five more than five

Figure 3-10 Proportion of transcripts linked to 1-5, or >5 unique linkages, following application of remove.LD to eliminate potentially LD-induced multiple linkages. The proportions are based on the total the number of mappable transcripts in each of the Brain, Kidney, Liver BXD studies, and the sim0B, sim0K, and sim0L studies (Appendix A.2 and Table 3-4).

92 Linkage analysis of expression traits in three tissues of RI mice

3.4. Cis- and Trans- acting modulators of gene expression

To understand the regulatory mechanism of gene expression, we set out to elucidate the extent of cis and trans linkages. In linkage analyses, the definition of cis-linkage is necessarily arbitrary as we are analysing approximately 22 thousand gene transcripts simultaneously. Here, we define an eQTL as acting in cis if it is co-located within 5Mb of where the target transcript is encoded. For example (Figure 3-11), a RIKEN clone encoded on Chr8 is most significantly linked to marker D8Mit189 that is also located on Chr8. A close inspection of the physical mappings of the RIKEN clone and the eQTL reveals that they are within 5Mb of each other, and so the RIKEN clone is said to be influenced in cis by the D8Mit189 locus.

Figure 3-11 Example cis-linkage of a RIKEN transcript (1810029E06Rik; AK007639) from BXD liver study. This transcript, which is located on chr8 (red tick mark on the x- axis), is mapped within 5Mb of the eQTL D8Mit189.

93 Linkage analysis of expression traits in three tissues of RI mice

This definition is applied to all linked transcripts with known genomic locations. Two important points are to be noted here. First, although we have shown that linkages of transcripts expressed at low levels are potentially artefactual, but as there is no reason to believe all linkages of lowly expressed traits are spurious and also to allow for later comparison of results against other eQTL linkage analyses (Chapter 4), the full set of expression traits (i.e. “expressed” and “unexpressed” transcripts) are analysed for cis and trans linkages. Secondly, not all transcripts have known genomic locations: while all transcripts from the Compugen 22k mouse arrays (Section 2.1.3) are obtained from known expressed ESTs (expressed sequence tags) or cDNAs, not all have been physically mapped back onto the genome and so locations at which they are encoded are unknown. Of the transcripts demonstrating significant linkage to at least one eQTL (Praw0PRED; Table 3-2C), 8.8%-9.7% have no known genomic location.

To avoid confounding results due to LD, cis and trans linkages are assessed for non-redundant linkages; linkages potentially induced by LD are first eliminated using remove.LD (Chapter 4). Of the linkages whose corresponding transcripts have known genomic location, >98% are in trans, corresponding to as many transcripts (Table 3-5): in the brain data, 2,719 transcripts are linked to 2,925 trans-linkages; in kidney, 1,073 transcripts are linked to 1,131 trans-linkages; and in liver, 1,644 transcripts are linked to 1,744 trans-linkages.

94 Linkage analysis of expression traits in three tissues of RI mice

Table 3-5 Extent of transcripts with known genomic location linked in cis and/or in trans at at least one unique eQTL. For ~10% of transcripts with significant linkages the genomic locations (lox) at which these transcripts are encoded are not known. Unique linkages are defined by remove.LD (Chpater 4). Some transcripts have are linked to both

Brain Kidney Liver

Total unique linkages 3,217 1,258 1,931

Unique linkages from 2,937 1,150 1,762 transcripts with known lox 91.3% 91.4% 91.2%

Total transcripts with 2,996 1,195 1,821 linkage

2,731 1,092 1,662 Transcripts with known lox 91.2% 91.4% 91.3%

12 19 18 Transcripts with cis linkage 0.44% 1.74% 1.08%

Transcripts with trans 2,719 1,073 1,644 linkage(s) 99.56% 98.26% 98.92%

Transcripts with both cis & 1 3 3 trans linkages 0.04% 0.27% 0.18%

The reason for the number of linkages exceeding the number of mappable transcripts is because of multiple linkages (Section 3.3). A more detailed analysis show that only a very small portion of transcripts are under both cis and trans regulation (Table 3-5 last row): one transcript in the brain data, two in the kidney, and three in liver are significantly mapped in cis to one eQTL and in trans to another eQTL. Furthermore, one transcript in the kidney data is significantly mapped in trans to two eQTLs and in cis to another eQTL. Significant linkages of both nature may imply the trait is under the genetic control of both cis- and trans-acting regulators. For example (Figure 3-12), a Pax3 (paired box gene 3) transcript encoded on Chr1 is significantly linked in cis to D1Mit134 on Chr1 and in trans to S08Gnf006.700 on Chr8, suggesting Pax3 is under the regulation of both genomic loci.

95 Linkage analysis of expression traits in three tissues of RI mice

Figure 3-12 Example transcript (pax3: paired-box 3) from our BXD brain data with both cis and trans linkages. Pax3 encoded on Chr1, appears to be influenced by a cis- acting eQTL, D1mit134, and a trans-acting eQTL, S08Gnf006.700, on Chr8.

The ratio of cis- to trans- linkages is dependent on the criterion used to define cis-linkages, and this criterion is necessarily arbitrary. In currently published mouse eQTL linkage studies alone, between 2cM (SCHADT et al.

2003) and 20Mb (BYSTRYKH et al. 2005) have been used. With a 5Mb ‘cis window’, <2% cis-linkages is observed in our BXD data. By increasing the size of the ‘cis window’, more eQTLs are expected to fall within this window, thus increasing the proportion of cis-linkages. We examined the extent to which this occurs with ‘cis window’ sizes of 10kb, 100kb, 1Mb, 2Mb, 5Mb, 10Mb, 15Mb, and 20Mb (Figure 3-13; see Table A-7 in Appendix A.3 for tabulation of actual numbers making up this figure).

96 Linkage analysis of expression traits in three tissues of RI mice

3.5%

3.0%

2.5%

2.0%

1.5%

1.0% % cis linkages

0.5%

0.0% 0 5 10 15 20 -0.5% cis window (Mb)

brain kidney liver

Figure 3-13 Percentages cis-linkages defined by different ‘cis-windows’. A linkage is, arbitrarily, defined as cis-acting if the eQTL is within xMb (x-axis) of the transcript. See Table A-7 for tabulation of actual numbers of linkages.

As expected, more linkages are categorised as cis-acting when a more liberal definition of cis-linkage is used. However, even with a 20Mb window, the proportions of cis-linkages remain much lower than that of trans-linkages. Because the distribution of genetic markers and expression traits on the genome is not truly random, it is difficult to determine the expected proportion of cis and trans linkages. However, it is possible to determine the expected number of syntenic linkages: linkages where the mapped eQTL is situated on the same chromosome as the transcript. Thus, given the distribution of our 779 genetic markers across each chromosome (Figure 3-14) and the total observed significant linkages (Table 3-5 first row), the expected proportion of syntenic linkages is estimated at ~5.3% (Table 3-6). Interestingly, while the extent of syntenic linkages found in the brain data is within expectation, the extent of syntenic linkages in the kidney and liver data are higher than expected (Table 3-6). These observations may potentially be attributable to the differences in the mechanisms of genetic regulation of gene expression between the different tissues (see Section 3.6 for further discussion on tissue-specificity).

97 Linkage analysis of expression traits in three tissues of RI mice

Figure 3-14 Distribution of the 779 genetic markers across the 20 chromosomes.

Table 3-6 Numbers and percentages of expected and observed syntenic linkages in the BXD Brain, Kidney, and Liver data. Brain Kidney Liver

115 61 93 Expected 5.3% 5.3% 5.2%

159 75 119 Observed 5.4% 6.5% 6.8%

98 Linkage analysis of expression traits in three tissues of RI mice

Variable proportions of cis-linkages have also been observed when different critical values are used to define significant linkage (HUBNER et al. 2005; MONKS et al. 2004; SCHADT et al. 2003): much higher proportions of cis-linkages are generally observed at more stringent linkage significance thresholds (Table 3-1). To determine whether this is the case in our three-tissue data, we re-calculated the proportion of cis-linkages at -4 -5 -6 -7 Praw thresholds of 10 , 10 , 10 , and 10 (Figure 3-15; see Table A-8 in Appendix A.3 for tabulation of actual numbers). Concordant with other published data, a definite increase in cis-linkages are observed across all three tissues as Praw threshold decreases (i.e. more significant), suggesting cis-acting eQTLs, which follow Mendelian inheritance, are likely to exert larger effects than trans-acting eQTLs, which are likely pleiotropic in nature (DOSS et al. 2005; PEIRCE et al. 2006).

70%

60%

50%

40%

30%

% cis linkages 20%

10%

0% 1.E-04 1.E-05 1.E-06 1.E-07 Praw

brain kidney liver

Figure 3-15 Percentage cis-linkages at different Praw thresholds for defining significant linkage. Note that the x-axis (Praw) is in log-scale. See Table A-8 for tabulation of actual numbers of linkages.

99 Linkage analysis of expression traits in three tissues of RI mice

Interestingly, the proportion of cis-linkage in the kidney data increases -7 considerably from <2% at Praw0PRED to almost ~67% at Praw010 . Compare to the much more modest increase in brain and liver, the kidney results potentially suggest kidney-specific genetic regulation of gene expression is predominately cis-acting rather than trans-acting (See Section 3.6 for discussion relating to tissue-specificity). Clearly, the kidney results could be a reflection of inherent noise embedded within this data: it was noted in Section 3.2.1 (Table 3-2C) that the extent of mappable transcripts in this data is not different to chance expectation, implying the majority, if not all, linkages are likely spurious. However, because the various Praw thresholds -4 -7 (10 to 10 ) are more significant (smaller) than PRED for the kidney data, the detectable linkages in this cis-linkage analysis are likely real: true linkages are expected to be more statistically significant than false positives.

To better understand whether the significantly larger proportion of cis- linkages in the kidney study is spurious or real, we compare the extent of observed and expected syntenic linkages (Figure 3-16). If all linkages in the kidney data are false, the extent of syntenic linakges is expected to remain <25%. Yet, >80% of linkages are found to be syntenic, suggesting (1) at least some of the observed linkages in the kidney data are real, and (2) the transcripts under the strongest genetic influence (most detectable at low Praw) are mainly under cis-acting genetic effect. These results are consistent with many published results (e.g. CARLBORG et al. 2005; DOSS et al. 2005).

100 Linkage analysis of expression traits in three tissues of RI mice

90%

80%

70%

60%

50%

40%

30%

20% % syntenic linkages 10%

0% 1E-4 1E-5 1E-6 1E-7 Praw

brain-expected kidney-expected liver-expected brain-observed kidney-observed liver-obs erved

Figure 3-16 Percentage syntenic linkages at different Praw thresholds.

101 Linkage analysis of expression traits in three tissues of RI mice

3.5. Master regulators of gene expression

Trans-acting regulators are thought to have a more pleiotropic effect than cis-acting regulatory elements. As such, we would expect all transcripts controlled by the same regulator to show significant linkage to the same eQTL. Such “master regulators” or “linkage hotspots” have been observed in most published eQTL-mapping studies (BREM et al. 2002; BYSTRYKH et al. 2005; CHESLER et al. 2005; MORLEY et al. 2004; SCHADT et al. 2003;

YVERT et al. 2003), but not all (MONKS et al. 2004). We set out to look for linkage hotspots in our three tissue data by calculating the proportion of linkages at each of the 779 marker loci.

The presence of linkage hotspots is notable when the proportions of unique linkages at each marker, defined at Praw0PRED, are plotted along the genome (Figure 3-17; also see Figure 6-2 in Section 6.3.1). To date, there is no formal definition of a linkage hotspot, rather, it is often described as a putative eQTL, or genomic region, with an unusual excess of significant linkages. Theoretically, with 779 markers being tested for linkage, each marker locus should contain ~0.13% (1/779) total linkages. However, we recognise that the distribution of markers is not uniform along the genome, and so linkage hotspot is defined for a marker locus if the proportion of linkages is greater than the average expected linkages for a chromosome: i.e. if the proportion of linkages is greater than 5(1/mi)/n, where mi is the number of markers on the ith chromosome and n is the number of chromosomes with genetic markers. Thus, given the marker distribution across the chromosomes (Figure 3-14), a marker locus is defined as a hotspot if it contains >2.76% of all significant linkages.

With this definition, we found genetic markers with excess linkages tended to cluster together (Figure 3-17; Table 3-7). There are two plausible reasons for this: first is that the region of the cluster may contain multiple master regulators each influencing subsets of transcripts; or alternatively,

102 Linkage analysis of expression traits in three tissues of RI mice

this is a consequence of linkage disequilibrium, whereby multiple transcripts are attempting, but struggling, to link to the same region of the genome harbouring a master regulator.

As there is no prior knowledge to suggest these nearby marker loci linked by multiple transcripts are indeed independent and because LD is known to influence QTL linkage analyses, we refined our definition of a linkage hotspot (or master regulator) as a genomic region containing one or more adjacent marker loci each containing >2.76% of all unique linkages. In total, three linkage hotspot clusters have been collapsed into three unique hotspots: 3 adjacent marker loci on Chr1, in brain; a pair of marker loci on Chr8, in brain; and a pair of marker loci on Chr9, in kidney (Table 3-7). In all instances multiple eQTLs making up a linkage hotspot are physically located within 5.3Mb of each other (Table 3-7 4th column).

Table 3-7 Linkage hotspots in the BXD brain, kidney, and liver eQTL studies. A linkage hotspot may consist of more than one eQTLs within 5.3Mb apart. The number of linkages at each eQTL is shown in the last column and for each linkage hotspot, the corresponding eQTLs are indicated in Figure 3-17. Also see Table 6-1 in Section 6.3.

Chromosome Physical Number Linked markers Tissue of master location of (eQTLs) regulator (Mb) linkages D1Mit9 77.6 116 Chr1 D1Mit216 80.3 569 D1Mit134 80.8 562 Brain S08Gnf006.700 9.5 161 Chr8 D8Mit124 14.8 113 Chr19 D19Mit22 9.9 91

Chr8 S8Gnf006.700 9.5 57 Kidney D9Mit297 34.1 72 Chr9 D9Mit91 37.3 35

Liver ChrX SXGnf055.520 82.4 114

103 Linkage analysis of expression traits in three tissues of RI mice

The first notable observation is that, apart from the Chr8 hotspot, there is a lack of overlap between hotspots identified across the three tissues. This lack of overlap is consistent with three other BXD-derived studies (SCHADT et al. 2003; BYSTRYKH et al. 2005; and CHESLER et al. 2005), suggesting tissue-specific trans-regulation of gene expression (DE KONING and HALEY 2005).

Interestingly, in a study where the experimental population largely overlaps with the 31 BXD RI mice used in our own three-tissue studies, CHESLER et al. (2005) identified a linkage hotspot at Chr1: 170-174Mb that was replicated by PEIRCE et al. (2006) using a panel of 56 F2 animals derived from the same parental strains as the BXD lines. The difference in the precise location of the Chr1 hotspot identified in these two studies compared to our BXD brain study could be due to differences in cell type: for our study, whole brain was used, while CHESLER et al. (2005) used forebrain and midbrain, and PEIRCE et al. (2006) used single-hemisphere of whole brain. Because different cell types of the same tissue have different functions, even a slight difference in the proportion of whole-brain tissue could imply different sets of actively co-regulated genes, hence different active master-regulators.

Alternatively, the difference between these three studies (our 31 BXD brain, CHESLER et al. 2005, and PEIRCE et al. 2006) could be indicative of artefactual linkage hotspots caused by technical errors, such as inter-array variations (Section 3.2.5), or environmental factors (DE KONING and HALEY

2005). For example, in an in silico study, PÉREZ-ENCISO (2004) showed that ~20% of spurious eQTLs tend to cluster within five genomic regions.

104 Linkage analysis of expression traits in three tissues of RI mice

Figure 3-17 Linkage hotspots in the BXD brain, kidney, and liver data. These plots show the percentages of total linkages (y-axis), defined at Praw0 PRED applying remove.LD to remove potentially LD-induced redundant linkages, at each of the 779 marker loci (x-axis) in the brain (top), kidney (middle), and liver (bottom) data. Marker loci containing >2.76% (horizontal dotted lines) of total linkages are indicated.

105 Linkage analysis of expression traits in three tissues of RI mice

To comprehend whether these linkage hotspots are significant, we return to our previous simulation studies (sim0B, sim0K, and sim0L; Section 3.1.2 and Section 2.3.1). As with the BXD data, we plotted the average proportion of linkages at each marker loci in genomic order for each of the three series of simulated datasets: sim0B, sim0K, and sim0L (Figure 3-18). The proportion of linkages at any one eQTL is <1%. This amount is far below that observed in the BXD data implying the rate of false positives is likely small.

As sim0 is a simulation of the null hypothesis, the extent of linkage is expected to be uniform across the 779 marker loci. However, there are a few genomic regions (Figure 3-18; one on Chr7 and two on ChrX) where the amount of linkages is evidently greater than other eQTLs. Remarkably, these linkage hotspots are identical across all three simulated datasets. Upon inspection we find that the genotype patterns corresponding to these marker loci have an unusual bias of genotype ratios: for each of these three markers, D7Msw058, DXMit25, DXMsw076), only 6 of the 31 BXD strains have a D genotype: D7Msw058 BBBBBDBDBDDBBBBBDBBBBBBDBBBBBBB DXMit25 BBBDBDBBBBBBBBDBBBBBBBBDBBBDBBD DXMsw076 BBBDBDBBBBBBBBBBBBBBDBBDBBBDBBD

As a result of this bias, sample sizes between the two genotype-groups (i.e. only 6 expression values in the D group) would have become considerably uneven, resulting in lost of statistical power and increased false positives (type I error). There are 11 other markers whose genotype patterns have only 7 D-genotypes and two of these eQTLs make up the smaller linkage hotspots on Chr4 of the simulated data (Figure 3-18). This set of simulation studies points out one of the weaknesses in existing (eQTL) linkage analyses, and cautions the need to perform critical post-hoc examinations of linkage results. It is likely this problem is not exclusive to eQTL linkage studies and may be relevant to classical QTL linkage

106 Linkage analysis of expression traits in three tissues of RI mice

analyses as well. Nevertheless, this problem is exacerbated in eQTL studies because of the massively parallel nature of these experiments that prohibits a gene-by-gene evaluation.

Re-examination of our three-tissue BXD data shows that <1% of linkages are found at any of the three marker loci with only 6 D-genotypes. It is unclear whether linkages at marker loci with extremely biased genotype distributions within the population are real or spurious. A biological reason for such biased genotype distributions at a specific genomic region within a population is allelic selection in the breeding process. Either consciously or unknowingly, a particular genotype at a specific locus may confer certain fitness to the organisms, resulting in allelic selection, which may manifest into non-syntenic association (Section 5.4).

107 Linkage analysis of expression traits in three tissues of RI mice

Figure 3-18 Linkage hotspots in the sim0B, sim0K, and sim0L studies. Proportions of linkages (y-axis) at each eQTL are plotted along their genomic order (x-axis). Red lines indicate the eQTLs whose corresponding genotype patterns have only 6 of the 31 strains with D-genotype.

108 Linkage analysis of expression traits in three tissues of RI mice

To further understand how low expression influences eQTL-linkage analyses, we also searched for linkage hotspots generated by negative controls (NCs; Section 3.2.4) and found that many more eQTLs contain >2.76% of total linkage of NCs (Figure 3-19). Linkage hotspots identified by NCs are likely caused by systematic variations (Section 3.2.4) and suggests linkages of lowly expressed transcripts are likely to accumulate at random non-causative marker loci.

Comparisons of linkage results between real transcripts and NCs in the three-tissue BXD data reveal common linkage hotspots. In the brain data, one of the linkage hotspots identified using NCs (Chr8: D8Mit124) is identical to that identified by 113 real transcripts. In liver, the ChrX linkage hotspot (SXGnf055.520) identified by >17% of NCs is identical to that identified by 114 real transcripts. These results imply some linkages within defined master regulators may be false positives due to systematic variations. The rate of false positives is difficult to determine, but all linkages may, and should, be prioritised based on average expression level (Section 3.2.5) and magnitude of mapped trait-variance (Section 3.2.6).

109 Linkage analysis of expression traits in three tissues of RI mice

Figure 3-19 Linkage hotspots of negative controls in the BXD brain, kidney, and liver studies. The proportions of total linkages (y-axis) at each marker locus are plotted in genomic order (x-axis). Markers containing >2.76% of total linkages are indicated with a red dot and by their marker names.

110 Linkage analysis of expression traits in three tissues of RI mice

3.6. Tissue-specificity

3.6.1. Tissue-specific gene expression

Using our criterion for defining “expression” (Section 3.2.5 and Section 2.1.5), a total of 18,017 transcripts are “expressed” in at least one of the three tissues. Of these, approximately 34% are expressed in a tissue- specific manner and 42% are expressed ubiquitously in all three tissues (Figure 3-20). The brain is commonly accepted as the most transcriptionally complex organ, and in support of this, we found ~32% of the 18,071 transcripts to be expressed only in the brain and only a small portion to be specifically expressed in kidney (0.58%) or liver (1.20%).

BRAIN 17,706

32%

8.4% 16% 42% 0 K . 5 % ID 8 0 R % 0.24% 2 9 . E 8 , N 1 2 2 V 3 E I ,7 7 Y L 0 1

Figure 3-20 Proportions of tissue-specific and tissue-non-independent expressions. Of a total of 18,071 transcripts expressed in any one of the three tissues, 17,706 are expressed in brain, 9,237 are expressed in kidney, 10,728 are expressed in liver, and 7,579 (42%) are expressed in all three tissues.

111 Linkage analysis of expression traits in three tissues of RI mice

To further understand tissue-specific expression, we identified and compared the set of null expressed transcripts between tissues, where a transcript is defined as null, or completely unexpressed, if its expression values across all 31 strains are below the 90th percentile of all NCs (negative controls). Conversely, a transcript is (completely) expressed if its expression values across all 31 BXDs are expressed above the 90th percentile of the NCs (Section 2.1.5). Null expression is interesting because these transcripts are likely to be completely inactive in the tissue of interest. Comparing the null expressed transcripts between tissues may provide insight into tissue-specific gene-interactions.

Based on these criteria, a total of 3,259 transcripts are defined as null expressed in at least one tissue (Figure 3-21). Almost 70% of all null transcripts are only null in either kidney or liver; a result that complements our previous observation that the majority of transcripts are expressed only in the brain.

To ascertain tissue-specific null transcription, we identified transcripts that are null in one tissue but completely expressed in another (Table 3-8). Again, a relatively large number of transcripts that are null in either kidney or liver are completely expressed in the brain.

Examination of the function and cellular localisation of some of these transcripts appears to substantiate our findings. For example, the transcript NM_008777, corresponding to the phenylalanine hydroxylase (Pah) gene, is found to be unexpressed in the brain of all 31 BXD mice but expressed in the kidneys and livers of all 31 BXDs, results that agree with previous findings (HSIEH and BERRY 1979). In another example, the transcript NM_017399 corresponding to the liver-specific fatty acid binding protein 1 (Fabp1) is highly expressed in the liver of all 31 BXDs strains and “null expressed” in the brain.

112 Linkage analysis of expression traits in three tissues of RI mice

BRAIN 390

3.0%

2.2% 3.3% 3.5% K 33.6% 34.7% ID 19.7% R 1 E , N 5 9 V 9 2 E I 9 2 , Y L 1

Figure 3-21 Extent of null expressed (completely unexpressed) transcripts in one, two, or all three tissues. Of a total of 3,259 null-expressed transcripts in at least one tissue, 390 are null in brain, 1,922 in kidney, 1,995 in liver, and 113 (3.5%) in all three tissues. See Table A-10 in Appendix A.5.

Table 3-8 Number of transcripts that are null in one tissue (rows) but expressed in all 31 BXD stains in another (columns). See Table A-10 in Appendix A.5.

Expressed in all 31 strains in… Brain Kidney Liver

Brain - 1 4 Null expressed Kidney 111 - 4 in… Liver 80 1 -

113 Linkage analysis of expression traits in three tissues of RI mice

3.6.2. Tissue-specific genetic influence of gene expression

Considering either all of 21,765 or only the set of “expressed” transcripts, almost up to three times as many transcripts are linked to at least one eQTL in the brain compared to either kidney or liver (Table 3-2), suggesting many more transcripts are active and genetically influenced in the brain. Here, we examine whether any of the genetically influenced transcripts identified in each tissue are common between tissues.

A total of 5,486 unique transcripts are linked to at least one eQTL in at least one tissue (Figure 3-22). Of these, the majority (~91%) demonstrate eQTL linkage in only one tissue, implying tissue-specific genetic regulation of gene expression. That is, generally, a gene that is genetically influenced in one tissue is invariantly expressed in another regardless of genetic background. Thus, it is likely that the eQTLs mapped by these transcripts will contain transcriptional effectors that are only active/functional in certain tissues.

114 Linkage analysis of expression traits in three tissues of RI mice

BRAIN 2,996

2,568 46.8%

163 246 3.0% 19 4.5% 0.35% 934 1,477 K 79 ID 17.0% 26.9% R 1 E , N 1.4% 1 1 E V 2 9 8 5 Y LI , 1

Figure 3-22 Extent of transcripts mapped uniquely to and commonly between tissues. Of the 5,486 transcripts mapped to at least one eQTL in at least one tissue, 2,996 have linkage in the brain, 1,195 in kidney, 1,821 in liver, and only 19 have significant linkage in all three tissues.

115 Linkage analysis of expression traits in three tissues of RI mice

Table 3-9 Transcripts mappable in all three tissues. In cases where the transcripts are mapped to multiple adjacent eQTLs, the most significantly mapped marker locus is indicated. GenBank Gene Symbol Gene location Brain eQTL Kidney eQTL Liver eQTL Chr8: 28-39Mb Chr8: 10-39Mb Chr8: 20-46Mb AK007639 1810029E06Rik Chr8: 30Mb D8Mit189 D8Mit189 D8Mit189 Chr8: 33-39Mb Chr8: 33-39Mb Chr8: 20-39Mb AK016395 4930595O18Rik Chr6:101Mb D8Mit294 D8Mit294 D8Mit189 Chr8: 70Mb Chr8: 62-70Mb Chr5: 117-126Mb AB041658 2510049I19Rik Chr8: 71Mb D8Mit31 D8Mit31 D5Mit371 Chr1: 80Mb Chr8: 28Mb Chr8: 20-39Mb AK015815 4930517J16Rik Chr4: 45Mb D1Mit216 D8Mit289 D8Mit294 Chr8: 10-15Mb Chr13: 69Mb Chr13: 102Mb NM_016716 Cul3 Chr1: 81Mb S08Gnf006.700 D13Mit11 D13Mit270 Chr1: 80-81Mb Chr1: 69-78Mb Chr12: 42Mb AK016137 Rnf139 Chr15: 60Mb D1Mit216 D1Mit9 D12Mit112 Chr7: 58-69Mb Chr8: 10-23Mb Chr8: 20-39Mb NM_023162 Znrd1 Chr17: 35Mb D07Msw060 D8Mit335 D8Mit335 Chr7: 72Mb Chr9: 37Mb Chr3: 73-108Mb AF342896 Klrb1d - D7Mit62 D9Mit91 D3Mit155 Chr1: 80Mb Chr11: 36-43Mb Chr14: 36-39Mb NM_018779 Pde3a Chr6: 142Mb D1Mit216 D11Mit308 D14Mit129 Chr1: 81Mb Chr9: 34Mb Chr13: 104Mb AK007998 2010001A14Rik Chr11: 58Mb D1Mit134 D9Mit297 D13Mit288 Chr1: 80-81Mb Chr19: 22Mb Chr11: 79Mb AF154571 Chia Chr3: 106Mb D1Mit134 D19Mit45 D11Mit34

116 Linkage analysis of expression traits in three tissues of RI mice

GenBank Gene Symbol Gene location Brain eQTL Kidney eQTL Liver eQTL Chr15: 3Mb Chr6: 102Mb ChrX: 85Mb NM_008938 Prph2 Chr17: 44Mb D15Mit13 S06Gnf100.540 DXMsw076 Chr15: 88Mb Chr11: 83Mb Chr18: 11Mb NM_016767 Batf Chr12: 83Mb D15Mit159 S11Gnf089.780 D18Mit31 Chr1: 80-81Mb Chr8: 10Mb ChrX: 96Mb NM_009658 Akr1b1 Chr1: 56Mb D1Mit134 S08Gnf006.700 DXMsw087 Chr17: 39Mb Chr13: 47Mb Chr4: 16Mb NM_013484 C2 (w/i H-2S) Chr17: 33Mb D17Mit11 D13Mit94 S04Gnf013.290 Chr1: 81Mb Chr5: 97Mb Chr13: 56Mb L08652 Rpl29-ps1 Chr3: 152Mb D1Mit134 D5Mit155 D13Mit13 Chr1: 62Mb Chr18: 39Mb Chr11: 77Mb AK010377 2410004A20Rik Chr9: 73Mb S01Gnf059.350 D18Mit14 D11Mit245 Chr1: 80-86Mb Chr8: 10Mb Chr11: 79Mb NM_008045 Fshb Chr2: 107Mb D1Mit134 S08Gnf006.700 S11Gnf085.885 Chr19: Chr1: 81Mb Chr2: 149Mb Chr11: 58Mb NM_007990 Fau 6Mb D1Mit134 D2Mit282 D11Mit208

117 Linkage analysis of expression traits in three tissues of RI mice

Of the 19 transcripts that are mappable in all three tissues, two are mapped to the same genomic region in all three tissues (Table 3-9). One such transcript, AK007639, corresponding to the RIKEN clone 1810029E06Rik, is significantly linked to an eQTL on Chr8 in all three tissues. Interestingly, this Chr8 region is also the site at which the transcript is encoded, suggesting cis-acting regulation. A BLASTN search (MCGINNIS and

MADDEN 2004) using this transcript against the NCBI's Mus musculus nr sequence database (Build 36) suggests it could be the MBII-136 C/D box snoRNA (E-value=9e-42). C/D box snoRNAs are known to be encoded within introns of their targets, and one of the known functions of C/D box snoRNAs is to guide methylation of rRNAs (BACHELLERIE et al. 2002;

HUTTENHOFER et al. 2001). As this transcript appears to be influenced in cis, we examined the Chr8 eQTL region and found, of particular interest, the Rbpms (RNA binding protein with multiple splicing) and Rbm13 (RNA binding motif protein 13) genes residing within this region. A possible hypothesis arising from this is that this C/D box snoRNA (AK007639) may be involved in a tissue-independent regulation of RNA-binding proteins (Rbpms and Rbm13) that may in turn regulate transcriptions of other genes.

If this hypothesis is true than transcripts linked to this Chr8: 10Mb-46Mb region may potentially be influenced, indirectly, by the C/D box snoRNA, AK007639. Re-examining our single-locus mapping analyses show that 130 (brain), 106 (kidney), and 50 (liver) transcripts are significantly linked, at Praw0PRED (See Table A-11 in Appendix A.6 for full lists of these transcripts), to this locus, but only AK007639 is also encoded within this region.

Interestingly, the second transcript (AK016395) with common linkages in all three tissues, corresponding to the RIKEN clone 493059O18, is significantly linked to the same Chr8 eQTL as the snoRNA (AK007639) mentioned above, but its regulation appears to be in trans as the transcripts is encoded on Chr6. A similar BLAST search as performed above reveals

118 Linkage analysis of expression traits in three tissues of RI mice

no useful information about the transcript. The function and regulation of this transcript may be associated with the snoRNA or it may be under the genetic regulation of any one of the genes residing within the Chr8 eQTL region. More speculatively, one cannot rule out the possibility of a sequential genetic regulation where AK016395 is controlled in trans by an RNA-binding protein (Rbpms or Rbm13) within the Chr8 region which is inturn regulated in cis by the putative snoRNA (AK007639).

Of the other 17 transcripts that are mappable in all three tissues, two are linked to the same eQTLs in two of the three tissues (Table 3-9): the RIKEN transcript 2510049I19Rik (AB041658) is linked to Chr8: 62Mb- 70Mb in brain and kidney; and the RIKEN transcript, 4930517J16Rik (AK015815), is linked to Chr8: 20Mb-39Mb in kidney and liver. Three other transcripts are mapped to the same chromosome, but not the same region, in two of the three tissues (Table 3-9): Cul3 (cullin 3; NM_016716) is mapped to an eQTL on Chr13 in kidney and liver; Rnf139 (ring finger protein 139; AK016137) is mapped to an eQTL on Chr1 in brain and kidney; and Znrd1 (zinc ribbon domain containing, 1; NM_023162) is mapped to an eQTL on Chr8 in kidney and liver. The remaining ten transcripts are mapped to different chromosomal loci across the three tissues, suggesting tissue-specific genetic regulation of gene expression.

It is plausible that the lack of overlapping mappable transcript between the three tissues may have been biased by the fact that ~91% of the transcripts demonstrating linkage in the kidney data are potentially spurious (Table 3- 2C). However, this does not change the conclusion that genetic regulation of gene expression is tissue-specific because the extent of expression traits showing significant linkage to at least one eQTL in both brain and liver is only ~5.8% (265/4552).

Multiple Regulators of Gene Expression. Considering all 779 marker loci, ~48% of mappable transcripts in the brain are mapped to >1 eQTL and

119 Linkage analysis of expression traits in three tissues of RI mice

~6.3% are mapped to >5 eQTLs (Section 3.3; Table 3-4; Figure 3-10). This is clearly more remarkable compare to the kidney and liver data where no more than 38% of mappable transcripts are linked to >1 eQTL and <3% are linked to > 5 eQTLs. However, following removal of potentially redundant linkages due to linkage disequilibrium, with remove.LD (Section 4.1.1), the extent of multiple linkages are drastically reduced in all three tissues with <7.2% mappable transcripts demonstrating linkage to >1 eQTL and no transcripts are linked to >3 eQTLs in any of the three tissues. These results indicate that transcripts in the brain have a higher predisposition to LD-induced linkages than transcripts in the kidney or liver, suggesting the genetic effects on gene expression is stronger in the brain than the other two organs.

Cis- and Trans Regulators of Gene Expression. While ~91% of linkages from the kidney study could be spurious (Table 3-2C), it is expected that the other 9% of potentially real linkages correspond to test-statistics that are most significant, or highly ranked. Thus, cis-linkages identified at -7 extremely significant thresholds (Praw010 ) are likely real (Section 3.4; Figure 3-15; Figure 3-16). Accordingly, the observation that the extent of cis-linkages is much higher in kidney compare to that observed in the brain or liver study at such a significant threshold implies cis-acting affect is much stronger in this tissue. It must be emphasise here that, although a higher proportion of cis-linkage is observed in kidney at low Praw, it does not necessarily connote genetic control of gene expression act mainly in cis. Conversely, it may in fact be an indication that the genetic effect on gene expression is mainly small and in trans, which would explain the small proportion of mappable transcripts in kidney because trans-acting effects are more difficult to detect than cis-effects. This is supported in a recent study (HUBNER et al. 2005), comparing kidney and fat cells of 30 RI rat lines, that found, of the eQTLs commonly mapped between these two tissues, ~15% are cis-acting. Although we are unable to observe this in our three tissue studies due to the initial lack of overlapping eQTLs, the

120 Linkage analysis of expression traits in three tissues of RI mice

combined results strongly indicate tissue-specific regulation is likely trans- acting.

Master Regulators of Gene Expression. While false linkages are expected in all three tissues, but because random linkages are not expected to aggregate, linkage hotspots are likely real, provided the effects of systematic variations and genotype ambiguities have been accounted for (Section 3.5). Since different sets of transcripts are influenced differently between tissues, it is unsurprising that the locations of linkage hotspots are also different between tissues (Figure 3-17). The presence of tissue- specific linkage hotspots hints at the existence and involvement of tissue- specific master-regulators in the control gene expression. For instance, in the brain, the Chr1 linkage hotspot (77.6-80.8Mb) may contain a brain- specific transcriptional effector that influences expression levels of 1,247 transcripts. Interestingly, brain and kidney share a common linkage hotspot on Chr8 but the sets of transcripts mapped to this region are different between these two tissues, suggesting a common master regulator that functions in a tissue-specific manner: the same regulator influencing different sets of genes (potentially belonging to different genetic pathways) when expressed in different tissues.

121 Linkage analysis of expression traits in three tissues of RI mice

3.7. Chapter summary and discussions

In this chapter, I have used mRNA expression levels from brain, kidney, and liver tissues from a panel of 31 BXD recombinant inbred mice in three separate eQTL studies to demonstrate the heritability of, and identified candidate eQTLs influencing, variations in mRNA levels at a transcriptomic-scale. These results and conclusions are summarised in Section 3.7.1, and issues arising from these studies are further discussed in the following sections.

3.7.1. Facts and figures

• 81.4% of all transcripts are “expressed” in brain, 42.4% in kidney, and 49.3% in liver (Section 3.2.5; Table 3-2)  Much of the transcriptome is active and at a detectable level

• 13.8% of all transcripts (“expressed” and “non-expressed”) are linked to at least one eQTL in brain, 5.5% in kidney, and 8.4% in liver (Section 3.2.1; Section 3.2.2; Table 3-2)  Variation in mRNA levels is genetically influenced and the sets of transcripts and their candidate eQTL can and have been identified

• 7.2% of all transcripts (“expressed” and “non-expressed”) that are linked to at least one eQTL are influenced by more than two unique eQTLs in brain, 5.0% in kidney, and 5.9% in liver (Section 3.3; Table 3-4)  Expression variations of some transcripts are genetically influenced by multiple loci that can be identified using single- locus linkage analyses

122 Linkage analysis of expression traits in three tissues of RI mice

• 0.44% of all mappable transcripts are influenced in cis in brain, 1.74% in kidney, and 1.08% in liver (Section 3.4; Table 3-5)  Expression levels of a majority of genes are influenced in trans by transcriptional effectors

• There are three linkage hotspots in the brain, on Chr1, Chr8, and Chr19; two in kidney, on Chr8 and Chr9; and one in liver, on ChrX (Section 3.5; Table 3-7) o Tissue-specific transcriptional effectors involved in the co- regulation of hundreds of genes have been identified

• 83% of all transcripts are “expressed” in at least one of the three tissues; of these, 42% are “expressed” in all three tissue, 32% in only brain, and <3% in either kidney or liver (Section 3.6.1; Figure 3-20)  Brain is much more transcriptionally complex than kidney and liver, with a set of transcripts that are exclusively expressed in the brain  There is also a set of transcripts that are expressed tissue- independently

• ~91% of all mappable transcripts are genetically influenced in a tissue- specific manner, and only ~0.35% (19 transcripts) are genetically influenced tissue-independently (Section 3.6.2)  The regulatory mechanisms governing gene expression variations are different between tissues

3.7.2. Quality control of microarray eQTL-mapping studies

I have shown that systematic variations associated with microarray technology can significantly influence eQTL-mapping studies. While some within-array structures and inter-array differences can partly be eliminated

123 Linkage analysis of expression traits in three tissues of RI mice

with appropriate within- and between-array normalisations (e.g. BOLSTAD et al. 2003; QUACKENBUSH 2002; SMYTH and SPEED 2003; WILLIAMS et al.

2006; YANG et al. 2002), but as I have shown in this chapter, residual variations may remain because of the enormity of simultaneous measurements. Other variations, such as those associated with low microarray detection level, are more difficult to remove because the “noise” level is often too great to be distinguished from real signal. Spatial systematic variations that influence a physical area on the array can simultaneously influence subset of genes localised to that spatial region, leading to spurious linkage hotspots.

One suggestion to addressing these issues is to perform repeated eQTL- mapping analyses using several versions of the same expression data that have been normalised with different methods. Linkage results that are robust to, and reproducible under, different normalisation methods are therefore more reliable (WILLIAMS et al. 2006). This method would help reduce many false positives, and so is ideal if the aim is to identify candidate genes and corresponding eQTLs for further in vivo or in vitro studies. However, this method has the potential to increase false negatives because true genetic signals sensitive to some normalisation methods may be filtered out, and so is not ideal if the aim of the study is to determine the true rate of genes whose expression levels are under genetic influence.

3.7.3. False positive and false negative rates

The issues of false positive (type I error) and false negative (type II error) rates have not been extensively discussed in this research project mainly because their estimation have been difficult and given the focus of this project their assessments have been partly sidestepped. Estimation of the error rates will not be simple because the rates of type I and II errors are affected by (1) correlation between genetic markers; (2) correlation between expression traits; and (3) the signal spectrum of microarray data.

124 Linkage analysis of expression traits in three tissues of RI mice

The first point was addressed and accounted for in Section 3.2.1 using simulation studies and a novel algorithm, remove.LD (Section 4.1.1). Of all transcripts identified as linked to at least one eQTL, ~ 36% in brain, 91% in kidney, and 60% in liver are potentially spurious.

In Section 3.2.1 I also showed that correlation between expression traits is unlikely to contribute significantly to random linkage. However, if it is indeed a confounding factor, trait-association can potentially lead to linkage hotspots. Such phenomenon is not undesirable as it has major biological and genetic implications. Though method(s) for controlling type I and II errors would then become far more difficult because the nature and extent of trait-association are expected to vary between traits.

Microarray sensitivity can influence type I error because low detection signal is associated with increase noise and so any variation in expression level is potentially a reflection of systematic variation (Section 3.2.4). Amongst the mappable transcripts, ~15% in brain, ~57% in kidney, and 48% in liver are classified as “unexpressed”; these are expected to be associated with a higher rate of type I error. Concurrently, a higher rate of type II error is expected to be associated with “unexpressed” transcripts not linked to any eQTL. To re-emphasise, many transcripts classified as “unexpressed” are possibly endogenously expressed but are expressed at such low levels that their detection becomes less reliable in a microarray system. As a result of the increased “noise” level associated with low detection, many lowly expressed transcripts whose expression levels are truly controlled by one or more eQTLs would be undetected because of reduced statistical power associated with reduced signal-to-noise ratio.

From these observations I conclude that the rates of type I and II errors are expected to be different between transcripts classified as “expressed” and “unexpressed”. Thus I would suggest estimates and assessments of these

125 Linkage analysis of expression traits in three tissues of RI mice

error rates to be performed on these two classes of genes separately. Furthermore, while a criterion or threshold is necessary for defining “expression”, it is nonetheless rather arbitrary. Thus it may be insightful to experiment with several different criteria and thresholds and estimating type I and II errors in each case.

3.7.4. Involvements of genotype patterns in eQTL mapping studies

Genotype patterns are important in eQTL-mapping analyses, as they are one of the two components (the other being expression trait values) that are correlated to determine linkage between an expression trait and a marker locus. The influence of genotype pattern structure on eQTL studies, however, has not been well studied. I have shown in this chapter that biases in the distribution of genotypes within a population at a marker locus and linkage disequilibrium can result in spurious linkages.

Biases in genotype frequencies within a population at different marker loci directly imply sample size biases, thus different linkage test statistic distributions, between these loci. I have shown in Section 3.5 that extreme imbalance in B-to-D ratio at a marker locus can give rise to artefactual linkages and linkage hotspots. By considering the average number of linkages at each marker with different frequencies of B-genotypes I further show that more linkages are generally observed at marker loci with obvious imbalance of B:D (i.e. relatively high and relative low numbers of B- genotypes) in the brain, kidney, and liver BXD eQTL studies (Figure 3- 23).

While the effect of LD is well understood, its influence in eQTL-mapping studies has not been extensively examined. Algorithms for eliminating redundant linkages due to LD have been previously proposed, but these methods tend to assume tightness of LD is uniform and/or linear across the

126 Linkage analysis of expression traits in three tissues of RI mice

genome (Section 4.2.1). In Chapter 4, I propose an alternate method, remove.LD, for defining LD that utilises similarities between pair-wise genotype patterns. Regardless of which method is used, it is clear that without the application of any algorithm to eliminate LD-induced multiple linkages can dramatically over-estimate the proportion of genes that are independently influenced by multiple eQTLs. A more thorough discussion of multiple linkages and LD effects is presented in the following chapters.

127 Linkage analysis of expression traits in three tissues of RI mice

Figure 3-23 Average numbers of linkages at markers with different numbers of B- genotypes. The results are from the BXD brain (top), kidney (middle), and liver (bottom) studies. Inserts are enlargements of each plot between the range of 0 and 0.5 average number of linkages per marker (y-axis). Of all 779 markers, none have genotype patterns containing <9 or >25 B-genotypes (x-axis). Note: small number of B implies a large number of D, and visa versa.

128 Validation and analysis of two-locus influence on mRNA levels

4. Validation and analysis of two-locus influence on mRNA levels

Quantitative traits such as transcript abundance are likely complex in origin and controlled by multiple genetic regulators that act in an additive or epistatic manner. A transcript is said to be under the regulation of multiple additive eQTLs if these eQTLs simultaneously, but independently, contribute to the transcript’s expression level. Loci that simultaneously and dependently contribute to a transcript’s abundance are said to have an epistatic effect.

In our three-tissue BXD eQTL linkage studies, between 37.6% and 48.1% of significantly linked transcripts are linked to two or more eQTLs (Table 3-4), suggesting multiple genetic regulation of gene expression. However, these linkages have been identified from single-locus mapping analyses that consider linkage of an expression trait at each genomic region independently of the next. Single-locus mapping analyses are therefore not designed for identifying multi-locus influences and though they may have the ability to detect additive (independent) multi-locus effects, they would not have the ability to detect epistatic (dependent) multi-locus effects.

The aim of this chapter is to determine the extent to which single-locus mapping techniques can be used to detect multi-locus effects. A method for validating two-locus effect, bqtl.twolocus, has been developed and applied to assess and verify the mode of interactions amongst multiple eQTLs identified from several single-locus mapping analyses, including the three- tissue BXD data presented in Chapter 3. That is, for a set of marker loci independently linked to a common expression trait using a 1-D genome- scan, we tested all combinations of marker-pairs from this marker set for two-locus effect using bqtl.twolocus.

129 Validation and analysis of two-locus influence on mRNA levels

We emphasise here that the purpose of this chapter is to examine the extent to which multiple linkages independently identified from single-locus eQTL-mapping analyses are likely real. It is not the intention of this thesis to search for, or develop new methodologies to search for, multiple genetic effectors of gene expression. In the field of classical QTL-mapping, various multi-locus mapping methods already exist for this purpose (see BROMAN 2001 and references therein) and which may be adopted for expression traits. More specifically, and more recently, STOREY,AKEY and KRUGLYAK (2005) proposed a step-wise forward regression approach for identifying joint linkages for expression traits (see Section 4.4 for implementation of this method). Regrettably, the search for multiple linkages is often hampered by the intensive computation associated with these methodologies and/or by the lack of statistical power associated with limited sample size; this may explain the lack of available multi-locus eQTL-mapping studies. The work presented in this chapter, and particularly bqtl.twolocus, is for the post hoc analyses of two-locus effects for the currently existing single-locus eQTL-mapping studies.

Our expectation is that some multiple linkages will show evidence of additive interactions while evidence of epistatic interaction is less likely. However, any evidence of epistatic interaction would indicate the potential usefulness of single-locus mapping techniques in mapping multi-locus effects.

130 Validation and analysis of two-locus influence on mRNA levels

4.1. Experimental design

The first part of this chapter (Section 4.2 and Section 4.3) is devoted to the analysis of multi-locus effects in our three-tissue BXD studies. Section 4.4 extends this analysis to seven other datasets corresponding to four publicly available eQTL-mapping studies (Table 4-9).

For each dataset, the following procedure is followed: 1) Perform single-locus mapping analysis using simple regression using

QTL Reaper (Section 2.1.4) and calculate nominal P-values (Praw) on likelihood ratio (LR) scores (Section 3.1.1); 2) Remove redundant linkages due to linkage disequilibrium, using a custom algorithm remove.LD (Section 4.1.1 and Section 2.5); 3) Identify eQTL(s) for each expression trait using a genome-wide

significance (PGWS) of 0.05; 4) Identify those traits mapped to multiple unique eQTLs; 5) For each trait with multiple eQTLs, determine if the trait is influenced simultaneously by at least two of the eQTLs using bqtl.twolocus (Section 4.1.2 and Section 2.6).

Although an empirical P-value threshold (Praw0PRED) was shown to be more appropriate in Chapter 3, in this chapter the more conservative Bonferroni correction (Praw0PBONF) is used to allow for cross-data comparison (Section 4.5).

4.1.1. Genotype pattern-association and removal of redundant linkages due to linkage disequilibrium

In Chapter 3, particularly Section 3.3, remove.LD was briefly introduced for the removal of potentially redundant linkages due to LD. Here, we describe more clearly the concept behind this algorithm: its purpose is to eliminate redundant linkage peaks due to LD where LD is assessed based on the extent of genotype pattern similarity (Section 2.4) between neighbouring eQTLs (Section 2.5 for details of algorithm).

131 Validation and analysis of two-locus influence on mRNA levels

Genotype pattern is the pattern of genotypes across a panel of segregating population at a specific genomic region, such as a marker locus (Section 1.5). Let us explain this using the BXD RI panel. In an RI panel, each mouse is homozygous inbred and so at any given locus, an individual will have genotype of either BB or DD (denoted, in this thesis, simply as B and D), corresponding to the genetic backgrounds of the founder strains C57BL/6J and DBA/2J (See Section 1.3.2). For example, at the marker S07Gnf047.960, located on Chr7, the first individual of the BXD panel, BXD1, has inherited this region of Chr7 from the C57BL/6J founder strain, and so has a B-genotype at this marker. Conversely, the second individual, BXD2, has inherited this genomic region from the DBA/2J founder and so has a D-genotype at this same marker locus. Stringing the genotypes at this marker locus across the 31 BXD mice results in the genotype pattern: BDBBBBBDBDDBBBBBDBBBBBDDDBBDBBB

The similarity between a pair of genotype patterns is inversely correlated to the effort required to change one pattern into another. Let us consider the genotype pattern at S07Gnf047.960 across only the first six strains, and the genotype patterns of a marker upstream (D7Mit229) and downstream (D07Msw058) of S07Gnf047.960: D7Mit229: BDBDBB S07Gnf047.960: BDBBBB D07Msw058: BBBBBD Only one change is necessary (at BXD4) to convert the genotype pattern of S07Gnf047.960 to that of D7Mit229, but two are necessary (at BXD2 and BXD6) to convert it to D07Msw058. Thus, D7Mit229 and S07Gnf047.960 are clearly more similar than D07Msw058 and S07Gnf047.960. With six RI lines, changing the genotype at any one of the six strains will result in a different genotype pattern. Thus, given n RI lines, each genotype pattern n(n-1) has n related pattern that differs only at one of n positions, /2 patterns

132 Validation and analysis of two-locus influence on mRNA levels

that differ at only 2 positions, and so on. The bottom-line here is that each marker genotype pattern can have many associated genotype patterns.

We use Pearson’s correlation coefficient to quantify pair-wise genotype patterns similarities (See Section 2.4 for discussion on the choice of Pearson’s correlation coefficient). This is achieved by first re-coding genotypes as digits; e.g. re-coding B=0 and D=1 results in the genotype patterns of 0100000101100000100000111001000 for S07Gnf047.960. Pairs of such numeric vectors can then be compared using Pearson’s correlation coefficient. Performing this on a set of M markers results in an MxM matrix of similarity coefficients. A heat map representation of such a matrix using the 779 marker loci across our 31 BXD mice is represented in Figure 4-1.

A pair of genotype patterns can be positively correlated or negatively correlated. Negative, or inverse, correlation occurs when the genotypes at the two marker loci always disagrees within an individual. For instance, the pattern of S07Gnf047.960, BDBBBB, is an exact invert of the pattern DBDDDD corresponding to the marker D4Mit249 on Chr4 across six BXD strains.

133 Validation and analysis of two-locus influence on mRNA levels

Figure 4-1 Heat map representation of all pair-wise genotype pattern similarities of 779 markers across 31 BXD RI strains. Markers are ordered by their physical location on the genome: x- and y-axes. Positive genotype pattern correlations are represented by gradients of orange. Negative (or inverse) correlations are represented by shades of blue. Absence of correlation is represented by white. This matrix is symmetrical about the diagonal and so the upper and lower triangles are identical. The red outlined region contains the inter-chromosomal genotype pattern correlations.

To define LD between neighbouring marker loci, a LD-threshold is set at the 95th percentile of all inter-chromosomal pair-wise genotype pattern correlations (Figure 4-1 red outlined region). An inter-chromosomal marker pair refers to any pair of markers located on different chromosomes. For elimination of potentially LD-induced redundant linkages, this LD- threshold is combined with linkage significance (Praw) as described in

Section 2.5.1. For instance, given a pair of neighbouring eQTLs (L1 and

L2), the one (L2) linked less significantly to the expression trait of interest is assessed for LD against the better-linked eQTL (L1). If Pearson’s correlation coefficient for this pair of marker loci is better than the LD- threshold then the two loci are considered to be in LD and L2 is eliminated.

134 Validation and analysis of two-locus influence on mRNA levels

4.1.2. Validation of two-locus effects on expression traits: bqtl.twolocus

For each pair of eQTLs (L1 and L2) identified as significantly linked to an expression trait, the following null hypothesis is tested:

H0: L1 and L2 do not simultaneously contribute to trait variance That is, we are explicitly testing whether or not the presence of both loci explains a trait’s variance better than either locus alone; assuming at least one of the loci truly influences the trait.

In principle, when a locus-pair simultaneously influences a trait, the extent to which each locus affects the trait is not necessarily identical. It is therefore important to note which of the pair has the primary (major) effect.

For example, to test the effects of L1 and L2 on an expression trait, y, the following two interactions are considered:

y ~ L1*L2 +  …(1)

y ~ L2*L1 +  …(2)

In these two equations, L1*L2 is a shortened form of the full model:

y ~ L1 + L2 + L1:L2 +  where y = expression trait value of an individual

L1 = main and independent effect of the primary locus

L2 = main and independent effect of the secondary locus

L1:L2 = epistatic effect between L1 and L2  = residual (non-genetic) effects

Statistically, evidence of two-locus effect is determined by comparing the full two-locus model:

y = L1 + L2 + L1:L2 +  … full model (Mfull) against the single-locus model:

y = L1 +  … single-locus model (Msingle)

135 Validation and analysis of two-locus influence on mRNA levels

A Bayesian approach, via the R-package R/QTL (BROMAN et al. 2003; Section 2.6), is used to estimate posterior probabilities for assessing various regression fits in these multiple regression models. The likelihood that an expression trait-variance is better explained with the addition of a secondary locus is measured using the logarithm of the odds ratio:

LOD = LR(Mfull) – LR(Msingle) where

LR(Mfull) = logarithm of the Bayesian posterior probability that both loci influences the trait, and

LR(Msingle) = logarithm of the posterior probability that the primary locus influences the trait variance.

Significance (P-value) of each LOD score is calculated empirically by comparing it against a “null” distribution. A null distribution is constructed from all LOD scores derived from the same model comparison, except the secondary locus is replaced by other marker loci on the genome. For example, given there are M markers on the genome, to assess the significance of the full two-locus model (y~L1*L2), its LOD score is compared against the distribution of LODs generated from the two-locus models y~L1*Lm, where m=1...M. Using a conventional 95% confidence interval, a test LOD is declared significant if it is within the upper 5% of the null distribution.

If L1 interacts solely, or most strongly, with L2, then their test statistic (LOD score) should, in principle, be the largest of the M LOD scores 1 (corresponding to the M markers), thus resulting in P= /M. This sets a lower limit on the P-values; a limit that is data-specific as it is dependent on M. This limit may be problematic if more precise estimates of P-values are 1 desired, but for our purpose, this is not important as /M << 0.05 in all datasets.

136 Validation and analysis of two-locus influence on mRNA levels

A locus-pair may influence an expression trait in an additive (L1+L2) or epistatic (L1:L2) manner, and a significant LOD score may be driven by either the additive or epistatic term in a full two-locus model (Mfull). To assess for epistatic interaction, the following null hypothesis is posed:

H0: L2 does not interact epistatically with L1

The effect of L1:L2 is measured by its regression coefficient, and its significance is determined if it is within the upper 5% of the null distribution. As with the LOD scores, a null distribution of epistatic interaction is derived from the coefficients of L1:Lm, where m=1… M.

The approach (bqtl.twolocus) outlined in this section is designed specifically to assess for eQTL interactions amongst pre-defined sets of marker loci. The development of bqtl.twolocus, and hence the assumptions made, is not tailored to search for the best two-locus linkages for each trait. Rather, it is designed to assess whether a pair of eQTLs, identified from a single-locus mapping analysis, is likely to have genuine joint influence on an expression trait. One of the major reason for using bqtl.twolocus rather than a traditional full two-locus model (Section 1.7) the reduction in computational burden, as a full exhaustive two-locus search is estimated to take approximately one month to analyse on a standard desktop computer. This cannot be easily overcome with the use of alternative approaches, such as a step-wise model selection, as statistical power is often constrained by small sample size, as will be demonstrated in Section 4.4.

137 Validation and analysis of two-locus influence on mRNA levels

4.2. Identification of unique multiple linkages in the BXD brain, kidney, and liver data

4.2.1. Methods for eliminating redundant linkages

As a consequence of LD, mapping eQTLs using a dense marker map can potentially over-estimate multiple linkages per expression trait. With traditional QTL-mapping studies, QTLs clustering at tightly linked markers are easily identified and grouped into a single unique linkage peak by simple visual inspection of the corresponding whole-genome scan plot. However, this becomes impractical in eQTL-mapping studies due to the large number of expression traits. As such, automated methods must be used to eliminate redundant linkage peaks caused by LD between closely located markers.

Several existing methods may be automated to achieve this: 1) Group all linkages that show monotonic decay from the main association into a unique eQTL (e.g. ABDALLAH et al. 2003); 2) Group all linkages that are less significant and within 5Mb from the main association into a unique eQTL (HUBNER et al. 2005); 3) Define the width of a linkage peak as 2-LOD units on either side of the main association; i.e. set support interval, SI, at 2-LOD (LANDER and

BOTSTEIN 1989).

With all these methods, linkages are eliminated without actual assessment of LD; these methods assume LD decays with distance. Though logical, it appears this is not always true, as exemplified in Figure 4-2. In this example, the marker D1Mit169, located on Chr1: 24.3Mb, is more strongly correlated to a marker (D1Mit322) 18.2Mb downstream of it than to one (D1Mit171) that is closer (9.8Mb apart).

138 Validation and analysis of two-locus influence on mRNA levels

Figure 4-2 Pearson’s correlation between D1Mit169 and other markers. The bottom panel is a magnification at the correlations in Chr1.

To overcome this we developed a novel algorithm (remove.LD) that uses genotype pattern association for assessing LD (Section 2.5). While it is understood that LD between two genomic regions is correlated with their distance apart, the extent of LD is, however, not uniform along the genome

(review in WALL and PRITCHARD 2003). In light of this, two neighbouring eQTLs are defined as in LD if their genotype patterns are highly correlated: greater than the upper 5% of the inter-chromosomal genotype pattern correlation distribution (Figure 4-3).

139 Validation and analysis of two-locus influence on mRNA levels

Figure 4-3 Probability distributions of inter- (black) and intra- (red) chromosomal marker genotype pattern correlations. Genotype patterns can be positively or negatively (inversely) correlated and are denoted by the signs (+/-) of Pearson’s correlation coefficient. The grey, dotted, vertical line indicates the upper 95th percentile of the inter- chromosomal marker genotype pattern correlations.

140 Validation and analysis of two-locus influence on mRNA levels

Genotype patterns of within-chromosomal markers (markers located on the same chromosome) are likely to be more similar than between- chromosomal markers (those that are located on different chromosomes) as a result of co-segregation (Figure 4-3). However, not all markers on a chromosome, particularly a large chromosome, are expected to be in LD. In consideration of this, the upper 95th percentile of the inter-chromosomal genotype pattern correlation (Figure 4-3 vertical line), rather than the 100th percentile, has been chosen as a threshold for defining LD. Implicit with this LD-threshold is the implication that, any pair of intra-chromosomal markers whose genotype patterns have a correlation below this threshold is sufficiently distant from each other to warrant independent consideration. We use this definition of LD to eliminate additional linkages whose corresponding markers are in LD to the most significantly associated marker.

The choice for using the 95th percentile is rather arbitrary but is much better than not attempting to remove potentially redundant linkages and more stringent than many existing methods as exemplify in Figure 4-4 using a Shoc2 transcript (NM_019658; soc-2 homolog). Without any filtering, an automated search for linkage using Praw0PRED (dotted grey lines) will result in seven significant eQTLs on Chr3 (top-middle panel), which is clearly untrue. Attempts to remove potentially redundant linkages result in three significant eQTLs (D3Mit76, D3Mit12, D3Mit106), identified by at least one of the four methods (red point above dotted grey lines).

D3Mit76 which is only identified by the method of HUBNER et al. (2005) appears to be a residual of the peak at D3Mit12 and is likely to be spurious (bottom-left panel). This marker is likely to have been called significant due to the fact that it is >5Mb from the more significant peak at D3Mit12.

D3Mit106 is called significant by both the HUBNER et al. (2005) method and the method that searches for monotonic decay from the most significant

141 Validation and analysis of two-locus influence on mRNA levels

linkage peak (D3Mit12) (top-right panel). At first glance this marker, along with D3Mit12, appears to be two independent linkage peaks, suggesting genetic regulation of Shoc2 expression may be influenced by two Chr3 eQTLs. However, examination of the genotype patterns at these two loci shows that their genotypes differ in only two of the 31 BXD RI mice (Pearson’s correlation coefficient C 0.88): D3Mit12: BDBBDBDBBBDBBBBDDBBBDBBBDDDDDBB D3Mit106: BDBBDBDDDBDBBBBDDBBBDBBBDDDDDBB . Thus, it appears, eliminating redundant linkages that are within 2-LOD unit from the most significant linkage peak or using remove,LD best avoids false positive multiple linkages. Arguments for using remove.LD over a SI-threshold are that: 1) The latter method appears to be highly dependent on the P-value threshold for linkage: if PRED (grey dotted line) had been more lax, two other markers (two red points just below the grey dotted line) would have been called significant even though they are clearly part of the main linkage peak; 2) remove.LD assesses LD base on actual genotype patterns, rather than assume a direct correlation between LD and linkage scores.

142 Validation and analysis of two-locus influence on mRNA levels

Figure 4-4 Genome-scan of a Shoc2 (suppressor of clear-2 homolog) transcript. This transcript has linkage peaks on Chr3:95-116Mb and Chr4: 45-49Mb. Four methods are used to remove potentially LD-induced linkages: monotonic decay; method of HUBNER et al. (2005); support interval of 2-LOD units; and remove.LD. Subsequent results for Chr3 are shown. Eliminated linkages (markers) following application of these methods are indicated by open circles, while remaining markers are indicated with solid red points.

143 Validation and analysis of two-locus influence on mRNA levels

We further compare the performance of these methods using the BXD brain data (Table 4-1). In this dataset, 2,996 transcripts are linked significantly to at least one eQTL using Praw0PRED (Table 3-2C). Without any attempt to account for LD between neighbouring markers results in >48% of these mappable transcripts being declared as having multiple detectable genetic effects and >9% as having 15 detectable genetic effects. By accounting for LD between neighbouring markers, the proportion of mappable transcripts identified as linked to multiple eQTLs drops down to 025%, suggesting the multiple linkages initially identified are consequence of LD between neighbouring markers.

Table 4-1 Number (%) eQTLs linked to a transcript for the 2,996 mappable expression traits in the BXD brain study. Results are for the original results (No action); following application of remove.LD; after removal of linkages within 2-LOD unit from main linkage peak (2-LOD width); after removal of linkages that are in monotonic decay from the main linkage peak (Monotonic decay); and after removal of linkages using the method described in HUBNER et al. (2005). Maximum Only one Multiple 1 5 eQTLs number of eQTL eQTLs linkages 1,556 1,440 277 No action 13 51.9% 48.1% 9.2% 2,781 215 remove.LD 0 3 92.8% 7.2% 2,622 374 2-LOD width 0 4 87.5% 12.5% 2,309 687 13 Monotonic decay 5 77.1% 22.9% 0.4% 2,244 752 1 HUBNER 5 74.9% 25.1% <0.1%

As suspected from the Shoc2 example, remove.LD and the method of 2- LOD width method are the most stringent, compare to the ‘monotonic decay’ and the method of HUBNER et al. (2005). Following remove.LD,

144 Validation and analysis of two-locus influence on mRNA levels

only ~7% of mappable transcripts are detected as mapping to multiple linkages with the maximum of three detectable linkages per transcript.

Expression traits are complex and are expected to be regulated by multiple genetic and non-genetic effects. However, the genetic effect of each individual eQTL is expected to be small, implying many of the secondary effectors are expected to be undetectable (GIBSON and WEIR 2005). Compound this problem of small effect-size with the complexity of LD between neighbouring genomic regions; one must be cautious of multiple linkages. Although remove.LD is a relatively more stringent process for accounting for LD, if multiple genetic influences of gene expression are to be identified from single-locus eQTL-mapping studies, it is more prudent to err on the side of false negatives than false positives.

4.2.2. Identifying unique multiple linkages

From this section onwards, the more conservative Bonferroni correction is used (Praw0PBON) so as to allow for cross-study comparisons later (Section 4.5 and Chapter 5).

Without applying any algorithms to remove potentially LD-induced multiple linkages, up to 46% of mappable transcripts are found significantly linked to two or more eQTLs (Table 4-2). With the application of remove.LD, the proportions of transcripts linked to multiple eQTLs are reduced by approximately 10-fold.

The stringency of the LD-threshold in remove.LD is to select for multiple linkages that are most likely to be real. Our argument is that, if evidence for loci-interaction is absent in this select set of multiple eQTLs, then eQTLs more likely to be LD-induced are even less likely to be real.

145 Validation and analysis of two-locus influence on mRNA levels

Of the total 124 transcripts, across all three tissues, identified by remove.LD as having multiple unique linkages (Table 4-2; Table 3-2A), 123 are significantly linked to only two eQTLs. The remaining transcript, NM_017378, from the liver data, is linked to three marker loci: D09Msw003 (Chr9), D11Mit41 (Chr11), and D17Mit49 (Chr17). These results suggest remove.LD is more modest in its assumption of multiple linkages than not accounting for any redundant linkages or using other methods for eliminating redundant linkages (Section 4.2.1).

Table 4-2 Number (%) transcripts with at least one significant eQTL (top row), with multiple potentially redundant linkages (middle row) and multiple unique linkages (bottom row), in our BXD brain, kidney, and liver studies.

Brain Kidney Liver

Total number transcripts with 1781 670 588 linkage

829 223 294 No action 46.5% 33.3% 34.4%

72 17 35 remove.LD 4.0% 2.5% 4.1%

To determine whether the extent of observed multiple linkages per transcript is likely real, we used simulation studies: sim0B, sim0K, sim0L

(Section 2.3.1), and simnorm (Section 2.3.2). The sim0B/K/L studies, used previously in Chapter 3, are designed to model the behaviour of our three- tissue BXD eQTL studies under the null hypothesis of no linkage, thus any linkages identified in these studies are likely due to systematic variations.

The simnorm studies are designed to model the null hypothesis of no linkage in the absolute absence of any systematic variations, by generating the expression values from a normal distribution rather than sampling from the real data. Results from sim0norm are expected to follow absolute random expectation of no linkage. Each of the results from simB/K/L and simnorm are

146 Validation and analysis of two-locus influence on mRNA levels

averages of five simulations (Figure 4-5; actual numbers are presented in Appendix A.2).

Regardless of whether remove.LD has been applied to eliminate redundant linkages, the proportion of multiple linkages observed in the real BXD studies are similar to that observed in the sim0B/K/L and simnorm studies (Figure 4-5), suggesting multiple linkages may in part be sourced from non-genetic and non-experimental variations. If multiple linkages have been observed in sim0B/K/L but not simnorm, then spurious multiple linkages would have been attributed to systematic, possibly inter-array, variations (Section 3.2.3 and Section 3.3.4). The fact that artefactual multiple linkages are observed in simnorm implies variations unassimilated with expression data are driving these linkages, and we suggest random genotype pattern-association to be a potential cause (See Chapter 5 for discussion on the influence of genotype pattern-association on multiple linkages).

To allow for comparisons against other datasets, later in this chapter, the 124 transcripts linked to unique multiple eQTLs (Table 4-2) were identified from the full set of 21,765 transcripts on the arrays without filtering on expression levels. A closer inspection shows that only 64 (89%), 11 (65%), and 19 (54%) of the transcripts from the brain, kidney, and liver, respectively, are “expressed”, where “expression” is defined using the criterion that at least half of the animals have to have hybridisation level above the 90th percentile of all negative controls (Section 2.1.5 and Section 3.2.5). Clearly, for further experimental validations, these transcripts would have priority over the “non-expressed” sets.

147 Validation and analysis of two-locus influence on mRNA levels

all-Brain-BXD

all-Kidney-BXD

all-Liver-BXD

all-sim-norm

all-sim0B

all-sim0K

all-sim0L

remove.LD-Brain- BXD

remove.LD-Kidney- BXD

remove.LD-Liver- BXD

remove.LD-sim-norm

remove.LD-sim0B

remove.LD-sim0K

remove.LD-sim0L

0% 20% 40% 60% 80% 100%

one tw o three four five more than 5

Figure 4-5 Proportions of transcripts linked to 1-5 or >5 eQLTs before and after application of remove.LD in the BXD, sim0B/K/L, and sim0norm studies. all: all transcripts and linkages (without application of any algorithms to remove potentially LD- induced multiple linakges); remove.LD: after application of the remove.LD algorithm); BXD: results from real BXD data; sim0B/K/L: results from the sim0B/K/L datasets (average of 5 simulations); sim-norm: average five results from simnorm. Actual numbers making up this figure are in Appendix A.2.

148 Validation and analysis of two-locus influence on mRNA levels

4.3. Verification of two-locus interactions in the BXD brain, kidney, and liver data

Traditionally, the aim of a multi-locus mapping study is to identify the set of expression traits under multi-locus influence and their corresponding sets of eQTLs (Section 1.7). However, the aim of this section is to examine the validity of multiple linkages accrued from single-locus eQTL-mapping studies. A major motivation for this is in the hope that the results can assist in the prioritisation of multiple linkages for potential follow-up analyses. Thus, I have focused this section to validating the presence of any two- locus effects within each of the 124 sets of multiple eQTLs identified from our previous single-locus linkage analyses. For completeness, an attempt to search for two-locus effects for all transcripts, using the method of STOREY,

AKEY, and KRUGLYAK (2005), was made, but as shown in the following section (Section 4.4), we lack sufficient sample size for such an ambitious aim.

To validate two-locus effects identified from single-locus eQTL-mapping analyses, a new algorithm was developed: bqtl.twolocus (Section 4.1.2 and Section 2.6). Using this method, two-locus effects can be confirmed for 97%, 88%, and 91% of the transcripts mapped to multiple independent eQTLs (Table 4-3). As expected, test significance increases with LOD scores (Figure 4-6). Note that there are actually 144, 34, and 74 data-points on each of these plots (corresponding to Brain, Kidney, and Liver data), even though there are only 72, 17, and 35 transcripts with multiple linkages. This is because for each set of eQTLs mapped by a transcript, all ordered pairs of loci are tested for joint influence (Section 4.1.2).

149 Validation and analysis of two-locus influence on mRNA levels

Table 4-3 Summary of two-locus linkages verified by bqtl.twolocus in the BXD brain, kidney, and liver data. See Appendix A.7.1 for details of results.

Brain Kidney Liver

Number of transcripts with 72 17 35 multiple linkages

Number (and %) of 70 15 34 transcripts with evidence of two-locus effects (97.2%) (88.2%) (97.1%)

Number (and %) of transcripts with evidence of 15 2 4 two-locus epistatic (20.8%) (11.8%) (11.4%) interactions

Figure 4-6 Significance of LOD scores from two-locus linkage analyses in our 31 BXD Brain (left), Kidney (middle), and Liver (right) data. Tests where the epistatic terms are also significant (P00.05) are highlighted in red. Dotted lines mark P=0.05.

150 Validation and analysis of two-locus influence on mRNA levels

For example, in the brain data, the transcript NM_019952 (cardiotrophin- like cytokin factor 1; Clcf1) is linked independently to markers D1Mit178 (Chr1) and Hoxd9 (Chr2), and so two separate two-locus effects are assessed: expression level of Clcf1 ~ D1Mit178 * Hoxd9 …(1) AND expression level of Clcf1 ~ Hoxd9 * D1Mit178 …(2) In (1), D1Mit178 is treated as a primary locus and the null hypothesis is that the addition of Hoxd9 does not improve the regression fit of Clcf1’s expression values to the marker genotypes (i.e. does not improve the test statistic). In (2) a different null hypothesis is tested: the addition of D1Mit178 to Hoxd9 does not improve the LOD score.

In a majority of cases, evidence of two-locus linkage is observed regardless of the ordering of marker loci; that is, significance is found when either locus of a pair is treated as the primary effector (See Appendix A.7.1 for complete list of LOD scores and P-values from bqtl.twolocus). Conversely, for six transcripts in the Brain, three in Kidney, and eight in Liver, evidence of two-locus effects is only observed when a specific marker is treated as the primary locus. Using Clcf1 as an example, the joint influence of D1Mit178 and Hoxd9 is only observed if Hoxd9 is treated as the primary locus; i.e. (2) is significant but (1) is not.

In our assessment for two-locus influence, an arbitrary P=0.05 threshold has been adopted (Figure 4-6 dotted lines). However, if a pair of loci is the only or major eQTLs influencing an expression trait, then this locus-pair should, in principle, score the best LOD statistic and hence the most significant P-value. Since P-values are estimated from empirical null 1 distributions (see Section 4.1.2), the most significant P= /M, where M is the number of markers on the genome (779 in this study). Examination of the P-values reveals that 45 (62.5%), 5 (29.4%), and 14 (40%) of the transcripts with multiple linkages, have the best LOD score in each case (Appendix A.7.1), and so have evidence to suggest that their corresponding locus-pair,

151 Validation and analysis of two-locus influence on mRNA levels

identified from our previous single-locus mapping studies, are more likely to be associating with each other than with other markers.

Because each full two-locus model takes into account both additive and epistatic interactions, a significant LOD score could have been driven solely by the additive effect. To examine whether any locus-pair influences expression traits epistatically, we assessed the coefficient terms (L1:L2) for significance. Of the 119 transcripts with significant two-locus effects, 21 also have significant epistatic terms at P 0 0.05 (Table 4-3; Figure 4-6 red circles; Appendix A.7.1).

These locus-pairs are interesting because it suggests their influence on the corresponding expression trait is likely to be dependent on each other. We, therefore, decided to examine these in more detail. Unfortunately, as with all QTL-mapping studies, identifying the causative functional variant is extremely difficult because the region of a mapped eQTL is usually relatively large. Genomic regions spanned by the 21 locus-pairs showing evidence of interaction averages 6Mb (Table 4-4). Within such a large genomic region may reside up to tens to hundreds of genes and potentially thousands of polymorphisms, any of which could potentially be the, or one of many, causative variant(s) influencing the expression trait. Thus, as expected, identifying the nearest genes from the mapped marker loci fail to impart any meaningful/obvious insight into the variants influencing the corresponding transcripts.

152 Validation and analysis of two-locus influence on mRNA levels

Table 4-4A Transcripts mapped to locus-pairs showing significant interaction in the BXD brain study. Gene Primary locus Secondary locus Accession Name Symbol Lox Marker Lox Nearest Gene Symbol Marker Lox Nearest Gene Symbol ID NADH a disintegrin Chr1: Chr8: dehydrogenase Chr17: and NM_023202 Ndufa7 D1Mit134 80.3- cullin 3 Cul3 D8Mit3 19.8- Adam18 (ubiquinone) 1 33.4: metalloprotease 86.0 24.0 alpha domain 18 PRP18 pre- serine (or Chr1: Chr8: fibroblast mRNA Chr2: cysteine) AK013280 Prpf18 D1Mit216 77.6- Serpine2 D8Mit339 38.5- growth factor Fgf20 processing factor 4.5 proteinase 80.8 46.2 20 18 homolog inhibitor clade Chr1: Chr11: RIKEN clone: Chr11: olfactory AK018857 1700052M18Rik D1Mit134 80.3- cullin 3 Cul3 D11Mit208 57.4- Olfr324 1700052M18 102.5 receptor 324 86.0 61.8 Chr1: Chr19: RIKEN clone: 9430099N05 Chr12: RIKEN clone: AK020531 D1Mit134 80.3- cullin 3 Cul3 D19Mit109 3.4- 1700123I01Rik 9430099N05 Rik 110.9 1700123I01 86.0 7.4 low density blood vessel Chr1: Chr19: Chr10: lipoprotein AF204174 epicardial Bves D1Mit134 80.3- cullin 3 Cul3 D19Mit68 0.0- Lrp5 45.0 receptor-related substance 86.0 6.0 protein inositol low density Chr1: Chr19: polyphosphate- Chr1: lipoprotein AK019601 Inpp4a D1Mit134 80.3- cullin 3 Cul3 D19Mit68 0.0- Lrp5 4-phosphatase 37.2 receptor-related 86.0 6.0 type I protein Williams-Beuren Chr15: Chr1: Chr19: low density NM_025362 Wbscr18 D1Mit134 cullin 3 Cul3 D19Mit68 Lrp5 syndrome 135.3 Chr1: 0.0- lipoprotein

153 Validation and analysis of two-locus influence on mRNA levels

critical region 18 80.3- 6.0 receptor-related 86.0 protein low density Chr1: Chr19: sterol-C4-methyl Chr8: lipoprotein NM_025436 Sc4mol D1Mit134 80.3- cullin 3 Cul3 D19Mit68 0.0- Lrp5 oxidase-like 67.6 receptor-related 86.0 6.0 protein low density Chr1: Chr19: EF hand domain Chr1: lipoprotein NM_028889 Efhd1 D1Mit134 80.3- cullin 3 Cul3 D19Mit68 0.0- Lrp5 containing 1 89.1 receptor-related 86.0 6.0 protein immunoglobulin Chr2: Rho GTPase Chr8: kappa chain Chr6: SH3 multiple X14625 Igk-V23 D2Mit6 19.3- activating Arhgap21 D8Mit128 46.2- Sh3rf1 variable 23 69.7 domains 2 20.9 protein 21 61.6 (V23) WW C2 and Chr4: lymphocyte Chr5: coiled-coil Chr11: BC006733 Wwc1 D4Mit203 125.7- protein tyrosine Lck Mpmv13 23.9- engrailed 2 En2 domain 35.7 131.9 kinase 34.1 containing 1 Chr7: flavoprotein Chr14: PDZ domain Chr15: glypican 6 AK017208 Pdzd2 D7Mit281 97.7- oxidoreductase Mical2 D14Mit185 103.2- Gpc6 containing 3 12.3 isoform 1 100.0 MICAL2 110.8 low density Chr11: hypothetical Chr19: estrogen receptor Chr10: lipoprotein NM_007956 Esr1 D11Mit41 82.8- protein Gm525 D19Mit68 0.0- Lrp5 1 (alpha) 5.3 receptor-related 89.5 LOC217071 6.0 protein histone H1-like Chr16: immunoglobulin Chr17: Chr11: NM_018792 protein in Hils1 D16Mit12 35.0- superfamily Igsf11 S17Gnf013.500 11.4- plasminogen Plg 94.8 spermatids 1 42.2 member 11 22.4 aarF domain Chr12: Chr19: low density ChrX: plasmacytoma AK014278 Adck1 D19Mit68 Lrp5 DXMsw076 Pet2 containing 88.9 0.0- lipoprotein 82.4- expressed

154 Validation and analysis of two-locus influence on mRNA levels

kinase 1 6.0 receptor-related 95.0 transcript 2 protein

Table 4-4B Transcripts mapped to locus-pairs showing significant interaction in the BXD kidney study. Gene Primary locus Secondary locus Accession Name Symbol Lox Marker Lox Nearest Gene Symbol Marker Lox Nearest Gene Symbol ID Chr2: Chr9: solute carrier family 37 RIKEN clone: 4921517B04 Chr3: zinc finger protein AK014905 D2Mit102 112.1- Zfp770 D9Mit91 34.0- (glycerol-3-phosphate Slc37a2 4921517B04 Rik 146.9 770 117.9 40.8 transporter), member 2 Chr13: phenylalanine-tRNA Chr17: CDC42 effector Chr9: NM_010864 myosin Va D13Mit18 35.3- synthetase 2 Fars2 D17Mit187 76.9- protein (Rho GTPase Cdc42ep3 74.9 42.1 (mitochondrial) 79.7 binding) 3

Table 4-4C Transcripts mapped to locus-pairs showing significant interaction in the BXD liver study. Gene Primary locus Secondary locus Accession Name Symbol Lox Marker Lox Nearest Gene Symbol Marker Lox Nearest Gene Symbol ID serine (or Chr1: Chr11: Chr5: cysteine) olfactory receptor AK014585 unc-84 homolog A Unc84a D1Mit216 77.6- Serpine2 D11Mit208 57.4- Olfr324 139.5 proteinase 324 80.8 61.8 inhibitor clade Chr2: Rous sarcoma Chr8: minichromosome Chr12: NM_008301 heat shock protein 2 Hspa2 Src 154.6- oncogene isoform Src S08Gnf076.440 70.3- maintenance Mcm5 77.3 159.7 1 81.6 deficient 5 cell Chr7: malic enzyme Chr19: Chr7: programmed cell NM_008051 fucosyltransferase 1 Fut1 Mod2 71.8- complex, Mod2 D19Mit103 52.3- Pdcd4 45.5 death 4 78.8 mitochondrial 54.3

155 Validation and analysis of two-locus influence on mRNA levels

anti-DNA Chr13: Chr15: brain-specific antibody kinesin family U55459 immunoglobulin - D13Msw109 107.3- Kif2a D15Mit29 69.4- angiogenesis Bail 363s.62 member 2A heavy chain IgG 108.3 78.9 inhibitor 1

156 Validation and analysis of two-locus influence on mRNA levels

Interestingly, we find in the Brain study, that six of the 15 (40%) interacting locus-pairs are identical: Chr1: 80-86Mb and Chr19: 0-7Mb (Table 4-4). More intriguingly, the Chr1 locus is a linkage hotspot, while the Chr19 locus is very near another linkage hotspot in the BXD Brain study (Table 3-7). A speculative, but plausible, explanation for this observation is that, there exist two interacting master regulators harboured within the Chr1 and Chr19 region that jointly influences the expression of multiple expression traits. The fact that only six such transcripts have been identified here could be explained by limitation of the initial single-locus mapping approaches to map multiple loci.

To investigate if our confirmation of two-locus effects could have been confounded by the bqtl.twolocus method, namely, by the manner in which we generate and use the null distributions, five simulation studies have been performed for each tissue (permB1…5, permK1…5, permL1…5; Section 2.3.4). Each of the 15 simulations is essentially identical to the real situation with the exception that expression values corresponding to each transcript are randomised across strains. Thus, in the brain data, a randomised set of the same 72 transcripts is used to test the same corresponding sets of eQTLs for two-locus effects; this is repeated five times and also for each of the Kidney and Liver data.

Table 4-5 Average numbers of multiple linkages and confirmed two-locus effects in the permB, permK, and permL studies. Since the simulated data are obtained real the three real BXD studies, the “Number of transcripts with multiple linkages” are identical to those from the real BXD data (See Table 4-3). Also see Appendix A.7.2. permB permK permL Number of transcripts with 72 17 35 multiple linkages Average number (and %) of 5.0 1.4 2.4 transcripts with evidence of (7.0%) (8.0%) (7.0%) two-locus effects Average number (and %) of 3.0 0.6 1.4 transcripts with evidence of (4.0%) (3.5%) (4.0%) two-locus epistatic effects

157 Validation and analysis of two-locus influence on mRNA levels

Because the test for the second locus is dependent on the first, and because the design of these permutation studies is such that no simulated traits are under any genetic control, the extent of significant two-locus effects identified from these studies should reflect the rate at which bqtl.twolocus falsely calls the first and/or second locus real. On average, <5% of the 72, 17, and 35 transcripts from the Brain, Kidney, and Liver data are expected to show false evidence of two-locus effects (Table 4-5). Analysis of the epistatic terms further shows that this term is, on average, significant at P00.05 for 03% of the transcripts with multiple linkages in Brain, Kidney, and Liver, respectively. These results imply the majority of confirmed two- locus effects for these 124 transcripts are likely to be truly influenced by at least one of the two identified loci.

Recalling that certain markers are prone to linkages, probably because of biases in genotype frequencies within the population (discussed in Section 3.5), we further searched for locus-pairs spuriously, but repeatedly, identified from these simulation studies. No such locus-pair was found, thus implying there is no marker-pair within the 124 marker sets that is unusually prone to false two-locus interactions (see Appendix 7.2 for a list of permB/K/L result with confirmable two-locus effects).

While these simulations have shown that the 119 locus-pairs verified for additive effects from our BXD studies are likely to be truly influenced by at least one eQTL and that the verification is unlikely to have been driven by spurious marker associations, they do not indicate if the results could have been driven by only one of the two loci or by other non-genetic factors. Residual inter-array variations can influence multiple-linkage because each array corresponds to one individual and so variations between arrays will resemble variations between individuals (Section 3.2.3). To address this we return to the sim0B, sim0K, and sim0L studies (Section 3.1.2 and Section 2.3.1). These studies are designed to model the real BXD data under the null hypothesis of no linkage; hence linkages identified from these studies

158 Validation and analysis of two-locus influence on mRNA levels

are likely to have arisen as a result of systematic variation within the experimental system.

Genotype pattern-association, as mentioned in the previous section, can also influence multiple linkages and we explore this using the simnorm data

(Section 2.3.2). Expression data if simnorm are sampled from normal distributions and so any confirmed two-locus effects would not have originated from expression trait variations. Instead, it would imply the only other input parameter of an eQTL-mapping study, namely, marker genotype patterns, as the cause of spurious multiple linkage.

Table 4-6 Average numbers (%) of multiple linkages and confirmed two-locus effects in the sim0B, sim0K, and sim0L studies. Results for each tissue are averages of five simulations.

Sim0B Sim0K Sim0L simnorm Average number of transcripts with multiple 14.1 16.2 8.2 43.6 linkages Average number (and %) of transcripts with 12.8 15.5 7.2 41 confirmed of two-locus (88.9%) (93.8%) (87.8%) (94.0%) effects Average number (and %) of transcripts with 0.6 2.4 0.8 2.6 confirmed of two-locus (4.2%) (14.8%) (9.8%) (6.0%) epistatic effects

Analyses of all 20 simulations (5 from each of simB, simK, simL, and simnorm) are conducted as outlined in Section 4.1. First notable observation from the results (Table 4-6) is that the number of transcripts link to >2 unique eQTLs, following application of remove.LD, is higher in simnorm than simB/K/L. An explanation for this is that our real expression data are not normally distributed with many expression traits having M-values approximating to 0 (Figure 4-7), which implies overall trait-variances are small, consequently reducing statistical power to detect linkage or multiple- linkages.

159 Validation and analysis of two-locus influence on mRNA levels

Figure 4-7 M-value distributions of expression data from our real BXD brain (blue), kidney (red), and liver (green) studies, and from the five simnorm replicate studies (black).

Of the transcripts linked to two or more unique eQTLs, two-locus influence is confirmed at P00.05 for 88%-94% of the transcripts in sim0B, sim0K, sim0L, and simnorm, on average (Table 4-6). Such a high rate of confirmed two-locus effects in these simulation studies suggest, for those transcripts spuriously linked to multiple eQTLs, many will also be confirmed for two- locus effects using bqtl.twolocus. Because a secondary locus is only tested predicated on the primary locus is already significantly associated to the simulated traits, the high verification rate of joint two-locus effects suggests bqtl.twolocus will tend to call a locus-pair significant if the first locus is strongly associated to the trait. Ideally, one would want to test the performance of bqtl.twolocus on validating the effect of secondary loci

160 Validation and analysis of two-locus influence on mRNA levels

alone, but this is difficult without prior knowledge on the mechanism and magnitude of loci-interaction on expression traits.

The fact that false two-locus effects are confirmed in both the sim0 and simperm studies hints at the possibility that these secondary loci are being called significant due to some non-genetic association with the primary loci (i.e. linkage disequilibrium between marker loci that are not necessarily located near each other). If loci L1 and L2 are in association, then a spurious linkage at L1 will also induce a spurious linkage at L2. There are several reasons for why L1 and L2 may be associated (i.e. “linked”), and to understand the behaviour of multiple linkages in the presence of loci association, we examine and categorise sets of multiple linkages base on associations between their corresponding genotype patterns in Chapter 5.

161 Validation and analysis of two-locus influence on mRNA levels

4.4. Step-wise search for two-locus effects and its dependency on sample size

Clearly, the most obvious method for assessing whether a pair of eQTLs identified from a single-locus eQTL-mapping analysis is real or not is to perform a separate two-locus mapping analysis. For mapping QTLs of classical traits, various methods exist for this purpose: composite interval mapping (JANSEN 1993; ZENG 1993) and multiple interval mapping (Kao,

ZENG, and TEASDALE 1999). In theory, these methods can be adapted for mapping expression QTLs, where multi-locus mapping is performed on each expression trait independently. Unfortunately, due to the computational burden already inherent in these methods, extension to tens of thousands of expression traits can be multiplicatively restrictive.

Recently, methods for multi-locus mapping for genetical genomics settings have begun to emerge: STOREY,AKEY and KRUGLYAK (2005) and

KENDZIORSKI et al. 2006. The first such proposed method is a step-wise forward regression search that incorporates a FDR criterion for directly assessing the significance of joint linkage (STOREY,AKEY and KRUGLYAK 2005). For our purpose here, we implemented this method1 to determine, independently from single-locus mapping, the extent of detectable multiple linkages in eQTL-mapping studies.

Running the first dataset, the BXD Brain data, using this method of resulted in no significant joint two-locus linkages. In fact, the estimated joint posterior probabilities for all transcripts were found to be zero with corresponding joint q-values equalling to one (Figure 4-8 top right). A more detailed investigation of the results show that even for the detection of the primary eQTLs, estimated posterior probabilities remained <0.5 and no primary linkages were called significant at a false positive rate better than

1 The R-scripts for this method was requested and obtained directly from the authors (Prof. J. Storey) via email correspondence in Sept 2006

162 Validation and analysis of two-locus influence on mRNA levels

20% (Figure 4-8 top left). For the search of secondary eQTLs, the best identified secondary locus for each transcript have observed zero posterior probability of being true (Figure 4-8 top right).

In their paper, the authors used a dataset consisting of 112 haploid yeast

(STOREY,AKEY and KRUGLYAK 2005). Thus, the most obvious explanation for our observed negative results is poor statistical power due to limited sample size. To test this, we ran two additional analyses using their yeast data. The first analysis consists of a random subset of 31 haploid individuals and the second analysis consists of a random subset of 86 haploid from their dataset2. The results support our suspicion: all transcripts have estimated joint posterior probability of zero with corresponding estimated q-values equalling to one in the n=31 dataset (Figure 4-8 middle panel); but in the n=86 data, joint posterior probabilities greater than zero and joint q-value estimates of less than one can be observed (Figure 4-8 bottom panel). It should be noted that, even in the n=86 scenario, no joint two-locus linkages were called significant at a false positive rate of better than 20% (minimum estimated q-value for joint two-locus linkage is ~0.28).

Together, these results demonstrate that while the multi-locus mapping method proposed by STOREY,AKEY and KRUGLYAK (2005) is effective for data with relatively large sample size (n1112), here we have shown that for the majority of existing eQTL-mapping data, an ab initio search for multiple loci is not feasible due to in sufficient statistical power attributed to limited sample size. This limitation likely explains the low eQTL detection rate in human studies so far (ROCKMAN and KRUGLYAK 2006). Furthermore, it was recently demonstrated that a two-staged two-locus approach is less powerful than an exhaustive pair-wise search for two-locus effects (EVANS et al. 2006). Thus, we conclude that, at present, our

2 Expression and genotype data corresponding to the 86, of the 112, yeast segregants were obtained through email correspondences between the authors (Dr. R. Brem) and a colleague in July 2004.

163 Validation and analysis of two-locus influence on mRNA levels

proposed method, bqtl.twolocus, may be the best compromise for validating potential multiple linkages identified from existing single-locus mapping analyses. Though it cannot identify multi-locus effects, it allows investigators to, quickly, prioritise multiple linkages that have a higher likelihood of being real.

A

B

C

Figure 4-8 Estimated false positive rates and posterior probabilities of primary locus and joint two-locus effects in datasets using brain tissues of 31 BXD RI mice (A); 31 haploid yeast (B); and 86 haploid yeast. Estimates are determined using the multi-locus mapping method of STOREY,AKEY and KRUGLYAK. (2005).

164 Validation and analysis of two-locus influence on mRNA levels

4.5. Meta-analysis of multiple linkages

To further understand whether multiple eQTLs independently identified from single-locus mapping analyses do jointly influence mRNA expression levels, we extended our analyses using seven other expression datasets, corresponding to four publicly available eQTL experiments as detailed in Table 4-7. Henceforth, the ten datasets will be denoted as follow: CotB/K/L: corresponds to our original BXD brain, kidney, and liver data

(COTSAPAS 2005)

ChesB: mRNAs from forebrain, also from the BXD panel (CHESLER et al. 2005) BysHSC: mRNAs from hematopoietic stem cells using the BXD panel

(BYSTRYKH et al. 2005) HubK/FC: kidney and fat cell data from the BXH/HXB RI rat panel

(HUBNER et al. 2005) YveY3/5: haploid yeast studies from fluor-reversed microarray

experiments (YVERT et al. 2003) SchL: from an F2 panel that have been derived from the same founders

as the BXD RI lines (SCHADT et al. 2003)

Overall, many variations exist between these five studies (Table 4-7), including the type of experimental cross, the number of segregating individuals (sample size) used, sources of mRNAs, microarray platforms, and analytical techniques used by the respective authors to determine eQTL linkage. The aims here are to identify and validate two-locus effects and to examine if and how these variables may influence our findings.

Using this collection of diverse studies, we set out to address two main questions: 1) Are the extents of multiple linkages observed in our BXD datasets similar to that observed in other studies, or could multiple linkages have been driven by some unknown systematic variations within our data?

165 Validation and analysis of two-locus influence on mRNA levels

2) Are the extents of confirmable two-locus interactions comparable between that observed in ours and other studies?

166 Validation and analysis of two-locus influence on mRNA levels

Table 4-7 Summary of eQTL-mapping datasets used in meta-analysis. First column lists the symbols used for each dataset. The mouse and rat data are derived from recombinant inbred (RI) experimental crosses. Sample sizes are the number of unique strains (or haploids) used by the corresponding investigators (see Section 2.2 for details). Study Sample Experimental Microarray Normalisation Tissue Reference label size cross platform methods CotB Whole brain Dual colour spotted Loess print-tip CotK 31 Kidney oligonucleotide then quantile COTSAPAS et al. 2006 CotL Liver arrays normalisation BXD RI mice ChesB 32 Forebrain CHESLER et al. 2005 Affymetrix single- Hematopoietic BysHSC 22 colour short- RMA- BYSTRYKH et al. 2005 stem cells oligonucleotide normalisation HubK BXH/HXB RI Kidney 30 arrays HUBNER et al. 2005 HubFC rat Fat cells YveY3 Loess print-tip BYxRM Dual colour spotted 86 Whole cell then scale YVERT et al. 2003 YveY5 haploid yeast cDNA microarry normalisation C57BL/6J x Agilent dual colour SchL 111 DBA/2J F2 Liver oligonucleotide Mean-adjustment SCHADT et al. 2003 mice arrays

167 Validation and analysis of two-locus influence on mRNA levels

4.5.1. Identifying transcripts linked to at least one eQTL

Using single-locus mapping analyses, between 3% and 22% of transcripts across the ten datasets are found to link to at least one eQTL (Table 4-8) at

PGWS00.05 (Section 3.1.1). Differences between these studies may influence the extent of identifiable linkages, and so we examine these in turn.

Experimental cross and sample size

It appears the proportions of mappable transcripts from studies using RI strains (CotB, CotK, CotL, ChesB, BysHSC, HubK, HubFC) are lower than those using either haploid yeasts (YveY3 and YveY5) or F2 mice (SchL). Population sizes of RI panels are generally small owing to the limited numbers of available strains from these experimental crosses. This restriction ultimately limits statistical power because data with small sample sizes are more sensitive to outliers.

Point-wise p-value threshold

Although a common Bonferroni threshold of 5% is used (Praw0PBON), but depending on the total number of markers used in each study, the corresponding PBON=0.05/M varies across studies because different numbers of markers, M, are used. In the BXD mouse studies, a total of 779 markers are used, corresponding to PrawC0.00006; in the RI rat study,

M=558 so PrawC0.00009; in the yeast study M=526 so PrawC0.0001; and in the F2 study M=134 so PrawC0.0004. That is, a slightly more stringent point-wise Praw threshold has actually been set in the BXD studies, which may have potentially reduced the extent of observed linkages.

168 Validation and analysis of two-locus influence on mRNA levels

Microarray platform and pre-processing method

Eight of the ten datasets have been normalised with quantile-normalisation

(See Section 1.4.1 for a review; BOLSTAD et al. 2003; SMYTH and SPEED 2003). Quantile-normalisation is a method of standardising between arrays to allow unbiased comparisons between expression levels measured using different arrays. For CotB/K/L and SchL, quantile-normalisation was applied following other within-array normalisation procedures (Section 2.1.3). For all Affymetrix data, per-processing has been performed using

RMA (robust multi-array analysis; IRIZARRY et al. 2003), which also utilises quantile-normalisation. Only in the yeast data (YveY3 and YveY5) was quantile-normalisation not performed. Instead, between-array scale- adjustment has been applied to remove potential inter-array variations. Thus, the higher proportion of transcripts found to be linked in the yeast data could imply quantile-normalisation is more robust than scale- adjustment as a method of between-array normalisation.

169 Validation and analysis of two-locus influence on mRNA levels

Table 4-8 Extent of linked transcripts across the ten eQTL-mapping analyses. First row shows the numbers and proportions of total transcripts with at least one linkage. Second row shows the numbers and proportions of transcripts with at least one linkage that are linked to two or more eQTLs, prior to the application of remove.LD. The last row shows the numbers and proportions of transcripts with at least one linkage that are linked to two or more eQTLs, after application of remove.LD. The results form this table is also presented in Figure 4-9.

CotB CotK CotL ChesB BysHSC HubK HubFC YveY3 YveY5 SchL

At least one 1781 670 855 407 807 1982 1124 1325 1220 2462 eQTL 8.2% 3.1% 3.9% 3.3% 6.5% 12.5% 7.1% 21.3% 19.6% 15.5%

Multiple 829 223 294 180 381 1057 491 885 833 1241 eQTLs before remove.LD 46.6% 33.3% 34.4% 44.2% 47.2% 53.3% 43.7% 66.8% 68.3% 50.4%

Multiple 72 17 35 19 45 402 227 123 93 200 eQTLs after remove.LD 4.0% 2.5% 4.1% 4.7% 5.6% 20.3% 20.2% 9.3% 7.6% 8.1%

170 Validation and analysis of two-locus influence on mRNA levels

4.5.2. Identifying transcripts linked to two or more unique eQTLs

Of the transcripts linked to at least one eQTL, between 30% and 70% are linked to multiple genomic regions (Table 4-8). This amount is reduced to less than 21% after remove.LD is applied to remove potentially LD- induced linkages (Table 4-8; Figure 4-9).

100%

90%

80%

70%

60% before 50% after 40%

30%

20%

10% % transcripts with multiple linkages 0%

B C C 3 5 otK S y y CotB C CotL hes H HubK ubF .C .C SchL C s H Y Y By Yve Yve

Figure 4-9 Proportions of transcripts with multiple linkages before and after remove.LD is applied to remove potentially LD-induced linkages. A tabulated version of this results presented in this figure is presented in Table 4-8.

In a dense marker map, more occurrences of LD are expected because markers are more closely positioned together, and so eQTL mapping studies that use denser marker maps are anticipated to have more LD- induced multiple linkages. Conversely, studies that use sparser marker maps, such as that used in SchL (F2) (Figure 4-10) are expected to have fewer LD-induced linkages. That is, reduction in the proportion of transcripts with unique multiple linkages would be expected to be smaller in SchL than the other ten datasets, but this is not observed (Figure 4-9). This unexpected result may be due to unevenly spaced markers along the

171 Validation and analysis of two-locus influence on mRNA levels

genome: many LD-induced multiple linkages in SchL are likely to have occurred at closely located markers (Figure 4-10 red boxes).

Interestingly, the proportion of linkages attributed to LD and thus removed in the two rat data (HubK and HubFC) are observably less than other datasets. On average, the distribution and density of markers in the RI rat cross is similar to the RI mouse cross (Figure 4-10), which means the ~20% transcripts with unique multiple linkages after application of remove.LD is probably not due to fewer LD-induced linkages, but to more non-LD-induced linkages. Biologically, this could imply transcripts are more likely regulated by multiple eQTLs in the RI rat system. Conversely, it may indicate the presence of some unknown sources of variations causing spurious multiple linkages in these two datasets (further discussed in Subsection: HubK and HubFC at the end of Section 4.5.3).

172 Validation and analysis of two-locus influence on mRNA levels

BXD RI mice BXH/HXB RI rat

Location (Mb) Location (Mb)

BYxRM haploid yeast F2 mice

Location (kb) Location (Mb)

Figure 4-10 Physical maps of markers used in the four experimental crosses. Marker maps (x-axes) are represented in megabases (Mb) in the three rodent crosses, and in kilobases (kb) in the yeast study. In the BXD data (top left), genome size is ~2,531Mb with 779 markers, thus map density is ~3.2Mb/marker. In RI rat data (top right), genome size is ~2,672Mb with 558 markers, thus marker density is ~4.8Mb/marker. In yeast data (bottom left), genome size is ~11.6Mb with 526 markers, thus marker density is ~22kb/marker. In F2 study (bottom right), genome size is ~2,320Mb (ChrX is omitted as no markers typed on ChrX), with 134 markers, which averages ~17Mb/marker. Note: marker map for the BXD panel (top left) is replicated from Figure 3-8.

As with CotB/K/L (Section 4.2), a large majority (~91%) of transcripts are defined as linked only to one unique eQTL after applying remove.LD (Figure 4-11; see Table A-6 in Appendix A.2 for tabulated version of this figure). On average, ~8% of transcripts are mapped to two unique eQTLs and less than 1% to three or more eQTLs. A maximum of five unique eQTLs are linked to a transcript in the HubFC study. Again, the difference in the RI rat data is noteworthy.

173 Validation and analysis of two-locus influence on mRNA levels

1 eQTL 2 eQTLs 3 eQTLs 4 eQTLs 5 eQTLs

CotB

CotK

CotL

ChesB

BysHSC

HubK

HubFC

YveY3

YveY5

SchL

0% 20% 40% 60% 80% 100% % transcripts w ith at least one linkage

Figure 4-11 Proportions of transcripts linked to 1, 2, 3, 4, or 5 unique eQTLs after applying remove.LD. A tabulated version of this figure can be found in Table A-6 in Appendix A.2. Note: results for CotB, CotK, and CotL are duplicated from parts of Figure 4-5.

174 Validation and analysis of two-locus influence on mRNA levels

4.5.3. Verification of two-locus interactions in ten eQTL datasets

Biologically, the most obvious explanation for multiple linkages is co- regulation of gene expression. Multiple regulators (or eQTLs) may interact additively (independently) or epistatically (dependently) in their influence of a transcript’s expression level. In principle, linkage of a transcript at one or more such eQTLs may be detected in a single-locus mapping analysis. In this section we use bqtl.twolocus (Section 4.1.2) to examine whether a set of multiple eQTLs independently mapped by a transcript is truly involved in the co-regulation of the transcript’s expression level. This subsection presents the analysis results performed on the ~9% of transcripts with at least two independent linkages across the ten datasets, using this validation method.

In majority, over 80% of transcripts have validated two-locus linkages (Figure 4-12), except for HubK and particularly for HubFC, where this percentage is noticeably smaller (see below in Subsection: HubK and HubFC).

175 Validation and analysis of two-locus influence on mRNA levels

97% 97% 100% 100% 95% 100% 92% 88% 90% 84% 80%

70% 62% 60%

50%

40% 33% 28% 30% 21% 20% 12% 11% 10% 11% 10% 10% 4% 6% 0% 0%

L % transcripts linked to multipleunique eQTLs B tK K 3 5 o bFC y y Cot C Cot Hub u C SchL ChesB H BysHSC eY. Yv YveY.C

tw o-locus effect tw o-locus epistatic effect

Figure 4-12 Proportions of transcripts with multiply-linked eQTLs showing evidence of two-locus interactions and two-locus epistatic interactions. See Table A-16 in Appendix A.7.3 for tabulated version of this figure.

Figure 4-13 Boxplot summary of the LOD scores from two-locus linkage validation tests across the ten eQTL-mapping data.

176 Validation and analysis of two-locus influence on mRNA levels

As with CotB/K/L, only a small proportion of validated locus-pairs have verified significant epistatic interaction terms in the other seven datasets (~11.3% on average). No epistatic effects are found for ChesB, which is not surprising as only a small number of transcripts (19) are initially inputted to bqtl.twolocus. If the ability of single-locus linkage analysis to identify two-locus epistatic effects is ~11.3%, then the likelihood that one of the 19 multiple linkages in ChesB have verified epistatic association is small: 19x11%C2.

While the extent of validated two-locus effects is comparable between studies, the extent of epistatic influence is higher in the F2 mouse (SchL) study. This is likely a consequence of improved statistical power associated with heterozygosity that is lacking in the other nine studies. Another factor associated with improved statistical power in multi-locus linkage analysis is increased sample size, which is evident in the increased range of LOD scores in the yeast (YveY3 and YveY5) and SchL studies (Figure 4-13). However, the fact that there is not much improvement in the rate of confirmed two-locus epistatic effect in the yeast studies suggests heterozygosity is more effective in improving identification of epistatic effects than increased sample size.

As noted earlier in Table 4-7, the F2 animals used in the SchL data were derived from the same parental strains used to generate the BXD data. We, therefore, compared the results from SchL, which used the examined the liver tissue, with the BXD liver (CotL) results. Of the four transcripts mapped to confirmed interacting locus-pairs in CotL (Table 4-4C), two (Fut1 and Hspa2) were also represented in the SchL microarray data. However, neither is linked to more than two eQTLs in the SchL data.

Although the results here do not agree between these two studies, PEIRCE et al. (2006) have shown that ~36.2% of eQTLs are replicable between studies using brain tissues from BXD RI lines and BXD F2 lines.

177 Validation and analysis of two-locus influence on mRNA levels

The discrepancies observed here between these two studies may be attributable to a combination of factors, in addition to those outlined in

PEIRCE et al. (2006). Firstly, the BXD lines are inbred animals that have undergone intense selection pressure, which may have led to joint selection of unlinked alleles; i.e. obligatory interacting loci (see Section 5.4). Secondly, there are much less meiotic recombination in F2 crosses, which can both prevent precise mapping of eQTLs and underestimate their effect sizes; i.e. eQTLs mapped in BXD lines may be missed in F2 studies. Thirdly, and on a more technical note, different microarray platforms and probe libraries were used between these two studies (Cot and Sch) (Table 4-7), thus although the same ‘gene’ may be represented on both arrays, the actual transcript may differ.

178 Validation and analysis of two-locus influence on mRNA levels

HubK and HubFC

Many transcripts with multiple independently linked eQTLs (i.e. multiple linkages identified from single-locus analyses) from these two studies cannot be confirmed for two-locus influence using bqtl.twolocus (Figure 4-12). For those 62% (HubK) and 33% (HubFC) of transcripts that do have evidence of two-locus influence (Figure 4-12), their corresponding LOD scores are slightly lower than other datasets (Figure 4-13), suggesting these locus-pairs only explain a relatively small amount of trait-variance. We investigate these unusual results by examining several known variables, including sample size and transcript expression level, and predict the lack of confirmed two-locus effect to be caused by genotype pattern-similarity (Chapter 5).

The number of strains used in these RI rat studies (n = 30) are similar to that used in the RI mouse studies (Table 4-7), so the lack of evidence for two-locus effects is unlikely due to limited statistical power associated with small sample size.

The relatively small proportions of confirmed two-locus effects in these two studies are unlikely to have been driven by unduly variable trait data because expression trait values from these datasets are not especially variable when compare to other studies (Figure 4-14). Variation associated with low signal level is also unlikely to be a source of artefactual multiple linkage because the average expression levels of the 402 (HubK) and 227 (HubFC) transcripts with multiple independent linkages are relatively higher than other transcripts and the BioB controls (recommended lower limit of detection†) from the corresponding datasets (Figure 4-15).

† This recommendation was provided by Dr Robert Henke from Millennium Sciences

179 Validation and analysis of two-locus influence on mRNA levels

Figure 4-14 Boxplot summary of expression variations (standard deviations of expression values across all individuals) of transcripts linked at multiple eQTLs.

Figure 4-15 Average expression levels in HubK (left) and HubFC (right) of all 15,866 transcripts (black); 404 (HubK) or 227 (HubFC) transcripts with multiple linkages (red); and Affymetrix BioB hybridisation controls (green). BioB controls are used as the recommended lower limits of detection in Affymetrix data. The abnormally high of BioB levels in HubFC suggest the spike-in controls were not used by HUBNER et. al. (2005).

180 Validation and analysis of two-locus influence on mRNA levels

Figure 4-16 Number of the 404 and 227 transcripts, in HubK (top) and HubFC (bottom), respectively, with multiple linkages, linked at each marker locus (x-axis). Grey dotted lines mark the chromosome boundaries.

Recall that the extent of multiple linkages identified from single-locus mapping analyses using HubK and HubFC are unusually high (Table 4-8 and Section 4.5.2). Therefore, it is possible that fewer two-locus influences are confirmed because the initial sets of identified multiple linkages are spurious. To address this, we examined the sets of marker loci linked by the 402 and 227 transcripts in HubK and HubFC. Surprisingly, there appears to be an over-representation of transcripts linked to several regions on

181 Validation and analysis of two-locus influence on mRNA levels

chromosome 4 (Figure 4-16). Particularly in the HubFC data, 36% of the 227 transcripts with multiple linkages are linked to at least two of the fours markers on Chr4 indicated in the bottom plot of Figure 4-16.

If a large number of transcripts are genuinely influenced by an identical set of eQTLs, we would expect all these transcripts to show linkage at each of the marker locus corresponding to this set of eQTLs. We would also expect these eQTLs to demonstrate significant two-locus interactions. However, of the 82 transcripts in HubFC linked to at least two of the four marker loci on Chr4, only two have confirmed two-locus effects.

The lack of confirmed multi-locus effects alludes to association between genotype patterns rather than genuine multi-locus interaction as a source of multiple-linkages in single-locus mapping analyses. A pair of marker loci,

L1 and L2, with similar genotype patterns has the potential to be linked by the same expression trait because linkage is determined by the strength of correlation between an expression trait and the marker’s genotype pattern.

If L1 and L2 have similar genotype patterns then a significant correlation between the trait and L1 will imply similar correlation significance between the trait and L2 (See Section 4.1.1 and Section 2.4 for discussion of genotype pattern similarity). An examination of the genotype patterns corresponding to the four Chr4 marker loci in HubFC reveals considerable similarity (each marker locus in the BXH/HXB rat panel is either homozygous B or homozygous H): D4Mit19: BHHHHHBHBHHHBHHHBBHBBBBHHHHHHH D4Rat240: BHHHHHBHBHHHBHHHBBHBBBBHHHBHHH Cacna1s: BHHHHHBHHHHHBHHHBBHBBBBHHHHHHH D4Cer7s17: BHHHHHBHHHHHHHHHBBHBBBBHHHHHHH

A conclusion from these observations is that genotype pattern-association is a source of multiple linkages. Whether these multiple-linkages are spurious

182 Validation and analysis of two-locus influence on mRNA levels

or real depends on the whether the genotype pattern-association is a random occurrence or a consequence of co-segregation between functionally interacting eQTLs. The influence of genotype pattern-association on multiple linkages is discussed in the Chapter 5.

183 Validation and analysis of two-locus influence on mRNA levels

4.6. Chapter summary and discussions

The major outcomes from this chapter are the developments of remove.LD and bqtl.twolocus, the affirmation that single-locus mapping analyses can accurately identify some two-locus additive effects, and the discovery of the influence of genotype pattern-association on multiple linkages.

4.6.1. Facts and figures

• 3% - 20% of transcripts from ten eQTL mapping studies are link to at least one eQTL  Genetic variation influencing gene expression can be detected for up to one-fifth of transcripts

• 2.5% - 20.3% of transcripts showing linkage to at least one eQTL are link to multiple eQTLs, after application of remove.LD  Multiple genetic variations influencing gene expression can be detected for up to one-fifth of heritable and detectable transcripts  Without remove.LD can potentially over-estimate the extent of transcripts influenced by multiple genetic variations by up to 10-fold

• 33% - 100% of multiple linkages are confirmed for two-locus effects  Multiple linkages identified from single-locus mapping studies are likely real with additive effects

• <28% of multiple linkages are confirmed for two-locus effects and show significance in test for epistasis  Many epistatic multi-locus epistatic effects on gene expression are unlikely to be detectable using single-locus mapping analyses

184 Validation and analysis of two-locus influence on mRNA levels

4.6.2. Linkage disequilibrium and remove.LD

Linkage disequilibrium (LD) between neighbouring genomic regions is directly related to the distribution of genotypes between these regions, and so the use of pair-wise genotype pattern correlation distribution is assumed to provide a more informed method for defining LD than methods that assumes a linear correlation between LD and genomic distance. This is particularly obvious in small samples of RI lines; as exemplified in Figure 4-1, a marker pair can have more similar genotype patterns than a pair that are further apart. Thus, our novel method, remove.LD, for eliminating potentially redundant linkages induced by LD, is designed such that it assesses LD based on genotype pattern similarity.

One of the user-defined parameters for this algorithm is the LD-threshold. LD is defined for a pair of markers commonly linked to a transcript if their genotype pattern correlation exceeds this threshold. In this thesis project I have chosen to use a relatively stringent LD-threshold, set at the 95th percentile inter-chromosomal marker genotype pattern correlations. This threshold is clearly not claimed as the gold-standard for assessing LD, but serves to provide a set of multiple linkages that are least likely to be influenced by LD.

The extent of transcripts defined as linked to multiple unique non-LD- induced eQTLs are highly dependent on this LD-threshold, which is in turn dependent on the type and size of the segregating population as well as the density of the marker maps. Future research to evaluate the behaviour of multiple linkages at different LD-thresholds and in different eQTL- mapping studies will be useful for both the eQTL- and classical QTL- mapping community.

185 Validation and analysis of two-locus influence on mRNA levels

4.6.3. Confirming two-locus effects and bqtl.twolocus

It is necessary to emphasise that the aim of this chapter is not to determine the extent and identity of genes whose expression level is influenced by multiple eQTLs. Rather, this research and bqtl.twolocus only seek to validate multiple linkages identified from single-locus mapping analyses, which are not designed for detecting multi-locus effects.

Assuming expression trait is complex in origin, a substantially large number of multi-locus effects would have been missed from these studies. Future research to further elucidate the regulatory architecture of gene expression would benefit by performing a complete multi-locus analysis. A multi-locus mapping using expression trait data is not trivial and will likely be much more computationally and statistically complex than single-locus mapping analyses.

4.6.4. Prelude to Chapter 5: genotype pattern-association bqtl.twolocus confirms two-locus effects but does not provide any insight into the mechanism of interaction between the locus pairs. From our simnorm studies, we find that the proportion of confirmed two-locus effects is as high as those observed in real data. As there is no trait-variance built into these studies, multiple linkages and confirmed two-locus effects are presumed to be artefacts of loci-association.

As we will discuss in the following chapter, not all loci-association are erroneous. But regardless of the reason underlying the association between a pair of marker loci, their association would result in strong correlation between their corresponding genotype patterns, which will in turn increase the probability of concurrent linkage at both loci by an expression trait. This phenomenon is exemplified in the HubK and HubFC studies in which a strong similarity between a distal and a proximal marker on Chr4 has led to linkages at both loci by a relatively large subset of genes.

186 Validation and analysis of two-locus influence on mRNA levels

Furthermore, genotype pattern-association may have initiated some of the 33%-100% of confirmed two-locus effects. Our ability to elucidate the sources of genotype pattern-association underlying some of the two-locus influences would improve our estimate of real two-locus effects.

187 Genotype pattern association and multiple-linkages

5. Genotype pattern association and multiple- linkages

In this chapter, we explore the underlying causes of multiple linkages. As we have demonstrated in the previous chapter, secondary eQTLs can be spuriously called significant by bqtl.teolocus even if the primary locus is spurious. This is observed with real, permuted, and simulated data under normality, thus suggesting artefactual identification and subsequent confirmation of two-locus effects are a consequence of the association between the locus-pairs.

Statistically, linkage of an expression trait at a marker locus is defined by the strength of correlation between the expression values of the trait and genotypes of the eQTL. An expression trait linked at two unique eQTLs would imply strong correlations between the trait values and genotypes at both eQTLs. Thus, by simple deduction, this would imply strong correlation between the genotypes at the two loci. Base on this, this chapter aims to explore the affect of genotype pattern-association on multiple linkages.

A pair of marker genotype patterns is in “association” if few changes are required to convert one into the other. The concept of genotype pattern association, or similarity, was discussed in Section 4.1.1 and Section 2.4.

To reiterate, let us consider three marker loci, L1, L2, and L3, with the following genotype patterns:

L1: BBDDBB

L2: BBDDBD

L3: BDBDBD

Genotype patterns between L1 and L2 are highly similar because only one genotype, at the last position, is different between the two patterns.

188 Genotype pattern association and multiple-linkages

Conversely, genotype patterns between L1 and L3 are highly dissimilar because three of the six genotypes are different between the two patterns.

Genotype pattern-association is important in eQTL- (and classical QTL-) mapping analyses because it can lead to multiple-linkages. In this chapter “multiple-linkage” is used to describe the set of marker loci independently (i.e. from single-locus mapping) linked to a common transcript. This is particularly true in single-locus mapping studies where linkage between an expression trait and a marker locus is considered independently of other traits and markers. Assuming trait y is influenced by, and therefore significantly linked to, the genomic region L1, and that the genotype pattern of L1 is highly similar to that of another (possibly non-causative) locus, L2, then y will also link significantly to L2 (Figure 5-1).

Expression level of y

L1 BBDD B B

L2 BBDD B D

Figure 5-1 Diagram exemplifying how trait y can significantly link to both marker loci L1 and L2. Trait y is significantly linked to L1 because its expression values correlate well with the genotype pattern of L1: strains with a B-genotype at L1 have high y expression level while strains with a D-genotype have low expression level. Because L2 has genotype pattern that is similar to that of L1 (differing only in the last strain), trait y is also linked to L2, regardless of whether L2 influences trait y.

Understanding the affect of genotype pattern-association on multiple linkages is important because genotype patterns can be in association due to (1) a fundamental genetic phenomenon, thus implying multiple linkages have a legitimate genetic basis, or (2) chance, thus implying multiple linkages are spurious.

189 Genotype pattern association and multiple-linkages

A biological reason for genotype pattern similarity between a pair of marker loci is co-segregation due either to linkage-disequilibrium (LD) or non-syntenic association (NSA). LD occurs between neighbouring genomic regions, which tend to co-segregate because meiotic recombination, being directly proportional to genomic distance, is less probable. This causes identical genotype inheritance at neighbouring regions, thus resulting in identical (or very similar) genotype patterns. Often, adjacent markers tend to have similar rather than identical genotype patterns, especially if recombination rate is relatively high or if the number of typed markers is relatively low (i.e. adjacent markers are farther apart). This is because both factors increases recombination rate between adjacent regions in one or more individuals (strains).

Protein 1 Protein 2

BBDD

Essential gene

Essential gene

Essential gene

Figure 5-2 Non-syntenic association. This essential gene requires the successful interaction between protein 1 and protein 2. These two proteins will only interact if both are wildtypes or both are mutants; i.e. genes encoding both proteins have the same genotype: both are B or both are D. If these two proteins are encoded by different alleles, they are unable to interact, and so this essential gene is not expressed at all. Although this essential gene will be transcribe in either the wildtype or mutant background, the level of expression may differ.

190 Genotype pattern association and multiple-linkages

NSA is an extended form of LD, where the corresponding pair of loci is located at distinct regions of the genome, often, but not necessarily, at different chromosomes (non-syntenic). NSA locus-pairs are indicative of co-evolution of alleles that have undergone joint-selection (WILLIAMS et al.

2001; PETKOV et al. 20005; CERVINO et al. 2005). That is, two genomic regions are NSA if they encode a pair of proteins that co-operatively regulate the activation or function of an essential or important gene central to the fitness of the organism (Figure 5-2). This pair of proteins will only interact if they are encoded by the complementary allele: they must co- segregate during meiotic recombination. NSA genes will therefore have identical genotype patterns. However, markers representing these genes may not necessarily have identical genotype patterns because the genes and their markers may not be in complete LD and so the respective markers may not perfectly co-segregate, resulting in similar genotype patterns.

A non-biological reason for genotype pattern similarity is chance, or random, association. Genotype patterns are effectively categorical strings and in RI panels only two categories exist corresponding to the two homozygous genotypes. Given n RI strains, for each genotype pattern there exists n other genotype patterns that differ to the initial pattern at only one of n positions (See Section 4.1.1). Similarly, there are n(n-1)/2 other strings differing only at two positions, and so on. Since the actual number of genotype patterns is limited by the extent of recombination, sample size (n), and type of segregating panel, it is not unreasonable to observe random pattern-associations.

In Section 4.5.3, we saw in the HubFC data that more than one-third of all transcripts with multiple-linkages are linked to a pair of marker loci with highly similar genotype patterns. Furthermore, two-locus effects cannot be confirmed for almost all of these locus-pairs. These results hint at the

191 Genotype pattern association and multiple-linkages

involvement of genotype pattern-association in (spurious) multiple- linkages.

In this section we are interested in determining the extent to which genotype pattern association influences multiple-linkages, by categorising and examining the sources of genotype pattern-associations between markers within sets of multiple-linkages across the ten eQTL datasets (Table 4-7). Clearly, such associations would not be an issue if some form of multi-locus mapping approach (e.g. STOREY,AKEY,andKRUGLYAK

2005; KENDZIORSKI et al. 2006) was adopted (see Section 1.7 for discussion). However, due to computational and sample size restraints, many of the current eQTL-mapping studies (mainly the ten under analysis in this thesis) have adopted a single-locus mapping approach. In these studies, multiple-linkages can result (BREM et al. 2005 and BREM and

KRUGLYAK 2005; also see discussion and references in Section 1.7 and Chapter 4).

Without performing a complete multi-locus analysis, it would be valuable to determine if any multiple-linkages identified from single-locus mapping analyses are indeed real; or at the very least, it would be useful to determine the integrity of these multiply-linked eQTLs as this can help prioritise those that the investigators may like to follow up. We, therefore, developed a pipeline to help prioritise multiple-linkages for more extensive validations, such as double knock-outs of gene-pairs that may potentially influence the expression of a gene in an additive or epistatic manner.

192 Genotype pattern association and multiple-linkages

5.1. Experimental design

Transcripts with multiple linkages

Are multi-locus in synteny?

yes no

Are multi-locus Are multi-locus confirmed for two- genotype locus effect? patterns NSA?

yes no yes no

Are multi-locus Are multi-locus genotype Residual LD; Obligatory confirmed for two- patterns randomly spurious syntenic co- locus effect? associated? multiple linkage segregation yes no

yes no Are multi-locus Are multi-locus Spurious NSA confirmed for confirmed for multiple linkage two-locus two-locus effect? effect?

yes no

yes no

Likely spurious Validated two- Spurious Spurious multiple linkage that is locus effects multiple multiple wrongfully confirmed whose genotype linkage linkage for two-locus effects patterns are unassociated

Figure 5-3 Schematic diagram outlining the pipeline (decision tree) for determining real (circled) and spurious (squared) multiple-linkages.

193 Genotype pattern association and multiple-linkages

The aim here is to understand the relationship between genotype pattern- association and multiple-linkages. To achieve this, the sets of multiple- linkages identified from the ten eQTL-datasets (Table 4-7) are first classified as having one of four genotype pattern-associations (Figure 5-3): 1) Linkage disequilibrium; 2) Non-syntenic association; 3) Random association; and 4) No association.

For the sets of multiple-linkages grouped into each class of genotype pattern-association, the extent of confirmable two-locus effect is then determined using bqtl.twolocus, as in Chapter 4. The pipeline outlined in Figure 5-3 is designed to help determine whether sets of multiple-linkages are likely to have true two-locus influences on expression traits or are likely artefacts of genotype pattern-association.

5.1.1. Classifying syntenic association

Genomic regions belonging to the same chromosome are said to be in synteny. As demonstrated in the previous chapters, LD of neighbouring loci can lead to false multiple-linkages, thus imitating multi-locus influence of gene expression. Syntenic loci have higher probabilities of being in some degree of LD because LD between locus-pairs is dependent on their relative genomic distance to each other. Thus, as well as applying remove.LD (Section 4.1.1; Section 2.5), it is also important to known if eQTLs linked to the same expression trait belong to the same chromosome.

Syntenic eQTLs with confirmed two-locus effects (determined using bqtl.twolocus; Section 4.1.2 and Section 2.6) are indicative of genuine functional interactions between these syntenic loci in their influences on gene expressions. Conversely, syntenic eQTLs that cannot be confirmed

194 Genotype pattern association and multiple-linkages

for two-locus effects are indicative of multiple-linkages caused by residual LD.

5.1.2. Classifying non-syntenic association

In theory, a pair of NSA loci would have identical genotype patterns because inheritances at both loci are expected to be identical in all individuals. However, genetic markers are used to infer actual causative eQTLs, and these markers may not be in complete LD with the eQTLs. Consequently, marker loci representing NSA eQTLs may not always have identical genotype patterns, but are expected to have the most similar genotype patterns to each other. Therefore, a pair of marker loci is defined as in NSA if they do not belong to the same chromosome but have genotype patterns that are the best matched to each other compare to all other marker genotype patterns. By this definition, there exists for each marker locus, a corresponding NSA locus.

NSA eQTLs with confirmed two-locus effects are indicative of genuine functional interactions between these NSA loci in their influences on gene expressions. Conversely, NSA eQTLs that cannot be confirmed for two- locus effects are indicative of spurious multiple-linkages.

5.1.3. Classifying random genotype pattern association

A set of eQTLs linked to the same transcript are defined as randomly associated if their genotype patterns are not in synteny and not NSA, but have a Pearson’s correlation coefficient less than the lower 25th percentile or larger than the upper 75th percentile (“tail”-50%) of all inter- chromosomal correlation coefficients (Figure 5-4). The tail 50% of inter- chromosomal correlation is used instead of the upper 50th percentile to include negative correlations (Section 4.1.1). This range is taken to represent the null hypothesis of no correlation between marker-pairs. Thus, if the genotype patterns of a marker-pair correlate with a Pearson’s

195 Genotype pattern association and multiple-linkages

correlation >75th or <25th percentile of the inter-chromosomal marker correlation distribution (Figure 5-4), then the pair is defined as in random association.

Figure 5-4 Distribution of genotype pattern correlations between all pair-wise inter- chromosomal markers from the marker maps of different panels of individuals. CotB/K/L used 31 BXD RI mice; ChesB: used 32 BXD RI mice; BysHSC: used 22 BXD RI mice; HubK/FC: used 30 BXH/HXB RI rat; YveY3/Y5: used 86 haploid yeasts; SchL: used 111 F2 mice. Between pairs of dashed lines are the mid-50% distributions and between pairs of dotted lines are the mid-90% of the distributions.

196 Genotype pattern association and multiple-linkages

The choice of this range is entirely arbitrary and may be altered depending on how stringent one wishes to define a pair of genotype patterns as in random association. See Section 5.5 and Section 5.7 for discussion on how the choice of this range can influence the extent of multiple-linkages classified as real.

Randomly associated eQTLs that are concurrently linked to an expression trait are by definition spurious, regardless of whether two-locus effects can be confirmed or not. That is, the bqtl.twolocus test cannot distinguish between real two-locus effects from random eQTL associations, and randomly associated eQTLs may sometimes pass the bqtl.twolocus test.

5.1.4. Classifying unassociated genotype patterns

Finally, genotype patterns of multiply-linked eQTLs that are neither syntenically-associated or NSA or randomly-associated are classified as unassociated. Unassociated eQTLs are, by definition, not confounded by genotype pattern association and so concurrent linkages at these loci are likely real. Furthermore, since multiple-linkages at unassociated eQTLs are real, confirmation of two-locus effects are expected for all sets of multiply- linked and unassociated eQTLs.

197 Genotype pattern association and multiple-linkages

5.2. Genotype pattern similarity can cause multiple- linkages

Results from previous chapters have hinted at the involvement of genotype pattern-association in multiple-linkages. Here we demonstrate that there is indeed a relationship between genotype pattern-similarity and linkage significance, by comparing these two measures directly.

For example, from a single-locus mapping analysis using the CotB data, Foxa3 is most significantly linked to D1Mit216 on Chr1 (Figure 5-5 top panel). Although to a lesser extent, Foxa3 is also linked to two other markers: D1Mit134, also on Chr1, and S08Gnf006.700, on Chr8. To establish whether linkages of Foxa3 at the latter two markers are due to their genotype patterns being similar to that of D1Mit216, we correlated genotype pattern of D1Mit216 with all other markers on the genome (779 in total). The resulting correlation coefficients are then plotted against linkage P-values of Foxa3 at these markers (Figure 5-5 bottom plot).

The trend is obvious: linkage significance increases with genotype pattern similarity. That is, Foxa3 tend to map to markers whose genotype patterns are similar to that of D1Mit216. For instance, D1Mit134, also significantly 0.05 linked to Foxa3 at Praw0PBON= /M, has extremely similar genotype pattern to D1Mit216 with a Pearson’s correlation of ~0.87. The most probable reason for such high similarity between these markers is LD.

This trend between genotype pattern-similarity and linkage Praw also applies to negatively associated genotype patterns (data points on the left of the dotted line). The concept of negative correlation was briefly described in Section 4.1.1. Simplistically, if for every occurrence of a B in one genotype pattern there is a D in another pattern then the two patterns are complete inverts. For instance, the pattern BBBDDD is the complete invert of DDDBBB. As with positively correlated genotype patterns, inversely

198 Genotype pattern association and multiple-linkages

similar genotype patterns can also lead to multiple-linkages, as demonstrated in Figure 5-6.

Figure 5-5 Genome scan and correlation between genotype pattern similarity and linkage of Foxa3. Upper plot shows the genome scan of a forkhead box A3 (Foxa3) transcript (NM_008260) from the BXD brain study (CotB). This transcript is linked to three markers (horizontal dotted line marks PBON=0.05/M threshold), with most significant linkage at D1Mit216 on Chr1. Bottom plot shows the genotype pattern correlations between this marker locus and all other markers on the genome (x-axis) against Foxa3’s linkage significance (Praw) at each corresponding marker locus (y-axis). The vertical dotted line marks zero (i.e. no) correlation. Positive correlation is on the right of the dotted line and negative correlations on the left. The three markers significantly link to Foxa3 are indicated (above horizontal dotted line). Note: the only value with Pearson’s correlation coefficient=1 (top right data-point) is of D1Mit216: comparing genotype pattern of D1Mit216 against itself.

199 Genotype pattern association and multiple-linkages

Expression level of y

L1 BBDD B B

L2 DDBB D D

Figure 5-6 Linkages to genotype pattern inverts. If expression of gene y is significantly linked to L1, then it will also link to L2 with equal but opposite test statistics.

200 Genotype pattern association and multiple-linkages

5.3. Residual linkage disequilibrium as a source of multiple-linkages

Without consideration for LD between neighbouring loci, up to 68% of the transcripts with at least one linkage are defined as having multiple-linkages across the ten eQTL-mapping studies analysed (Section 4.5.2; Table 4-8). This percentage is reduced to less than 20% following the application of remove.LD.

While this algorithm considers linkages at neighbouring marker loci, it does not consider “residual LD”. Residual LD describes LD between, not necessarily adjacent, genomic regions located on the same chromosome (i.e. syntenic regions). In spite of the length of a chromosome, the genotype pattern at any given locus is always dependent on its adjacent loci, and by chain of dependence, genotype patterns between syntenic markers will always have some degree of dependence. For example, in Figure 5-7, L1 and L2 are clearly in LD because they are immediately adjacent to each other, and so have identical genotype patterns. Although L3 is remotely located from L1, but because they belong on the same chromosome, the genotype pattern of L3 is highly similar to that of L1, differing only at the 2nd position. However, similar genotype patterns between syntenic genomic regions are not always due to residual-LD. In some cases, a pair of syntenic loci may share similar genotype patterns because co-segregation at these two loci has been, deliberately or unconsciously, selected for in the breeding process. If co-inheritance of a pair of syntenic regions confers fitness to the individuals, then a majority of the population would have highly associated genotypes at this locus-pair (CERVINO et al. 2005; PETKOV et al. 2005).

201 Genotype pattern association and multiple-linkages

BXD: 123456

L1 L2

L3

L1: B  B   B

L2 : B  B   B

L3 : BB B   B

Figure 5-7 Similar genotype patterns between syntenically associated marker loci. For this example, one chromosome from each of 6 BXD strains is illustrated. The genetic backgrounds of these mice are mosaic patterns of their founders: B (white) and D (black). As expected, since L1 and L2 are adjacent to each other they are in LD and have identical genotype patterns. However, although L3 is not immediately next to L1, but due to residual LD or obligatory co-segregation, these two marker loci have highly similar genotype patterns.

In this section we use the knowledge of synteny and results from the bqtl.twolocus test to assess whether multiple-linkages have arisen due to residual-LD or co-segregation of intra-chromosomal regions. For our purpose, we define a set of eQTLs linked to a common transcript as syntenic if they are located on the same chromosome and can be detected even after the application of remove.LD.

The numbers of transcripts linked to multiple syntenic loci vary greatly across the ten eQTL-datasets (Figure 5-8). The noticeably higher proportions of syntenic multi-locus peaks in the RI rat data (HubK and HubFC) support our previous observations in Section 4.5.3 that many transcripts are linked concurrently to multiple marker loci on the same chromosome (Chr4).

202 Genotype pattern association and multiple-linkages

100% 90% 78.85% 80% 70% 60.20% 60% 50% 40%

synteny 30% 18.50% 20% 9.68% 10% 4.17% 5.88% 5.26% 0.00% 2.22% 2.44% 0%

C C 5 otL S y chL CotB CotK C HubK .C S ChesB HubF Y BysH veY.Cy3 Y Yve % transcripts linked to multiple eQTLs that are in

Figure 5-8 Percentages of transcripts linked to multiple eQTLs that are in synteny. Actual numbers making up these percentages are in Appendix A.8.

100% 90% 80%

70% 60% 50% 40% synteny 30%

20% 10% 0%

tB K L 5 o SC Y Cot C Cot H HubK ubFC ve SchL ChesB H YveY3 Y % transcripts linkedto multiple eQTLs thatin are Bys

tw o-locus effect tw o-locus epistatic effect

Figure 5-9 Percentages of transcripts linked to syntenically-associated eQTLs that are confirmed for two-locus and two-locus epistatic effects. Actual numbers making up these percentages are in Appendix A.8.

203 Genotype pattern association and multiple-linkages

If a syntenic genotype pattern-association were a consequence of residual- LD, then we would expect the involved loci to score poorly in the bqtl.twolocus test. Conversely, if co-segregation between a pair of syntenic eQTLs contributes to individual fitness, then a bqtl.twolocus test with this locus-pair should show significance.

In four of the ten datasets (CotK, ChesB, YveY3, and YveY5), all syntenically-associated eQTLs are confirmed for two-locus effects (Figure 5-9), suggesting obligatory, or beneficial, co-segregation of these syntenic regions. This is similarly true for the high proportions of confirmed two- locus effects amongst the sets of syntenically-associated multiple-linkages in ChesB and SchL.

In comparison, of the considerable amount of syntenic multiple-linkages in HubK and HubFC (Figure 5-8), the extents of confirmed two-locus effects are relatively low (Figure 5-9). Residual-LD is the most plausible explanation for the multiple-linkages that are syntenic but not confirmable for two-locus effects. Conversely, the lack of confirmed two-locus effects amongst multiple syntenic linkages in CotL and BysHSC is more plausibly due to the extremely low number, or absence, of multiple-linkages that are in synteny rather than from the effect of residual-LD.

Although there are less observable multiple-linkages in HubK and HubFC that are both syntenic and confirmable for two-locus effects, those that do fall under this category are showing evidence of epistatic influence. The only other study with confirmed two-locus epistatic influence amongst the set of syntenically-associated multiple-linkages is SchL: ~35% of all multiple-linkages with verified two-locus epistatic effects are in synteny. In these cases, the corresponding intra-chromosomal loci, or gene-products from these regions, are predicted to be dependent on each other in their influence of the respective transcripts’ expression levels. That is, by definition of two-locus epistatic interaction in this thesis project (Section

204 Genotype pattern association and multiple-linkages

1.7; Section 4.1.2), the action of one eQTL is dependent on another on the same chromosome, thus increasing the probability of co-segregation between these two regions.

205 Genotype pattern association and multiple-linkages

5.4. Non-syntenic association as a source of multiple-linkages

NSA (non-syntenic association) is a source of genotype pattern-association that describes a pair of eQTLs whose genotypes in a population are in disequilibrium. This is potentially caused by preferential selection of certain combinations of alleles at locus-pairs on different chromosomes

(WILLIAMS et al. 2001). In the extreme case as described earlier in this chapter (Figure 5-2), such a locus-pair may be involved in the function of an essential gene, such that, non-identical inheritance of this locus-pair would result in lethality.

This phenomenon is especially likely in inbred animals because inbreeding exerts heavy selection pressure and pre-existing allelic combinations are generally favoured over new ones (PETKOV et al. 2005). Identifying NSA loci is important because allelic association can lead to spurious associations/linkages in genome-wide QTL-mapping studies (CERVINO et al. 2005) and because underlying real NSA locus-pairs are genes/genetic variants whose interaction and co-evolution are essential for the survival of an animal. In our case, the action of such NSA locus-pairs is manifested in their joint control of gene expression.

For the purpose of this chapter, NSA is defined for a pair of maker loci linked to the same transcript if they do not belong to the same chromosome and have genotype patterns that matched to each other better than all other markers on the genome (Section 5.1.2).

Figure 5-10 shows an example of NSA from CotB: from a single-locus 0.05 linkage analysis (left panel), Ccl1 is linked, at Praw0PBON= /M, to two eQTLs on Chr7 (S07Gnf033.680) and Chr17 (D17mit18). Shown on the right-hand panel of this figure is Pearson’s correlation coefficient between the most significantly linked marker (D17Mit18) and all other markers on

206 Genotype pattern association and multiple-linkages

the genome (x-axis), against linkage significance of the trait at each marker (y-axis). Markers on Chr7 have been coloured red and markers on Chr17 have been coloured blue. This figure clearly shows that, excluding intra- chromosomal marker correlations (green data-points), the genotype pattern of D17mit18 is better matched to S07Gnf033.680 than all other markers on the genome, scoring a Pearson’s correlation coefficient of ~0.69. The reverse is also true: the genotype pattern of S07Gnf033.680 correlates better to D17Mit18 than to all other markers, excluding those on Chr7 (red data-points). As such, this pair of multiple-linkages is classified as NSA.

Conversely, Figure 5-11 shows an example where the pair of eQLTs significantly linked by a RIKEN gene (AK014278) is not NSA. This is because their genotype patterns correlate better with other inter- chromosomal markers (open circles) than with each other (Pearson’s correlation coefficient of ~0.36).

207 Genotype pattern association and multiple-linkages

Figure 5-10 Example of NSA multiple-linkage. Left panel shows the genome scan, from CotB study, of the NM_011329 0.05 transcript, corresponding to the chemokine ligand 1 (Ccl1) gene. Ccl1 is significantly linked (Praw0PBON = /M = - log10Praw14.2) to D17Mit18 on Chr7 and S07Gnf033.680 on Chr17. Right panel shows that these two marker loci are NSA: their genotype patterns are best matched to each other, excluding intra-chromosomal genotype pattern correlations (respective colour points). Markers on Chr7 are coloured red, and markers on Chr17 in green.

208 Genotype pattern association and multiple-linkages

Figure 5-11 Example of non-NSA multiple-linkage. Left panel shows the genome-scan, from CotB study, of the transcript 0.05 AK014278, corresponding to the RIKEN clone: 3200001D21. This transcript is significantly linked (Praw0PBON = /M = - log10Praw14.2) to D19Mit68 on Chr19 and DXMsw076 ChrX. Right panel shows that these two marker loci are not NSA: their genotype patterns are correlating better to other markers not on the same chromosome than to each other. Markers on Chr19 are coloured red, and markers on ChrX in green.

209 Genotype pattern association and multiple-linkages

Of all transcripts linked to multiple eQTLs across the ten eQTL datasets, between ~12% and 62% are linked to eQTLs that are NSA (Figure 5-12). To determine whether the observed proportions of NSAs are greater than chance expectation, a simulation study was performed using the BXD lines. If physically unlinked loci with similar genotype patterns, in RI lines, are due solely to joint allelic selection, then the extent of genotype pattern similarity between inter-chromosomal markers would be larger than that base on the assumption that no association between physically unlinked loci exists.

Due to the vagaries of the construction and potential shared ancestries of these mice, it is difficult to simulate a new panel of RI lines. However, assuming independent chromosomal segregation, a simulated RI panel can be obtained by “mixing-and-matching” the chromosomes from a real RI panel (Section 2.4.1), then testing for the extent of inter-chromosomal genotype pattern similarity.

Table 5-1 Proportions of all pair-wise inter-chromosomal genotype pattern similarities, as measured by Pearson’s correlation coefficient, using an original panel of 32 BXD RI lines and replicate sets of simulated 32 RI lines. Results from simulated studies are averages of five simulations where chromosomes were sampled randomly without replacement; results for another five simulations where chromosomes were sampled randomly with replacement (Section 2.4.1) are almost identical. Inter-chromosomal genotype pattern similarity: Pearson’s correlation coefficient <±0.5 1±0.6 1±0.7 1±0.8 Original 99.5% 0.08% 0.003% - BXD Average simulated 99.5% 0.08% 0.005% <10-3 BXD

Results from these simulations (Table 5-1) show that there is no excess of NSA in the original data (32 BXD RI mice: Cot data) compare to a hypothetical RI panel where physically unlinked (non-syntenic) allelic association is absent. It must be emphasise here that, while these results

210 Genotype pattern association and multiple-linkages

suggest there is no excess of NSA in the BXD RI lines, it does not mean there is an absence of it. It is highly plausible that some loci are jointly selected to compensate for the intensive selection pressure of inbreeding; but the number of such allelic co-evolution is not so great that it can be discerned from random genotype association.

Because non-random allelic association is a well-known phenomenon in humans (CERVIN et al. 2005), we continue under the assumption that NSA is present in these RI animals, as well as the other experimental crosses. A closer examination of the 12%-62% NSA locus-pairs show that, more than 66% have confirmed two-locus effects (Figure 5-13), suggesting many NSA multiple-linkages have real two-locus effects on their corresponding expression traits. NSA multiple-linkages that fails the bqtl.twolocus test are likely spurious and may have been identified from single-locus linkage analyses because of random genotype pattern-association.

Although epistatic selection of distinct loci during the generation of inbred lines have been proposed as a potential cause of NSA (CERVINO et al. 2005), it is not surprising that no NSA locus-pairs in CotL, CesB, and BysHSC demonstrate evidence of epistatic influence because the total numbers of detected and confirmed locus-pairs exerting epistatic effects are small in these studies (Figure 4-12; Appendix A.7.3): four in CotL, zero in ChesB, and two in BysHSC. Overall, the proportions of multiple-linkages with evidence of epistatic interactions are small (Figure 5-13) because, single-locus linkage analyses are not designed for detecting epistatic multi- locus effects.

As the parental strains used to generate the F2 mice in SchL and the BXD lines in CotL are the same (Table 4-7), we compared the results from these two liver studies in an attempt to ascertain some of these NSA eQTL-pairs. Unfortunately, of the 14 transcripts mapped to NSA locus-pairs, only 11 are also analysed for eQTL-linkage in the SchL study (i.e. only 11/14

211 Genotype pattern association and multiple-linkages

transcripts are represented on both microarray platforms (Table 4-6)), and none of these 11 are linked to more than one eQTL in SchL. In fact, only three of these 11 transcripts are significantly linked to one eQTL in the SchL study: AK010720 to Chr6; NM_010924 to Chr9; AK016223 to Chr3. As discussed in Section .4.5.3, there are a multitude and combinations of factors that may explain the discrepancies between these two liver studies. The most relevant of which is the difference in experimental crosses. RI animals have been subjected to intense selection pressure through their inbreeding process; this selection pressure is thought to have caused joint allelic selection of physically unlinked loci (WILLIAMS et al. 2001;

CERVINO et al. 2005; PETKOV et al. 2005; SCHADT 2006). Thus, NSA locus-pairs present in BXD RI lines may not necessarily be present in their related F2 cousins.

Similarly, of the 57 NSA locus-pairs identified in SchL, 36 of the corresponding transcripts are represented in the CotB/K/L data, but none are mapped to NSA locus-pairs or even to any eQTLs in CotL. As discussed in Section 4.5.1, almost three times as many eQTLs have been mapped in SchL compare to CotL (Table 4-8) as a consequence of the differences between these studies. In particular, the advantages of larger sample size, heterozygosity, and less stringent significance threshold, that are associated with SchL may have led to more eQTLs being detected.

212 Genotype pattern association and multiple-linkages

100% 90% 80%

70% 62.2% 60% 52.6% 50% 40.0% 35.3%

NSA 40% 28.5% 30% 18.1% 18.4% 20% 11.9% 13.0% 14.0% 10% 0%

tB L o otL ch C CotK C HubK ubFC .Cy3 S ChesB H Y % transcripts linked to multiple eQTLs that are BysHSC ve Y YveY.Cy5

Figure 5-12 Percentages of transcripts linked to multiple eQTLs that are non- syntenically associated. Actual numbers making up these percentages are in Appendix A.8.

100% 90% 80%

70% 60% 50% 40% 30%

20% 10% 0%

C tK B bK 5 hL o u CotB C CotL hes H veY3 Sc C HubF Y YveY BysHSC % transcripts linked to multiple eQTLs that are NSA tw o-locus effect tw o-locus epistatic effect

Figure 5-13 Percentages of transcripts linked to NSA eQTLs that are confirmed for two-locus and two-locus epistatic effects. Actual numbers making up these percentages are in Appendix A.8.

213 Genotype pattern association and multiple-linkages

The significance of identifying NSA locus-pairs and confirming their joint effects on their cognate expression traits is that, residing at each NSA eQTL-pairs are genetic variants whose interaction are likely essential for the survival and evolution of the respective experimental crosses. Thus, having identified these NSA locus-pairs, an obvious next step is to identify the actual causative/functional variant-pairs. Unfortunately, as with all QTL-mapping studies, the difficulty lies in the relatively large genomic region implicated by each eQTL; this makes identifying the functional variant extremely difficult.

One method for easing the search is to focus on NSA locus-pairs where one of the two eQTLs is linked in cis to the expression trait (suggested, for example, by CARLBORG et al. 2005). It is presumed that eQTLs that act in cis are likely to harbour regulatory variants. Thus, for a pair of NSA eQTLs, if one harbours a regulatory variant, then it is plausible that the other may harbour the regulator acting on this cis-variant. Based on this argument, we identified a total of eight NSA locus-pairs in the Cot data (two in CotB, two in CotK, four in CotL) where one of the eQTLs are on the same chromosome as the respective expression traits (Table 5-2).

More interestingly, for two of the NSA locus-pairs in CotL (Table 5-2), one of the eQTLs are located within 5Mb of the transcripts, making them cis-eQTLs by our definition (Section 3.4). Thus, for these two transcripts, nicotinamide N-methyltransferase (Nnmt; NM_010924) and nephronophthisis 4 (Nphp4; AK016223), investigations would focus on: (1) identifying polymorphisms within the regulatory regions of Nnmt and Nphp4 between the founder lines, C57BL/6J and DBA/2J; and (2) identifying genes within the other eQTLs that may interact/regulate Nnmt and Nphp4.

Further of note is the locus-pair D2Mit62- D9Mit91 which are mapped by both Tessp4 (testis serine protease 4; AB047759) in the brain and Nt5e (5'

214 Genotype pattern association and multiple-linkages

nucleotidase, ecto; AK008787) in the kidney. Both transcripts are encoded on Chr9 and so potentially regulated in cis by the D9Mit91 eQTL. Although these genes are located ~22Mb from each other and the D9Mit91 eQTL is located >50Mb from both transcripts, it is not inconceivable to hypothesis a long-range regulatory control located at the D9Mit91 locus that influences the expression of Tessp4 and Nt5e. The hypothesis is supported by the fact that both genes have hydrolase activity and so may be regulated by similar processes.

215 Genotype pattern association and multiple-linkages

Table 5-2 NSA locus-pairs where one of the eQTLs is located on the same chromosome as the expression trait, in the CotB/K/L studies. For two of the NSA eQTL-pairs, one of the eQTLs is linked in cis to the expression trait (within 5Mb). Transcript Primary locus Secondary locus Study Accession Chr Mb Marker Chr Mb Marker Chr Mb ID CotB AK019601 1 37.6 D1Mit134 1 80.8 D19Mit68 19 3.4 CotB AB047759 9 110.9 D2Mit62 2 117.9 D9Mit91 9 37.3 CotK AK008787 9 88.7 D2Mit62 2 117.9 D9Mit91 9 37.3 CotK NM_0087125 117.1 D5Mit155 5 97.0 D9Rp2 9 121.6 CotL NM_010924 9 48.6 D2Mit436 2 80.0 D9Mit4 9 52.3 CotL AK016223 4 151.0 S04Gnf150.225 4 148.8 D12Mit235 12 37.9 CotL AK017586 10 95.4 S10Gnf020.445 10 22.3 D12Mit114 12 65.3 CotL NM_018733 2 66.1 D2Mit396 2 122.6 D18Mit149 18 45.4

216 Genotype pattern association and multiple-linkages

5.5. Random genotype pattern association as a source of multiple-linkages

Discounting the average 48% of multiple-linkages whose corresponding genotype patterns are either in synteny or are NSA, there remains, on average, 52% of multiple-linkages unaccounted for. Since genotype patterns are essentially categorical strings, it is plausible to encounter genotype patterns that are randomly associated to some extent.

Using the mid-50% range of the distributions of inter-chromosomal genotype pattern correlations as a basis for no association, pairs of genotype patterns correlating at the tail-50% of the distribution are consequently tagged as under random association, provided they are not already defined as syntenic or NSA (Section 5.1.3). With this criterion, up to 77% of multiple-linkages are associating randomly (Figure 5-14).

From the results in Figure 5-14, we can attribute random genotype pattern similarity to two major causes: (1) small sample size: compare the RI data to the F2 results, random pattern association is clearly more problematic in the smaller studies; and (2) heterozygosity: even under the assumption that random co-segregation is uniform across all diallelic experimental crosses, when considering genotype patterns as categorical strings, having an additional herterozygous genotype can vastly reduce random pattern correlation.

The small proportions of randomly associated multiple-linkages observed in HubK and HubFC are explained by the fact that many have already been attributed to syntenic association.

As demonstrated in the simulation study in the previous section (Section 5.4), because NSA cannot always be distinguished from chance associations, it is possible that some of the locus-pairs classified as

217 Genotype pattern association and multiple-linkages

randomly associating may in fact be under non-syntenic association; i.e. have biological basis for their genotype pattern similarity. Though, to truly distinguish the two is difficult, and is likely only achievable with much larger sample sizes.

One method to determine whether locus-pairs classified as randomly associating in Figure 5-14 are truly random, we tested for their joint affects on the corresponding expression traits. Multiply-linked eQTLs with truly chance similarity of genotype patterns are necessarily spurious by definition and so are not expected to pass the bqtl.twolocus test to be confirmed for two-locus effect. However, in eight of the ten eQTL-studies, two-locus effects are confirmed for all randomly associated multiple- linkages (Figure 5-15). In the other two studies, a sizeable 71% (HubFC) and 94% (SchL) of randomly associated multiple-linkages also passed the bqtl.twolocus test.

A potential explanation for these results is that the definition of random association may be too liberal. Recall from Section 5.1.3 that a pair of markers are associating randomly if their genotype patterns correlate with a Pearson’s correlation coefficient within the tail-50% of all pair-wise inter- chromosomal marker correlations (Figure 5-4). That is, non-associating eQTLs may have been falsely classified as under random association by this criterion. To test this, we set a more stringent tail-10% criterion for random association: random association is only described for marker pairs whose genotype patterns correlate at <5th or >95th percentile of the correlation distribution of all pair-wise inter-chromosomal markers.

218 Genotype pattern association and multiple-linkages

100% 90% 77.8% 80% 69.9% 70% 61.3% 58.8% 57.1% 60%

50% 42.1% 40% 35.6% 36.0%

30% 21.4% 20%

randomly associated 9.3% 10% 0%

tB K L ot SC ubK Co C Cot H H ubFC Cy3 SchL ChesB H % transcripts linked to multiple eQTLs that are Bys veY. Y YveY.Cy5

Figure 5-14 Percentage of transcripts linked to multiple eQTLs that are randomly associating (defined using tail-50% criterion). Actual numbers making up these percentages are in Appendix A.8.

100% 90% 80%

70% 60% 50% 40% 30%

randomly associated 20% 10% 0%

tB K L 5 o % transcripts linkedto multiple eQTLs that are SC Y Cot C Cot H HubK ubFC ve SchL ChesB H YveY3 Y Bys

tw o-locus effect tw o-locus epistatic effect

Figure 5-15 Percentages of transcripts linked to randomly associated eQTLs (using tail-50% criterion) that are confirmed for two-locus and two-locus epistatic effects. Actual numbers making up these percentages are in Appendix A.8.

219 Genotype pattern association and multiple-linkages

The extent of randomly associating multiple-linkages predictably decreases after setting a more stringent criterion (Figure 5-16). But the relative proportion of confirmed for two-locus effects are not different to the results observed using the more liberal definition of random association (compare Figure 5-17 with Figure 5-15).

With this we are forced to conclude that multiple-linkages resulting from random genotype pattern association cannot be distinguished from real two- locus effects using bqtl.twolocus. That is, genotype patterns of some marker loci are related in such a way that they can score statistically significant results in both the eQTL mapping analysis and the bqtl.twolocus confirmation test, thus giving the impression that these loci are genuinely influencing gene expression variations.

If this is true, then one of the possible explanations3 is that these pairs of randomly associating, physically unlinked, loci are part of many loci harbouring weak variants that distinguishes the two progenitor lines. In such a scenario, if it also happens that a pair of such loci co-segregate randomly more often than chance, then because both loci will be detected in single-locus mappings (due to their independent effects), they would also be more likely to appear to have joint effects on the expression trait.

3 This idea is provided by an anonymous reviewer of this PhD thesis.

220 Genotype pattern association and multiple-linkages

100% 90% 80% 70% 63.89%

60% 52.94% 50% 45.71% 46.34% 42.11% 39.78% 40% 31.11% 30% 19.65%

randomly associated 20% 8.37% 7.50% 10% 0%

B ot SC ubK y5 C CotK CotL H C % transcripts linked to multiple eQTLs that are H SchL ChesB ys HubFC B YveY.Cy3 YveY.

Figure 5-16 Percentages of transcripts linked to multiple eQTLs that are randomly associating (using tail-10% criterion). Actual numbers making up these percentages are in Appendix A.8.

100% 90% 80% 70% 60% 50% 40% 30% 20%

randomly associated 10% 0%

tK L o ot y5 CotB C C hesB HubK ubFC C SchL C H BysHSC YveY.Cy3 YveY. % transcripts linked ti multiple eQTLs that are

tw o-locus effect tw o-locus epistatic effect

Figure 5-17 Percentages of transcripts linked to randomly associated eQTLs (using tail-10% criterion) that are confirmed for two-locus and two-locus epistatic effects. Actual numbers making up these percentages are in Appendix A.8.

221 Genotype pattern association and multiple-linkages

5.6. Unassociated multiple-linkages

Finally, any set of multiple eQTLs linked to a common transcript that are neither syntenic, NSA, or randomly associated (using tail-50% criterion), are classified as free from genotype pattern association. That is, their concurrent linkages, identified from single-locus linkage tests, are not due to genotype pattern association. These multiple-linkages are expected to be real and should have absolute confirmation for two-locus effect, using the bqtl.twolocus test.

Less than 17% of multiple-linkages are unassociated, these are only found in four datasets (Figure 5-18): CotL, YveY3, YveY5, and SchL. Of these, all are confirmed to have two-locus effects on their corresponding expression traits (Figure 5-19), enforcing our supposition that unassociated multiple-linkages are likely real.

It is worth mentioning that, in CotL, only one transcript (one of a total of 35 transcripts link to multiple eQTLs: ~3%) is linked to multiple unassociated eQTLs. This transcript, AK014404, corresponding to the RIKEN clone 3732407C23Rik, is linked to the marker loci D5Mit139 on Chr5 and S07Gnf047.960 on Chr7.

222 Genotype pattern association and multiple-linkages

100% 90% 80% 70% 60% 50% 40% 30% unassociated 17.0% 20% 14.6% 15.1%

10% 2.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0%

tB K L ot SC ubK Co C Cot H H ubFC Cy3 SchL ChesB H % transcripts linked to multiple eQTLs that are Bys veY. Y YveY.Cy5

Figure 5-18 Percentages of transcripts linked to multiple eQTLs that are unassociated, following defining random association with the tail -50% criterion (Figure 5-14). Actual numbers making up these percentages are in Appendix A.8.

100% 90% 80%

70% 60% 50% 40%

unassociated 30%

20% 10% 0%

tB K L 5 o % transcripts linkedto multiple eQTLs that are SC Y Cot C Cot H HubK ubFC ve SchL ChesB H YveY3 Y Bys

tw o-locus effect tw o-locus epistatic effect

Figure 5-19 Percentages of transcripts linked to unassociated eQTLs (following defining random association with the tail-50% criterion) that are confirmed for two- locus and two-locus epistatic effects. Actual numbers making up these percentages are in Appendix A.8.

223 Genotype pattern association and multiple-linkages

Except for the CotL result, it is interesting to note that, of all ten eQTL datasets, only the two haploid yeast and the F2 mice studies are link to multiple eQTLs that are unassociated. We ascribe this result to heterozygosity and increased sample size. Sample sizes in the yeast and F2 studies are up to five times larger than the RI mice/rat studies, which may have improved statistical power to detect multi-locus influences in the single-locus linkage analyses. In addition to increased sample size, SchL is the only study that utilised a heterozygous population. An extra heterozygous genotype at each locus allows better estimate of trait-variance and more accurate and sensitive correlation between trait values and genotype values of the causative locus (loci).

For completeness, alternate results of unassociated multiple-linkages are presented in Figure 5-20 and Figure 5-21. These results are derived from using the tail-10% criterion to define random association (Section 5.5; Figure 5-16 and Figure 5-17). As expected, the extent of multiple-linkages defined as unassociated increases because more multiple-linkages have escaped the test of random association. As with the results from using the tail-50% criterion, more unassociated multiple-linkages are observed in the yeast and F2 mouse studies (Figure 5-20) and almost all are confirmed for two-locus effects (Figure 5-21).

Here we have identified a set of marker-pairs showing co-inheritance with the same expression traits but are not genotypically correlated: Pearsons’ correlation coefficient between genotype patterns of each of these marker- pairs approximates to zero. Similarly, these marker-pairs should not be in linkage disequilibrium. LD measures such as Lewontin’s D’ (LEWONTIN 2 and KOJIMA 1960) and R (HILL and ROBERTSON 1968) can be used to test this. The results are as expected: of all the marker-pairs identified from single-locus mapping, those that have been defined as unassociated scored a lower LD value (D’C0; Figure 5-22; see also Appendix A.9 for similar

224 Genotype pattern association and multiple-linkages

plots comparing Pearson’s correlation coefficient with joint two-locus linkage significance of the ten eQTL-mapping studies).

100% 90% 80% 70% 60% 50% 45.5% 38.2% 36.6% 40% 30% unassociated 20% 13.9% 14.3% 10% 5.9% 4.4% 0.0% 1.7% 0.9% 0%

B ot SC ubK y5 C CotK CotL H C % transcripts linked to multiple eQTLs that are H SchL ChesB ys HubFC B YveY.Cy3 YveY.

Figure 5-20 Percentages of transcripts linked to multiple eQTLs that are unassociated, following defining random association using tail-10% criterion (Figure 5-16). Actual numbers making up these percentages are in Appendix A.8.

100% 90% 80% 70% 60% 50% 40% 30%

unassociated 20% 10% 0%

B ot SC ubK y5 C CotK CotL H H C SchL ChesB ys HubFC B YveY.Cy3 YveY. % transcripts linked to multiple eQTLs that are

tw o-locus effect tw o-locus epistatic effect

Figure 5-21 Percentages of transcripts linked to unassociated eQTLs (following defining random association using tail-10% criterion) that are confirmed for two- locus and two-locus epistatic effects. Actual numbers making up these percentages are in Appendix A.8.

225 Genotype pattern association and multiple-linkages

Figure 5-22 Lewontin’s D’ measure of LD between locus-pairs identified from sinlge-locus mapping analyses of the CotB/K/L data, against estimated joint effects estimated from bqtl.twolocus. Indicated in red and green are locus-pairs presumed to be unassociating (see text). See Appendix A.9 for similar plots comparing Pearson’s correlation coefficient with joint two-locus linkage significance.

226 Genotype pattern association and multiple-linkages

This is not surprising because Pearson’s correlation is similar to LD measures (Figure 5-23). LD measures such as Lewontin’s D’ assesses the strength of linkage between two markers base on their allelic distributions within a mapping population. More specifically, Pearson’s correlation directly assesses genotype pattern similarity, and since allelic distribution is the ratio of genotypes within a genotype pattern, genotype pattern similarity is essentially a more specific measure of LD. This is supported by the observation that Pearson’s correlation coefficient is more stringent that Lewontin’s D’, and by the fact that a common alternate measure of LD is mathematically identical to Pearson’s coefficient, r: R2=r2 (Figure 5-23 right panel). Again, the fact that unassociated marker-pairs are unlinked supports our proposition that they are likely real.

227 Genotype pattern association and multiple-linkages

Figure 5-23 Comparisons between Pearson’s correlation coefficient with D’ (LEWONTIN and KOJIMA 1960) and R2 (HILL and ROBERTSON 1968) as measures of LD.

228 Genotype pattern association and multiple-linkages

5.7. Chapter summary and discussions

The major outcome from this thesis chapter is the development of a decision tree (Figure 5-3 and Figure 5-24) to categorise multiple-linkages by their believability using knowledge of genotype pattern association.

5.7.1. The pipeline

The necessity of this pipeline arises from the fact that not all multiple- linkages identified from single-locus mapping analyses are real and not all confirmed two-locus effects from bqtl.twolocus are real. By first delineating multiply-linked eQTLs by their sources of genotype pattern- association and then applying the bqtl.twolocus test, this pipeline allows more informed decisions to be made of the credibility of these multiple- linkages.

A summary of the results from applying this pipeline to 17 – 402 sets of multiple-linkages as determined from previous chapters is presented in Figure 5-24. Without this pipeline and using bqtl.twolocus alone, 90% of multiple-linkages are confirmed for two-locus effects, on average. However, with the additional information of genotype pattern-association, only 38.5% of multiple-linkages are awarded high confidence to be real (circled in Figure 5-24). The other 54.5% are potentially erroneous and will have comparatively lower priority for further investigaion.

229 Genotype pattern association and multiple-linkages

Transcripts with multiple linkages 17-402 (2.5%-20.3%)

Are multi-locus in synteny?

yes no

Are multi-locus Are multi-locus confirmed for two-locus genotype patterns effect? NSA?

yes no yes no

Are multi-locus Are multi-locus genotype Residual LD; Obligatory confirmed for two- patterns randomly 0%-60.8% syntenic co- locus effect? associated (tail-50%)? segregation (10.2%) 0%-24.6% yes no (8.5%) yes no Are multi-locus Are multi-locus Spurious NSA confirmed for two- confirmed for two- multiple linkage 8.8%-48.9% locus effect? locus effect? 0%-13.3% (25.0%) (4.4%) yes no

yes no Spurious Unarguably 0% Spurious Spurious real 6.6%-77.8% 0%-2.6% 0%-17% (46.5%) (0.5%) (5%)

Figure 5-24 Summary result from pipeline for determining real and spurious multiple-linkages. The results are across the ten eQTL studies. Between 17 and 402 transcripts are linked to more than one eQTL; this is equivalent to a range of 2.5%-20.3% of transcripts link to at least one eQTL. All other percentages are predicated on the total number of transcripts with multiple-linkages per dataset. At the end of the decision tree, in circles (real multiple-linkage) and squares (spurious multiple-linkage), are the ranges and averages (in brackets) of transcripts satisfying the questions along the tree branches. Actual numbers and percentages corresponding to each study can be found in Appendix A.8.

230 Genotype pattern association and multiple-linkages

5.7.2. Sources of genotype pattern association

On average, 19% of multiply-linked eQTLs have genotype patterns that are in syntenic association, 29% in non-syntenic association, 47% under random association, and 5% are unassociated. That is, genotype pattern association appears to have caused the majority of multiple-linkages.

Genotype pattern-association is either real or spurious. Real association is due to co-segregation of distinct genomic regions because their interaction is under selection (PETKOV et al. 2005; CERVINO et al. 2005). The mechanism of interaction between these regions is unclear and may range from direct association at a chromosomal level or physical interactions between gene-products encoded from these regions. While regulation of gene expression may require interaction between multiple loci, the reverse is not necessarily true. Regions that co-segregate and have associated genotype patterns may not be involved in gene expression. The consequence of this is multiple-linkages demonstrating NSA but failing the bqtl.twolocus test for two-locus effects.

Spurious genotype pattern association is due either to residual-LD between syntenic loci or random association. The former can be distinguished from true syntenic multi-locus interactions using bqtl.twolocus. The latter is purely random and is expected to be more prevalent in studies with small sample sizes (n = small number of individuals). This is because, under random expectation, the probability of two loci having identical genotypes across all individuals diminishes as n increases. Because of the arbitrariness at which “randomness” is defined, not all multiple-linkages classified as randomly associating are false. If the definition (i.e. threshold) is too liberal there will be many false negatives, and if the reverse is true many spurious multiple-linkages will be defined as unassociated resulting in false positives. Research into examining the optimal method for defining random

231 Genotype pattern association and multiple-linkages

association is valuable as 46.5% of all multiple-linkages fall into this category, on average (Figure 5-24).

5.7.3. Multiple-linkages and polygenic expression traits

Like many quantitative traits, transcript abundance is likely complex. This is supported by a recent study that estimated, using simulation studies, that approximately half of the highly heritable expression traits are influenced by at least five additive eQTLs (BREM and KRUGLYAK 2005). In their study, a panel of 112 yeast segregants was used. The fact that they failed to identify eQTLs for ~40% of highly heritable traits even with such as relatively large sample size suggests that many current eQTL-mapping studies simply lack the power to detect eQTLs with small effects.

Clearly, it would be ideal to use multi-locus mapping analyses for dissecting the genetics of complex traits, such as mRNA abundance. But as we have demonstrated in Section 4.4, the experimental system, particularly in relation to sample size, of current eQTL-mapping studies is such that we simply lack the statistical power. While this chapter does not claim to have solved this daunting problem, it does provide a procedure for elucidating and assessing the validity of multiply-linkages eQTLs resulting from single-locus eQTL-mapping analyses.

In Carlborg et al. 2005, the authors described an automated pipeline foe assessing significance for each expression trait and its corresponding eQTL(s), using FDR and prior information on repeatability (if replicates are available) or heritability (in the absence of replicates) and relative location of gene transcript to eQTL. Incorporating their strategy with ours, would help prioritise eQTLs for further, often expansive, postliminary analyses.

232 Discussions

6. Discussions

In studies of expression quantitative trait locus-mapping there are generally two directions of interest. One is to study specific expression phenotypes, in which case we are likely to be primarily interested in the influence of cis- acting variations on expressions of particular genes that correlate to the phenotypes. In contrast, the more general direction is to study the regulatory architecture of gene expression, which places more focus on trans-acting variations whose actions are likely pleiotropic and affect many genes. This thesis project has focused mainly on the latter area and in particular we are interested in the phenomena of one locus influencing many genes and one gene being influenced by many loci.

In this concluding chapter I will first provide a summary of the pipeline generated from this thesis project for guiding eQTL-mapping analyses to investigate the regulatory architecture of gene expression. This will then be follow by discussions of two main conclusions: expression phenotypes are oligogenic traits (Section 6.2) and regulations of gene expression are largely pleiotropic (Section 6.3).

233 Discussions

6.1. eQTL -mapping pipeline

This pipeline, provided in Figure 6-1, can be grouped in four main sections. First is the pre-processing of expression data (boxed in red), with the principal aim of eliminating microarray-related variations that can potentially confound analyses and interpretations of eQTL-mapping studies. While the exact choice of normalisation is dependent on the microarray platform and distributional characteristics of the expression data, both within- and between- array normalisations are highly recommended (Section 3.7.2).

There are many statistical methods for eQTL-mapping (Section 1.6), which takes appropriately normalised expression data and marker genotypes as inputs and generally returns a gene-by-marker matrix of raw P-values as output. Next, this P-matrix is “reduced” (pink box), using remove.LD, to eliminate potentially LD-induced redundant linkages (Section 4.1.1; Section 2.5). remove.LD may be applied after correction for multiple- testings that converts raw P-values to adjusted P-values (GE et al. 2003) or q-values (STOREY 2002; STOREY and TIBSHIRANI 2003). Following remove.LD, a threshold (i.e. cut-off for the P-values) is generally applied to deduce the list of significant linkages: a list of expression traits and their complementary eQTL(s).

Equipped with this list of significant linkages, two directions can be taken: one is to study the oligogenic nature of gene expression (green box), and the other is to study the pleiotropic nature of their trans-acting effectors (blue box).

An oligogenic trait is one that is regulated by multiple effectors with a few exerting relatively larger effects on the trait than the majority of the remaining effectors. Multiple effectors with large influences are detectable using single-locus mapping analyses but those with minor effects often

234 Discussions

remain undetected (ROCKMAN and KRUGLYAK 2006). Because many oligogenic and polygenic traits effectors act in trans and contribute only small effects, trans-eQTLs are often more difficult to detect and verify

(reviewed in SLADEK and HUDSON 2006). Within the green section of Figure 6-1 is a strategy for identifying and prioritising multiple-linkages (one trait linked to many eQTLs) resulting from single-locus mapping approaches. This process utilises the bqtl.twolocus test (Section 4.1.2) as well as information on sources of genotype pattern association between the multiply-linked eQTLs (Chapter 5).

A trans-acting regulator is pleiotropic if it controls the expression of many genes. The blue-outlined section is a simple method for identifying these master-regulators. We have used the criterion of >2.76% total linkage at a given genomic region for defining “linkage hotspots” (Section 3.5), but this cut-off can be altered, depending on the desired stringency. This process for finding master regulators also suggests treating neighbouring marker loci all with >2.76% of total linkages as a single hotspot because we are always cautious of LD effects between adjacent genomic regions.

A process that is not included in this pipeline is the removal of “unexpressed” genes (Section 3.2.4 and Section 3.2.5). Consideration of overall expression level of a transcript is important as it provides insights into the extent to which observations may be driven by variations associated with low signal detection. Conversely, genes expressed at low levels are not necessarily endogenously unexpressed or are uninteresting; it only implies these genes have a higher probability of generating false positives (Section 3.7.3).

Thus, this step is not omitted because it is unimportant; rather, it is because there is no one specific stage at which this filtering step should be applied. One may wish to filter-out “unexpressed” genes prior to normalisation or immediately after normalisation, though this should be done with caution;

235 Discussions

most normalisation methods are highly dependent on the data’s distributional characteristics and so removing parts of the data through removal of “unexpressed” genes can create a skewness that may violate certain assumptions underlying some normalisation methods. Similarly, removal of “unexpressed” genes prior to performing certain linkage tests can also cause statistical biases, especially during the test of significance and calculation of P-values where the null distribution is often based on the test statistics of other genes. A good stage to filter-out “unexpressed” genes is perhaps at the point immediately after the list of significant linkages has been identified because any subsequent analyses are not conditional on other expression traits. Such post hoc filter processes have also been adopted for “parsing QTL parameters” (PEIRCE et al. 2006). However, if the application of this filtering process does not appreciably reduce data- dimensionality then there is no harm in retaining the set of “unexpressed” genes for all subsequent analyses.

236 Discussions

Raw expression

Within-array normalisation

Between-array normalisation

normalised Test-statistics expression e.g. LR scores

Trait- variance eQTL-mapping Raw P Marker genotypes Multiple testing remove.LD correction Genotype pattern Determine correlation pair-wise genotype Adjusted P pattern correlation Determine Marker map Reduced P significant linkage with P threshold

Transcripts Identify List of with multiple linkages per transcripts and their linkages transcript complementary eQTLs

Determine source of genotype pattern bqtl.twolocus Identify association linkages per locus

Syntenic multiple Unassociated Confirmed multiple linkages 2-locus effects Identify loci linkages with >3% linkages NSA multiple Unconfirmed linkages 2-locus effects Merge neighbouring Random loci multiple linkages Master Priority list of regulators multi-locus linkages

Figure 6-1 Pipeline for analysing the genetic regulatory architecture of gene expression using eQTL-mapping studies.

237 Discussions

6.2. Gene expression is an oligogenic trait

It is now widely accepted that gene expression variation is heritable(BREM et al. 2002; CHESLER et al. 2005; CHEUNG et al. 2003; MONKS et al . 2004;

SCHADT et al. 2003). However, of the ten eQTL-mapping studies analysed in this thesis project, <22% of all genes have an identifiable genetic component (i.e. “mappable” to at least one eQTL; Table 4-8), suggesting a greater proportion of genes may be under oligogenic or polygenic effects with unmappable genetic determinants (GIBSON and WEIR 2005). This view is supported by BREM and KRUGLYAK (2005) who failed to identify any eQTLs for 40% of highly heritable expression traits. These types of traits are presumably controlled by many regulators each exerting very a small effect and so are often undetectable in genetic analyses. Oligogenic traits differ to polygenic traits in that, in addition to the many minor affecting regulators the traits are also controlled by a few major affecting effectors. Thus, mappable genes determined from the ten eQTL-mapping studies are likely oligogenic where the mapped eQTLs contain the major affecting variants while the non-mappable genes (>78% of all genes across the ten studies) are polygenic whose causative variants are undetectable.

These considerations suggest transcript abundance is likely influenced by multiple genetic variants and so a single-locus mapping approach is probably inappropriate for studying the genetic regulatory architecture of gene expression. Given only <20% of mappable genes are link to two or more eQTLs (i.e. <2.5% of all genes) across the ten studies (Table 4-8), suggests the presence of relatively high false negatives and that many genetic influences on many genes are missed (GIBSON and WEIR 2005).

6.2.1. Solution: multi-locus mapping?

A multi-locus linkage analysis may be more appropriate for elucidating the regulatory network of gene expression. However, multi-locus mapping

238 Discussions

analyses are computationally intractability and statistical complex (Section

1.7; Chapter 4). Recently, STOREY,AKEY and KRUGLYAK (2005) developed a step-wise method for mapping multi-locus effects and have successively applied it to a set of yeast data (BREM et al. 2002; YVERT et al.

2003; BREM et al. 2005). This method reduces the computational intensity associated with a full n-dimensional genome-scan to n lots of 1-D genome scans, where n is the number of loci assumed to be influencing the expression traits. But, even with this method, the computational intensity is still not trivial. One solution is to utilise high performance computing, namely, by implementing STOREY,AKEY and KRUGLYAK’s algorithm to run batches of expression traits in parallel. This is possible because the analyses are trait-based and so expression data can be split and analysed separately. Clearly, this is conditional on having access to the necessary computational facilities.

Clearly, the most obvious method for assessing the validity of multiply- linked eQTLs resulting from single-locus mapping studies would be to compare them against the results from a multi-locus mapping analysis.

Thus, we performed the method of STOREY,AKEY and KRUGLYAK (2005) to our initial 31 BXD data (Section 4.4). Unfortunately, we found that, given our sample size, we do not have sufficient statistical power for multi- locus analyses such as this. Further applying method to two subsets of the yeast data originally used in the STOREY,AKEY and KRUGLYAK (2005) paper with n=31 and n=86, we confirmed the presence of a power issue for multi-locus mapping analyses.

Together, these results suggest a desperate need to increase sample size of eQTL-mapping studies, but in the absence of this and for the existing studies, we recommend bqtl.twolocus for prioritising multiply-linked eQTLs.

239 Discussions

6.2.2. Influence of genotype pattern association in mapping multi-locus effects

In this thesis project I have shown that genotype pattern association can greatly influence the analysis and interpretation of multiple-linkages observed from single-locus mapping studies (Chapter 5). By the same token, multi-locus mapping studies may be similarity influenced and should be subjected to the same scrutiny.

It is entirely plausible for interacting loci to have similar genotype patterns because they are expected to have similar co-inheritance patterns (GRABER et al. 2006). This is supported by an emerging hypothesis pertaining to the mechanism of transcription (SPILIANAKIS et al. 2005). This hypothesis states that functionally related genes are co-expressed because the chromosomal regions at which these genes are encoded are physically brought together in the nucleus to be transcribed co-ordinately. Under these circumstances, multiple loci identified by multi-locus mappings will demonstrate genotype pattern association and we would consider “inter- chromosomal communication” as a potential mechanism of the gene’s transcriptional control.

Conversely, significant multi-locus mapping results may arise due to random genotype pattern association. There are a couple of solutions for overcoming this problem. One is to increase the number of individuals used. If association between genotype patterns are truly random, then statistically, their association should decrease with increasing sample size. An alternate method is to use independent replication: given a relatively large panel of individuals, the study can be split into two and analysed independently. Random genotype pattern similarity is not expected to persist in different individuals, thus results due to random genotype pattern association are not expected to replicate. The set of individuals between the two replicates need not be completely different (i.e. partial overlap is

240 Discussions

possible), but only different enough to ensure random genotype pattern associations do not persist.

241 Discussions

6.3. Regulation of gene expression is pleiotropic

In the previous section, we discussed the phenomenon of one gene being influenced by multiple eQTLs. In this section, we focus on the phenomenon of one eQTL influencing multiple genes. In Section 3.5 we introduced the concept of “linkage hotspot” defined as an eQTL with >2.76% of total linkages. Under random expectation, each locus should not have more than 0.13% of total linkage (given 779 markers), thus marker loci with >2.76% of total linkages are indicative of the presence of a master regulator. Recall from Section 3.5, a total of five potential master regulators were identified in the 31 BXD brain, kidney, and liver tissue studies (Table 3-7; Figure 3- 17): Chr1: 77-81Mb (Brain); Chr8: 9-15Mb (Brian and Kidney); Chr9: 34- 38Mb (Kidney); Chr19: 9-11Mb (Brain); and ChrX:75-85Mb (Liver).

Extending the analysis described in Section 3.5 to the other seven eQTL- mapping studies, examined in this thesis project, shows that linkage hotspot is not exclusive to our studies (Appendix A.4.1). These results suggest “linkage hotspots” are not spurious observations and master regulation of gene expression is present in other experimental crosses. We therefore conclude that regulators of gene expression are pleiotropic and a variant in such an effector will influence many target genes.

6.3.1. Mechanism of control of gene expression

The regulation of gene expression is complex and involves many steps and different effectors, including those that are directly involved in transcription and translation, or in mRNA stability and degradation, or in transportation and signalling pathways (Section 1.2). These effectors may act positively to increase the expression of their target genes, or negatively to decrease their expression. For simplicity we propose three mechanisms of action for any effector of gene expression:

242 Discussions

1) Expression variation in the effector influences the expression level of its target gene(s): increase in the concentration of an effector within a cell increases the efficiency of its action thus altering the level of expression of its target(s); 2) Sequence variation in the effector influences the expression level of its target gene(s): a mutant effector may, for example, bind with less affinity to its target thus reducing the target’s level of expression; 3) Sequence variation in an effector influences the expression level of another effector which inturn influences the expression level of its target gene(s).

A major problem with eQTL and classical QTL mapping studies is that a significant linkage only implies the presence of a potential regulator residing within the region of the eQTL, but the region can be relatively large ranging from a couple to tens of megabases (Mb). The difficulty is then to identify from within such a large locus the causative gene (or variant) controlling the target gene’s expression level. This can be achieved by: 1) Searching for sequence variants within the locus that complements the genotypes of the eQTL; 2) Inferring interactions between the target gene and its potential effector gene from within the eQTL through their functions, if such information is available; 3) Searching for potential binding sites near the target gene that may have a complementary domain within one of the potential effectors.

Attempts to identify one effector from a list of many are difficult because any inferred correlation between the target gene and candidate effector may be within random expectation. The beauty of linkage hotspots is that, for each eQTL are many target genes. While “hotspots” are not necessarily true “master regulators”, they do suggest segregation of “differences” in pleiotropic effects (ROCKMAN and KRUGLYAK 2006). These differences

243 Discussions

may be sequence variations or expression variations of one or more gene(s) encoded within a hotspot. Regardless of the mechanism of action, the fact that many expression traits are linked at these hotspots means that, any interaction inferred between a candidate effector and many target genes would increase its believability.

We therefore examined the above three propositions relating to the mechanism of gene expression using linkage hotspot results observed in our three-tissue BXD studies (CotB/K/L). As expected, a search in the NCBI mRNA reference sequence database (RefSeq; http://www.ncbi.nlm.nih.gov/RefSeq/) via the UCSC Mouse Genome Browser (http://genome.ucsc.edu/), show that there are many genes residing within each of the five hotspots identified in CotB/K/L. By only focusing on the genes that are known and annotated or are represented on our Compugen microarrays, we found 16 genes within the Chr1 hotspot, 48 within the Chr8 hotspot, 35 within the Chr9 hotspot, 34 within Chr19 hotspot, and 8 within the ChrX hotspot (Table A-9 in Appendix A.4.2).

Of these sets, a total of 39 genes are represented on the same microarrays used to measure gene expressions in these three studies: 5 within Chr1 region, 14 within the Chr8 region, 9 within Chr9, 7 within Chr19, and 4 within ChrX. That is, the expression data for 39 of these potential effector genes is available, allowing the first proposition to be addressed. Significant correlations between the expression pattern of a potential effector and the expression patterns of the target genes (the genes that are linked to the hotspots), would suggest expression variation of the targets are influenced by the expression variation of the effector. A simple correlation between the expression levels of each of the 39 potential genes against their corresponding targets across the 31 BXD strains in the respective tissues shows support of the first proposition for some effectors (Figure 6-2).

244 Discussions

Figure 6-2 Expression correlation between potential master regulators and their target genes. Each line in each panel is the density distribution of the correlations between the M-values of a potential effector against the M-values of the targets linked to the eQTL of the effector. The top-left panel is of the five potential effectors within the Chr1 linkage hotspot, which we have expression data for, against the 1247 target genes from the brain study. The middle and left panels in the top and middle rows are of the 14 potential effectors within the Chr8 hotspot and the 274 targets in the brain (top row) and 57 targets in the kidney (middle row). Middle-left panel is of the seven Chr19 potential effectors against 91 targets in the brain. Bottom-left panel is of nine Chr9 potential effectors against 107 targets in kidney. Bottom-middle panel is of four ChrX potential effectors against 114 targets in liver.

245 Discussions

The results (Figure 6-2) are most striking for the five Chr1 effector genes (top left panel), which shows that expressions of five of the four effectors are highly positively (right-shifted peak) and negatively (left-shifted peak) correlated to their 1247 targets, implying expression variations in one or more of these four effectors are influencing expressions of the 1247 targets. Conversely, expression of the remaining potential effector, Kcne4 (green line), is uncorrelated to its targets and may imply either its influence on the targets is via a sequence variant (satisfying the 2nd proposition above) or that it is not involved in the regulation of the 1247 genes. Similar results are observed in the other datasets, with many potential effectors either more positively or negatively correlated to their targets implying a respective activation or repression action. That is, given an effector that activates the expression of x genes but inhibits the expression of y genes, then an increase in the concentration of the effector would consequently increase the expression level of the x genes and suppress the expression of the y genes.

If only one of the potential effectors is co-expressed with the target genes, it would be easy to infer a direct manifestation of the first proposition: expression levels of many transcripts are influenced by the transcript abundance of one master regulator. However, Figure 6-2 clearly shows that many of the potential regulators are co-expressed with the targets. There are two possible explanations for this observation. Firstly, there is an emerging theory that gene expression is not independent of gene order: genes within physical clusters are co-expressed (SÉMON and DURET 2006). This suggests the manifestation of hotspots is due to tight clustering of functionally related genes (ROCKMAN and KRUGLYAK, 2006), all or few of which may be involved in the regulation of the targets. Thus, amongst the potential effectors co-expressing with the targets, may exists one (or few) true master regulator, GX, but as the other positional candidates are located physically closed to and thus co-transcribed with GX.

246 Discussions

A second possible explanation to the multiple potential effectors showing co-expression to the targets is, of course, the possibility that these effectors jointly influence the targets. For example, within the Chr1 hotspot are two genes Epha4 (Eph receptor A4) and Pax3 (paired box gene 3) which are both co-expressed with many of the 1247 targets in the brain data. Interestingly, both direct interaction and co-expression between these two proteins/genes have been demonstrated (SWARTZ et al. 2001; BEGUM et al. 2005). In particular, their interaction has been implicated in the regulation of muscle development (SWART et al. 2001). Thus, it is plausible that some or all of the potential co-expressed effectors do indeed jointly influence the expression of many of the mapped target genes.

Although many effectors have expression levels that are correlating to their targets, there are many that do not. In particular, a majority of the nine effectors within the Chr9 hotspot do not appear to correlate to the 107 targets mapped to this region. These results hint at the possibility of the second proposition: it is sequence variations in the effectors that are influencing the expression levels of their targets. As the progenitor strains of the BXD panel are completely sequenced, SNP data are readily available from the NCBI’s dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) and Ensembl Mouse Transcript SNP View database (http://www.ensembl.org/ Mus_musculus/transcriptsnpview). Of the 141 genes that are known and annotated or represented on the Compugen arrays, 54 have at least one SNP between the two founder strains (Table A-9 in Appendix A.4.2). Systematic examination of these SNPs would help reduce the set of candidate effectors.

An interesting observation from our SNP search is that the regions on Chr9 and Chr19 are particularly variable with many SNPs. It is known that many of the laboratory inbred mouse strains were initially derived from a very small number of founders and so share many haplotypes (BECK et al. 2000;

IDERAABDULLAH et al. 2004). Thus we predict the Chr9 and Chr19 hotspots

247 Discussions

to be contained in haplotype blocks with high frequencies of sequence variants. Particularly for the Chr9 hotspot, the fact that many of the potential effectors are not co-expressed with the 107 target genes, it is likely that expression variation of these target genes are due to sequence variants in one or more of the potential effectors within this Chr9 eQTL.

Interestingly, and referring back to the Chr1 hotspot, we find that within the range of this hotspot region (77-81Mb), is a lack of sequence variation between the progenitor strains in the 5’ half compare to the 3’ half. If expression levels of the 1247 transcripts mapped to this hotspot are regulated by a sequence variation within this region, then one could quickly reduce the list of positional candidates by concentrating on the 3’ half of this genomic region. This idea is the basis of using haplotype blocks to assist in QTL-mapping and fine mapping analyses (WANG et al. 2004;

PARK et al. 2003; GURYEV et al. 2006).

Finally, we address the third proposition that gene expression regulation is hierarchical. In the case where the first proposition is true, that the expression regulation of target genes is controlled by the expression level of an effector, then the effector must itself be regulated. Re-examining eQTL-mapping results for the 37 potential effectors that are represented on the microarray (Chapter 3) show that only four are significantly linked to an eQTL at Praw0PBON (Table A-9 in Appendix A.4.2). Cullin 3 (Cul3) within the Chr1 hotspot is mapped to an eQTL on Chr8, which is the Chr8 hotspot identified in both the brain and kidney studies. Furthermore, the other three potential effectors, F7 (coagulation factor VII), Tfdp1 (transcription factor Dp 1), and F10 (coagulation factor X), are all encoded within the Chr8 hotspot and are all significantly linked to a region on Chr1, but only in the brain and not the kidney study. F7 and F10 are both mapped to the Chr1 hotspot, while Tfdp1 is mapped to an eQTL ~7Mb downstream from the Chr1 hotspot.

248 Discussions

One possible model inferred from these findings is a potential feedback master regulatory control in the brain (Figure 6-3), where the protein encoded by Cul3 on Chr1 is involved in the regulation of up to 1,247 target genes including one or more of the master regulators encoded on Chr8 (F7, F10, and/or Tfdp1). These Chr8 master regulators would inturn regulate the expression of Cul3 in addition to 274 other target genes.

or or

F7 F10 Tfdp1

Cul3

Figure 6-3 Schematic diagram of inter-dependent master regulation of gene expression. Proteins encoded by F7 and/or F10 and/or Tfdp1 may be involved in the regulation of the expression of Cul3, and conversely, the CUL3 protein may regulate the expression of F7, F10, and/or Tfdp1. In turn, each of F7, F10, TDFP1, and CUL3 positively (+) and negatively (–) regulate the expression level of many other target genes.

Finally, in addition to these approaches, one could further narrow-down the list of positional candidates by inferring functional relation between one or more potential effectors with all or a subset of the targets. This can be achieved by searching for common binding motifs or common functional annotations amongst the targets then relating them back to the potential effectors. For instance, of the 1247 targets corresponding to the brain Chr1 linkage hotspot, we chose a subset of 372 that are expressed (Section 3.2.5) and searched for common transcription factor binding motifs within 2kb upstream and 2kb downstream of the genes, using Clover (FRITH et al. 2004) and the JASPAR Transcription Factor Binding Profile database

249 Discussions

(http://jaspar.genereg.net/).4 By restricting the Clover score to greater than 10 with a corresponding P-value <0.001, 123 motifs were identified in total. Of these, six are present at least once in >10% of the 372 genes (Table 6-1).

A search for TFs containing these six conversed protein domain motifs in the Ensembl Mouse Genome database (http://www.ensembl.org/ Mus_musculus/) did not identify any TFs within the Chr1:77-81Mb region. A possible reason for this lack of finding is that, the regulation of the Chr1 master regulator may not act on the targets directly. Instead, it may be exerting its affects on the target transcripts via one or more TFs harbouring one or more of the six identified over-represented protein domains.

Table 6-1 Over-represented transcription factor (TF) binding-motifs identified in more than 10% of 372 transcripts mapped to the Chr1 linkage hotspot.

MOTIF # % JASPAR ID TF CLASS transcripts transcripts

MA0073 ZN-C2H2 Rreb1 132 35.5

MA0041 FORKHEAD Foxd3 76 20.4

MA0042 FORKHEAD FoxI1 57 15.3

MA0055 bHLH Myf 65 17.5

MA0088 ZN-C2H2 Staf 47 12.6

MA0060 CAAT-BOX Nyf 41 11.0

Furthermore, we find that Foxd3 (forkhead box D3) and Myf5 (myogenic factor 5; a member of the MYF proteins with bHLH domain) are both mapped at the Chr13 marker D13Mit145 at Praw<0.001 and Praw~0.003, respectively. This suggests Foxd3 and Myf5 may themselves be potentially co-regulated, implying a complicated cascade of genetic regulation for the final control of the >1000 target transcripts.

4 This analysis was performed by Mark Cowley from UNSW.

250 Discussions

Clearly, because there are many potential steps at which the expression level of a transcript may be regulated, similar analyses can be performed for different classes of proteins. This is particularly important in light of the emerging theory that TFs are not the major class of effectors influencing expression traits (reviewed in SLADEK and HUDSON 2006). For example, instead of searching for common TF-binding motifs, one could search for common signalling protein-binding motifs, especially since many eQTLs are have been shown to be genetic variants influencing signalling and metabolic pathways in yeast (BREM et al. 2002). Additionally, commona functional annotations or biological processes amongst and between the targets and potential effectors can similarly be examined. Co-expression and eQTL-mapping analyses can then be performed for any resulting candidates to help build upon, and ultimately dissect, the genetic architecture underlying gene expression variation.

The three propositions presented in this section for the mechanism of gene expression regulation are necessarily simplistic. In reality, the types and combination of genetic regulatory variations underlying gene expression are numerous (reviewed in ROCKMAN and KRUGLYAK 2006). To fully elucidate the genetics of natural variation, a genetical genomics approach is only an initial step that must be supported by integrating other systems biology approaches such as that suggested in this section as well as those reviewed in ROCKMAN and KRUGLYAK (2006).

251 Appendix

Appendix

A.1. Quantile-normalised data

A.1.1. Brain, kidney, and liver BXD studies

Figure A-4 Boxplot representations of M-value distributions of the brain BXD (CotB) data before (top) and after (bottom) quantile-normalisation. In the top panel, within- array normalisation has been performed on all arrays using print-tip loess. In the bottom panel, an extra step of between-array normalisation has been performed using quantile- normalisation. Each boxplot represents the M-value distribution of one microarray experiment; i.e. one BXD strain (x-axis).

A-1 Appendix

Figure A-5 Boxplot representations of M-value distributions of the kidney BXD (CotK) data before (top) and after (bottom) quantile-normalisation. Figure legend as in Table A-1.

A-2 Appendix

Figure A-6 Boxplot representations of M-value distributions of the liver BXD (CotL) data before (top) and after (bottom) quantile-normalisation. Figure legend as in Table A-1.

A-3 Appendix

A.1.2. sim(0)/(1)/(2)/(3)

Figure A-7 Boxplot representations of M-value distributions of five (a-e) sim(0) datasets before (top) and after (bottom) quantile-normalisation. Top panel shows the raw simulated data, while the bottom panel shows the distribution of the data after quantile-normalisation. Each boxplot represents one simulated array, or strain.

A-4 Appendix

Figure A-8 Boxplot representations of M-value distributions of five (a-e) sim(1) datasets before (top) and after (bottom) quantile-normalisation. Legend as in Figure A-4.

A-5 Appendix

Figure A-9 Boxplot representations of M-value distributions of five (a-e) sim(2) datasets before (top) and after (bottom) quantile-normalisation. Legend as in Figure A-4.

A-6 Appendix

Figure A-10 Boxplot representations of M-value distributions of five (a-e) sim(3) datasets before (top) and after (bottom) quantile-normalisation. Legend as in Figure A-4.

A-7 Appendix

A.2. Number of linkages per transcript

Table A-2 Numbers of transcripts linked to 1-5 or >5 eQTLs in the brain, kidney, and liver BXD studies. The first part is for results using Praw0PBON criterion for defining linkage and the second part are derived from Praw0PRED (see Section 3.2.1). all linkages: numbers of linkages prior to removal of potentially LD-induced multiple linkages (Figure 3-7 for Praw0PRED; Figure 4-5 for Praw0PBON). remove.LD: numbers of linkages following application of remove.LD (Section 2.5) to eliminate potentially LD-induced linkages (Figure 3-10 for Praw0PRED; Figure 4-5 for Praw0PBON).

Praw0PBON 1 2 3 4 5 >5

brain 952 474 162 58 56 79 all kidney 447 134 46 23 9 11 linkages liver 561 157 67 33 14 23

brain 1709 72 0 0 0 0 remove.LD kidney 653 17 0 0 0 0

liver 820 34 1 0 0 0

Praw0PRED 1 2 3 4 5 >5

brain 1556 726 334 103 89 188 all kidney 740 265 102 42 24 22 linkages liver 1136 362 152 70 47 54

brain 2781 209 6 0 0 0 remove.LD kidney 1135 57 3 0 0 0

liver 1713 106 2 0 0 0

A-8 Appendix

Table A-3 Numbers of simulated genes linked to 1-5 or >5 eQTLs in the sim0B studies. Legend as in Table A-2. sim0B.1/2/3/4/5 corresponds to the five replicates and sim0B.av is their averages. Corresponds to Figure 3-7, Figure 3-10, and Figure 4-5.

Praw0PBON 1 2 3 4 5 >5

sim0B.1 331 102 44 23 4 7

sim0B.2 352 131 47 18 11 20

sim0B.3 348 123 39 16 5 4

sim0B.4 351 88 57 16 4 5 all linkages sim0B.5 365 116 48 18 4 7

sim0B.av 349.4 112 47 17.8 5.6 8.6

sim0B.1 498 13 0 0 0 0

sim0B.2 551 14 2 0 0 0

sim0B.3 523 12 0 0 0 0

sim0B.4 506 13 2 0 0 0 remove.LD sim0B.5 542 15 1 0 0 0

sim0B.av 524 13.4 1 0 0 0

Praw0PRED 1 2 3 4 5 >5

sim0B.1 678 227 96 48 19 16

sim0B.2 658 248 126 42 23 20

sim0B.3 731 229 92 38 21 15

sim0B.4 686 231 102 44 14 13 all linkages sim0B.5 658 244 105 46 11 18

sim0B.av 682.2 235.8 104.2 43.6 17.6 16.0

sim0B.1 1033 50 1 0 0 0

sim0B.2 1073 39 5 0 0 0

sim0B.3 1088 38 0 0 0 0

sim0B.4 1040 46 4 0 0 0 remove.LD sim0B.5 1029 50 2 1 0 0

sim0B.av 1052.6 44.6 2.4 0.2 0 0

A-9 Appendix

Table A-4 Numbers of simulated genes linked to 1-5 or >5 eQTLs in the sim0K studies. Legend as in Table A-2. sim0K.1/2/3/4/5 corresponds to the five replicates and sim0K.av is their averages. Corresponds to Figure 3-7, Figure 3-10, and Figure 4-5.

Praw0PBON 1 2 3 4 5 >5

sim0K.1 403 114 48 19 8 9

sim0K.2 393 124 42 18 8 12

sim0K.3 393 129 51 20 4 7

sim0K.4 402 110 47 19 9 6 all linkages sim0K.5 386 133 45 20 12 6

sim0K.av 395.4 122 46.6 19.2 8.2 8

sim0K.1 586 15 0 0 0 0

sim0K.2 584 13 0 0 0 0

sim0K.3 586 18 0 0 0 0

sim0K.4 575 18 0 0 0 0 remove.LD sim0K.5 585 17 0 0 0 0

sim0K.av 583.2 16.2 0 0 0 0

Praw0PRED 1 2 3 4 5 >5

sim0K.1 680 233 102 46 15 23

sim0K.2 710 236 94 40 16 23

sim0K.3 690 217 106 50 19 17

sim0K.4 680 240 78 46 20 16 all linkages sim0K.5 676 250 97 41 23 20

sim0K.av 687.2 235.2 95.4 44.6 18.6 19.8

sim0K.1 1058 40 1 0 0 0

sim0K.2 1083 34 2 0 0 0

sim0K.3 1043 55 1 0 0 0

sim0K.4 1031 48 1 0 0 0 remove.LD sim0K.5 1050 56 1 0 0 0

sim0K.av 1053.0 46.6 1.2 0 0 0

A-10 Appendix

Table A-5 Numbers of simulated genes linked to 1-5 or >5 eQTLs in the sim0L studies. Legend as in Table A-2. sim0L.1/2/3/4/5 corresponds to the five replicates and sim0L.av is their averages. Corresponds to Figure 3-7, Figure 3-10, and Figure 4-5.

Praw0PBON 1 2 3 4 5 >5

sim0L.1 344 96 42 11 3 7

sim0L.2 290 106 36 10 7 2

sim0L.3 289 86 41 21 6 5

sim0L.4 284 88 40 14 8 8 all linkages sim0L.5 310 111 38 15 10 6

sim0L.av 303.4 97.4 39.4 14.2 6.8 5.6

sim0L.1 490 13 0 0 0 0

sim0L.2 445 6 0 0 0 0

sim0L.3 445 3 0 0 0 0

sim0L.4 434 8 0 0 0 0 remove.LD sim0L.5 479 10 1 0 0 0

sim0L.av 458.6 8 0.2 0 0 0

Praw0PRED 1 2 3 4 5 >5

sim0L.1 715 269 95 39 25 17

sim0L.2 673 225 105 39 22 14

sim0L.3 683 209 113 53 27 18

sim0L.4 640 226 126 35 19 21 all linkages sim0L.5 671 247 100 45 29 24

sim0L.av 235.2 107.8 42.2 24.4 676.4 18.8

sim0L.1 1096 62 2 0 0 0

sim0L.2 1051 26 1 0 0 0

sim0L.3 1068 35 0 0 0 0

sim0L.4 1010 57 0 0 0 0 remove.LD sim0L.5 1069 46 1 0 0 0

sim0L.av 1058.8 45.2 0.8 0 0 0

A-11 Appendix

Table A-6 Number of simulated genes linked to 1-5 or >5 eQTLs in the sim0norm studies. Linkages are defined using the criterion Praw0PBON. Legend as in Table A-1. Correspond to Figure 4-5 in Section 4.2.2.

1 2 3 4 5 >5

sim0norm.1 819 236 91 31 21 17

sim0norm.2 754 260 115 51 23 20

sim0norm.3 765 269 93 41 12 21

sim0norm.4 744 251 102 35 24 17 all linkages

sim0norm.5 717 263 110 31 13 18

sim0norm.av 759.8 255.8 102.2 37.8 18.6 18.6

sim0norm.1 1180 34 1 0 0 0

sim0norm.2 1178 45 0 0 0 0

sim0norm.3 1152 49 0 0 0 0

sim0norm.4 1132 37 4 0 0 0 remove.LD

sim0norm.5 1140 46 2 0 0 0

sim0norm.av 1156.4 42.2 1.4 0 0 0

A-12 Appendix

Table A-7 Numbers of transcripts linked to 1-5 or >5 eQTLs in the ten eQTL- mapping studies analysed in this thesis project. Numbers are calculated after applying remove.LD to eliminate potentially LD-induced multiple linkages (Section 2.5), and using the criterion Praw0PBON for defining significant linkages (Section 3.2.1). Correspond to Figure 4-11 in Section 4.5.2. Results for CotB/K/L are replicated from Table A-2.

1 2 3 4 5 >5

CotB 1709 72 0 0 0 0

CotK 653 17 0 0 0 0

CotL 820 34 1 0 0 0

ChesB 388 18 1 0 0 0

BysHSC 762 41 4 0 0 0

HubK 1580 350 44 8 0 0

HubFC 897 208 17 1 1 0

YveY3 1202 119 2 2 0 0

YveY5 1127 89 4 0 0 0

SchL 2262 195 5 0 0 0

A-13 Appendix

A.3. cis-linkages

Table A-8 Number (%) of linkages defined as cis- or trans- acting using different “cis- window” sizes in the BXD brain, kidney, and liver studies. Linkages are defined at Praw0PRED (Section 3.2.1). Following application of remove.LD (Section 2.5), there are a total of 2,937 unique linkages in the brain data, 1,150 in kidney, and 1,765 in liver, whereby the genomic location of the corresponding transcripts are known (Section 3.4). Correspond to Figure 3-13. cis window (Mb) # cis % cis # trans % trans 0.01 0 0.00% 2937 100.00% 0.1 0 0.00% 2937 100.00% 1 1 0.03% 2936 99.97% 2 2 0.07% 2935 99.93%

Brain 5 12 0.41% 2925 99.59% 10 34 1.16% 2903 98.84% 15 43 1.46% 2894 98.54% 20 51 1.74% 2886 98.26% 0.01 0 0.00% 1150 100.00% 0.1 0 0.00% 1150 100.00% 1 3 0.26% 1147 99.74% 2 11 0.96% 1139 99.04% 5 19 1.65% 1131 98.35% Kidney 10 28 2.43% 1122 97.57% 15 34 2.96% 1116 97.04% 20 38 3.30% 1112 96.70% 0.01 0 0.00% 1762 100.00% 0.1 0 0.00% 1762 100.00% 1 5 0.28% 1757 99.72% 2 6 0.34% 1756 99.66%

Liver 5 18 1.02% 1744 98.98% 10 32 1.82% 1730 98.18% 15 41 2.33% 1721 97.67% 20 53 3.01% 1709 96.99%

A-14 Appendix

Table A-9 Number (%) of linkages at different Praw that are defined as cis- or trans- acting in the BXD brain, kidney, and liver studies. Extent of linkages are determined following the application of remove.LD (Section 2.5), and extent of cis-linkages are defined by cis-window of 5Mb (Section 3.4). Correspond to Figure 3-15.

# P % raw unique # cis % cis # trans threshold trans linkages

PRED= 3217 12 0.41% 2925 99.59% 0.00014 10-4 2579 8 0.34% 2344 99.66%

10-5 483 2 0.45% 442 99.55% Brain

10-6 81 1 1.35% 73 98.65%

10-7 9 1 11.11% 8 88.89%

PRED= 1258 19 1.65% 1131 98.35% 0.00012 10-4 1007 16 1.75% 900 98.25%

10-5 122 7 6.42% 102 93.58% Kidney 10-6 17 5 33.33% 10 66.67%

10-7 6 4 66.67% 2 33.33%

PRED= 1931 18 1.02% 1744 98.98% 0.00015 10-4 1357 12 0.97% 1221 99.03%

10-5 182 4 2.38% 164 97.62% Liver

10-6 43 2 4.88% 39 95.12%

10-7 20 1 5.26% 18 94.74%

A-15 Appendix

A.4. Linkage hotspots A.4.1. In seven publicly available eQTL-mapping studies Figure A-11 Linkage hotspots in ChesB (top), BysHSC (middle), and SchL (bottom). Plotted are the percentages of total linkages (y-axis) at each marker locus along the genome (x-axis). Linkages 0.05 are defined at Praw0PBON= /M and hotspots are defined for locus if it contains >3% of total linkages (dotted line) after application of remove.LD. All three of these datasets uses experimental crosses derived from the two parental strains (C57BL/6J and DBA/2J), but studied in different tissues. Cross references to Section 6.3.

A-16 Appendix

Figure A-12 Linkage hotspots in HubK (top) and HubFC (bottom). Plotted are the percentages of total linkages (y- 0.05 axis) at each marker locus along the genome (x-axis). Linkages are defined at Praw0PBON= /M and hotspots are defined for locus if it contains >3% of total linkages (dotted line) after application of remove.LD. Both studies used the BXH/HXB RI rat strains but in different tissues: kidney (HubK) and fat cells (HubFC). Cross references to Section 6.3.

A-17 Appendix

Figure A-13 Linkage hotspots in YveY3 (top), and YveY5 (bottom). Plotted are the percentages of total linkages (y- 0.05 axis) at each marker locus along the genome (x-axis). Linkages are defined at Praw0PBON= /M and hotspots are defined for locus if it contains >3% of total linkages (dotted line) after application of remove.LD. Cross references to Section 6.3.

A-18 Appendix

A.4.2. Analysis of potential master regulators in the BXD brain (CotB), kidney (CotK), and liver (CotL) studies

Table A-10 Master regulator analysis for CotB, CotK, and CotL. A total of five linkage hotspots are noted: two in CotB, two in CotK and one in CotL, as indicated in column “Tissue” with the short-hands B, K, and L. The Chr8 linkage hotspot is shared in CotB and CotK. The immediate up- and downstream markers from the linkage hotspots bound the region examined. Known and annotated genes or genes that are also represented on the Compugen (CGEN) microarrays (“yes” in second last column) within these regions are considered. SNPs between the two progenitor strains (C57BL/6J and DBA/2J) are obtained from Ensembl Mouse database (http://ensembl.org). Those genes that are represented on the arrays are tested for linkage and results are noted in the last column (“eQTL”). Only four genes are linked: one from the Chr1 linkage hotspot in the brain study; three from the Chr8 hotspot from the brain (B) study.

SNPs on Gene Symbol Gene Product Go: Biological Processes between eQTL CGen Tissue

Hotspot Hotspot founders Linkage

axon guidance; tyrosine kinase Epha4 Eph receptor A4 0 yes 0 signalling pathway regulation of transcription; cell Pax3 paired box gene 3 proliferation and migration; neural 0 yes 0 tube development sphingosine-1-phosphate membrane protein with hydrolase Sgpp2 0 phosphotase 2 activity phenylalanine-tRNA synthetase- Farslb phenylalanyl-tRNA aminoacylation 0 like beta

Brain monoacylglycerol O- cell redox homeostasis; lipid Mogat1 0 acyltransferase 1 metabolism

Chr1: 77-81Mb UTP14 U3 small nucleolar Utp14b meiosis; spermatogenesis 0 ribonucleoprotein acyl-CoA synthetase long-chain Acsl3 fatty acid and lipid metabolism 0 family member 3 potassium voltage-gated channel Kcne4 potassium ion transport 0 yes 0 Isk-related Scg2 secretogranin II intracellular signalling cascade; 0 yes 0

A-19 Appendix

chemotxin; negative regulation of apoptosis; positive regulation of cell proliferation adaptor-related protein complex endocytosis; intracellular protein 1 u/s; 5 int; Ap1s3 AP-1 sigma 3 transcport 1d/s WD repeat and FYVE domain pre-mRNA processing and Wdfy1 2 u/s; 1 int containing 1 cytoskeleton assembly mitochondrial ribosomal protein Mrpl44 RNA processing 1 d/s L44 serine (or cysteine) proteinase cell differentiation; nervous system Serpine2 2 syn; 4 int inhibitor clade development 4 n/s; 1 int; 8 A830043J08Rik hypothetical protein LOC241128 unknown d/s Cul3 cullin 3 cell cycle; ubiquitin cycle 1 splice; 4 int yes B: S08Gnf006.700 9430031J16Rik hypothetical protein LOC241134 unknown 0 Tmem28 transmembrane protein 28 unknown 0 cellular proliferation, dna repair, and Lig4 DNA ligase IV prevent chromosome and single 0 chromatid aberrations aromatic compound metabolic Abhd13 abhydrolase domain containing 13 1 d/s process; proteolysis tumor necrosis factor (ligand) B and T cell proliferation, Tnfsf13b 0 superfamily costimulation, homeostasis Myo16 myosin XVI mysoin motor activity 1 syn; 1u/s Irs2 insulin receptor substrate 2 insulin signaling; cell proliferation 0 yes 0 Col4a1 procollagen type IV alpha 1 cell adhesion; phosphate transport 0 cell adhesion; phosphate transport;

Chr8: 9-15Mb Col4a2 procollagen type IV alpha 2 1 syn

Brain and kidney kidney and Brain negative regulation of angiogenesis RAB20 member RAS oncogene protein transport; small GTPase- Rab20 0 family mediated signal transduction 0710008K08Rik hypothetical protein LOC69225 kinase activity 0 inhibitor of growth family member regulation of cell cycle and Ing1 0 yes 0 1 transcription Ankrd10 ankyrin repeat domain 10 regulation of transcription 0 1700016D06Rik hypothetical protein LOC76413 unknown 0

A-20 Appendix

1700018L24Rik hypothetical protein LOC75528 - Rho guanine nucleotide exchange Rho guanyl-nucleotide exchange Arhgef7 1 n/s factor (GEF7) factor activity EG434280 hypothetical protein LOC434280 unknown 0 establishment and/or maintenance of SRY (sex determining region Y)- Sox1 chromatin architecture; regulation of - yes 0 box 1 transcription 1700094C09Rik hypothetical protein LOC78634 unknown 0 tubulin gamma complex associated microtubule cytoskeleton Tubgcp3 0 protein 3 organization and biogenesis metabolic process; phospholipid Atp11a ATPase class VI type 11A 0 yes 0 transport regulation of Rho protein signal Mcf2l mcf.2 transforming sequence-like 1 int transduction F7 coagulation factor VII blood coagulation; metabolic process 0 yes B: D1Mit134 F10 coagulation factor X blood coagulation; proteolysis 0 yes B: D1Mit216 protein Z vitamin K-dependent Proz blood coagulation; proteolysis 0 plasma DNA replication, recombination, and Pcid2 PCI domain containing 2 0 repair induction of apoptosis by Cul4a cullin 4A intracellular signals; regulation of 0 cell proliferation; DNA repair Grtp1 GH regulated TBC protein 1 regulation of Rab GTPase activity 0 Adprhl1 ADP-ribosylhydrolase like 1 protein amino acid ADP-ribosylation 0 DCN1 defective in cullin Dcun1d2 unknown 0 neddylation 1 domain transmembrane and coiled-coil Tmco3 ion transport; regulation of pH 0 domains 3 hypothetical protein LOC66501 1700029H14Rik unknown 0 yes 0 isoform 2 lysosomal membrane glycoprotein fusion of lysosomes with Lamp1 0 yes 0 1 phagosomes Tfdp1 transcription factor Dp 1 apoptosis; regulation of transcription 0 yes B: D1Mit178 Gas6 growth arrest specific 6 regulation of cell growth 0 yes 0

A-21 Appendix

ATPase H+/K+ transporting beta Atp4b ion transport 0 yes L: S13Gnf043.620 polypeptide G protein-coupled receptpr kinase Grk1 signal transduction 0 1 Rasa3 RAS p21 protein activator 3 intracellular signaling cascade 0 yes 0 D8Ertd457e hypothetical protein LOC101994 unknown 0 2410022L05Rik hypothetical protein LOC66423 unknown 0 yes 0 2610019F03Rik hypothetical protein LOC72148 unknown 0 CDC16 cell division cycle 16 Cdc16 cell division; mitosis 0 homolog UPF3 regulator of nonsense Upf3a response to unfolded protein 0 transcripts homolog AF366264 SUMO-1 specific protease 4 proteolysis 0 Fbxo25 F-box only protein 25 ubiquitin cycle 0 Erich1 glutamate-rich 1 unknown 0 SAP90/PSD-95 associated protein Dlgap2 cell-cell signaling 1 syn; 1 intr 2 Cln8 ceroid-lipofuscinosis neuronal 8 metabolic process 1 int; 1 d/s yes 0 Rho guanine nucleotide exchange 1 n/s; 4 syn; Arhgef10 intracellular signaling cascade factor (GEF) 10 12 int; 5 d/s echinoderm microtubule associated Eml3 unknown 1 int protein like chromatin remodeling; regulation of Mta2 metastasis-associated protein 2 0 transcription Tut1 RNA binding motif protein 21 RNA uridylyltransferase activity 0 eukaryotic translation elongation Eef1g translational elongation 0 factor 1

brain brain Ahnak AHNAK nucleoprotein intracellular signaling cascade 1 int Scgb1a1 secretoglobin family 1A member 1 phospholipase A2 inhibitor activity 0

Chr19: 9-11Mb Asrgl1 asparaginase like 1 glycoprotein catabolic process 0 Incenp inner centromere protein cell cycle; chromosome segregation 1 int yes 0 immune response; iron ion transport; Fth1 ferritin heavy chain 1 0 regulation of cell proliferation Best1 bestrophin ion transport 0

A-22 Appendix

RAB3A interacting protein Rab guanyl-nucleotide exchange Rab3il1 0 (rabin3)-like 1 factor activity Fads3 fatty acid desaturase 3 fatty acid and lipid metabolism 0 fatty acid and lipid metabolism; Fads2 fatty acid desaturase 2 0 electron transport Fads1 delta-5 desaturase fatty acid and lipid metabolism 3 int; 6 d/s flap structure specific Fen1 DNA replication and repair 0 yes 0 endonuclease 1 neural stem cell-derived dendrite Dagla lipid catabolic process - regulator 2 syn; 1 n/s; 5 Syt7 synaptotagmin VII gamma isoform plasma membrane repair; transporter yes 0 int 1 n/s; 4 int; 1 4930579J09Rik IIIG9 protein unknown d/s 0610038F07Rik hypothetical protein LOC66072 unknown 4 d/s pre-mRNA cleavage factor I 59 5730453I16Rik mRNA processing 3 int kDa subunit 2810441K11Rik hypothetical protein LOC68642 unknown 0 Tmem138 hypothetical protein LOC72982 unknown 0 cytochrome b ascorbate dependent Cybasc3 electron transport 0 3 protein dihydroxyacetone kinase 2 Dak glycerol metabolic process 2 n/s homolog damage specific DNA binding Ddb1 ubiquitin cycle 1 splice; 1 int yes 0 protein 1 Pga5 pepsinogen 5 group I v 1 int 2 n/s; 1 int; 10 Vps37c vacuolar protein sorting 37C unknown d/s 2 n/s; 6 syn; T cell costimulation; induction of Cd5 CD5 antigen 11 int; 1 yes 0 apoptosis by extracellular signals splice; 2 d/s 1 syn; 6 int; 1 Cd6 CD6 antigen cell adhesion yes 0 d/s Slc15a3 solute carrier family 15 member 3 oligopeptide transport 0

A-23 Appendix

heat shock 70kDa protein 5 Tmem132a unknown 1 int binding protein 1 Tmem109 transmembrane protein 109 unknown 1 n/s; 2 d/s 2 syn; 1 int; 1 Prpf19 nuclear matrix protein SNEV mRNA processing; RNA splicing u/s; 2 d/s 1 n/s; 9 int; 1 Zp1 zona pellucida glycoprotein 1 single fertilization yes 0 d/s Kirrel3 membrane protein mKirre hemopoiesis 1 n/s ST3 beta-galactoside alpha-2 3- St3gal4 protein amino acid glycosylation sialyltransferase Dcps histidine triad protein member 5 mRNA catabolic process 1 int; 5 u/s Toll-interleukin 1 receptor domain- immune response; cell surface Tirap 0 containing receptor linked signal transduction FAD-dependent oxidoreductase Foxred1 electron transport 1 int domain containing hypothetical protein LOC109229 C030004A17Rik transferase activity 1 int; 1 u/s isoform 2 RNA pseudouridylate synthase 1 int; 1 u/s; 1 Rpusd4 pseudouridylate synthase activity domain containing d/s 1 n/s; 2 syn; 7 signal recognition particle receptor Srpr intracellular protein transport int; 1 splice; 1 yes B: D1Mit134 ('docking protein') u/s; 2 d/s kidney kidney cell adhesion molecule- regulation of myoblast 4 n/s; 4 syn; 4 Chr9: 34-38Mb Cdon related/down-regulated by differentiation;smoothened signaling splice; 14 int oncogenes pathway DEAD (Asp-Glu-Ala-Asp) box 1 syn; 1 Ddx25 hydrolase and helicase activity yes 0 polypeptide 25 splice; 2 int Pus3 pseudouridine synthase 3 tRNA processing 1 n/s; 1 int yes 0 Svs7 seminal vesicle secretory protein 7 unknown 0 Gm846 gene model 846 unknown - D730048I06Rik hypothetical protein LOC68171 unknown 0 yes 0 EG434396 predicted gene, EG434396 unknown - secreted seminal vesicle Ly-6 A630095E13Rik unknown 1 splice; 1 u/s protein 1 Acrv1 acrosomal vesicle protein 1 unknown 0 yes 0

A-24 Appendix

checkpoint kinase 1 homolog (S. DNA repair; protein amino acid Chek1 3 int; 8 d/s yes 0 pombe) phosphorylation 1 syn; 1 plice; Stt3a intergral membrane protein 1 protein amino acid glycosylation 6 int; 2 u/s Ei24 etoposide induced 2.4 unknown 0 yes 0 fasciculation and elongation Fez1 gamma-tubulin binding 1 int protein zeta 1 Pknox2 Pbx/knotted 1 homeobox 2 regulation of transcription 2 int 1810021J13Rik hypothetical protein LOC66279 unknown 1 d/s yes 0 solute carrier family 37 (glycerol- 2 syn; 1 u/s; Slc37a2 3-phosphate transporter), member glycerol-3-phosphate transport 10 int 2 Ccdc15 coiled-coil domain containing 15 unknown 1 n/s; 1int Hepacam hepatocyte cell adhesion molecule regulation of transcription - Robo4 roundabout homolog 4 angiogenesis; cell differentiation 1 n/s BC024479 hypothetical protein LOC235184 unknown 0 endothelial cell-selective adhesion calcium-independent cell-cell Esam1 0 molecule adhesion Nrgn neurogranin protein kinase cascade 0 yes 0 V-set and immunoglobulin domain Vsig2 unknown 0 containing 2 Spa17 sperm autoantigenic protein 17 signal transduction 0 1 syn; 2 u/s; 5 Siae yolk sac gene 2 hydrolase activity int transforming growth factor beta 1 syn; 1 n/s; 1 Tbrg1 cell cycle regulated gene int Panx3 pannexin 3 unknown 0 Tmem47 transmembrane protein 47 unknown 0 4930595M18Rik hypothetical protein LOC245492 ubiquitin cycle - Dmd dystrophin muscular dystrophy striated muscle development 1 int yes 0 mitogen-activated protein kinase

liver Map3k7ip3 kinase kinase 7 interacting protein unknown - 3

ChrX: 75-85Mb carbohydrate and glycerol metabolic Gyk glycerol kinase isoform 1 - yes B: D1Mit216 process

A-25 Appendix

1700072E05Rik hypothetical protein LOC73495 unknown - yes 0 nuclear receptor subfamily 0 group regulation of transcription; sex Nr0b1 - yes 0 B member 1 determination CN716893 hypothetical protein LOC434903 unknown -

A-26 Appendix

A.5. Null transcripts

Table A-11 Transcripts that are “null expressed” in one tissue but “completely expressed” in another. These results refer to the brain, kidney, and liver BXD studies. “Null expression” is defined for a transcript if its level of expression (A-values) across all 31 BXDs in the corresponding tissue is less than the 90th percentile of all negative controls. “Complete expression” is defined for a transcript if its expression level is above the 90th percentile of all negative controls in the corresponding tissue of all 31 BXDs. The gene names are as provided by the Compugen microarray library (Clive and Vera Ramaciotti Centre, University of New South Wales, Australia). Cross references to Figure 3-19 and Table 3-6. Expressed in all GenBank Null in … Gene Name 31 BXDs in … accession Brain Kidney NM_008777 Mus musculus phenylalanine hydroxylase (Pah), mRNA Brain Liver NM_017399 Mus musculus fatty acid binding protein 1, liver (Fabp1), mRNA Kidney Brain AK017289 Mus musculus 6 day neonate head cDNA, RIKEN full-length enriched library, clone:5430410E06, full insert sequence NM_009045 Mus musculus avian reticuloendotheliosis viral (v-rel) oncogene homolog A (Rela), mRNA NM_008777 Mus musculus phenylalanine hydroxylase (Pah), mRNA AK005982 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:1700014N06, full insert sequence. AK008957 Mus musculus adult male stomach cDNA, RIKEN full-length enriched library, clone:2210416M21, full insert sequence. NM_017463 Mus musculus pre B-cell leukemia transcription factor 2 (Pbx2), mRNA NM_020014 Mus musculus glial cell line derived neurotrophic factor family receptor alpha 4 (Gfra4), mRNA AK015551 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930471G24, full insert sequence. NM_025934 Mus musculus RIKEN cDNA 2010110K24 gene (2010110K24Rik), mRNA. L29479 Mus musculus serine/threonine kinase (sak-a) mRNA, complete cds NM_011954 Mus musculus mitogen regulated protein, proliferin 3 (Mrpplf3), mRNA. AK014853 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4921509J17, full insert sequence. AK020126 Mus musculus 12 days embryo male wolffian duct includes surrounding region cDNA, RIKEN full- length enriched library, clone:6720454L20, full insert sequence. NM_008887 Mus musculus aristaless homeobox gene homolog (Drosophila) (Arix), mRNA AF282730 Mus musculus tissue inhibitor of metalloproteinases TIMP4 (Timp4) mRNA, complete cds

A-27 Appendix

Expressed in all GenBank Null in … Gene Name 31 BXDs in … accession Z12497 M.musculus rearranged T-cell receptor beta chain Vbeta5 repertoire (VDJ). AK009491 Mus musculus adult male tongue cDNA, RIKEN full-length enriched library, clone:2310024D23, full insert sequence. AK017754 Mus musculus 8 days embryo cDNA, RIKEN full-length enriched library, clone:5730507C05, full insert sequence. AK020337 Mus musculus adult male epididymis cDNA, RIKEN full-length enriched library, clone:9230112G11, full insert sequence. AK016002 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930539M17, full insert sequence. NM_008901 Mus musculus POU domain, class 3, transcription factor 4 (Pou3f4), mRNA NM_021399 Mus musculus CTIP2 protein (Ctip2), mRNA AB028499 Mus musculus mRNA for Flamingo 1, complete cds NM_020504 Mus musculus claudin 13 (Cldn13), mRNA U21569 Mus musculus immunoglobulin heavy chain productive VDJ rearrangement, anti-bacterial levan mAb VHCBLC235.7, mRNA, partial cds. NM_030678 Mus musculus glycogen synthase 1, muscle (Gys1), mRNA. S65035 {brain-specific TGF beta regulate sequence} [mice, Balb/c, serum-free embryo cells, mRNA Partial, 1658 nt]. AF100956 Mus musculus Bing1 (BING1), tapasin (tapasin), RalGDS-like factor (RLF), KE2 (KE2), BING4 (BING4), beta1, 3-galactosyl transferase (beta1,3-galactosyl transferase), ribosomal protein subunit S18 (RPS18), Sacm21 (Sacm21), H2K1(b) (H2-K1(b)), RING1 (RING1), NM_011670 Mus musculus ubiquitin carboxy-terminal hydrolase L1 (Uchl1), mRNA D25499 Mouse mRNA for Ig kappa light chain, variable region (partial cds). U55473 Mus musculus anti-DNA immunoglobulin heavy chain IgM mRNA, antibody 373s.29, partial cds. NM_013648 Mus musculus reticulon 2 (Z-band associated protein) (Rtn2), mRNA Z46845 M.musculus mRNA for preproglucagon AF093867 Mus musculus clone 14S10 T cell receptor alpha V-J region mRNA, partial cds. NM_011404 Mus musculus solute carrier family 7 (cationic amino acid transporter, y+ system), member 5 (Slc7a5), mRNA. AK017643 Mus musculus 8 days embryo cDNA, RIKEN full-length enriched library, clone:5730447C08, full insert sequence. NM_008122 Mus musculus gap junction membrane channel protein alpha 7 (Gja7), mRNA AK010077 Mus musculus adult male tongue cDNA, RIKEN full-length enriched library, clone:2310066K23, full insert sequence.

A-28 Appendix

Expressed in all GenBank Null in … Gene Name 31 BXDs in … accession NM_010201 Mus musculus fibroblast growth factor 14 (Fgf14), mRNA NM_010800 Mus musculus muscle, intestine and stomach expression 1 (Mist1), mRNA X05736 Mouse mRNA for T-cell receptor alpha-chain v-d-j region. M57978 Mouse IgK chain mRNA, VJ5 region. AK018070 Mus musculus adult male thymus cDNA, RIKEN full-length enriched library, clone:5830499B15, full insert sequence. AK013000 Mus musculus 10, 11 days embryo cDNA, RIKEN full-length enriched library, clone:2810405K07, full insert sequence. NM_008232 Mus musculus hepatoma-derived growth factor, related protein 1 (Hdgfrp1), mRNA AF290877 Mus musculus WAVE-1 mRNA, complete cds NM_011706 Mus musculus vanilloid receptor-like protein 1 (Vrl1), mRNA M92391 Mouse Ig gamma chain mRNA, V-D-J region from hybridoma MOR33.2.9, partial cds. NM_028218 Mus musculus RIKEN cDNA 2210409E12 gene (2210409E12Rik), mRNA. AK017789 Mus musculus 8 days embryo cDNA, RIKEN full-length enriched library, clone:5730525O22, full insert sequence. AK003823 Mus musculus 18 days embryo cDNA, RIKEN full-length enriched library, clone:1110019K20, full insert sequence. NM_008507 Mus musculus linker of T-cell receptor pathways (Lnk), mRNA. AK015694 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930504B16, full insert sequence. U20366 Mus musculus Hoxa11 locus antisense-23A mRNA AF252281 Mus musculus Kelch-like 1 protein (KLHL1) mRNA, complete cds AF302503 Mus musculus pellino 1 (Peli1) mRNA, complete cds AK010963 Mus musculus 13 days embryo liver cDNA, RIKEN full-length enriched library, clone:2510012J08, full insert sequence. NM_008000 Mus musculus fer (fms/fps related) protein kinase, testis specific 2 (Fert2), mRNA NM_016706 Mus musculus coilin (Coil), mRNA AB009674 Mus musculus mRNA for ADAM22, complete cds NM_007709 Mus musculus Cbp/p300-interacting transactivator with Glu/Asp-rich carboxy-terminal domain 1 (Cited1), mRNA NM_011219 Mus musculus protein tyrosine phosphatase, receptor type, Z (Ptprz), mRNA AK015976 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930535D10, full insert sequence. AK017174 Mus musculus 11 days pregnant adult female ovary and uterus cDNA, RIKEN full-length enriched

A-29 Appendix

Expressed in all GenBank Null in … Gene Name 31 BXDs in … accession library, clone:5033415K03, full insert sequence. AK018845 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:1700041B01, full insert sequence. AF394596 Mus musculus orexin receptor 1 mRNA, partial cds. AF113520 Mus musculus actin-like-7-beta protein (Actl7b) mRNA, complete cds NM_009158 Mus musculus mitogen activated protein kinase 10 (Mapk10), mRNA BC014708 Mus musculus, Similar to mammalian ependymin related protein 1, clone MGC:25915 IMAGE:4223134, mRNA, complete cds. AK015295 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930432N10, full insert sequence. M15442 Mouse myelin proteolipid protein mRNA, complete cds AF357345 Mus musculus clone MBI-46 C/D box snoRNA, partial sequence. NM_021382 Mus musculus tachykinin receptor 3 (Tacr3), mRNA AK021035 Mus musculus 10 days neonate medulla oblongata cDNA, RIKEN full-length enriched library, clone:B830049N13, full insert sequence. AK014267 Mus musculus 13 days embryo head cDNA, RIKEN full-length enriched library, clone:3110080O07, full insert sequence. AK008108 Mus musculus adult male small intestine cDNA, RIKEN full-length enriched library, clone:2010004N24, full insert sequence. AK017951 Mus musculus adult male thymus cDNA, RIKEN full-length enriched library, clone:5830427D02, full insert sequence. AF042361 Mus musculus clone OR55-36 putative olfactory receptor mRNA, partial cds Z12550 M.musculus rearranged T-cell receptor beta chain Vbeta8 repertoire (VDJ). AK016822 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4933415J04, full insert sequence. AK015443 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930451G21, full insert sequence. AK017738 Mus musculus 8 days embryo cDNA, RIKEN full-length enriched library, clone:5730497N03, full insert sequence. AJ231251 Mus musculus IgVk ay4 gene. AK013621 Mus musculus adult male hippocampus cDNA, RIKEN full-length enriched library, clone:2900036G11, full insert sequence. Z47780 M.musculus mRNA expressed in islet cells (clone 43) NM_016782 Mus musculus neurexin 4 (contactin associated protein) (Nrxn4), mRNA

A-30 Appendix

Expressed in all GenBank Null in … Gene Name 31 BXDs in … accession NM_016967 Mus musculus oligodendrocyte transcription factor 2 (Olig2), mRNA AJ231216 Mus musculus IgVk ah4 gene. BC002232 Mus musculus, clone IMAGE:3491487, mRNA, partial cds. AK021402 Mus musculus 0 day neonate eyeball cDNA, RIKEN full-length enriched library, clone:E130202I16, full insert sequence. BC002306 Mus musculus, Similar to CG11246 gene product, clone MGC:8248 IMAGE:3591968, mRNA, complete cds. AK013487 Mus musculus adult male hippocampus cDNA, RIKEN full-length enriched library, clone:2900006A17, full insert sequence. AK018860 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:1700054K02, full insert sequence. X61397 Mouse mRNA for carbonic anhydrase-related polypeptide NM_013805 Mus musculus claudin 5 (Cldn5), mRNA NM_012013 Mus musculus factor in the germline alpha (Figla), mRNA AF004333 Mus musculus Ig heavy chain variable region, monoclonal IgG1 anti-idiotope antibody 464 specific to mouse monoclonal antibody IIB4, gene, partial cds. AK019108 Mus musculus ES cells cDNA, RIKEN full-length enriched library, clone:2410011O22, full insert sequence. AK010793 Mus musculus ES cells cDNA, RIKEN full-length enriched library, clone:2410133F24, full insert sequence. BC006651 Mus musculus, Similar to hypothetical protein FLJ23153, clone MGC:7814 IMAGE:3500056, mRNA, complete cds. NM_011118 Mus musculus proliferin 2 (Plf2), mRNA. AK016481 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4931428F04, full insert sequence. AK015539 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930470O06, full insert sequence. AK004471 Mus musculus 18 days embryo cDNA, RIKEN full-length enriched library, clone:1190003K10, full insert sequence. NM_016857 Mus musculus exocyst component protein 70 kDa homolog (S. cerevisiae) (Exo70), mRNA NM_013540 Mus musculus glutamate receptor, ionotropic, AMPA2 (alpha 2) (Gria2), mRNA AF316533 Mus musculus JL2/29-3 rearranged immunoglobulin heavy chain variable region mRNA, partial cds. AK019921 Mus musculus adult male pituitary gland cDNA, RIKEN full-length enriched library, clone:5330430C04, full insert sequence.

A-31 Appendix

Expressed in all GenBank Null in … Gene Name 31 BXDs in … accession AK006533 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:1700030C20, full insert sequence. AK014039 Mus musculus 13 days embryo head cDNA, RIKEN full-length enriched library, clone:3110010B10, full insert sequence. NM_013911 Mus musculus f-box and leucine-rich repeat protein 12 (Fbxl12), mRNA AK013264 Mus musculus 10, 11 days embryo cDNA, RIKEN full-length enriched library, clone:2810438F06, full insert sequence. AK019360 Mus musculus adult male hippocampus cDNA, RIKEN full-length enriched library, clone:2900083M14, full insert sequence. AJ011107 Mus musculus mRNA for 3'UTR of Clc1 gene Kidney Liver AF327059 Mus musculus apolioprotein A-V (ApoA-V) mRNA, complete cds, alternative trasncript NM_009045 Mus musculus avian reticuloendotheliosis viral (v-rel) oncogene homolog A (Rela), mRNA NM_009512 Mus musculus solute carrier family 27 (fatty acid transporter), member 5 (Slc27a5), mRNA NM_010388 Mus musculus histocompatibility 2, class II, locus Mb2 (H2-DMb2), mRNA Liver Brain NM_013728 Mus musculus olfactory receptor 4 cluster, gene 3 (Olfr4-3), mRNA AK013822 Mus musculus adult male hippocampus cDNA, RIKEN full-length enriched library, clone:2900084O13, full insert sequence. AK008706 Mus musculus adult male stomach cDNA, RIKEN full-length enriched library, clone:2210011G09, full insert sequence. AK015551 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930471G24, full insert sequence. NM_025934 Mus musculus RIKEN cDNA 2010110K24 gene (2010110K24Rik), mRNA. U50961 Mus musculus clone TSAP6 p53-induced apoptosis differentially expressed mRNA sequence NM_011954 Mus musculus mitogen regulated protein, proliferin 3 (Mrpplf3), mRNA. AK016071 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930548K13, full insert sequence. AK013631 Mus musculus adult male hippocampus cDNA, RIKEN full-length enriched library, clone:2900041A09, full insert sequence. AF282730 Mus musculus tissue inhibitor of metalloproteinases TIMP4 (Timp4) mRNA, complete cds AK014809 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4921504E14, full insert sequence. AK017754 Mus musculus 8 days embryo cDNA, RIKEN full-length enriched library, clone:5730507C05, full insert sequence. NM_008901 Mus musculus POU domain, class 3, transcription factor 4 (Pou3f4), mRNA

A-32 Appendix

Expressed in all GenBank Null in … Gene Name 31 BXDs in … accession NM_021399 Mus musculus CTIP2 protein (Ctip2), mRNA NM_019553 Mus musculus DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 21 (RNA helicase II/Gu) (Ddx21), mRNA NM_020504 Mus musculus claudin 13 (Cldn13), mRNA U38806 Mus musculus fertilin beta mRNA, complete cds M16072 Mouse Ig active gamma-2a H-chain V-Dsp2.2-J2-C mRNA (L6 monoclonal antibody). AK019735 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930544G21, full insert sequence. AF100956 Mus musculus Bing1 (BING1), tapasin (tapasin), RalGDS-like factor (RLF), KE2 (KE2), BING4 (BING4), beta1, 3-galactosyl transferase (beta1,3-galactosyl transferase), ribosomal protein subunit S18 (RPS18), Sacm21 (Sacm21), H2K1(b) (H2-K1(b)), RING1 (RING1), AK009004 Mus musculus adult male tongue cDNA, RIKEN full-length enriched library, clone:2300002D11, full insert sequence. AK019567 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930405L22, full insert sequence. NM_015819 Mus musculus heparan sulfate 6-O-sulfotransferase 2 (Hs6st2), mRNA NM_009021 Mus musculus retinoic acid induced 1 (Rai1), mRNA NM_013648 Mus musculus reticulon 2 (Z-band associated protein) (Rtn2), mRNA NM_011598 Mus musculus testis lipid binding protein (Tlbp), mRNA AK014006 Mus musculus 13 days embryo head cDNA, RIKEN full-length enriched library, clone:3110005O21, full insert sequence. L37413 Mus musculus (clone 2C8.G11) immunoglobulin light chain mRNA fragment. BC002191 Mus musculus, Similar to i-beta-1,3-N-acetylglucosaminyltransferase, clone IMAGE:3488094, mRNA, partial cds. NM_011860 Mus musculus ooplasm (Op1), mRNA AK020777 Mus musculus 0 day neonate thymus cDNA, RIKEN full-length enriched library, clone:A430106J12, full insert sequence. NM_007483 Mus musculus aplysia ras-related homolog B (RhoB) (Arhb), mRNA NM_016677 Mus musculus hippocalcin-like 1 (Hpcal1), mRNA NM_010201 Mus musculus fibroblast growth factor 14 (Fgf14), mRNA X05736 Mouse mRNA for T-cell receptor alpha-chain v-d-j region. M57978 Mouse IgK chain mRNA, VJ5 region. U39293 Mus musculus rearranged immunoglobulin Vh15A mRNA, partial cds. AK017349 Mus musculus 6 days neonate head cDNA, RIKEN full-length enriched library, clone:5430427M07, full

A-33 Appendix

Expressed in all GenBank Null in … Gene Name 31 BXDs in … accession insert sequence. NM_010364 Mus musculus general transcription factor IIH, polypeptide 4 (Gtf2h4), mRNA AF290877 Mus musculus WAVE-1 mRNA, complete cds J04694 Mus musculus alpha-1 type IV collagen (Col4a-1) mRNA, complete cds NM_008120 Mus musculus gap junction membrane channel protein alpha 4 (Gja4), mRNA U28772 Mus musculus C57Bl/6J clone L45 odorant receptor gene, partial cds. AK017789 Mus musculus 8 days embryo cDNA, RIKEN full-length enriched library, clone:5730525O22, full insert sequence. AK015694 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930504B16, full insert sequence. U20366 Mus musculus Hoxa11 locus antisense-23A mRNA NM_029703 Mus musculus RIKEN cDNA 1500026B10 gene (1500026B10Rik), mRNA. AF072759 Mus musculus fatty acid transport protein 4 mRNA, partial cds AF302503 Mus musculus pellino 1 (Peli1) mRNA, complete cds AK005121 Mus musculus adult male cerebellum cDNA, RIKEN full-length enriched library, clone:1500002K03, full insert sequence. AK021369 Mus musculus 0 day neonate eyeball cDNA, RIKEN full-length enriched library, clone:E130103E02, full insert sequence. AK019533 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4921501M06, full insert sequence. NM_008000 Mus musculus fer (fms/fps related) protein kinase, testis specific 2 (Fert2), mRNA NM_008065 Mus musculus GA repeat binding protein, alpha (Gabpa), mRNA NM_011655 Mus musculus tubulin, beta 5 (Tubb5), mRNA U66835 Mus musculus unknown protein mRNA, partial cds NM_010288 Mus musculus gap junction membrane channel protein alpha 1 (Gja1), mRNA AB009674 Mus musculus mRNA for ADAM22, complete cds NM_007709 Mus musculus Cbp/p300-interacting transactivator with Glu/Asp-rich carboxy-terminal domain 1 (Cited1), mRNA NM_020614 Mus musculus TATA box binding protein (Tbp)-associated factor, RNA polymerase I, B (Taf1b), mRNA AK015976 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930535D10, full insert sequence. NM_031870 Mus musculus mutS homolog 4 (E. coli) (Msh4), mRNA. M15442 Mouse myelin proteolipid protein mRNA, complete cds

A-34 Appendix

Expressed in all GenBank Null in … Gene Name 31 BXDs in … accession NM_030742 Mus musculus pheromone receptor V3R1 (V3R1), mRNA. AK015215 Mus musculus adult male testis cDNA, RIKEN full-length enriched library, clone:4930428B01, full insert sequence. NM_008662 Mus musculus myosin VI (Myo6), mRNA AK017738 Mus musculus 8 days embryo cDNA, RIKEN full-length enriched library, clone:5730497N03, full insert sequence. AK013621 Mus musculus adult male hippocampus cDNA, RIKEN full-length enriched library, clone:2900036G11, full insert sequence. NM_007578 Mus musculus calcium channel, P/Q type, alpha 1A (Cacna1a), mRNA NM_011861 Mus musculus protein kinase C and casein kinase substrate in neurons 1 (Pacsin1), mRNA NM_011118 Mus musculus proliferin 2 (Plf2), mRNA. AK007352 Mus musculus 10 day old male pancreas cDNA, RIKEN full-length enriched library, clone:1810006K23, full insert sequence. NM_011653 Mus musculus tubulin alpha 1 (Tuba1), mRNA AK021219 Mus musculus ES cells cDNA, RIKEN full-length enriched library, clone:C330022C24, full insert sequence. NM_013540 Mus musculus glutamate receptor, ionotropic, AMPA2 (alpha 2) (Gria2), mRNA NM_007540 Mus musculus brain derived neurotrophic factor (Bdnf), mRNA X02555 Mouse mRNA fragment for G1 kappa immunoglobulin A8/4 light chain (V-J). AF096868 Mus musculus transcytosis associated protein p115 mRNA, partial cds D90225 Mouse mRNA for OSF-1 AK018685 Mus musculus adult male cecum cDNA, RIKEN full-length enriched library, clone:9130422G05, full insert sequence. Liver Kidney NM_009258 Mus musculus serine protease inhibitor, Kazal 3 (Spink3), mRNA

A-35 Appendix

A.6. Linkages at D8Mit189

Table A-12 Transcripts linked to the Chr8: 10-46Mb region at Praw0PRED, in the BXD brain, kidney, and liver studies. See Section 3.6.2. Highlighted are the two transcripts that are mapped to this region in all three tissues: AK007639 and AK016395. Marker Tissue Transcript D8Mit124 Brain NM_007737, AK013309, AF022078, AF411063, AK020637, NM_021489, AK006812, NM_011342, AF238284, AK018737, AF033565, AF120465, AF139236, D12735, U89431, NM_011704, NM_019455, AK005832, AF118273, AK020339, BC006689, NM_016738, BC006699, AK019596, Z12387, AK006397, AF041876, AK006163, AK010791, NM_010237, AK018196, NM_021606, AK017010, AK007024, NM_008954, NM_011667, NM_013616, AF041875, Z12445, AK015100, M97879, AK015330, S85733, AB010312, NM_010049, NM_011920, NM_013563, NM_011627, NM_007556, U21392, NM_010382, AK011047, AK013935, NM_025747, AK018293, AK020501, NM_008064, NM_011821, NM_021532, Y18276, NM_021895, NM_027334, AK020983, AK017944, AK010301, AK021070, AK017413, L31652, NM_011861, Z12462, AF041867, NM_028605, AK017526, AK017313, AK015795, AK005669, NM_008726, NM_008964, NM_020585, Z12283, NM_031873, L17085, AK006989, AK013730, AK017406, NM_010926, BC007148, AB010309, AF256217, AF041939, AB051827, NM_007515, NM_019788, NM_020332, BC004641, AJ409479, X76684, AK020781, AK018906, AJ278128, AF377871, AB024497, AK003950, NM_009945, AK014147, AK015744, AK015900, NM_025281, NM_008935, NM_007782, AE008684.7, AK007237, NM_025448 Kidney AF151730, AJ231223, NM_021305, NM_010812, AK020460, AF073933, NM_010801, NM_011396, NM_007492, M18091, AJ279028, AK006256, AK015976, AK006455, AK013609, AK014804, NM_010223, AK006434, NM_010913, NM_008708, M19440, L20266, AF201683, X74587, NM_009115, U23012 Liver AK013649, U66888, NM_026165, NM_024432, Z12496, AK007095, AK007214, AK005591, M14637, NM_008950, NM_018773, AK021050, NM_013463, M92391, NM_011514, X53402, AK018584, AK021326, AK011290, NM_025922, AF343088, L08213 D8Mit335 Brain AK007165, NM_028375, AJ231255 Kidney BC005776, NM_007810, NM_028218, AK016991, AK014278, NM_023162, M99623, U01139 Liver AK015086, AK006767, AK017060, NM_010583, NM_010789, NM_023162, X94085

A-36 Appendix

D8Mit3 Brain BC008529, NM_023202 Kidney AK018072, AF245700, BC005669, M79304, Liver AK008751, AK010344 D8Mit287 Brain AK009641, NM_010068, AK018082 Kidney AK016903, BC006820, M32071, AF217545, Z12192, AF020191, BC011259, AK005036, NM_013798 Liver NM_026584 08.025.220 Brain NM_007741, NM_007596 Kidney AK016343, AF193435, U26229, Z12180, NM_011148, NM_007927 Liver AK015070, AK004294, NM_007795, AK012992 D8Mit289 Brain - Kidney NM_010451, NM_026374, NM_010411, NM_009613, AK017082, NM_013509, K01925, AF041895, AK002882, S81415, AK015631, AB010307, AF090691, NM_009280, X81716, AK015815, AK021139, NM_016688 Liver - D8Mit189 Brain AK007639 Kidney NM_015765, AK021056, AK012199, NM_026208, AF290179, NM_010583, AK004476, AK021231, AK013224, AK016044, AK007639, L09051, NM_030261, NM_030726 Liver AK004476, AK007055, AK015307, AK013224, AK007639, BC006779, AB050982, AK008676, AK016395 D8Mit24 Brain - Kidney AF274590, AF012149, AK006239, AK015775, AK015140 Liver NM_020613, AK004025 D8Mit294 Brain BC006779, AK016395

A-37 Appendix

Kidney NM_011300, AK015921, U55612, NM_021536, AK019567, NM_011242, AK008544, NM_010564, NM_010060, NM_010887, AK016395, AK019388, NM_008367 Liver AK015815, AK005671 D8Mit339 Brain AK015188, AK013280, AK008676 Kidney AK003507, NM_008267, NM_008384 Liver AK003912

A-38 Appendix

A.7. bqtl.twolocus results

A.7.1. From the BXD brain (CotB), kidney (CotK), and liver (CotL) studies

Table A-13 bqtl.twolocus results for 72 transcripts linked to two eQTLs in the brain 31 BXD (CotB) study. The first three columns are the GenBank accession numbers of these transcripts and their genomic locations; a “-“ Chr implies the physical location of the transcript is unknown. Both eQTLs are tested as the primary (1º) locus and the LOD scores and corresponding P-values are shown in columns 10-11. Tests scoring the smallest possible P=(1/7790=0.0013 are indicated with yellow highlight and non-significant tests, P>0.05, are highlighted in grey. The coefficient measuring the epistatic effects between the two loci are shown in the last two columns. Significant two-locus epistatic effects are highlighted in blue. Cross references to Section 4.3. GenBank 1º 2º Epistatic coefficient Chr Mb 1º eQTL 1º Mb 2º eQTL 2º Mb LOD P accession Chr Chr coefficient P D1Mit83 1 86.03 D13Mit145 13 92.56 10.70 0.0013 0.004 0.9653 NM_011561 10 82.54 D13Mit145 13 92.56 D1Mit83 1 86.03 11.78 0.0026 0.004 0.9615 D1Mit216 1 80.32 S08Gnf006.700 8 9.53 9.36 0.0013 0.035 0.6098 NM_008260 7 15.88 S08Gnf006.700 8 9.53 D1Mit216 1 80.32 11.77 0.0013 0.035 0.6611 S08Gnf006.700 8 9.53 D1Mit216 1 80.32 8.78 0.0013 -0.065 0.1951 Z12268 6 41.27 D1Mit216 1 80.32 S08Gnf006.700 8 9.53 10.81 0.0013 -0.065 0.2760 D1Mit178 1 69.13 Hoxd9 2 74.59 2.29 0.1104 0.003 0.9666 NM_019952 19 4.01 Hoxd9 2 74.59 D1Mit178 1 69.13 5.78 0.0013 0.003 0.9538 S08Gnf006.700 8 9.53 D3Mit155 3 87.61 13.73 0.0013 -0.042 0.4724 L39117 - - D3Mit155 3 87.61 S08Gnf006.700 8 9.53 14.80 0.0013 -0.042 0.3928 S08Gnf006.700 8 9.53 D1Mit134 1 80.78 6.43 0.0013 0.037 0.7574 NM_024432 17 53.71 D1Mit134 1 80.78 S08Gnf006.700 8 9.53 6.74 0.0039 0.037 0.8126 D14Mit185 14 109.22 D7Mit281 7 99.48 5.97 0.0180 0.068 0.0436 AK017208 15 12.34 D7Mit281 7 99.48 D14Mit185 14 109.22 7.78 0.0013 0.068 0.0449 D1Mit216 1 80.32 D8Mit124 8 14.76 7.44 0.0051 -0.069 0.1964 NM_019455 6 65.38 D8Mit124 8 14.76 D1Mit216 1 80.32 8.09 0.0013 -0.069 0.2721 D1Mit134 1 80.78 D11Mit208 11 58.21 10.28 0.0026 0.141 0.0154 AK018857 11 102.49 D11Mit208 11 58.21 D1Mit134 1 80.78 11.29 0.0013 0.141 0.0180 D11Mit41 11 88.75 D5Mit309 5 78.27 3.98 0.0501 -0.021 0.7933 X70398 18 33.66 D5Mit309 5 78.27 D11Mit41 11 88.75 15.39 0.0013 -0.021 0.7381

A-39 Appendix

GenBank 1º 2º Epistatic coefficient Chr Mb 1º eQTL 1º Mb 2º eQTL 2º Mb LOD P accession Chr Chr coefficient P D1Mit134 1 80.78 D8Mit124 8 14.76 7.60 0.0090 0.041 0.4262 AK020339 9 60.64 D8Mit124 8 14.76 D1Mit134 1 80.78 8.67 0.0013 0.041 0.4724 Hoxd9 2 74.59 D1Mit216 1 80.32 7.31 0.0013 0.043 0.3517 NM_009072 12 16.34 D1Mit216 1 80.32 Hoxd9 2 74.59 7.89 0.0039 0.043 0.3556 D19Mit68 19 3.43 D1Mit134 1 80.78 7.27 0.0013 0.157 0.0013 AK019601 1 37.63 D1Mit134 1 80.78 D19Mit68 19 3.43 9.03 0.0026 0.157 0.0064 S08Gnf006.700 8 9.53 D1Mit216 1 80.32 8.38 0.0013 -0.064 0.3081 NM_021306 1 86.96 D1Mit216 1 80.32 S08Gnf006.700 8 9.53 10.27 0.0026 -0.064 0.3530 D19Mit109 19 6.00 D1Mit134 1 80.78 10.02 0.0013 -0.198 0.0154 AK020531 12 105.95 D1Mit134 1 80.78 D19Mit109 19 6.00 10.68 0.0013 -0.198 0.0051 D1Mit134 1 80.78 D19Mit68 19 3.43 7.34 0.0013 -0.177 0.0116 NM_025436 8 63.79 D19Mit68 19 3.43 D1Mit134 1 80.78 9.72 0.0026 -0.177 0.0128 D1Mit134 1 80.78 D8Mit124 8 14.76 6.80 0.0064 0.109 0.5494 NM_008954 17 54.69 D8Mit124 8 14.76 D1Mit134 1 80.78 9.42 0.0013 0.109 0.5404 D3Mit12 3 100.40 D4Mit214 4 44.90 6.68 0.0026 0.014 0.7856 NM_019658 19 53.61 D4Mit214 4 44.90 D3Mit12 3 100.40 10.69 0.0013 0.014 0.8549 D1Mit134 1 80.78 Mod2 7 76.78 11.37 0.0013 0.104 0.1489 AF249870 10 18.78 Mod2 7 76.78 D1Mit134 1 80.78 11.67 0.0013 0.104 0.0911 D1Mit134 1 80.78 D8Mit3 8 23.42 7.75 0.0051 0.073 0.0347 NM_023202 17 31.52 D8Mit3 8 23.42 D1Mit134 1 80.78 8.99 0.0013 0.073 0.0462 D19Mit68 19 3.43 DXMsw076 X 84.75 17.50 0.0051 -0.242 0.0090 AK014278 - - DXMsw076 X 84.75 D19Mit68 19 3.43 18.95 0.0013 -0.242 0.1142 S08Gnf006.700 8 9.53 D1Mit9 1 77.61 7.48 0.0013 0.061 0.3838 AK005616 4 118.37 D1Mit9 1 77.61 S08Gnf006.700 8 9.53 9.44 0.0013 0.061 0.3697 D1Mit216 1 80.32 D8Mit124 8 14.76 7.02 0.0039 0.049 0.4018 AK020983 X 128.23 D8Mit124 8 14.76 D1Mit216 1 80.32 9.03 0.0013 0.049 0.4185 D1Mit216 1 80.32 D8Mit339 8 39.42 10.89 0.0013 -0.068 0.1155 AK013280 2 4.54 D8Mit339 8 39.42 D1Mit216 1 80.32 12.11 0.0026 -0.068 0.0475 SXGnf055.520 X 82.43 D8Mit124 8 14.76 10.00 0.0039 0.003 0.9628 NM_028605 4 85.92 D8Mit124 8 14.76 SXGnf055.520 X 82.43 10.43 0.0013 0.003 0.9564 S08Gnf006.700 8 9.53 D1Mit216 1 80.32 6.92 0.0013 0.061 0.5764 BC013849 5 143.96 D1Mit216 1 80.32 S08Gnf006.700 8 9.53 6.95 0.0039 0.061 0.6303 NM_021442 3 29.65 S08Gnf006.700 8 9.53 D1Mit216 1 80.32 7.73 0.0013 -0.014 0.7843

A-40 Appendix

GenBank 1º 2º Epistatic coefficient Chr Mb 1º eQTL 1º Mb 2º eQTL 2º Mb LOD P accession Chr Chr coefficient P D1Mit216 1 80.32 S08Gnf006.700 8 9.53 7.93 0.0013 -0.014 0.7728 D1Mit134 1 80.78 D19Mit103 19 53.14 7.59 0.0077 0.031 0.7445 AJ279835 4 147.64 D19Mit103 19 53.14 D1Mit134 1 80.78 8.59 0.0013 0.031 0.7137 D1Mit216 1 80.32 D2Mit436 2 80.00 5.90 0.0154 -0.015 0.6868 NM_009744 - - D2Mit436 2 80.00 D1Mit216 1 80.32 10.48 0.0013 -0.015 0.6778 D1Mit134 1 80.78 D8Mit124 8 14.76 6.46 0.0090 0.059 0.6585 NM_008726 4 146.48 D8Mit124 8 14.76 D1Mit134 1 80.78 8.45 0.0013 0.059 0.6727 D1Mit216 1 80.32 S08Gnf006.700 8 9.53 7.99 0.0013 0.008 0.9371 BC003914 6 52.69 S08Gnf006.700 8 9.53 D1Mit216 1 80.32 9.03 0.0013 0.008 0.9320 D1Mit134 1 80.78 D19Mit68 19 3.43 7.93 0.0013 -0.347 0.0039 AF204174 10 45.49 D19Mit68 19 3.43 D1Mit134 1 80.78 13.30 0.0013 -0.347 0.0077 D1Mit216 1 80.32 D19Mit103 19 53.14 8.23 0.0051 0.010 0.7908 NM_019674 7 120.84 D19Mit103 19 53.14 D1Mit216 1 80.32 11.85 0.0013 0.010 0.8203 D1Mit134 1 80.78 D19Mit68 19 3.43 5.81 0.0013 -0.144 0.1220 AK006652 - - D19Mit68 19 3.43 D1Mit134 1 80.78 6.59 0.0064 -0.144 0.0732 D1Mit134 1 80.78 D11Mit34 11 78.88 7.07 0.0013 -0.006 0.9602 AF357411 1 63.47 D11Mit34 11 78.88 D1Mit134 1 80.78 7.86 0.0013 -0.006 0.9461 D19Mit68 19 3.43 D1Mit134 1 80.78 7.24 0.0013 -0.328 0.0013 NM_025362 5 134.08 D1Mit134 1 80.78 D19Mit68 19 3.43 7.29 0.0026 -0.328 0.0026 S08Gnf006.700 8 9.53 D1Mit134 1 80.78 7.22 0.0026 -0.045 0.6226 BC005613 16 23.47 D1Mit134 1 80.78 S08Gnf006.700 8 9.53 7.84 0.0013 -0.045 0.6560 D19Mit68 19 3.43 D1Mit134 1 80.78 4.95 0.0116 -0.028 0.7677 AK020050 8 63.81 D1Mit134 1 80.78 D19Mit68 19 3.43 6.32 0.0013 -0.028 0.7895 D11Mit34 11 78.88 D1Mit216 1 80.32 6.34 0.0064 0.002 0.9615 AK012092 15 96.49 D1Mit216 1 80.32 D11Mit34 11 78.88 6.58 0.0013 0.002 0.9653 D1Mit328 1 68.25 D13Mit145 13 92.56 9.79 0.0013 0.100 0.1258 NM_010902 2 75.37 D13Mit145 13 92.56 D1Mit328 1 68.25 10.27 0.0013 0.100 0.1643 D1Mit216 1 80.32 D8Mit124 8 14.76 7.16 0.0077 0.195 0.3825 NM_009945 9 79.96 D8Mit124 8 14.76 D1Mit216 1 80.32 8.41 0.0013 0.195 0.3607 D1Mit134 1 80.78 D2Mit200 2 179.48 7.39 0.0013 -0.002 0.9756 NM_025974 9 120.59 D2Mit200 2 179.48 D1Mit134 1 80.78 8.46 0.0013 -0.002 0.9743 Mpmv13 5 28.60 D4Mit203 4 127.94 8.02 0.0013 -0.057 0.0424 BC006733 11 35.64 D4Mit203 4 127.94 Mpmv13 5 28.60 8.11 0.0013 -0.057 0.0282

A-41 Appendix

GenBank 1º 2º Epistatic coefficient Chr Mb 1º eQTL 1º Mb 2º eQTL 2º Mb LOD P accession Chr Chr coefficient P D1Mit216 1 80.32 D19Mit103 19 53.14 7.80 0.0013 0.049 0.3967 NM_025639 15 82.29 D19Mit103 19 53.14 D1Mit216 1 80.32 7.82 0.0013 0.049 0.2978 D11Mit34 11 78.88 D1Mit216 1 80.32 7.15 0.0013 -0.017 0.7831 NM_013641 8 82.93 D1Mit216 1 80.32 D11Mit34 11 78.88 8.98 0.0051 -0.017 0.8575 D19Mit22 19 9.85 D1Mit134 1 80.78 5.50 0.0154 -0.071 0.3954 AK015801 9 119.44 D1Mit134 1 80.78 D19Mit22 19 9.85 7.00 0.0026 -0.071 0.3017 D8Mit128 8 60.21 D2Mit6 2 20.95 7.74 0.0090 -0.165 0.0385 X14625 6 69.76 D2Mit6 2 20.95 D8Mit128 8 60.21 7.79 0.0026 -0.165 0.0282 D19Mit22 19 9.85 D12Mit3 12 71.27 5.66 0.0154 0.134 0.5674 NM_030006 1 83.24 D12Mit3 12 71.27 D19Mit22 19 9.85 5.98 0.0026 0.134 0.4377 D11Mit34 11 78.88 D1Mit134 1 80.78 8.15 0.0026 0.015 0.6701 U55583 6 69.52 D1Mit134 1 80.78 D11Mit34 11 78.88 8.70 0.0026 0.015 0.7214 D1Mit216 1 80.32 D8Mit287 8 23.98 6.74 0.0064 -0.010 0.7766 NM_010068 2 153.14 D8Mit287 8 23.98 D1Mit216 1 80.32 8.29 0.0026 -0.010 0.7856 D8Mit124 8 14.76 D2Mit365 2 27.77 5.11 0.0026 0.019 0.7009 NM_007515 X 95.68 D2Mit365 2 27.77 D8Mit124 8 14.76 5.65 0.0026 0.019 0.7612 D1Mit134 1 80.78 D19Mit68 19 3.43 5.83 0.0090 0.066 0.0424 NM_028889 1 87.13 D19Mit68 19 3.43 D1Mit134 1 80.78 9.00 0.0026 0.066 0.1361 D8Mit124 8 14.76 D1Mit134 1 80.78 6.66 0.0051 0.122 0.4711 AJ278128 2 112.09 D1Mit134 1 80.78 D8Mit124 8 14.76 7.62 0.0026 0.122 0.4737 D1Mit134 1 80.78 D2Mit436 2 80.00 6.25 0.0167 0.073 0.2157 BC008274 1 74.59 D2Mit436 2 80.00 D1Mit134 1 80.78 7.37 0.0026 0.073 0.2760 D1Mit134 1 80.78 D19Mit68 19 3.43 4.83 0.0051 -0.185 0.1335 AK020467 X 47.57 D19Mit68 19 3.43 D1Mit134 1 80.78 5.80 0.0051 -0.185 0.1630 D1Mit134 1 80.78 Hoxd9 2 74.59 4.63 0.0154 0.021 0.7060 NM_010028 X 11.53 Hoxd9 2 74.59 D1Mit134 1 80.78 4.86 0.0051 0.021 0.7356 D17Mit18 17 7.81 S07Gnf033.680 7 30.49 2.46 0.0822 -0.063 0.0732 NM_011329 11 81.92 S07Gnf033.680 7 30.49 D17Mit18 17 7.81 4.59 0.0064 -0.063 0.1220 D11Mit41 11 88.75 D19Mit68 19 3.43 6.36 0.0077 -0.101 0.0090 NM_007956 10 5.51 D19Mit68 19 3.43 D11Mit41 11 88.75 6.44 0.0064 -0.101 0.0539 S17Gnf013.500 17 14.53 D16Mit12 16 38.98 6.84 0.0154 0.067 0.0231 NM_018792 11 94.79 D16Mit12 16 38.98 S17Gnf013.500 17 14.53 7.64 0.0064 0.067 0.0077 NM_020496 9 24.64 D19Mit68 19 3.43 D19Mit103 19 53.14 5.00 0.0090 0.001 0.9910

A-42 Appendix

GenBank 1º 2º Epistatic coefficient Chr Mb 1º eQTL 1º Mb 2º eQTL 2º Mb LOD P accession Chr Chr coefficient P D19Mit103 19 53.14 D19Mit68 19 3.43 5.06 0.0218 0.001 0.9884 D2Mit304 2 128.08 D12Mit3 12 71.27 3.42 0.0475 -0.005 0.9101 NM_024444 8 71.16 D12Mit3 12 71.27 D2Mit304 2 128.08 3.86 0.0090 -0.005 0.9191 D19Mit28 19 15.18 D7Mit301 7 78.82 4.93 0.0398 0.007 0.7818 AK019468 12 66.64 D7Mit301 7 78.82 D19Mit28 19 15.18 7.79 0.0103 0.007 0.7805 D19Mit10 19 46.50 D19Mit68 19 3.43 4.98 0.0103 -0.026 0.5276 NM_031397 10 70.99 D19Mit68 19 3.43 D19Mit10 19 46.50 5.26 0.0103 -0.026 0.5661 D1Mit134 1 80.78 D19Mit68 19 3.43 4.55 0.0193 -0.040 0.6033 AF367247 X 42.66 D19Mit68 19 3.43 D1Mit134 1 80.78 6.63 0.0103 -0.040 0.6290 D7Mit117 7 18.91 D6Mit67 6 98.13 4.17 0.0270 0.000 0.9987 NM_010321 17 44.24 D6Mit67 6 98.13 D7Mit117 7 18.91 4.40 0.0193 0.000 0.9987 D2Mit62 2 117.89 D9Mit91 9 37.26 4.07 0.0591 -0.037 0.2054 AB047759 9 110.86 D9Mit91 9 37.26 D2Mit62 2 117.89 5.15 0.0193 -0.037 0.2311 D19Mit103 19 53.14 D1Mit9 1 77.61 3.79 0.0834 -0.086 0.4275 Z12439 - - D1Mit9 1 77.61 D19Mit103 19 53.14 4.99 0.0193 -0.086 0.5019 D1Mit328 1 68.25 D2Mit436 2 80.00 3.71 0.0231 0.006 0.8870 NM_019925 12 108.33 D2Mit436 2 80.00 D1Mit328 1 68.25 3.84 0.0359 0.006 0.9178 S12Gnf051.120 12 48.03 D1Mit134 1 80.78 3.66 0.0411 -0.028 0.7022 AK006510 4 134.51 D1Mit134 1 80.78 S12Gnf051.120 12 48.03 4.22 0.0270 -0.028 0.6893 D1Mit178 1 69.13 D19Mit68 19 3.43 5.27 0.0282 0.012 0.8164 AK020063 1 132.33 D19Mit68 19 3.43 D1Mit178 1 69.13 6.26 0.0591 0.012 0.8742 S13Gnf003.255 13 6.09 D2Mit5 2 9.48 2.84 0.1181 0.066 0.5404 AL133159.7 17 35.17 D2Mit5 2 9.48 S13Gnf003.255 13 6.09 4.54 0.0757 0.066 0.6367 D15Mit13 15 3.29 S15Gnf039.230 15 41.71 0.25 0.7831 0.000 0.9987 NM_007544 6 121.33 S15Gnf039.230 15 41.71 D15Mit13 15 3.29 2.83 0.0834 0.000 0.9987

A-43 Appendix

Table A-14 bqtl.twolocus results for 17 transcripts linked to two eQTLs in the kidney 31 BXD (CotK) study. Legend as in Table A-1312. Cross references to Section 4.3. GenBank 1º 2º Epistatic coefficient Chr Mb 1º eQTL 1º Mb 2º eQTL 2º Mb LOD P accession Chr Chr coefficient P S13Gnf028.450 13 30.61 S01Gnf070.445 1 73.31 10.02 0.0013 -0.077 0.4673 BC013667 1 191.45 S01Gnf070.445 1 73.31 S13Gnf028.450 13 30.61 9.83 0.0026 -0.077 0.4313 D9Rp2 9 121.58 D5Mit155 5 97.01 8.69 0.0013 0.049 0.2567 NM_008712 5 117.08 D5Mit155 5 97.01 D9Rp2 9 121.58 3.24 0.0783 0.049 0.1977 D12Mit36 12 55.86 D8Mit289 8 27.63 7.37 0.0013 -0.065 0.3787 AF090691 2 148.26 D8Mit289 8 27.63 D12Mit36 12 55.86 6.67 0.0077 -0.065 0.3081 D2Nds1 2 93.33 D18Mit144 18 86.01 6.30 0.0013 0.016 0.6637 NM_021442 3 29.65 D18Mit144 18 86.01 D2Nds1 2 93.33 5.90 0.0090 0.016 0.5725 S08Gnf006.700 8 9.53 D17Mit18 17 7.81 9.73 0.0013 -0.021 0.5212 AF316988 12 109.38 D17Mit18 17 7.81 S08Gnf006.700 8 9.53 8.40 0.0013 -0.021 0.5918 D2Mit369 2 40.76 D5Mit233 5 51.33 7.79 0.0026 -0.002 0.9564 M34197 - - D5Mit233 5 51.33 D2Mit369 2 40.76 5.21 0.0064 -0.002 0.9397 D2Mit102 2 114.08 D9Mit91 9 37.26 5.56 0.0039 -0.069 0.0026 AK014905 3 145.87 D9Mit91 9 37.26 D2Mit102 2 114.08 6.30 0.0077 -0.069 0.0051 D9Mit91 9 37.26 D2Mit62 2 117.89 5.31 0.0051 0.069 0.1309 AK008787 9 88.73 D2Mit62 2 117.89 D9Mit91 9 37.26 5.85 0.0077 0.069 0.1990 D3Mit155 3 87.61 D9Mit4 9 52.29 5.63 0.0064 -0.057 0.6367 NM_019545 3 98.29 D9Mit4 9 52.29 D3Mit155 3 87.61 7.44 0.0064 -0.057 0.6945 S10Gnf029.520 10 31.39 D15Mit13 15 3.29 7.25 0.0077 0.026 0.8973 NM_007544 6 121.33 D15Mit13 15 3.29 S10Gnf029.520 10 31.39 5.12 0.0128 0.026 0.8703 S11Gnf089.780 11 82.79 D19Mit22 19 9.85 6.51 0.0090 0.170 0.0154 AK010807 11 77.74 D19Mit22 19 9.85 S11Gnf089.780 11 82.79 6.07 0.0154 0.170 0.0706 D13Mit18 13 35.60 D14Mit99 14 13.92 5.04 0.0128 -0.006 0.8460 NM_008659 11 75.40 D14Mit99 14 13.92 D13Mit18 13 35.60 4.21 0.0231 -0.006 0.8614 D14Mit99 14 13.92 D13Mit18 13 35.60 4.24 0.0308 -0.013 0.7779 AK020896 15 87.31 D13Mit18 13 35.60 D14Mit99 14 13.92 4.82 0.0334 -0.013 0.7946 S04Gnf000.500 4 3.48 D8Mit339 8 39.42 3.57 0.0385 -0.062 0.2080 AK003507 18 35.94 D8Mit339 8 39.42 S04Gnf000.500 4 3.48 3.11 0.0783 -0.062 0.2734 D9Mit297 9 34.05 D9Mit146 9 79.64 4.12 0.0398 0.014 0.6765 BC003244 19 45.63 D9Mit146 9 79.64 D9Mit297 9 34.05 4.48 0.0668 0.014 0.7189

A-44 Appendix

D13Mit18 13 35.60 D17Mit187 17 77.78 3.87 0.0578 0.100 0.0026 NM_010864 9 75.36 D17Mit187 17 77.78 D13Mit18 13 35.60 4.15 0.0578 0.100 0.0013 D2Mit286 2 154.58 D9Mit297 9 34.05 3.52 0.0706 0.000 0.9756 AJ235966 6 69.99 D9Mit297 9 34.05 D2Mit286 2 154.58 3.43 0.0783 0.000 0.9884

A-45 Appendix

Table A-15 bqtl.twolocus results for 35 transcripts linked to two or three eQTLs in the liver 31 BXD (CotL) study. Legend as in Table A-1312. Cross references to Section 4.3. GenBank 1º 2º Epistatic coefficient Chr Mb 1º eQTL 1º Mb 2º eQTL 2º Mb LOD P accession Chr Chr coefficient P D11Mit318 11 61.82 D12Mit112 12 41.92 7.77 0.0013 -0.011 0.6624 NM_025727 11 100.28 D12Mit112 12 41.92 D11Mit318 11 61.82 7.64 0.0013 -0.011 0.6213 D5Mit139 5 122.99 S07Gnf047.960 7 41.05 13.87 0.0013 0.064 0.3094 AK014404 12 12.08 S07Gnf047.960 7 41.05 D5Mit139 5 122.99 14.36 0.0013 0.064 0.2798 Mod2 7 76.78 D19Mit103 19 53.14 7.14 0.0013 -0.082 0.0026 NM_008051 7 39.70 D19Mit103 19 53.14 Mod2 7 76.78 7.07 0.0013 -0.082 0.0244 D09Msw003 9 3.02 D11Mit41 11 88.75 7.91 0.0013 -0.021 0.3543 D11Mit41 11 88.75 D09Msw003 9 3.02 8.08 0.0013 -0.021 0.3941 D17Mit49 17 43.47 D11Mit41 11 88.75 5.97 0.0013 0.046 0.0847 NM_017378 18 38.51 D11Mit41 11 88.75 D17Mit49 17 43.47 5.92 0.0051 0.046 0.0501 D09Msw003 9 3.02 D17Mit49 17 43.47 3.91 0.0154 -0.003 0.8896 D17Mit49 17 43.47 D09Msw003 9 3.02 4.13 0.0244 -0.003 0.8909 D13Msw109 13 108.34 D15Mit29 15 74.53 10.55 0.0013 -0.041 0.0308 U55459 12 109.10 D15Mit29 15 74.53 D13Msw109 13 108.34 10.18 0.0064 -0.041 0.0231 D11Mit208 11 58.21 D1Mit216 1 80.32 9.71 0.0013 -0.088 0.0295 AK014585 5 138.22 D1Mit216 1 80.32 D11Mit208 11 58.21 9.08 0.0051 -0.088 0.0116 D7Mit246 7 19.31 D1Mit106 1 162.33 10.03 0.0013 0.023 0.7291 BC012209 6 28.88 D1Mit106 1 162.33 D7Mit246 7 19.31 5.77 0.0090 0.023 0.5725 Src 2 157.54 S08Gnf076.440 8 74.19 13.33 0.0013 0.108 0.0732 NM_008301 12 73.26 S08Gnf076.440 8 74.19 Src 2 157.54 13.14 0.0026 0.108 0.0257 D11Mit34 11 78.88 D13Mit30 13 98.18 9.10 0.0013 0.001 0.9795 BC011540 9 15.12 D13Mit30 13 98.18 D11Mit34 11 78.88 8.56 0.0026 0.001 0.9730 D1Mit83 1 86.03 D12Mit231 12 95.21 6.64 0.0013 -0.011 0.7741 NM_007591 8 84.11 D12Mit231 12 95.21 D1Mit83 1 86.03 7.11 0.0051 -0.011 0.7946 S08Gnf104.600 8 102.72 D1Mit309 1 118.09 7.37 0.0013 0.002 0.8729 NM_009229 8 106.31 D1Mit309 1 118.09 S08Gnf104.600 8 102.72 8.98 0.0051 0.002 0.9268 D9Mit91 9 37.26 D2Mit58 2 108.15 8.03 0.0013 -0.025 0.2567 Y18276 3 55.36 D2Mit58 2 108.15 D9Mit91 9 37.26 7.30 0.0026 -0.025 0.3389 D4Mit288 4 56.04 D11Mit337 11 118.67 6.22 0.0013 0.000 0.9987 NM_025822 3 66.67 D11Mit337 11 118.67 D4Mit288 4 56.04 7.08 0.0064 0.000 0.9987

A-46 Appendix

D11Mit318 11 61.82 D1Mit216 1 80.32 12.01 0.0013 -0.051 0.1091 S57281 - - D1Mit216 1 80.32 D11Mit318 11 61.82 10.94 0.0039 -0.051 0.0629 D2Mit286 2 154.58 D9Mit4 9 52.29 8.50 0.0026 0.014 0.7689 NM_018739 9 22.34 D9Mit4 9 52.29 D2Mit286 2 154.58 2.69 0.1386 0.014 0.7882 D15Mit13 15 3.29 D7Mit126 7 88.55 6.82 0.0026 0.002 0.9230 NM_019926 X 65.98 D7Mit126 7 88.55 D15Mit13 15 3.29 6.66 0.0051 0.002 0.9255 S18Gnf008.065 18 11.24 D1Mit113 1 171.88 6.11 0.0026 -0.043 0.6213 NM_008056 15 38.94 D1Mit113 1 171.88 S18Gnf008.065 18 11.24 2.87 0.0783 -0.043 0.6483 D14Mit121 14 41.30 D9Mit191 9 46.79 7.32 0.0026 0.010 0.6919 NM_011051 - - D9Mit191 9 46.79 D14Mit121 14 41.30 5.76 0.0090 0.010 0.7022 S17Gnf094.470 17 91.59 D3Mit351 3 140.16 6.04 0.0039 0.003 0.8549 AK010399 6 51.73 D3Mit351 3 140.16 S17Gnf094.470 17 91.59 2.04 0.1489 0.003 0.8819 D5Mit352 5 34.06 D4Mit155 4 100.92 6.44 0.0039 -0.007 0.7458 BC006037 17 78.44 D4Mit155 4 100.92 D5Mit352 5 34.06 4.54 0.0167 -0.007 0.7484 S18Gnf012.470 18 15.61 D4Mit203 4 127.94 8.20 0.0039 0.012 0.7702 NM_008705 - - D4Mit203 4 127.94 S18Gnf012.470 18 15.61 6.54 0.0077 0.012 0.7343 D10Mit186 10 75.55 S05Gnf048.950 5 53.22 5.94 0.0051 0.035 0.1053 AK009442 3 30.46 S05Gnf048.950 5 53.22 D10Mit186 10 75.55 6.26 0.0077 0.035 0.1515 D8Mit335 8 19.78 D5Mit370 5 124.33 7.92 0.0064 -0.026 0.6739 AK017060 11 112.65 D5Mit370 5 124.33 D8Mit335 8 19.78 1.54 0.2875 -0.026 0.5571 D9Mit4 9 52.29 D2Mit436 2 80.00 5.37 0.0077 0.089 0.2798 NM_010924 9 48.62 D2Mit436 2 80.00 D9Mit4 9 52.29 5.03 0.0244 0.089 0.2837 D19Mit28 19 15.18 D8Mit312 8 94.86 6.08 0.0077 0.026 0.7625 NM_007710 7 16.28 D8Mit312 8 94.86 D19Mit28 19 15.18 4.55 0.0257 0.026 0.7099 S18Gnf008.065 18 11.24 D9Mit4 9 52.29 4.99 0.0090 -0.048 0.0937 NM_011186 - - D9Mit4 9 52.29 S18Gnf008.065 18 11.24 4.41 0.0154 -0.048 0.0873 D1Mit328 1 68.25 D3Mit208 3 55.53 7.51 0.0090 -0.053 0.1297 NM_011631 10 86.67 D3Mit208 3 55.53 D1Mit328 1 68.25 7.51 0.0128 -0.053 0.1489 D16Mit167 16 35.04 D7Mit17 7 96.15 4.76 0.0090 -0.026 0.2478 NM_025573 5 114.44 D7Mit17 7 96.15 D16Mit167 16 35.04 4.62 0.0167 -0.026 0.2657 D12Mit114 12 65.35 S10Gnf020.445 10 22.27 4.27 0.0167 -0.054 0.3633 AK017586 10 95.40 S10Gnf020.445 10 22.27 D12Mit114 12 65.35 4.63 0.0501 -0.054 0.4621 D4Mit249 4 124.13 S18Gnf012.470 18 15.61 5.54 0.0167 -0.009 0.8716 NM_009271 2 156.93 S18Gnf012.470 18 15.61 D4Mit249 4 124.13 5.21 0.0193 -0.009 0.8922 AK016223 4 150.98 D12Mit235 12 37.92 S04Gnf150.225 4 148.83 3.77 0.0295 -0.056 0.4275

A-47 Appendix

S04Gnf150.225 4 148.83 D12Mit235 12 37.92 3.98 0.0372 -0.056 0.4531 D9Mit4 9 52.29 D18Mit31 18 11.42 4.05 0.0308 -0.017 0.7253 M59956 - - D18Mit31 18 11.42 D9Mit4 9 52.29 2.84 0.0937 -0.017 0.6662 D2Mit396 2 122.58 D18Mit149 18 45.44 4.25 0.0411 0.090 0.0976 NM_018733 2 66.14 D18Mit149 18 45.44 D2Mit396 2 122.58 3.46 0.0937 0.090 0.0950 S03Gnf004.460 3 7.47 D1Mit134 1 80.78 4.19 0.0424 0.011 0.8010 AK010720 6 147.52 D1Mit134 1 80.78 S03Gnf004.460 3 7.47 3.31 0.1399 0.011 0.8357 D17Mit159 17 69.75 D12Mit63 1 22.91 3.93 0.0513 -0.039 0.3453 AK006979 11 75.96 D12Mit63 1 22.91 D17Mit159 17 69.75 4.58 0.0616 -0.039 0.4031

A-48 Appendix

A.7.2. Significant bqtl.twolocus results from the permB/K/L studies

Table A-16 Permuted expression phenotypes that passed the bqtl.twolocus test in the permB/K/L studies. For each permB/K/L, five permutation tests were performed (perm #). Indicated is the list of transcripts (GenBank accessions) that continue to pass the bqtl.twolocus test and confirmed for two-locus effects even when the expression values have been randomised across the 31 BXDs. The last column indicates whether the permuted expression phenotype also passed the two-locus epistatic test. Correspond to Table 4-5. Perm Permuted sig. Study Multiply-linked eQTLs # transcript epi.? X14625 D2Mit6; D8Mit128 yes AK020339 D1Mit134; D8Mit124 yes D1Mit216; NM_021306 yes S08Gnf006.700 NM_019658 D3Mit12; D4Mit214 yes 1 AB047759 D2Mit62; D9Mit91 yes AF249870 D1Mit134; Mod2 yes AK014278 D19Mit68; DXMsw076 yes BC006733 D4Mit203; Mpmv13 yes NM_025436 D1Mit134; D19Mit68 no NM_019674 D1Mit216; D19Mit103 no AF249870 D1Mit134; Mod2 yes NM_009072 D1Mit216; Hoxd9 no PermB 2 NM_007956 D11Mit41; D19Mit68 no NM_031397 D19Mit68; D19Mit10 no AF367247 D1Mit134; D19Mit68 no X14625 D2Mit6; D8Mit128 yes NM_009744 D1Mit216; D2Mit436 yes 3 AF357411 D1Mit134; D11Mit34 yes AK006652 D19Mit68; D1Mit134 no NM_010902 D1Mit328; D13Mit145 no AK020050 D1Mit134; D19Mit68 yes 4 NM_028889 D1Mit134; D19Mit68 no NM_019952 D1Mit178; Hoxd9 yes 5 AJ279835 D1Mit134; D19Mit103 yes NM_010321 D6Mit67; D7Mit117 no 1 AK014905 D2Mit102; D9Mit91 no NM_019545 D3Mit155; D9Mit4 yes 2 AK020896 D13Mit18; D14Mit99no AF090691 D8Mit289; D12Mit36 yes PermK 3 AK014905 D2Mit102; D9Mit91 no AK020896 D13Mit18; D14Mit99 yes 4 S01Gnf070.445; BC013667 no S13Gnf028.450 PermL 1 NM_011631 D1Mit328; D3Mit208 yes

A-49 Appendix

NM_019926 D7Mit126; D15Mit13 yes NM_007591 D1Mit83; D12Mit231 no Y18276 D2Mit58; D9Mit91 no D09Msw003; D11Mit41; NM_017378 yes D17Mit49 2 NM_007591 D1Mit83; D12Mit231 yes NM_011631 D1Mit328; D3Mit208 yes NM_018733 D2Mit396; D18Mit149 yes BC012209 D1Mit106; D7Mit246 yes 4 NM_025822 D4Mit288; D11Mit337 no S57281 D1Mit216; D11Mit318 no D1Mit134; 5 AK010720 no S03Gnf004.460

A.7.3. Summary of bqtl.twolocus results from the ten eQTL-mapping studies

Table A-17 Number and percentages of transcripts with multiple unique linkages that are confirmed for two-locus and two-locus epistatic effects in the ten eQTL- mapping studies analysed in this thesis project. The transcripts analysed are those that are found to link to more than one marker after applying remove.LD. This table corresponds to Figure 4-12. Transcripts Confirmed Confirmed with two-locus Study two-locus % % multiple epistatic effects linkages effects CotB 72 70 97.2 15 20.8 CotK 17 15 88.2 2 11.8 CotL 35 34 97.1 4 11.4 ChesB 19 18 94.7 0 0.0 BysHSC 45 38 84.4 2 4.4 HubK 402 249 61.9 42 10.4 HubFC 227 76 33.5 14 6.2 YveY.Cy3 123 123 100.0 13 10.6 YveY.Cy5 93 93 100.0 9 9.7 SchL 200 184 92.0 56 28.0

A-50 Appendix

A.8. Genotype pattern and multiple linkages

Table A-18 Genotype pattern-association classification and bqtl.twolocus results of multiple-linkage from the ten eQTL-mapping studies. Multi-locus linkages are first classified into one of four genotype pattern associations: syntenic, NSA, random, and no association. Random association is defined using the “tail-50%” criterion (Section 5.5). This is followed by confirmation for two-locus effects using bqtl.twolocus. Proportions are based on the total number of transcripts link to multiple eQTLs (last column). This table corresponds to Figure 5-24. Type of genotype Total number of Synteny NSA Random Unassociated pattern association transcripts Confirmed two-locus linked to Yes No Yes No Yes No Yes No effects multiple eQTLs 2 1 12 1 56 CotB 0 0 0 72 2.8% 1.4% 16.7% 1.4% 77.8% 1 4 2 10 CotK 0 0 0 0 17 5.9% 23.5% 11.8% 58.8% 13 1 20 1 CotL 0 0 0 0 35 37.1% 2.9% 57.1% 2.9% 1 9 1 8 ChesB 0 0 0 0 19 5.3% 47.4% 5.3% 42.1% 1 22 6 16 BysHSC 0 0 0 0 45 2.2% 48.9% 13.3% 35.6% 99 143 64 10 86 HubK 0 0 0 402 24.6% 35.6% 15.9% 2.5% 21.4% 41 138 20 7 15 6 HubFC 0 0 227 18.1% 60.8% 8.8% 3.1% 6.6% 2.6% 3 16 86 18 YveY3 0 0 0 0 123 2.4% 13.0% 69.9% 14.6% 9 13 57 14 YveY5 0 0 0 0 93 9.7% 14.0% 61.3% 15.1% 32 5 50 7 68 4 34 SchL 0 200 16.0% 2.5% 25.0% 3.5% 34.0% 2.0% 17.0% Average 8.5% 10.2% 25.0% 4.4% 46.5% 0.5% 5.0% 0.0%

A-51 Appendix

Table A-19 Alternate version to Error! Reference source not found.17 where random association is defined using the “tail-10%” criterion (Section 5.5). Type of Total number genotype Synteny NSA Random Unassociated of transcripts pattern link to association multiple Confirmed two- Yes No Yes No Yes No Yes No eQTLs locus effects 2 1 12 1 46 10 CotB 0 0 72 2.8% 1.4% 16.7% 1.4% 63.9% 13.9% 1 4 2 9 1 CotK 0 0 0 17 5.9% 23.5% 11.8% 52.9% 5.9% 13 1 16 5 CotL 0 0 0 0 35 37.1% 2.9% 45.7% 14.3% 1 9 1 8 ChesB 0 0 0 0 19 5.3% 47.4% 5.3% 42.1% 1 22 6 14 2 BysHSC 0 0 0 45 2.2% 48.9% 13.3% 31.1% 4.4% 99 143 64 10 79 7 HubK 0 0 402 24.6% 35.6% 15.9% 2.5% 19.7% 1.7% 41 138 20 7 14 5 1 1 HubFC 227 18.1% 60.8% 8.8% 3.1% 6.2% 2.2% 0.4% 0.4% 3 16 57 47 YveY3 0 0 0 0 123 2.4% 13.0% 46.3% 38.2% 9 13 37 34 YveY5 0 0 0 0 93 9.7% 14.0% 39.8% 36.6% 32 5 50 7 14 1 88 3 SchL 200 16.0% 2.5% 25.0% 3.5% 7.0% 0.5% 44.0% 1.5% Average 8.5% 10.2% 25.0% 4.4% 35.5% 0.3% 16.0% 0.2%

A-52 Appendix

A.9. Effect of LD on joint two-locus effects

Figure A-14. Pearson’s correlation coefficient measure of linakge disequilibrium between locus-pairs identified from single-locus mapping analyses of the CotB/K/L data, against estimated joint two-locus effects on the corresponding expression traits using bqtl.twolocus. Indicated in red and green are locus-pairs presumed to be unassociating.

A-53 Appendix

Figure A-15 Pearson’s correlation coefficient measure of linakge disequilibrium between locus-pairs identified from single-locus mapping analyses of the ChesB, BysHSC, and SchL data, against estimated joint two-locus effects on the corresponding expression traits using bqtl.twolocus. Indicated in red and green are locus-pairs presumed to be unassociating.

A-54 Appendix

Figure A-16 Pearson’s correlation coefficient measure of linakge disequilibrium between locus-pairs identified from single-locus mapping analyses of the HubFC/K data, against estimated joint two-locus effects on the corresponding expression traits using bqtl.twolocus. Indicated in red and green are locus-pairs presumed to be unassociating

A-55 Appendix

Figure A-17. Pearson’s correlation coefficient measure of linkage disequilibrium between locus-pairs identified from single-locus mapping analyses of the YveY3/Y5 data, against estimated joint two-locus effects on the corresponding expression traits using bqtl.twolocus. Indicated in red and green are locus-pairs presumed to be unassociating.

A-56 References

References

ABDALLAH, J.M., B. GOFFINET, C. CIERCO-AYROLLES, and M. PÉREZ- ENCISO, 2003 Linkage disequilibrium fine mapping of quantitative trait loci: A simulation study Genetics Selection Evolution 35: 513– 532. BACHELLERIE, J.P., J. CAVAILLE and A. HUTTENHOFER, 2002 The expanding snoRNA world. Biochimie 84: 775-790. BAILEY, D., 1971 Recombinant-inbred strains. An aid to finding identity, linkage, and function of histocompatibility and other genes. Transplantation 11: 325-327. BECK, J.A., S. LLOYD,M.HAFEZPARAST,M.LENNON-PIERCE,J.T.EPPIG et al., 2000 Genealogies of mouse inbred strains. Nature Genetics 24: 23-25. BEGUM,S.,N.EMANI,A.CHEUNG,O.WILKINS,S.DER,and P.A.HAMEL, 2005 Cell-type-specific regulation of distinct sets of gene targets by Pax3 and Pax3/FKHR. Oncogene 24(11): 1860-72. BENJAMINI, Y., and Y. HOCHBERG, 1995 Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society B 57: 289 -300. BERRY, C.C., 2005 bqtl: Bayesian QTL mapping toolkit. BOLSTAD, B.M., R.A. IRIZARRY,M.ASTRAND and T.P. SPEED, 2003 A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19: 185-193. BREM, R.B., and L. KRUGLYAK, 2005 The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proceedings National Academy of Sciences U S A 102: 1572-1577. BREM, R.B., J.D. STOREY,J.WHITTLE and L. KRUGLYAK, 2005 Genetic interactions between polymorphisms that affect gene expression in yeast. Nature 436: 701-703. BREM, R.B., G. YVERT,R.CLINTON and L. KRUGLYAK, 2002 Genetic dissection of transcriptional regulation in budding yeast. Science 296: 752-755. BROMAN, K.W., 2001 Review of statistical methods for QTL mapping in experimental crosses. Lab Animal (NY) 30(7): 44-52. BROMAN,K.W.,H.WU,S.SEN and G.A. CHURCHILL, 2003 R/qtl: QTL mapping in experimental crosses. Bioinformatics 19: 889-890. BYSTRYKH,L.,E.WEERSING,B.DONTJE,S.SUTTON,M.T.PLETCHER et al., 2005 Uncovering regulatory pathways that affect hematopoietic stem cell function using 'genetical genomics'. Nature Genetics 37: 225-232. CARLBORG,O.,D.J.DE KONING,K.F.MANLY,E.CHESLER,R.W.WILLIAMS et al., 2005 Methodological aspects of the genetic dissection of gene expression. Bioinformatics 21: 2383-2393. CAVALIERI,D.,J.P.TOWNSEND and D.L. HARTL, 2000 Manifold anomalies in gene expression in a vineyard isolate of Saccharomyces

R-1 References

cerevisiae revealed by DNA microarray analysis. Proceedings of the National Academy of Sciences 97: 12369-12374. CERVINO, A.C., G. LI,S.EDWARDS,J.ZHU,C.LAURIE, et al., 2005 Integrating QTL and high-density SNP analyses in mice to identify Insig2 as a susceptibility gene for plasma cholesterol levels. Genomics 86(5): 505-17. CHESLER, E.J., L. LU,S.SHOU,Y.QU,J.GU et al., 2005 Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nature Genetics 37: 233-242. CHEUNG,V.G.,L.K.CONLIN,T.M.WEBER,M.ARCARO,K.Y.JEN et al., 2003 Natural variation in human gene expression assessed in lymphoblastoid cells. Nature Genetics 33: 422-425. CHURCHILL, G.A., and R.W. DOERGE, 1994 Empirical threshold values for quantitative trait mapping. Genetics 138: 963-971. CLARK,A.G.,S.GLANOWSKI,R.NIELSEN,P.D.THOMAS,A.KEJARIWAL et al., 2003 Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302: 1960-1963. COFFMAN, C.J., R.W. DOERGE,M.L.WAYNE and L.M. MCINTYRE, 2003 Intersection tests for single marker QTL analysis can be more powerful than two marker QTL analysis. BMC Genetics 4: 10. COTSAPAS, C. J., 2005 The Genetics of variation in gene expression. School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney. COTSAPAS, C.J., E.K. CHAN, M.F. KIRK, M.TANAKA and P.F. LITTLE, 2003 Genetic variation and the control of transcription, pp. 109-114 in Cold Spring Harbor Symposia on Quantitative Biology. Cold Spring Harbor, Cold Spring Harbor, New York. COTSAPAS, C.J., R.B. WILLIAMS,J.N.PULVERS,D.J.NOTT,E.K.CHAN et al., 2006 Genetic dissection of gene regulation in multiple mouse tissues. Mammalian Genome 17: 490-495. CRAWFORD, D.C., D.T. AKEY AND D.A. NICKERSON, 2005 The patterns of natural variation in human genes. Annual Review of Genomics and Human Genetics 6: 287-312. DARWIN, C., 1859 On the origin of species by means of natural selection. J. Murray, London. DEKKERS, J.C.M., and F. HOSPITAL, 2002 Multifactorial genetics: the use of molecular genetics in the improvement of agricultural populations. Nature Reviews Genetics 3: 22-32. DE KONING, D.J. and C.S. HALEY 2005 Genetical genomics in humans and model organisms. Trends in Genetics 21(7): 377-381. DOERGE, R.W., and G.A. CHURCHILL, 1996 Permutation tests for multiple loci affecting a quantitative character. Genetics 142: 285-294. DOSS S.,E.E.SCHADT,T.A.DRAKE,andA.J.LUSIS, 2005 Cis-acting expression quantitative trait loci in mice. Genome Research 15(5): 681-91. DRISCOLL, M.C., C.S. DOBKIN and B.P. ALTER, 1989 gamma delta beta - Thalassemia Due to a de novo Mutation Deleting the 5' beta -globin

R-2 References

Gene Activation-Region Hypersensitive Sites. Proceedings of the National Academy of Sciences 86: 7470-7474. EVANS,D.M.,J.MARCHINI,A.P.MORRIS,andL.R.CARDON, 2006 Two- stage two-locus models in genome-wide association. PLoS Genetics 2(9): e157. FRITH,M.C.,Y.FU,L.YU,J.F.CHEN,U.HANSEN, and Z. WENG, 2004 Detection of functional DNA motifs via statistical over- representation. Nucleic Acids Research 32(4): 1372-81. GE,Y.,S.DUDOIT and T.P. SPEED, 2003 Resampling-based multiple testing for microarray data analysis. TEST 12: 1-77. GIBBS,R.A.,G.M.WEINSTOCK,M.L.METZKER,D.M.MUZNY,E.J. SODERGREN et al., 2004 Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493-521. GIBSON, G. and B. WEIR, 2005 The quantitative genetics of transcription. Trends in Genetics 21(11): 616-23. GRABER, J.H., G.A. CHURCHILL,K.J.DIPETRILLO,B.L.KING,P.M.PETKOV et al., 2006 Patterns and mechanisms of genome organization in the mouse. Journal of Experimental Zoology 305A: 683-688. GURYEV,V.,B.M.SMITS,J. VAN DE BELT,M.VERHEUL,N.HUBNER, and E. CUPPEN, 2006 Haplotype block structure is conserved across mammals. PLoS Genetics. 2(7): e121. HILL, W.G. and A. ROBERTSON, 1968. Linkage disequilibrium in finite populations. Theoretical Applied Genetics 33: 54-78. HOLM, S., 1979 A Simple Sequentially Rejective Bonferroni Test Procedure. Scandinavian Journal of Statistics 6: 65 -70. HSIEH, M.C., and H.K. BERRY, 1979 Distribution of phenylalanine hydroxylase (EC 1.14.3.1) in liver and kidney of vertebrates. Journal of Experimental Zoology 208: 161-167. HUBNER,N.,C.A.WALLACE,H.ZIMDAHL,E.PETRETTO,H.SCHULZ et al., 2005 Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease. Nature Genetics 37: 243- 253. HUTTENHOFER,A.,M.KIEFMANN,S.MEIER-EWERT,J.O'BRIEN,H. LEHRACH et al., 2001 RNomics: an experimental approach that identifies 201 candidates for novel, small, non-messenger RNAs in mouse. EMBO Journal 20: 2943-2953. IDERAABDULLAH, F.Y., E. DE LA CASA-ESPERON,T.A.BELL,D.A. DETWILER,T.MAGNUSON et al., 2004 Genetic and Haplotype Diversity Among Wild-Derived Mouse Inbred Strains. Genome Research 14: 1880-1887. IRIZARRY, R.A., B. HOBBS,F.COLLIN,Y.D.BEAZER-BARCLAY,K.J. ANTONELLIS et al., 2003 Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4: 249-264. JANSEN, R.C., 1993 Interval mapping of multiple quantitative trait loci. Genetics 135(1): 205-11. JANSEN, R.C., and J.P. NAP, 2001 Genetical genomics: the added value from segregation. Trends in Genetics 17: 388-391.

R-3 References

JIA,Z. AND S. XU, 2007 Mapping Quantitative Trait Loci for Expression Abundance. Genetics 176: 611-623. JIN,W.,R.M.RILEY,R.D.WOLFINGER,K.P.WHITE,G.PASSADOR- GURGEL et al., 2001 The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster. Nature Genetics 29: 389-395. JIROUT,M.,D.KRENOVA,V.KREN,L.BREEN,M.PRAVENEC et al., 2003 A new framework marker-based linkage map and SDPs for the rat HXB/BXH strain set. Mammalian Genome 14: 537-546. KAO, C.H., Z.B. ZENG, and R.D. TEASDALE. 1999 Multiple interval mapping for quantitative trait loci. Genetics 152(3): 1203-16. KENDZIORSKI, C.M., M.CHEN,M.YUAN,H.LAN, and A.D. ATTIE, 2006 Statistical methods for expression quantitative trait loci (eQTL) mapping. Biometrics 62(1): 19-27. KING, M.C., and A.C. WILSON, 1975 Evolution at two levels in humans and chimpanzees. Science 188: 107-116. KIRST, M., A.A. MYBURG,J.P.G.DE LEON,M.E.KIRST,J.SCOTT et al., 2004 Coordinated Genetic Regulation of Growth and Lignin Revealed by Quantitative Trait Locus Analysis of cDNA Microarray Data in an Interspecific Backcross of Eucalyptus. Plant Physiology 135: 2368-2378. KONG,X.,K.MURPHY,T.RAJ and T. MATISE, 2004 A Combined Linkage- Physical Map of the . The American Journal of Human Genetics 75: 1143–1148. LANDER, E.S., and D. BOTSTEIN, 1989 Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121: 185-199. LANDER,E.S.,L.M.LINTON,B.BIRREN,C.NUSBAUM,M.C.ZODY et al., 2001 Initial sequencing and analysis of the human genome. Nature 409: 860-921. LETTICE, L.A., S.J.H. HEANEY,L.A.PURDIE,L.LI,P.DE BEER et al., 2003 A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Human Molecuar Genetics 12: 1725-1735. LEWONTIN, R.C. and K. KOJIMA, 1960 The evolutionary dynamics of complex polymorphisms. Evolution 14: 458-472. LI,Z.,A.FLORATOS,D.WANG,andA.CALIFANO, [unpublished] A Pattern Discovery-Based Method for Detecting Multi-Locus Genetic Association. http://arxiv.org/abs/q-bio/0703038v1 LI,J.,JIANG,T.MAO, J.H., BALMAIN,A.PETERSON, L. et al., 2004 Genomic segmental polymorphisms in inbred mouse strains. Nature Genetics 36: 952-954. LI,Q.,S.HARJU and K. R. PETERSON, 1999 Locus control regions: coming of age at a decade plus. Trends in Genetics 15: 403-408. LINDBLAD-TOH,K.,C.M.WADE,T.S.MIKKELSEN,E.K.KARLSSON,D.B. JAFFE et al., 2005 Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438: 803-819.

R-4 References

MARTINEZ, O., and R. N. CURNOW, 1992 Estimating the locations and sizes of effects of quantitative trait loci using flanking markers. Theoretical and Applied Genetics 85: 480-488. MATSUZAKI,H.,S.DONG,H.LOI,X.DI,G.LIU et al., 2004 Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nature Methods 1: 109-111. MATYS, V., O. V. KEL-MARGOULIS,E.FRICKE,I.LIEBICH,S.LAND et al., 2006 TRANSFAC(R) and its module TRANSCompel(R): transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34: D108-110. MCGINNIS,S.,andT.L.MADDEN, 2004 BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research 32: W20-25. MCLEOD,H.L.,andW.E.EVANS, 2001 Pharmacogenomics: Unlocking the Human Genome for Better Drug Therapy. Annual Review of Pharmacology and Toxicology 41: 101-121. MONKS,S.A.,A.LEONARDSON,H.ZHU,P.CUNDIFF,P.PIETRUSIAK et al., 2004 Genetic inheritance of gene expression in human cell lines. American Journal of Human Genetics 75: 1094-1105. MORGANTE,M.,andF.SALAMINI, 2003 From plant genomics to breeding practice. Current Opinion in Biotechnology 14: 214-219. MORLEY,M.,C.M.MOLONY,T.M.WEBER,J.L.DEVLIN,K.G.EWENS et al., 2004 Genetic analysis of genome-wide variation in human gene expression. Nature 430: 743-747. OLEKSIAK, M.F., G.A. CHURCHILL and D.L. CRAWFORD, 2002 Variation in gene expression within and among natural populations. Nature Genetics 32: 261-266. PARK, Y.G., R. CLIFFORD,K.H.BUETOW, and K.W. HUNTER 2003 Multiple Cross and Inbred Strain Haplotype Mapping of Complex-Trait Candidate Genes. Genome Research 13: 118-121. PASTINEN,T.,andT.J.HUDSON, 2004 Cis-Acting Regulatory Variation in the Human Genome. Science 306: 647-650. PEIRCE,J.L.,H.LI,J.WANG,K.F.MANLY,R.J.HITZEMANN, et al. 2006 How replicable are mRNA expression QTL? Mammalian Genome 17(6): 643-56. PEIRCE,J.L.,L.LU,J.GU,L.M.SILVER and R. W. WILLIAMS, 2004 A new set of BXD recombinant inbred lines from advanced intercross populations in mice. BMC Genetics 5: 7. PÉREZ-ENCISO, M., 2004 In Silico Study of Transcriptome Genetic Variation in Outbred Populations. Genetics 166: 547-554. PETKOV P.M.,J.H.GRABER,G.A.CHURCHILL,K.DIPETRILLO,B.LKING, and K. PAIGEN, 2005 Evidence of a large-scale functional organization of mammalian chromosomes. PLoS Genetics 1(3): e33. PRAVENEC,M.,V.KREN,D.KRENOVA,V.BILA,V.ZIDEK et al., 1999 HXB/Ipcv and BXH/Cub recombinant inbred strains of the rat: strain distribution patterns of 632 alleles. Folia Biol (Praha) 45: 203-215.

R-5 References

QUACKENBUSH, J., 2002 Microarray data normalization and transformation. Nature Genetics 32: 496 - 501. RDEVELOPMENT CORE TEAM, 2006 R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ROCKMAN, M.V. and L. KRUGLYAK, 2006 Genetics of global gene expression. Nature Review Genetics. 7(11): 862-72. SACHIDANANDAM,R.,D.WEISSMAN,S.C.SCHMIDT,J.M.KAKOL,L.D. STEIN et al., 2001 A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409: 928-933. SADÉE,W.,andZ.DAI, 2005 Pharmacogenetics/genomics and personalized medicine. Human Molecular Genetics 14: R207-R214. SANDBERG,R.,R.YASUDA,D.G.PANKRATZ,T.A.CARTER,J.A.DEL RIO et al., 2000 Regional and strain-specific gene expression mapping in the adult mouse brain. Proceedings of the National Academy of Sciences U S A 97: 11038-11043. SCHADT,E.E.,S.A.MONKS,T.A.DRAKE,AJ.LUSIS,N.CHE et al., 2003 Genetics of gene expression surveyed in maize, mouse and man. Nature 422: 297-302. SCHAID, D.J., 2002 Relative efficiency of ambiguous vs. directly measured haplotype frequencies. Genetic Epidemiology 23: 426–443. SCHAID, D.J., 2004 Evaluating associations of haplotypes with traits Genetic Epidemiology 27: 348-364. SÉMON M. and L. DURET, 2006 Evolutionary origin and maintenance of coexpressed gene clusters in mammals. Molecular Biology and Evolution 23(9): 1715-23. SLADEK, R. and T.J. HUDSON, 2006 Elucidating cis- and trans-regulatory variation using genetical genomics.Trends in Genetics 22(5): 245- 50. SMYTH,G.K.,andT.SPEED, 2003 Normalization of cDNA microarray data. Methods 31: 265-273. SPILIANAKIS, C.G., M.D. LALIOTI,T.TOWN,G.R.LEE and R.A. FLAVELL, 2005 Interchromosomal associations between alternatively expressed loci. Nature 435: 637-645. SRIVASTAVA,M.,S.HSIEH,A.GRINBERG,L.WILLIAMS-SIMONS,S.-P. HUANG et al., 2000 H19 and Igf2 monoallelic expression is regulated in two distinct ways by a shared cis acting regulatory region upstream of H19. Genes and Development 14: 1186-1195. STEEMERS,F.J.,W.CHANG,G.LEE,D.L.BARKER,R.SHEN et al., 2006 Whole-genome genotyping with the single-base extension assay. Nture Methods 3: 31-33. STOREY, J. D., 2002 A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64: 479-498. STOREY,J.D.,J.M.AKEY and L. KRUGLYAK, 2005 Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biology 3: e267.

R-6 References

STOREY,J.D.,andR.TIBSHIRANI, 2003 Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences U S A 100: 9440-9445. STRACHAN, T., and A.P. READ, 1999 Human Molecular Genetics 2. BIOS Scientific Publishers, Oxford, UK. SWARTZ, M.E., J. EBERHART, E.B. PASQUALE, and C.E. KRULL, 2001 EphA4/ephrin-A5 interactions in muscle precursor cell migration in the avian forelimb. Development 128(23): 4669-80. TAYLOR, B.A., 1978 Recombinant inbred strains: Use in gene mapping, pp. 423-438 in Origins of Inbred Mice, edited by H. C. MORSE III. Academic Press, NY. TAYLOR,B.A.,C.WNEK,B.S.KOTLUS,N.ROEMER,T.MACTAGGART et al., 1999 Genotyping new BXD recombinant inbred mouse strains and comparison of BXD and consensus maps. Mammalian Genome 10: 335-348. THE CHIMPANZEE SEQUENCING AND ANALYSIS CONSORTIUM, 2005 Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437: 69-87. THE INTERNATIONAL HAPMAP CONSORTIUM, 2005 A haplotype map of the human genome. Nature 437: 1299-1320. VAN DER PLOEG,L.H.T.,A.KONINGS,M.OORT,D.ROOS,L.BERNINI et al., 1980 "-!-Thalassaemia studies showing that deletion of the "- and -genes influences !-globin gene expression in man. Nature 283: 637-642. VENTER, J.C., M.D. ADAMS,E.W.MYERS,P.W.LI,R.J.MURAL et al., 2001 The Sequence of the Human Genome. Science 291: 1304-1351. VILLARD, J., 2004 Transcription regulation and human diseases. Swiss Medical Weekly 134: 571-579. WANG,X.,R.KORSTANJE,D.HIGGINS, and B. PAIGEN, 2004 Haplotype Analysis in Multiple Crosses to Identify a QTL Gene. Genome Research 14: 1767-1772. WALL, J.D., and J.K. PRITCHARD, 2003 Haplotype block and linkage disequilibrium in the human genome. Nature Reviews Genetics 4: 587-597. WATERSTON,R.H.,K.LINDBLAD-TOH,E.BIRNEY,J.ROGERS,J.F.ABRIL et al., 2002 Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562. WESTFALL, P.H., and S.S. YOUNG, 1993 Resampling-based Multiple Testing. Wiley, New York. WHITEHEAD,A.,andD.L.CRAWFORD, 2005 Variation in tissue-specific gene expression among natural populations. Genome Biology 6: R13. WHITNEY,A.R.,M.DIEHN,S.J.POPPER,A.A.ALIZADEH,J.C.BOLDRICK et al., 2003 Individuality and variation in gene expression patterns in human blood. Proceedings of the National Academy of Sciences U S A 100: 1896-1901.

R-7 References

WILLIAMS, R.B., C.J. COTSAPAS,M.J.COWLEY,E.CHAN,D.J.NOTT et al., 2006 Normalization procedures and detection of linkage signal in genetical-genomics experiments. Nature Genetics 38: 855-856. WILLIAMS,R.W.,J.GU,S.QI and L. LU, 2001 The genetic structure of recombinant inbred mice: high-resolution consensus maps for complex trait analysis. Genome Biology 2: RESEARCH0046. WINTER, W.E., and J.H. SILVERSTEIN, 2000 Molecular and genetic bases for maturity onset diabetes of youth. Current Opinions in Pediatrics 12: 388-393. WITTKOPP, P.J., 2005 Genomic sources of regulatory variation in cis and in trans. Cellular and Molecular Life Sciences 62: 1779-1783. YANG, Y.H., S. DUDOIT,P.LUU,D.M.LIN,V.PENG et al., 2002 Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research 30: e15. YVERT, G., R.B. BREM,J.WHITTLE,J.M.AKEY,E.FOSS et al., 2003 Trans- acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nature Genetics 35: 57-64. ZAMIR, D., 2001 Improving plant breeding with exotic genetic libraries. Nature Reviews Genetics 2: 983-989. ZENG, Z.B., 1993 Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci. Proceedings of the National Academy of Sciences USA 90(23): 10972-6. ZENG, Z.B., C.H. KAO, and C.J. BASTEN, 1999 Estimating the genetic architecture of quantitative traits. Genetical Research 74: 279-289. ZOU, W. and Z.B. ZENG, [unpublished] Multiple interval mapping for gene expression QTL analysis. http://statgen.ncsu.edu/zeng/MIM- eQTL.pdf

R-8 Referenced Websites

Referenced Websites

ArrayExpress http://www.ebi.ac.uk/arrayexpress/ Ensembl Mouse Genome database http://www.ensembl.org/Mus_musculus/ Ensembl Mouse Transcript SNP View database http://www.ensembl.org/Mus_musculus/transcriptsnpview GeneNetwork (WebQTL) http://www.genenetwork.org/ JASPAR Transcription Factor Binding Profile database http://jaspar.genereg.net/ NCBI Gene Expression Omnibus (GEO) ftp://ftp.ncbi.nlm.nih.gov/pub/geo/data/geo/ Mouse Genome Informatics http://www.informatics.jax.org/ NCBI’s dbSNP http://www.ncbi.nlm.nih.gov/SNP/ NCBI Reference Sequence (RefSeq) database http://www.ncbi.nlm.nih.gov/RefSeq/ QTL Reaper http://sourceforge.net/projects/qtlreaper/ UCSC Mouse Genome Broswer http://genome.ucsc.edu/

R-9