UNIVERSITYOF QUEENSLAND

AUSTRALIA

Characterization of the genetic and environmental factors driving expression variability

Anita Goldinger B.Sc. (Hons)

A submitted for the degree of at The in 2016 University of Queensland Diamantina Institute

1 Abstract

Gene expression variation is a quantitative trait that drives phenotypic diversity across populations. On a cellular level, gene expression is an intermediate between stored genetic information and the functional utilization of this information within the cell. Through Wide Association Studies (GWAS), thousands of genetic polymorphisms associated with numerous diseases have been identified. These have provided many novel insights into the disrupted biological processes that drive the etiology of various health conditions.expression Quantitative Trait Loci (eQTLs) provide an additional layer of biological information about the physiological impact of common genetic variants. Therefore, the study of the genetic regulation of gene expression (eQTL studies) has been useful both in the validation and functional characterisation of GWAS polymorphisms. This has contributed to a better understanding of the precise molecular processes that contribute to the development of disease.

Global transcriptomic analyses have provided as greater insight into the level of complexity that drives biological systems. Transcriptomic data are often comprised of gene regulatory and co-expression networks, an emergent property of transcriptomic and other omic data. These networks within each fields interact with each other to further add layers of complexity that drive biological systems. Variation contained with gene expression datasets can, therefore, provide detail into the flow of infor- mation through these biological systems and how these can be influenced by genetic polymorphisms.

Transcriptome variation is highly influenced by genetic and environmental factors. Genetic regulation of gene expression represents, with some exceptions, fixed regulatory points that strictly control the expression of . Variance attributed to environmental effects, on the other hand, are often bio- logical responses to specific stimuli. The dissection of the genetic and environmental influences on expression levels will help to form a baseline upon which network models can be built to disseminate the biological flow of information in healthy, latent or disease groups.

This thesis will detail both methodological methods to clean data, and statistical approaches analyze the complexities found within the variance of transcriptomic data. The focus of this thesis is the

2 dissection of three major influences of gene expression variability: technical artifacts, environmental and genetic variation. Using statistical and quantitative genetic techniques on array-based and gene expression datasets, this thesis examines:

1. The use of Principal Components Analysis (PCA) to identify and correct for known batch ef- fects

2. Season variation as a pervasive environmental contributor to gene expression variation

3. The genetic contribution driving robust gene co-expression modules

The Brisbane Systems Study (BSGS) is comprised of both unrelated and related individu- als and was used throughout these three studies. The Center for Health Discovery and Well Being (CHDWB) cohort was used as a replication study and the Multiple Tissue Human Expression Re- source (MuTHER) cohort was employed to examine tissue-specific effects.

The first chapter provides a technical methodological analysis of the batch effect correction technique PCA. Batch effects have a large impact on gene expression variability, often creating artificial sys- tematic trends. By decomposing the data in Principal Components (PCs) we were able to quantify the degree and distribution of technical artifacts within gene expression datasets and determine the effectiveness of this correction method.

The second chapter examines the influence of pervasive macro-environmental factors on gene expres- sion datasets and provides a statistical framework to identify seasonal variation. Since datasets are often collected over time, samples may contain seasonal trends in gene expression that are environ- mentally driven and are not regarded as technical artifacts. By using loess decomposition and cosinor regression, 74 transcripts with a significant season trend were identified independently of seasonal variation in blood cell count.

Chapter three examines the genetic contribution to gene expression covariance between transcripts, called Blood Informative Transcripts (BITs) comprising of nine modules that have been previously identified and validated. Using quantitative genetic techniques, the genetic and environmental compo- nents driving phenotypic correlations for BIT transcripts were quantified. When compared to 10,000 bootstrap permutations of random probes the BITS demonstrate significant genetic correlation (aver- age 0.63 across all BITs) and an average genetic contribution to phenotypic correlations of 0.42. The high degree of genetic correlation demonstrates a strong genetic framework regulating the expression of BITs modules. This chapter also examines the presence of this replicated modules in three separate

3 tissue types, identifying several tissue-specific coexpression modules.

Overall, this thesis explores the landscape of gene expression variability. A methodological frame- work for the identification of technical artifacts and seasonal effects is investigated and the genetic architecture driving transcriptomic co-expression is characterized.

4 Declaration by Author

This thesis is composed of my original work, and contains no material previously published or written by another person except where due reference has been made in the text. I have clearly stated the contribution by others to jointly-authored works that I have included in my thesis.

I have clearly stated the contribution of others to my thesis as a whole, including statistical assistance, survey design, data analysis, significant technical procedures, professional editorial advice, and any other original research work used or reported in my thesis. The content of my thesis is the result of work I have carried out since the commencement of my research higher degree candidature and does not include a substantial part of work that has been submitted to qualify for the award of any other degree or diploma in any university or other tertiary institution. I have clearly stated which parts of my thesis, if any, have been submitted to qualify for another award.

I acknowledge that an electronic copy of my thesis must be lodged with the University Library and, subject to the policy and procedures of The University of Queensland, the thesis be made available for research and study in accordance with the Copyright Act 1968 unless a period of embargo has been approved by the Dean of the Graduate School.

I acknowledge that copyright of all material contained in my thesis resides with the copyright holder(s) of that material. Where appropriate I have obtained copyright permission from the copyright holder to reproduce material in this thesis.

5 Publications during candidature

Peer Reviewed Publications:

F. C. McKay, P. N. Gatt, N. Fewings, G. P. Parnell, S. D. Schibeci, M. A. I. Basuki, J. E. Powell, A. Goldinger, M. J. Fabis-Pedrini, A. G. Kermode, T. Burke, S. Vucic, G. J. Stewart, and D. R. Booth, “The low EOMES/TBX21 molecular phenotype in multiple sclerosis reflects CD56+ cell dys- regulation and is affected by immunomodulatory therapies.,” Clin. Immunol., vol. 163, pp. 96-107, Mar. 2016. A. Goldinger, K. Shakhbazov, A. K. Henders, A. F. McRae, G. W. Montgomery, and J. E. Powell, “Seasonal effects on gene expression.,” PLoS One, vol. 10, no. 5, p. e0126995, Jan. 2015.

A. Goldinger, A. K. Henders, A. F. McRae, N. G. Martin, G. Gibson, G. W. Montgomery, P. M. Visscher, and J. E. Powell, “Genetic and nongenetic variation revealed for the principal components of human gene expression.,” Genetics, vol. 195, no. 3, pp. 1117-28, Nov. 2013.

6 Publications included in this thesis

Table 1: Contribution to authorship for Chapter One: A. Goldinger, A. K. Henders, A. F. McRae, N. G. Martin, G. Gibson, G. W. Montgomery, P. M. Visscher, and J. E. Powell, “Genetic and nongenetic variation revealed for the principal components of human gene expression.,” Genetics, vol. 195, no. 3, pp. 1117-28, Nov. 2013.

Contributor Statement of contribution A.G. (Candidate) Designed Experiments (50%) Statistical analysis and interpretation (90%) Wrote the paper (95%) A.K.H Managed recruitment of Subjects and laboratory protocols for the generation of the BSGS dataset (30%) A.F.M Managed recruitment of Subjects and laboratory protocols for the generation of the BSGS dataset (50%) Advice on Statistical analysis and Interpretation of Results (10%) N.G.M Managed recruitment of Subjects and laboratory protocols for the generation of the BSGS dataset (10%) Edited Paper (10%) G.G Provided the CHDWB dataset (100%) G.W.M Managed recruitment of Subjects and laboratory protocols for the generation of the BSGS dataset (10%) P.M.V. Advice on Statistical analysis Interpretation of Results (10%) Edited Paper (20%) J.E.P Designed Experiments (50%) Statistical analysis (10%) Advice on Statistical analysis Interpretation of Results (80%) Wrote (5%) and Edited paper (80%)

7 Table 2: Contribution to authorship for Chapter Two: A. Goldinger, K. Shakhbazov, A. K. Henders, A. F. McRae, G. W. Montgomery, and J. E. Powell, “Seasonal effects on gene expression.,” PLoS One, vol. 10, no. 5, p. e0126995, Jan. 2015.

Contributor Statement of contribution A.G. (Candidate) Designed Experiments (50%) Statistical analysis and interpretation (95%) Wrote the paper (95%) K.S. Statistical analysis (5%) Advice on Statistical analysis Interpretation of Results (10%) A.K.H Managed recruitment of Subjects and laboratory protocols for the generation of the BSGS dataset (30%) A.F.M Managed recruitment of Subjects and laboratory protocols for the generation of the BSGS dataset (50%) Advice on Statistical analysis Interpretation of Results (10%) G.W.M Managed recruitment of Subjects and laboratory protocols for the generation of the BSGS dataset (10%) J.E.P Designed Experiments (50%) Statistical analysis (10%) Advice on Statistical analysis Interpretation of Results (80%) Wrote paper (5%) and Edited paper (80%)

8 Contributions by others to the thesis

Chapter One: Joseph Powell conceived the experiment. Joseph Powell and Peter Visscher (supervi- sors) provided input on the results and text. Data was provided by Anjali Henders, Allan McRae, Nicolas Martin, Grant Montgomery (BSGS dataset) and Greg Gibson (CHDWB dataset).

Chapter Two: Joseph Powell (supervisor) conceived the experiment and provided input on the results and text. Konstantin Shakhbazov provided cell count correction data. Data was provided by Anjali Henders, Allan McRae, Nicolas Martin, Grant Montgomery (BSGS dataset).

Chapter Three: Joseph Powell (supervisor) conceived the experiment with Greg Gibson. Joseph Powell and Peter Visscher (supervisors) provided input on the results and text. Jian Yang provided advice on statistical methods. Data was provided by Anjali Henders, Allan McRae, Nicolas Martin, Grant Montgomery (BSGS dataset).

9 Statement of parts of the thesis submitted to qualify for the award of another degree

None

10 Acknowledgements

I would like to thank Joseph Powell for being an excellent supervisor, for his guidance, support and encouragement, and the opportunities that he has extended to me. I would also like to thank Peter Visscher and Naomi Wray for their direction, advice and guidance.

I would like to thank Michelle Hill, Gethin Thomas, Grant Montgomery who acted as mythesis com- mittee.

There were many people who have been very generous of their time and supportive. I would especially like to thank Allan McRae for insight into statistical analysis, Jian Yang for navigating me through the intricacies of GCTA, Konstantin Shakhbazov for being critical on cell count correction, Gibran Hemani for showing me computer tricks along with other members of CSNG for their helpful insight and humor over the years. I would like to especially thank Zhihong Zhao, Peter Smartt, Rober Maier, Costanza Vallerga and Andrew Bakshi and for enlightening lunchtime discussions. I would also like to thank the administrative staff who have been very helpful: Anjali Henders, Angela Bestard and Sandra Pavey.

The financial support from the Australian Government through the Australian Postgraduate Award and the University of Queensland Diamantina Institute is gratefully acknowledged and very much appreciated

To Alex Gomersall, your patience, understanding and unconditional support have been so vital. This thesis would not have been possible without you.

Thank you all very much.

11 Keywords

Genetics, Statistics, Gene Expression, Seasonal Variation, Genetic Covariance, Genetic Variation, Quantitative Trait Loci

12 Australian and New Zealand Standard Research Classifications (ANZSRC)

ANSRC code: 010402 Biostatistics 50% 060405 Gene Expression 30% 060412 20%

13 Fields of Research (FoR) Classification

FoR code: 0604 Genetics 50% 0104 Statistics 50%

14 “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ (I found it!) but ‘That’s funny . . . ’”

Isaac Asimov

15 Contents

Abstract 2

Declaration by Author 5

Publications during candidature 6

Publications included in this thesis 7

Contributions by others to the thesis 9

Statement of parts of the thesis submitted to qualify for the award of another degree 9 Acknowledgements ...... 10

Keywords 11

Australian and New Zealand Standard Research Classifications (ANZSRC) 12

Fields of Research (FoR) Classification 13

1 Introduction 29 1.1 Gene expression as a complex trait ...... 29 1.2 Gene expression variability ...... 30 1.3 Genetical ...... 32 1.3.1 Genetic-regulation of gene expression ...... 32 1.3.2 eQTLs ...... 33 1.3.3 Location of regulatory variants ...... 34 1.3.4 Distal regulatory variants ...... 34 1.4 Non-additive genetic variation ...... 35 1.4.1 Dominance ...... 35 1.4.2 Epistasis ...... 35

16 1.4.3 Gene-by-environment interactions ...... 36 1.5 Environmental regulation ...... 39 1.5.1 Differential gene expression ...... 39 1.5.2 Classifying environmental conditions ...... 39 1.5.3 Epigenetic variation ...... 39 1.5.4 Batch Effects ...... 40 1.5.5 Principal Components ...... 41 1.6 Cell type and tissue specificity ...... 42 1.7 Omics ...... 43 1.7.1 Proteomics ...... 44 1.7.2 Metabolomics ...... 45 1.8 Systems Genetics ...... 46 1.8.1 Dimension reduction ...... 47 1.8.2 Coexpression ...... 47 1.9 Network analysis ...... 48 1.10 Clinical applications of gene expression ...... 49

2 Genetic and non-genetic variation revealed for the principal components of human gene expression 50 2.1 Abstract ...... 51 2.2 Introduction ...... 51 2.3 Materials and Methods ...... 52 2.3.1 Samples ...... 52 2.3.2 Replication sample ...... 53 2.3.3 Normalization and batch effect correction ...... 53 2.3.4 Principal Components ...... 55 2.3.5 Association with Batch effects ...... 57 2.3.6 Estimation of ...... 58 2.3.7 Genome-wide association study on PCs ...... 59 2.3.8 eQTL analysis ...... 59 2.3.9 Biological Pathway Analysis ...... 60 2.4 Results ...... 60 2.4.1 Non-genetic contributions to PCs ...... 60 2.4.2 Genetic contribution to PCs ...... 65

17 2.4.3 Biological pathway analysis: ...... 71 2.5 Discussion ...... 73

3 Seasonal effects on gene expression 76 3.1 Abstract ...... 77 3.2 Introduction ...... 77 3.3 Materials and Methods ...... 78 3.3.1 Samples ...... 78 3.3.2 Genotyping ...... 78 3.3.3 Gene expression ...... 78 3.3.4 Cell counts ...... 79 3.3.5 Normalization ...... 79 3.3.6 Cell count correction ...... 80 3.3.7 Conversion to time series ...... 80 3.3.8 Seasonal decomposition ...... 81 3.3.9 Autocorrelation ...... 81 3.3.10 Cosinor regression ...... 81 3.3.11 Measured weather variables ...... 82 3.3.12 Biological enrichment analysis ...... 82 3.3.13 Blood cell specific markers ...... 83 3.4 Results ...... 83 3.4.1 Decomposition of time series data ...... 83 3.4.2 Effect of season on gene expression ...... 84 3.4.3 Cell count seasonality ...... 88 3.4.4 Environmental variables ...... 90 3.4.5 Enrichment analysis ...... 91 3.4.6 Cell-specific mRNA markers ...... 93 3.4.7 Discussion ...... 94 3.4.8 Conclusion ...... 96

4 Common genetic control drives the co-expression of mRNA transcripts 97 4.1 Abstract ...... 97 4.2 Introduction ...... 98 4.3 Materials and Methods ...... 101

18 4.3.1 Brisbane Systems Genetics Study ...... 101 4.3.2 Normalization ...... 102 4.3.3 Multiple Tissue Human Expression Resource ...... 103 4.3.4 Modules ...... 103 4.3.5 Heritability of gene expression ...... 104 4.3.6 Phenotypic Correlations ...... 104 4.3.7 Genetic Correlations ...... 104 4.3.8 Proportion of phenotypic correlation explained by genetic effects ...... 105 4.3.9 eQTL analysis ...... 106 4.3.10 Transcription factor enrichment ...... 107 4.4 Results ...... 107 4.4.1 Defining the presence of modules using phenotypic correlations ...... 107 4.4.2 Genetic contribution ...... 116 4.4.3 Identifying modules across tissues ...... 117 4.4.4 eQTL analysis ...... 119 4.4.5 Transcription factors ...... 124 4.5 Discussion ...... 127

5 Discussion and Conclusion 134 5.1 Transcriptomic variability ...... 134 5.1.1 Gene expression as a quantitative trait ...... 134 5.1.2 Biological impact of common SNPs ...... 135 5.1.3 Additive and non-additive genetic variability ...... 135 5.1.4 Non-genetic variability ...... 136 5.2 Technical variability ...... 137 5.2.1 Batch effects ...... 138 5.2.2 Principal Component correction ...... 138 5.2.3 Environmental variability ...... 141 5.2.3.1 Seasonal variability ...... 141 5.2.3.2 Enrichment ...... 142 5.2.3.3 Causality ...... 142 5.2.3.4 Impact of seasonal effects ...... 142 5.2.4 Physiological variability ...... 143 5.2.4.1 Cell counts ...... 143

19 5.2.4.1.1 Seasonal effects ...... 143 5.2.4.1.2 Cellular heterogeneity ...... 144 5.2.4.2 Tissue specificity ...... 145 5.2.4.2.1 eQTLs ...... 145 5.2.4.2.2 BIT ...... 146 5.3 Genetic covariance ...... 146 5.3.1 BITs ...... 147 5.3.2 Phenotypic correlations ...... 147 5.3.3 Genetic contribution to phenotypic correlations ...... 148 5.3.4 Shared eQTLs ...... 148 5.3.5 Transcription Factors ...... 149 5.3.6 GWAS enrichment ...... 149 5.4 Conclusion ...... 151

Bibliography 151

20 List of Figures

2.1 Principal Variance Components Analysis (PVCA) of gene expression datasets (A)-(D). PVCA calculates the proportion of variance in the entire dataset that is at- tributed to certain batch, sex and age effects. Altogether batch, sex and age effects account for 23% of the variance in the non-corrected dataset (A) and are fully removed in the corrected dataset (B). Batch effects present in PC corrected dataset are repre- sented in (C) PC25 and (D) PC50. The residuals represent the remaining variance in the dataset not attributed to these batch effects...... 61 2.2 Variance explained by PCs. Calculated from the eigenvalues obtained from the Singular Value Decomposition. Variance explained by each PC is plotted in black and cumulative variance in blue. A) The normalized dataset, B) Corrected with linear models, C) PC25 corrected D) PC50 corrected. All variances sum to 1...... 62 2.3 Batch effect associations to PCs. Significant values (after Bonferroni correction p 0.05 2 < 335 ) are shown in red. (A) R from a regression analysis of each PC on batch, sex and age effects in a dataset that is not corrected for any of those factors. (B) R2 from a regression analysis of each PC on the batch, sex and age effects in a dataset that is corrected for these factors. There is no significant association to batch effects present. (C) Association to batch effects in a dataset corrected for the first 25 PCs. (D) Association for batch effects in dataset corrected for the first 50 PCs...... 64 2.4 Narrow-sense heritability estimates of each PC obtained from QTDT. These re- sults indicate that nearly all PCs hold genetic information...... 65

21 2.5 Correlation between additive genetic and common environmental factors.Significant association (p = 8.57e-08 and R2=0.08) between the common environment and ge- netic components estimated in an AC and AE models. The proportion of common environment variance was calculated by dividing the variance attributed to common environment (Vc) by the total phenotypic variance (Vp). The proportion of genetic variability (heritability) was calculated by dividing the additive genetic component (Vg) by the total phenotypic variance (Vp). This result indicates that the heritability estimates obtained are confounded with common environment variance and, there- fore, inflated upwards by common family effects...... 66 2.6 Distribution of heritability estimates for 9,086 probes from four different correc- tion methods.Black - standard normalized, Red - standard normalization with linear model correction for batch effects, Blue - corrected for batch effects using the first 25 PCs, Purple - corrected for batch effects using the first 50 PCs. There is a drop in heritability using the correction methods; however, this is more pronounced in the PC corrected dataset. The heritability estimates are constrained to zero due to the nature of genetic variance component estimating in QTDT...... 67 2.7 Selection of probes driving PCs. 2A) Maximum (blue) and minimum (black) eigen- vector values for each PC. The eigenvector values represent the extent of the correla- tion between a probe and a PC, with 0 indicating no association. Selection of probes driving each PC is based on the optimal number of probes for each PC that have the same eigenvector value cut-off. As multiple probes can contribute a small amount of variance to each PC it is reasonable that a low cut off value can pick up many of the significant probes driving each PC. Probes that have an eigenvector value of greater than 0.02 or less than -0.02 were selected for further biological enrichment analysis as this incorporated all the maximally and minimally expressed probes in this section. (2B) Selection of probes at this cut-off value enabled approximately similar numbers of probes to be selected for each PC. The slightly lower number of probes that are selected in the initial PCs is due to the lower maximum and minimum eigenvector values in these PCs as shown in (2A)...... 69

3.1 Time series decomposition for TRIM23 (ILMN 1752741) using loess decomposi- tion.Original = The raw time-series data for the probe. Seasonal = The regular cyclic component. Trend = The linear drift over time. Remainder = The irregular (error) component that is not explained by the seasonal and trend components...... 84

22 3.2 Manhattan plot of the cosinor seasonal analysis. The -log10(p) of each cosinor regression model is plotted against the chromosomal location of each probe. Bonfer- roni correction significance line is added. A) Not corrected for cell count B) Corrected for cell count. Includes autosomal chromosomes 1-22, X(23), Y(24) and Mitochon- drial(25)...... 86 3.3 Histograms showing the distribution of the variance explained by the model.R2 Blue denotes probes with a statistically significant association of gene expression levels and cyclic variation A) Not corrected for cell count B) Corrected for cell count. 87 3.4 Seasonal variation in cell count. Five cells types that demonstrate significant associ- 0.05 ation to seasonal variation (p < 10 , Bonferroni correction). The black lines represent the fitted values in a cosinor regression. The red lines represent the actual cell values. From these figures, it is evident that the cells follow complex repeating patterns of peaks and troughs throughout the year. However, it can be observed that they show a consistent seasonal trend following one clear peak and trough per year. These val- ues were collected over a three-year period and are plotted in sequential order. The year of sample collection is labeled in the axis as a number (5-8) after the month, corresponding to 2005, 2006, 2007 and 2008 respectively...... 89 3.5 Standardized monthly values for 12 weather conditions. Measured weather vari- ables that exhibit seasonal variation in Brisbane (black dots and connecting lines). The red dots represent the cosine curve with a 12-month repeating cycle...... 90 3.6 KEGG enriched pathway for Antigen Processing and Presentation. Significant seasonal genes in our study are highlighted with red stars...... 92

23 4.1 These graphs are a visual representation of the correlation values between mod- ule probes in the BSGS dataset. Panels A, B, and C are heat maps for the correlation

matrices for A) phenotypic (rP), B) genetic (rA) and C) environmental (rE) value be- tween all pairs of the BIT probes. The BITs are ordered from module one to nine. Red values represent high correlation values (˜1) and blue values represent low correlation values (˜-1). In all of the heat map plots below, the intra-module correlation values are all highly positive, demonstrating their unique correlational structure captured by the BITs. There are many low to moderate intra-module correlations, showing similarity between the modules. These values, however, are not pertinent to the module iden- tity, for reasons described in section 4.4.1. Panel D represents the values from panels A, B, and C as boxplots. Modules are ordered on the x-axis with ”R” representing the empirical null values for the modules, collected from 10,000 randomly sampled transcripts. From D) this is evident that the correlation values in all the modules are

higher than what would be expected by chance. This graph shows that rA values are

the highest for all modules, while the rP and rE approximately the same...... 113 4.2 Stacked histogram comparisons across all modules and tissues. (A) Heritabil- ity (B) Phenotypic correlations, (C) Genetic correlations, (D) Environmental corre-

lations, (E) Proportion of phenotypic correlation (p ˆA) attributed to genetics and (F) Modules that are present. The y-axis of the graph is scaled to 1 so that the contribu- tion of each tissue can be easily visualized. The actual correlation values are labeled on the graph. In (F) 1 means the presence of a module based on the empirical null value on phenotypic correlations (all modules are significant when examining genetic correlations). The parts of the graph that are hashed out represent modules that are not significant using the modules minimum method for that particular measure. In (F) the hashed out values are the modules that are not considered to be present overall - based module minimum genetic correlations...... 115 4.3 eSNP association between BITs. eSNP associations in Module two (A) and Module seven (B) Genes are represented as blue boxes (gene labels above). Black and Purple lines represent two intances of shared eSNPs between BITs in Module two ...... 122

24 4.4 Chromosome location and LD map of the SNPs significantly associated with two probes in module two. The -log10 (fdr) of the association between two SNPs is plotted as dots (y-axes on the left) with the relevant effect size (r2) colored as per the legend in the top right hand corner. The recombination of that section of the chromo- some is represented by the blue line (y-axes on the right). The purple diamond is the selected SNP for which from which the association analysis statistics are performed (r2 and the FDR values). All SNPs are in moderate LD with each other. There appear to be two groups within this set that are in higher LD with each other...... 123

25 List of Tables

1 Contribution to authorship for Chapter One: A. Goldinger, A. K. Henders, A. F. McRae, N. G. Martin, G. Gibson, G. W. Montgomery, P. M. Visscher, and J. E. Pow- ell, “Genetic and nongenetic variation revealed for the principal components of human gene expression.,” Genetics, vol. 195, no. 3, pp. 1117-28, Nov. 2013...... 7 2 Contribution to authorship for Chapter Two: A. Goldinger, K. Shakhbazov, A. K. Henders, A. F. McRae, G. W. Montgomery, and J. E. Powell, “Seasonal effects on gene expression.,” PLoS One, vol. 10, no. 5, p. e0126995, Jan. 2015...... 8

2.1 eQTL results for all four datasets (a-d) in two cohorts. Local regions were defined as +/- 1MB either side of the TSS and distal as elsewhere in the genome. The num- bers of probes with local or distal associations significant at various study-wide FDR thresholds is provided for each of the four correction methods. Study replicated in CHWBD ...... 68 2.2 Results from the GWAS for each PC. Probes that were found to be significant after 0.05 a Bonferroni correction (α = 488,462 ) on each PC are listed in this table. Though none of these are significant after correcting for all PCs, they are significant at an empirical p-value of 0.05 for each PC after 1000 permutations. PC - principal component, CHR - chromosome, SNP - SNP ID, BP - base pair, BETA - regression coefficient, STAT - Coefficient T-statistic, P - Asymptotic p-value for t-statistic, EMP - empirical p-value after 1000 permutations, G LOC - the location (in relation to the gene) of the SNP, GENE - genes nearby/containing the SNP, GENE DESC - description of the gene name. 70 2.3 Pathway analysis for the first 50 PCs. Pathway analysis for PC1-50 was performed using DAVID Bioinformatics Resources 6.7, Functional Annotation Tool. PC = Prin- cipal Component, Term = name of KEGG pathway, Count = count of probes in each hit, % = percentage of all submitted probes that are present within the pathway, P= the p-value calculated using a modified Fisher’s exact test for enrichment, FDR = correction of p-values using the Benjamini-Hochberg FDR method...... 72

26 3.1 Significant probes for the 12 month cosinor regression. Results for 12-month cosi- nor model and autocorrelation on two datasets: not corrected (NC) and corrected (CC) for cell count. Probes = number of significant probes for the 12-month cosinor model. E(R2) and Std(R2) = statistics for the R2 values obtained from probes signif- icant for the 12-month cosinor model. Autocorrelation = The number of probes with significant autocorrelation...... 85 3.2 Table C2C2: Biological enrichment for 12-month cosinor model. Dataset = Datasets corrected (CC) and not-corrected (NC) for cell count. General process = Overall bi- ological function , Terms= GO terms or pathways, Significance = Determined with a Cluster Enrichment Score (CER) or FDR corrected p-values ...... 93 3.3 Gene list enrichment analysis for blood cells. . Enrichment for blood cell signature was found using the userListEnrichment function in the WGCNA R package. . . . . 94

4.1 Table of BIT transcript information. Gene name, Illumina ID, starting base pair (BP) and chromosome (CHR) of the BIT probes ...... 109 4.2 Summary of values for each module in Blood in the BSGS cohort.Values are av- eraged across all transcripts within a module. Information is provided for BIT her- 2 itability (h ), Phenotypic correlations (rP) , Genetic correlations (rA)) and Shared genetic associations. Values provided are the mean or expected value (E), standard error (SE) and p-value (p-val). The column ”Number of probes” shows the number of BIT probes that were available for analysis in the BSBS cohort after Quality Control (QC). The column ”Number of probes with significant h2” shows how many BITs in the BSGS cohort had significant h2 estimates. The column ”eSNPs” lists all the SNPs that were significantly (FDR < 0.05) associated with BIT in this module. The col- umn ”eQTLs” contains the number of independent regions (groups of eSNPs in high LD with each other) that were associated with the BIT in this module. For example, in module nine eSNPs made up one eQTL associated with two transcripts and two eSNPs made up another eQTL associated with three transcripts...... 112

27 4.4 SNPs associated with transcripts in modules two and seven. Information on the significantly shared SNPs and their associated transcripts. ”SNP CHR” is the chromo- some of the associated SNP. ”Probe Gene” is the gene tagged by the Illumina probe used in this study. ”Gene CHR” is the chromosome in which the gene, tagged by the Illumina probe, is located. In module two, nine SNPs were associated with two genes. LD between the nine SNPs ranged between 0.3 and 1, indicating that all of these nine SNPs are linked, forming one eQTL. Two additional SNPs in high LD (R2 = 0.9), comprising one eQTL, were associated with three genes in module two. In module seven, two genes were associated with one SNP (one eQTL)...... 121 4.3 Correlation and heritability values for BIT modules. Mean/expected value (E) 2 and standard error (SE) for heritability (h ), phenotypic correlation rP and genetic

correlation rA values for all the BITs in each modules, within each tissue. The genetic

contribuiton to rP is reported (p ˆA). Transcripts that are labelled ”R” are the randomly selected transcripts used to evaluate the empirical null value...... 125 4.5 Module associated GWAS SNPs. Chromosome, genome location, nearby genes, GWAS associated disease/traits and citations for the shared BIT SNPs (as listed in 4.4) 126 4.6 List of BITs that are transcription factors. ”Probe illumina ID” is identifier the BIT transcripts used in this study, ”Probe” is the gene in which the BIT probe is tagging. ”Tissue” is where this transcript/transcription factor is predominantly expressed in the body...... 127 4.7 BITs Transcription factor enrichment. (a) The number transcription factors found amongst the module probes, (b) 1,381 known transcription factors elsewhere [1], (c) the number of probes within modules that are not known transcription factors and (d) The number of genes that are not known transcriptional factors in whole blood. Details of the transcription factors is given in [1]...... 127

28 Chapter 1

Introduction

1.1 Gene expression as a complex trait

Gene expression is a biological process that drives the formation of cellular via recording information within deoxyribonucleic acid (DNA) into proteins, the building blocks of the cell [2]. This flow of biological information from the DNA to the proteins is a highly-regulated multi-step process that involves many different control points, regulatory molecules, and degradation processes. The precise genomic regulation found in eukaryotes has enabled a significant degree of functional complexity to arise. One feature of this is the generation of a vast array of proteins that result from a relatively small number of genes. In humans, approximately 1% of the genome is comprised of protein-coding genes. This percentage consists of 20,000 protein-coding genes [3] producing just over 91,000 proteins [4]. The rest of genome is devoted to genomic regulation through various control mechanisms such as transcription factors (TFs) or non-coding RNAs (ncRNAs). All these factors create a complex regularity network that is required for maintenance of phenotypes [5], homeostasis [6] and cellular rhythms [7–9].

Studies into the genetic regulation of single genes have been performed since 1962 [10]. The de- velopment of genome-wide microarrays and RNA sequencing has facilitated the profiling of thou- sands of transcripts in a high-throughput manner. Data from these arrays has enabled explorative, hypothesis-generating research into transcript abundance and variation [11] and the identification of the molecular mechanisms underlying the response to perturbations in a biological system. Differ- ential expression of genes can arise from both environmental and genetic factors [12], and has been used as a means of understanding the molecular mechanisms that drive phenotypic differences be- tween cellular traits [13, 14], physiological differences [15] and diseases [16–19].

29 Because multiple genetic and environmental factors influence transcript levels in a population, gene expression phenotypes are complex traits. Transcriptional variation across a segregating population

(VP) is driven by several factors (Equation 1.1):

VP = VG +VE +VGxE (1.1)

These components include the portion of the variance that is attributed to genetics (VG), environment

(VE, described in section 1.5) and gene-by-environment (GxE) interactions (VGxE, described in section

1.4.3). Environmental variance can be described as the remainer of (VP − (VG +VGxE) and therefore incorporated any variance that is not directly attributed to genetic variation. The genetic component of this can be further decomposed into additive (VA, described in section 1.3) and non-additive ge- netic components, which include dominance (VD, described in section 1.4.1) and interaction effects

(VI), otherwise known as gene-by-gene (GxG) interactions or epistasis (described in section 1.4.2) (Equation 1.2).

VG = VA +VD +VI (1.2)

Additive genetic variation is the most interesting part of genetic variation to study as it drives the largest portion of the variance [20] in complex traits and is the genetic variation inherited between generations. The partitioning of additive genetic variance from total phenotypic variance can be achieved using genetic similarities between related [21] and unrelated individuals [22, 23] (Equation 1.1).

In this thesis, we cover three aspects of transcriptional variation. Seasonal variation as a macro- environmental and time-series component (Chapter 3), the genetic variance driving gene expression coexpression (Chapter 4) and technical artifacts (Chapter 2), which are potential confounding factors.

1.2 Gene expression variability

Gene expression variability is one of the fundamental dynamics that drives differences in phenotypic traits among individuals [24]. It occurs even in model that have identical and shared the same environment [25]. There is evidence that innate entropy in gene expression is an

30 essential feature in biological systems [26]. It is also possible significant portion of variation in gene expression could be the result of technical variation arising during the generation of the data [27] as demonstrated in Chapter 2.

Robustness is a universal property of biological systems that enables functionality and homeostasis in spite of internal (e.g. genetic ) and external (environmental) perturbations [28]. One mech- anism that contributes to the robustness of biological systems is the multiple control pathways and redundancies found in cellular gene networks [28]. Robustness is important not only as a buffering mechanism against otherwise harmful/lethal mutations and conditions but is also critical to evolving populations [29]. By neutralizing the effects of non-critical mutations, scores of cryptic mutations can accumulate across a population. These accumulated mutations could facilitate rapid evolution- ary responses to drastic environmental changes [30]. This buffering effect could contribute to the incomplete penetrance observed in certain diseases [31].

Canalization is the process of buffering genetic and environmental perturbations to produce normal phenotypes [32]. Canalization is thought to contribute to the complexity of certain diseases and has been suggested as a mechanism that drives the pleiotropy between genetic polymorphisms found in GWAS for various psychiatric and neurological disorders [33]. The disruption of the trajectory of normal neuronal development (through the expression of genes) outside normal developmental variability, is hypothesized to contribute to a wide a range of clinical disorders can arise depending on the where in the developmental trajectory the derailment occurred and the severity [34–36].

The variance of transcripts can be measured in terms of Coefficient of Variation (CV) (Equation 1.3)

σ CV = (1.3) µ

.

The CV measures of transcripts were able to differentiate between individuals with and Parkinson’s disease [37]. Albeit this study has a small sample size it appears that schizophrenia was associated with decreases in CV and, therefore, increase the robustness of the system whereas responses to environmental perturbations are constrained. Robustness in a biological system is es- sential to maintain homeostasis and in the presence of disruptive environmental forces and mutations. However, too much constraint and the system no longer has the plasticity to adapt to new stressors. The variance was lower in patients with Parkinson’s disease when compared to schizophrenic patients indicates a derailment of normal compensatory mechanisms that maintain phenotypic integrity [37].

31 This work demonstrates that transcript variability plays a key role in defining genetically controlled responses to environmental conditions.

The robustness of a network of transcripts appears to increase with the importance of the pathway to cellular function [37]. Core biological processes, such as signal transduction pathways seem to have lower variance and higher connectivity since variations in this could disrupt the homeostasis of a cell. Genes at the periphery of the signaling pathway (the receptors) demonstrate higher variance and less connectivity as these receptors are often required to respond to a range of biological signals [37]. Evidence from studies on gene regulatory networks in starfish has demonstrated that variability and connectivity change across larval development, with greater robustness in early development [30]. The robustness of the system appears to be in part maintained by molecular systems induced in the presence of . Heat shock proteins present a biological mechanism that is key to maintaining cellular function in response to stress. These proteins provide phenotypic stability in response to heat despite varying degrees of susceptibilities that can naturally arise from genetic variation. Inhibition of heat shock proteins increased the correlation between the genotype and the phenotype in response to heat stress [38].

The dissection of the variance of gene expression has yielded valuable insights into the genetic ar- chitecture of complex traits. Transcriptional variation appears to be an intrinsic feature essential for maintaining phenotypes and contributing to the development of diseases. The incorporation of dif- ferent facets of gene expression during development, over time, in response to specific environmental stimuli and different tissues and cell lines will be imperative to furthering our understanding of the intricate molecular mechanisms that control biological systems [39].

1.3 Genetical genomics

1.3.1 Genetic-regulation of gene expression

Genetic variation driving gene expression has been well established and reported in the literature, with the majority of transcribed genes demonstrating additive genetic variation [2, 40–46], with 82% of array-measured messenger RNA (mRNA) transcripts exhibiting h2>0 [44], 42% of transcripts having additive genetic variance h2>0.3 and ˜5% with h2 > 0.5 [43].

Although proteins are the building blocks of cellular function, protein-coding genes account for

32 around 1-2% [47]. It has been suggested that the remaining ˜98% of the genome is involved in regulatory control mechanisms for gene translation and transcription [48–50]. This concept has been supported by research which estimates that up to three-quarters of the genome is transcribed [51] and by recent discoveries by Encyclopedia of DNA Elements (ENCODE) that 80.4% of the genome has functional regulatory elements present [52].

The regulation of transcription or other biological mechanisms is likely to be the role played by the majority of GWAS Single Nucleotide Polymorphism (SNP) that have been found in gene deserts and intergenic regions [53] HINDORFF et al. 2009). The regulatory role of GWAS SNPs indicates that the driving factors for complex diseases could be predominated by changes in gene regulation instead of protein function. This hypothesis is supported by the finding that a major proportion of significant SNPs from GWAS are located in regions enriched with functional elements [52]. Whole genome methods [54] have found that the largest portion of heritability was from variants located in putative regulatory regions [55].

The large number of GWAS SNPs that are located within regulatory regions of the genome indicates that genomic variants that affect gene expression are the primary processes driving complex traits which are in contrast to Mendelian disorders, which are affected predominantly by protein coding changes [39].

1.3.2 eQTLs eQTL mapping, has been remarkably successful in identifying genetic polymorphisms that regulate gene expression [2, 42, 56–59]. The vast majority of the genetic variation driving transcriptomic expression is additive, with one study showing that 21% of probes with a detectable eQTLs explained up to 80% of the additive genetic variance [44].

Gene expression is often classified as an intermediate trait as it is the next step in which genetic information is turned into the cellular building blocks [2, 60]. Gene expression has therefore been useful in the functional characterization of GWAS hits. Many GWAS SNPs have been identified as eQTLs, demonstrating biological functionality. This both verifies the GWAS association and provides the first molecular step to understanding its mechanism of action [46, 61–64].

33 1.3.3 Location of regulatory variants

The position of a eQTL to its target gene can reveal valuable information about its impact on tran- scriptional regulation. Loci that occur within one megabase from the start or the end of the gene it is associated with are termed local eQTLs while distant eQTLs are loci occurring outside this region and on other chromosomes. Local eQTLs are the most common form of regulatory variant identified in genetical genomic studies so far with up to 80% of expression genes in whole blood having a lo- cal eQTL [65]. A number of these eQTLs were GWAS candidate loci indicating a large impact on downstream phenotypic effects [57].

Local regulatory variants can act in both cis and trans. Cis-acting eQTLs are alleles that are located proximal to the transcript. They can influence a gene in an allele-specific manner, for example by affecting the binding of RNA polymerase and transcription factors, the expression of a gene on that particular transcript [66]. Trans-acting eQTLs, on the other hand, affect both alleles simultaneously through a diffusible factor. Therefore, while local regulatory variants are assumed to act primarily in cis, it is possible that a local eQTL can work in trans by affecting a neighboring gene or via a feedback loop [67,68]. Distal eQTLs therefore exert their allelic effects in trans. This feedback loop can occur by either a in a transcription factor that modifies how it regulates its target genes or by a mutation in the Transcription Start Site (TSS) of a transcription factor that up or downregulates its expression [68].

1.3.4 Distal regulatory variants

Distal eQTLs are trans-acting exerting their allelic effects through intermediate diffusible factors and can influence some genes. Therefore, distant eQTLs have the potential to act as master regulators and control the expression of networks of genes [69]. An example of this is with a GWAS locus associated with type 2 and high-density lipoprotein. This locus cis-regulated the transcription factor KLF14 which was a master-regulator of adipose gene expression [70].

Distant eQTLs are not detected in eQTL studies as frequently as local eQTLs, due mainly to smaller effect sizes [71]. Furthermore, since distant eQTLs are located throughout the genome, their detec- tion is often limited by the increased burden of multiple testing corrections. Adding to the difficultly in their identification is the observation that distant eQTLs are also more variable, showing greater context dependence [72, 73], tissue specificity [74] and are more likely affected by confounders [75].

34 As sample sizes increase, more distant eQTLs are being discovered [45, 57, 76–79] with a number of variants demonstrating large context specificity [76, 80]. Therefore, it is possible that while local eQTLs appear to have direct genetic regulatory effects on complex traits, distal polymorphisms play a more cryptic role of genetically regulating transcript variability in response to environmental expo- sures and, therefore, sit at the interface between heritable genetic and environmental contribution to gene expression variability.

1.4 Non-additive genetic variation

1.4.1 Dominance

Non-additive variation, consisting of dominance and overdominance, were replicated for a handful of transcripts [44]. While having a significant impact on variance on these particular transcripts, dominance effects had minimal impact on total transcriptomic variation [44]. Another way in which the effects of strongly dominant loci can be ablated and even appear additive is when the recessive gene is at a high frequency in a population [20].

Non-additive effects are harder to detect in association studies than additive genetic variation. SNPs in GWAS studies are used as markers for causal variant in GWAS studies. These markers SNPs will often explain less additive variance than the actual causal SNP. The discrepancy between markers and causal SNPs is linearly dependent on the amount of Linkage Disequilibrium (LD) between the two variants. Therefore GWAS relies on enough coverage of the genome to ensure high LD between the genes. For dominance effects, this decrease in explained variance between markers and causal variants is much higher, reducing at a factor of r4 [81]. Therefore in normal coverage, SNP arrays provide a greater coverage of additive instead of non-additive genetic effects.

1.4.2 Epistasis

Genotype-dependent interaction effects, such as GxG, cause the effect size of a variant to be larger than it’s additive components between other genetic loci or environmental conditions. The interaction variances can be described statistically as the combination of two terms not explained by their inde- pendent, additive effects. GxG or epistasis have been of interest in the study of complex traits [82]. Some papers have suggested that redundancy in biological pathways lead to overestimations of ad-

35 ditive variation, thereby giving further weight to epistatic interaction to explaining the remaining variance [83]. Common family environmental effects appear to drive these pathway estimations [84]. Common family environmental effects are environmental effects shared between family members which can appear as additive genetic variation.

Epistatic interactions are tough to identify in human genomic data. Epistatic interactions, like domi- nance effects have a greater loss in explained variance with lower LD between the marker and causal SNPs, reducing by a factor ranging between r4 for additive by additive (AxA) variance, r6 for additive by dominant (AxD) variance and r8 for dominant by dominant (DxD) interactions [81]. Therefore imputed datasets or array with higher SNP densities are required to detect epistatic effects. Another reason for the difficulty in detection the curse of dimensionality, where extremely stringent signifi- cance thresholds, due to the sheer volume of required tests, leads extremely low signal-to-noise ratios in detecting these associations. Therefore large sample sizes are needed to gain sufficient power to detect associations. Other methods to increase power to detect GxG interactions focus on dimension- reduction techniques or filtering out regions of interest [85] or variance eQTLS [86]

Epistatic interactions have been estimated to have little impact of total variability on gene expres- sion [44, 85]. In [87] up to 16% of the variance in their datasets was expected to be driven by GxG interaction effects. The authors extrapolated this results from the total SNP variance obtained from variance eQTLs (eQTLs associated with gene expression variability) that had replicable epistatic ef- fects. Some of these epistatic loci explained as much as 4.3% of phenotypic variance [87]. Altogether, however, few studies have provided evidence of a substantial impact of epistatic variance and loci on the total phenotypic variance of gene expression [44, 85] and other complex traits [81].

1.4.3 Gene-by-environment interactions

GxE interactions, otherwise denoted as genotype-environment interactions (GEIs), are a type genotype- dependent interaction where responses to the environment differ between genomic regions. These ge- netically regulated responses to the environmental could underlie the development of latent diseases and conditions [37]. Interactions between smoking and genetic risk factors for rheumatoid arthritis illustrate GEIs as factors driving disawfrasAWE3Rease susceptibility [88].

There is also evidence that GEIs drive the variance of many complex traits [89]. GEIs are a suspected contributor to the missing heritability found between GWAS loci and common traits [90].

36 Advances in high throughput genetic techniques have prompt an greater number of published stud- ies examining GEIs since 2000 [91]. Numerous studies have identified genomic loci and transcripts that exhibit GEIs for various phenotypes including psychological conditions [92–97], nicotine de- pendence [88, 98], [99–101], immune response [72, 102, 103] and drugs [104]. Genomic loci exhibiting GEIs are often referred to as context-specific Quantitative Trait Loci (QTLs) or, in the case of genetical genomics, eQTLs [79]

The environmental conditions in humans are much harder to control than for model organisms. There- fore when studying GEIs in humans, there is the risk of unmeasured and unforeseen confounders. The presence of these confounders, in addition to improper model fitting [105] in GEIs studies could ex- plain the lack of replication of many context-specific GEIs loci [97]. In order to control environmental conditions more precisely, the use of human cell lines instead [87, 106] or randomized control trails (RCTs) [107] will facilitate more precise environment control GEIs [90]. Genetical genomic studies have identified numerous GEIs loci and several notable features of these loci. The use of model or- ganisms enables the precise environmental control required to characterize the genomic properties of GEIs. In S. cerevisiae more than 5% of the genome had readily detectable GEIs [108]. Additional studies in yeast calculated that GEIs drove around 9% of total transcript variance [109]. Studies both yeast and C. elegans showed that it was predominately distal loci that regulated transcripts with GEIs effects [109,110]. In human epithelial cells transcripts in response to proinflammatory oxidized phospholipids had significant GEIs effects. These epithelial transcripts were also enriched for distal eQTL [106].

Extensive distal regulation of transcripts with GEIs indicates that theses gene may be a part of larger networks or modules. In fact, in C. elegans, a large portion of transcripts exhibiting GEIs fall within functional networks [111]. This enrichment of GEIs in pathways suggests that the effects of GEIs could propagate across a whole network of genes. The network effects of GEI loci may not be easy to detected in systemic studies due to stringent significance thresholds. Network analysis of tran- scripts may enable the detection of GEI regulatory effects. In addition to extensive distal regulation, transcripts with GEIs effects are often non-critical genes [108]. Critical genes, such as housekeeping genes of have shallow regulatory networks and are tightly constrained in their expression levels [89] and are unlikely have many distal regulatory mechanisms. The protection of critical genes could be one mechanism in which robust biological systems buffer against deleterious genetic and environ- mental effects on the cell.

One means of distal regulation is from the regulation of TFs. TFs are proteins which regulate the

37 expression levels of their downstream gene targets. In yeast, the increase in regulatory TFs in a set of transcripts was positivity correlated with gene expression variation in these transcripts [112]. The binding of some TFs is also environment-dependent [113].

In Chapter 4, we characterized nine modules of highly phenotypically correlated genes. These mod- ules showed many characteristics of GEIs effects. Enrichment of TFs and regulation by distal eQTLs indicated network effects across each of the modules extending to other transcripts in the transcrip- tome. The transcriptional responses in each of these modules showed marked responses between indi- viduals to various environmental responses [114, 115]. Another feature shared between the modules and identified transcripts with GEIs is tissue-specificity. In a study of transcripts exhibiting allele- specific expression (ASE) in twins, transcripts that had GEIs were tissue-specific, with GEI effects only found in Lymphoblastoid Cell Line (LCL) and fat samples but not in skin and blood [116].

In usual genetical genomics studies, GEIs may not be detectable in association studies due to the wide variety of environmental effects and genetic backgrounds that are present in population samples. Therefore the study design of genetical genomics and GWAS make it difficult to determine the impact of GEIs in human gene expression variability. One method to identify loci with interaction effects is to prioritize analysis on variance eQTLs. Variance eQTLs exhibit interaction effects that have that largest impact on gene expression transcripts [86]. In a study of 765 twins, 70% of 508 variance- eQTLs had significant GEIs effects [87]. This study demonstrated that a considerable number of loci in the had GEIs [87].

In a study of 8,013 ASE sites in blood, 38-49% of the total variance was attributed to genotype dependent interaction effects such as epistasis (GxG) and GEIs [116]. Since variance estimates for epistatic effects are low [85], these results could be primarily driven by GEIs. Despite observing just a subset of the genome, these estimates of GEI variability on transcripts with ASE demonstrates the potentially large impact that GEIs have on transcript variability.

Context-specific genetic regulation is another level of complexity that regulates transcriptomic re- sponses to environmental perturbations. The prevalence of GEIs in genetical genomic studies high- lights the importance of cellular environment when dissecting the genetic architecture of the transcrip- tome. A comprehensive catalog of GEIs across the genome will enable the prediction of phenotypic effects from genomic information alone.

38 1.5 Environmental regulation

1.5.1 Differential gene expression

Differential gene expression can be used to identify molecular mechanisms that influence response to environmental conditions [73, 117–121]. In addition to GxE interactions, there is evidence that transcriptomic responses to environmental conditions are genetically controlled, through direct inter- actions between the genotype and environment [72] but also through the genetic control of stochas- ticity [122–125]. Indeed, it has been proposed that many disease risk alleles primarily exert their effects under specific environmental conditions. By only being present under certain environmental conditions, these alleles will be, for the most part, invisible in a population and, consequently, avoid negative selection pressure [126].

1.5.2 Classifying environmental conditions

Environmental conditions can be classified as either macro and micro environmental factors. Micro- environmental factors are conditions that are unique to individuals such as diet, exercise, exposure to drugs and [125]. Over a large enough sample size, the effects of microenvironmental effects on a population scale are usually negated. Environmental factors that affect the majority of individ- uals in a population and can be measured (e.g. temperature) are classified as macro-environmental effects [127]. Variation between the months, seasonality, which can encompass a range of macro- environmental factors, shows transcriptomic signatures for DNA repair and binding and cell count fluctuations in erythrocytes, platelets, neutrophils, monocytes and CD19 cells. The impact of season- ality as a macro-environmental factor on the transcriptome is discussed in chapter 3.

1.5.3 Epigenetic variation

Epigenetic DNA modifications are heritable changes in gene function that are independent of direct changes to the DNA sequence [128]. Since these changes have a heritable impact on phenotype, methylation has been suggested an important non-genetic factor contributing to heritable phenotypic variability [129].

The most prevalent form DNA methylation occurs on cytosine located at CpG sites, which are a linear

39 pattern of cytosines followed by a guanine. Adenine nucleotides are also methylated in several species but, for mammals, it has only been observed in mouse embryonic cells [130, 131]. Several enzymes are involved in the DNA methylation process, with the two main types being DNA methyltransferases, involved in adding the methyl group and methyl-CpG binding proteins, which interpret the methyla- tion markers [132]. 70% of human promoters have CpG islands [133] and 40% of all promoters occur nearby CpG islands [134]. Methylation of cytosines has been associated with suppressed [135] or silenced gene expression [136]. In mammals, the majority of cysteines are unmethylated [137]. The high presence of CpG islands in gene promoters and the considerable amount of unmethylated sites in highly expressed genes indicates that they are often present in transcriptionally active sites [132]. Methylation patterns appear to be closely related to gene expression. Across the genome, there is significant genetic covariance between gene expression and methylation patterns [138]. Methylation has also demonstrated tissue-specific effects [139], which is also prevalent in the transcriptome [41].

There are two methods in methylation patterns is inherited. One of these means involves varia- tion in the genome whereby specific loci regulated the amount of methylation that occurs in the genome [140]. The use of genome-wise bisulphite sequencing has enabled methylation to be studied as a quantitative trait. Studies have also revealed that epigenetic changes are genetically regulated. Numerous CpG site across the genome have been identified to be highly heritable and have strong local regulatory elements [141].

The second type of epigenetic inheritance is when there is incomplete erasure of the epigenetic DNA markers during the development of the gamete. This lead to trans-generational methylation patterns [142]. This mechanism could represent an interface between genetic and environmental influences [143] since methylation patterns can be affected by environmental effects [142,144,145]. The effect of environment on the epigenome has been quantified in monozygotic twins, whereby initially identical epigenomic profiles can develop larger differences over time [146]. Epigenetic inheritance can enable transcriptional changes from environmental exposures to be directly passed down generations [143].

1.5.4 Batch Effects

Gene expression transcripts are sensitive to the cellular environment that surrounds them. Slight dif- ferences in technical or environmental conditions within the laboratory can create synthetic changes in expression levels, called ’batch effects’ [147–150]. Batch effects can be severe confounders in gene expression studies. Technical artifacts are ubiquitous in gene expression datasets even in well-

40 designed experiments generated under ideal circumstances [150]. Batch effects can be the most prominent source of variation, even surpassing biological variation [77, 151]. For example, in the BSGS [42] comprising of 335 unrelated individuals, around 25% of the variance the raw gene expres- sion dataset was attributed to batch effects [152]. There are several sources of batch effects, which include type and age of reagents [153], temperature [154] and ozone levels [155] to name a few.

Appropriate normalization of the datasets is required to minimize between sample variation aris- ing from sample quantity and to remove large batch effects driving distributional changes between genes [156]. Careful measures are needed when collecting and measuring samples to reduce the pres- ence of these artifacts. The reduction in batch effects can be best achieved by carefully recording all information on sample processing and handling [157]. These variables can then be included in statistical models to remove any batch effects [154, 158–160]. Failure to correct for batch effects has led to spurious results in downstream analyses [161, 162].

Without processing records, batch effects can be removed from the data by using PCA or Singular Value Decomposition (SVD) [77, 163–165]. These unsupervised methods can be effective at remov- ing batch effects, particularly when a large number of Principal Components (PCs) are used [152]. However, since the selection of PCs is often arbitrary and there is often a mix of genetic and non- genetic factors in each component, there is a risk of removing heritable genetic variation when using these methods [152].

1.5.5 Principal Components

Recently, the removal of PCs has been used to correct for batch effects in large gene expression datasets [77,163]. Within the field of statistical genetics, this approach has previously been employed to remove signals from genes that saturated biological signals of interest. For example, Alter et al. 2000 found that the first PCs were mainly associated with genes related to steady state growth in the cell [166]. Removal of that PC left the genetic signals associated with cell-cycle progression, which was of interest to the investigators. Removal of PCs to correct for confounding became popular in genomic research after it was demonstrated to be an efficient method to correct for population stratification within SNP data [167]. This approach was based on a strong statistical footing [168] but also required manual inspection of the data to observe clustering of different populations. PC correction was also later employed to correct for stratification in gene expression data [169].

The use of PCs in gene expression datasets to correct for batch effects became prevalent following

41 the use Surrogate Variable Analysis (SVA) to correct for expression heterogeneity by removing PCs [170]. However, SVA requires that the primary variable of interest be adjusted for in the dataset via linear model before PC analysis is performed. However, as batch effects constitute a large proportion of the dataset [77, 158] the removal of PCs directly from the whole dataset was later used as the method to correct for these factors [164, 165]. Some studies removed up to 50 PCs [77, 163], which can constitute a large proportion of the total variance within a dataset. These correction methods were even used when suitable technical effects were recorded [171] that could have been used to correct for batch effects using linear models [154] or Bayesian methods [172]. As PCs can pick up multiple sources of variation, there was a concern that such approaches would remove genetic data [157,173].

The clustering of different groups within principal components has been used not only for correction but also to determine differential gene expression observed between specific experimental conditions [174], populations [175] and tissue types [176]. As the changes in variances constitute not only batch effects but also genetic components driving gene expression variability in the population, the arbitrary removal of PCs could delete regulatory information from the datasets. Since batch effects present such a ubiquitous and pervasive source of variation in transcriptomic datasets, PCs provide an ad-hoc means of removing such confounders when no processing dates have been recorded, and no other statistical technique can be otherwise applied [152].

1.6 Cell type and tissue specificity

Many gene expression studies have been performed on LCL, which are purified B-cells that are im- mortalized with Epstein Bar virus. Immortalized cell provide a continuously replenishable and stable source ribonucleic acid (RNA) and DNA from an individual [177] and also a homogeneous envi- ronment between samples, which can minimize environmental variability. There have been some concerns about the use of these cells in expression studies, as there is evidence that the immortal- ization step causes changes in the transcriptional landscape such as random patterns of monoallelic expression [178]. Variables such as the titers of the Epstein-Barr virus can also have a noticeable effect on expression levels obtained [179]. However results from studies done on LCLs have been concordant [56], comparisons with adipose tissue have found over 50% overlap [46], and comparisons with skin have shown overlap in cis-eQTLs of around 70% [180].

Between different cell types and tissues, there are notable differences in the expression levels and ge- netic regulation of transcripts [40,41,71,74,181,182]. One of the mechanisms driving the differences

42 is differential binding of transcriptions factors between the cells [183]. A study of eQTLs found that between fibroblasts, LCLs, and T-cells, only 6.8% of detectable eQTLs are shared between all three tissues, 9.7% in two tissues and up to 83.4% are cell specific [184]. eQTLs shared between tissues were found to cluster tightly around the TSSs, which corresponds to the finding that there is stronger tissue specificity in enhancer elements found further away from the gene [185]. The eQTLs that are shared between tissues also demonstrated much larger effect sizes [102,184]. A comparison between fat, LCL, and skin cells showed 12.3% overlap between all three tissues, 16.2% shared in two tissues and 71.4% found in one tissue, with skin showing the least amount of unique eQTLs [182]. A study of two different tissues, liver, and adipose, found a much higher percentage (70%) of sharing between these tissues [186] as opposed to what was found among specific blood cells. The larger degree of shared regulation between these tissues could be due to the greater number of diverse cells found in tissues drowning out cell-specific effects. Comparisons between brain and whole blood are largely concordant with the exception of some regulatory regions [187].

Examination of regulatory variants between monocytes and B-cells showed that 21.8% variants were shared between cell types, while 32.2% were B-cell specific and 45.9% were monocyte-specific [102]. This study found a that a large number of cis-eQTLs were cell-specific, with some showing differ- ential allelic effects between cells and 4.6% of cell-specific eQTLs being associated with a different gene in another cell type. These cis-eQTLs also exhibited enrichment for known cell-specific GWAS hits accordingly. A large number of shared eQTLs were in trans which have also been identified in another independent population [41]. Other studies have found the opposite whereas cis-eQTLs are the predominant type of polymorphism shared between tissues [40]. Averaged across the genome, the mean genetic control between LCLs and whole blood was close to zero with only core, house- keeping genes showing higher estimates of positive and negative genetic correlation [40, 181]. These studies demonstrate that while blood can provide a useful proxy for other cell types, caution should be exercised if results are used to infer effects in other tissues.

1.7 Omics

Genetical integrated genetic polymorphisms and gene expression revealing an intricate regu- latory system that controls the transcript levels [188].Biological systems are comprised of many tiers, starting with the encoding of genetic information within the genome.

43 1.7.1 Proteomics

Proteomic studies focus on the global quantification of proteins within the body. Functional proteins are the building blocks of the cell and are more direct determinants of cellular function than transcript levels [189]. Proteomic information can also provide additional information on post-translational modifications, such as acetylation, phosphorylation, and ubiquitination, that impact protein function- ality, cell and time specificity and protein-protein interactions and cannot be so easily detected from the genome and transcriptome alone [190]. Of the 20,000 genes annotated in Swiss-Prot and UniProt, the protein-coding characteristics of many proteins are still unknown [191, 192]. While there have been many studies examining various aspects of the proteome, there have only been a few systematic characterizations of the global proteome [190,193,194]. The low number of global proteome studies are primarily due technological problems in data accuracy, collection, and distribution. The number of proteins detected in a proteomic study using mass-spectrometry is between 16,000-17,000. 70% of detect proteins are expressed ubiquitously across cell types and represent regulatory proteins that have core cellular functions. The number of core proteins is estimated to be around 10,000 [195,196]. These core proteins are enriched for metabolic processes, with over 60% of all metabolic enzymes being present in all tissues [196]. Between tissues, however, proteins can have up to five orders of magnitude changes in abundance. Proteomic studies have even identified transcripts produced by long intergenic non-coding RNA (lincRNAs) [190,197] indicating the utility of the proteome in uncovering downstream functionality of genes.

Correlation between mRNA levels and proteins have yielded conflicting results in model organisms [198]. However, they have been relatively concordance ranging between 0.41-0.55 in different tissues [190].The ratio between protein and transcriptomic abundance between cell lines has been shown to be a more stable measure of the relationship between transcripts and proteins [199]. The differences between transcript and protein levels can reflect the translational speed of a transcript, which remains relatively constant in the cells [200]. Therefore, the ratio between transcripts and proteins can enable the prediction of protein abundance from mRNA levels alone [190].

There have been notable differences found between eQTLs and protein Quantitative Trait Loci (pQTLs) [201]. While some eQTLs are also pQTLs, the majority demonstrated zero to low attenuated effects between the two levels of regulation [202]. Quite a number of pQTLs that only demonstrates an effect on protein levels [203]. One study in LCL cells demonstrated that 90% of ribosome occupancy Quan- titative Trait Loci (rQTLs) were eQTLs, however this was not the case with pQTL. The attenuation

44 between eQTLs and pQTLs appeared to be occuring not at the translational level, since rQTLs had little impact on the pQTLs. Instead, there was evidence that the pQTLs work independently through protein degradation mechanisms, futher supported by the enrichment of pQTLs near acetylation sites and the prominent role of lysine degradation in protein degradation [204].

One of the possible reasons for the lack of concordance could be technical difficulties and differences between the array-based measurement of mRNAs and the isolation and quantification of proteins using mass spectrometry. Another potential problem is that global quantification of mRNA transcripts only represents a steady-state measurement of the expression levels of probes, which are in a constant state of flux between transcription and decay [205]. In fact, a study in LCLs identified 195 RNA Decay Quantitative Trait Loci (rdQTLs) associated with RNA decay rates and also found that 10% of genes had a significant correlation between steady-state levels and variation in decay rates [206]. Finally, transcripts also undergo phenotypic buffering, which a rate-limiting step during translation that leads to the degradation of transcripts [207]. The effect of a phenotypic buffering has been documented for multisubunit proteins, which are rate-limited by the least abundant protein subunit. Transcript levels for homopolymers demonstrate a much higher degree of correlation between protein levels and transcripts than multisubunit proteins [201]. One mechanism that can lead to such buffering is the greater evolutionary constraints placed on proteins as opposed to transcripts [208].

Integrative omics between the transcriptome and the proteome has highlighted the importance of the metabolic processes. Genome-scale metabolic maps have identified notable differences between the genetic profile and metabolic reactions that occur. These studies demonstrate that there is considerable variability in the enzymes used in catalyzing these reactions [196]. Since these enzymes are critical to the core functioning of the cell, the incorporation of enzymatic protein abundance, as well as metabolomic information, will enable the creation of a global image of biological processes.

1.7.2 Metabolomics

Metabolomics involves the quantification and study of metabolic substrates within these bodies. These metabolic substrates are a part of vast networks of complex chemical reactions that make up the of a cell. These can involve the recycling of nucleotides, lipids and protein, pro- cessing of waste products and the chemical reactions required to produce energy within the cell. Advances in Nuclear Magnetic Resonance (NMR) and Mass Spectrometry (MS) technologies have facilitated the profiling of hundreds of metabolites simultaneously. The chemical snapshots obtained

45 from these techniques give a readout of the current physiological state of the cell [209]. Metabolic readouts, unlike gene expression or proteins, provide a direct measurement of cellular function that cannot become attenuated from post-translational regulations. These chemical measures have also demonstrated great utility in discovering new gene functionality [210–213]. There is also additional evidence that metabolites can act as signaling molecules that can regulate gene expression [214]. This further increases the complexity of interactions between the different tiers of omics data.

Correlation values between enzyme transcripts and their associated transcripts are largely similar, particularly when viewing these correlations on the scale of entire metabolic pathways [214]. Many metabolites demonstrate high heritability with some studies showing up to 40% of measured metabo- lites having a heritability of > 0.6 [215]. The effect sizes for GWAS hits in metabolite concentration levels are larger than for phenotypes [214], with effect sizes ranging between 10-60% [214]. Results have also largely been concordant with the metabolic pathways of the metabolites, and their asso- ciated GWAS hits in particular enzymes [216, 217]. Metabolic markers could also be the results of microbial metabolism [218]. While appearing to be an environmental effect, microbial metabolism is similar between dizygotic and monozygotic twins [219]. Microbial biomasses play a critical function in the body, ranging from digestion to regulating the immune system [220].

When integrated with transcriptomic and genetic date, metabolomics had provided greater biological context and information to the molecular process that drives cellular systems. While metabolomics has been considered the apogee of omics data [209], there are still a growing number of omics fields which are involved in the collection of global information on specific biological properties. These include the methylome (discussed in section 1.5.3), the lysine acetylome (the post-translational mod- ification of proteins) [221], lipidomics (the study of lipids) [222] and even microbiomics (collection of metagenomes of the bacteria cells living within organisms) [223]. Information of the baseline state of an individual can provide valuable information into the various cellular process. Further under- standing the processes occurring in active biological systems can be achieved by examining how the internal physiological state changes in response to the environment or within other actively changing physiological states such as growth [224].

1.8 Systems Genetics

Systems genetics [39] incorporates molecular phenotypes (section 1.7), to study the flow of infor- mation from the genome to the phenotype of interest [39]. Using whole-genome transcriptomic,

46 proteomic and other -omic datasets, complex networks of interactions can be studied. The increasing levels of organization from the genome to multitudes of intermediate phenotypes gives rise to emer- gent properties, new features and capabilities that would not otherwise be observed if examining the expression of a single gene [225]. The advent of technologies that can cheaply and quickly mea- sure intermediate phenotypes such as whole-genome gene expression, metabolites, and proteins have facilitated the opening of the black box between genetic information and phenotypic traits.

1.8.1 Dimension reduction

Gene expression demonstrates a highly networked structure, with many genes showing a high degree of co-expression. Co-expression can occur when genes interact with each other, either as a part of a biological network or a protein complex. This feature of -omics data enables the data to be reduced down to smaller sets of highly correlated sets of variables. Cutting down the number of variables reduces the dimensionality of a dataset while increasing the power to detect meaningful associations otherwise lost due to the stringent multiple testing significance thresholds.

1.8.2 Coexpression

Modules of correlated transcripts can also provide a great deal of functional information about the genes. These covariance structures, also called modules, have been found by decomposing vari- ance using principal components analyses [226, 227], hierarchical clustering [228], Weighted Cor- relation Network Analysis (WGCNA) [229], and Modulated Modularity Clustering (MMC) [230]. These modules are topological, sub-networks within a much larger network of interactions and rep- resent functional units within a much bigger network that have hierarchical and scaled-free proper- ties [231]. The significant enrichment in Gene Ontology (GO) terms, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, TF binding sites and other biological interactions [232] indicates that biological plausibility of these modules. Regulatory polymorphisms governing gene modules have been identified in yeast [226] and humans [233].

Modules are often identified based on the phenotypic variation of the transcripts [114]. Phenotypic variance can be attributed to both genetic and environmental factors. Chapter 4 uses Identity-By- Decent (IBD) methods to partition the genetic and environmental variances driving modules identi- fied a previous study [114]. Chapter 4 illustrates the extent that genetic and environmental factors influence the co-variation between transcripts clustered within a co-expression modules.

47 1.9 Network analysis

The vast majority of Genetics of Gene Expression (GOGE) studies have focused on the analysis of single gene targets. However, gene products within a cell often do not function alone, but instead exert their effects through complex networks ranging from intracellular signaling and metabolic function [110,179]. Perturbations across multiple regulatory genes in a network can have a much a larger effect on disease than single gene changes [234]. Analysis of variants affecting gene networks has elucidated interesting processes with regards to complex diseases such as [46], regulatory networks in stem cells [235] and immune response [236]. Analyzing systems affected by mutations can also give biological insight into the function of identified eQTLs and can even identify novel functions. For example, a study of obesity susceptibility discovered novel interactions between TFG-β and genes outside its known signaling pathway [237].

Groups of correlated genes are likely to respond environmental perturbations together [238], replicate across different microarray platforms [228] and can be highly conserved in populations [114] These networks of interacting genes could also be a driving factor behind phenotypic stability and robust- ness [234], by providing genetic redundancies. Transcripts within these modules are associated with disease [228, 232] and different environmental conditions [114, 115]. The ability of the modules to provide information on both genetic and environmental states individuals demonstrates their utility as indicators of disease susceptibility and physiological state. This particular information based on different conditions is the basis of personalized medicine. This vision may be possible through the integration of various -omics data (as discussed in 1.7) by providing highly detailed biological profiles and physiological responses of individuals [239].

Additional uses of gene networks include searching for epistatic interactions among a small number of correlated genes, analyzing pleiotropic effects, identifying the function of predicted genes from ’guilt by association’ and determining differences between tissues [179]. Being able to test for these factors can enhance our understanding of mechanisms maintaining quantitative genetic variation. The integration of eQTL data alongside gene modules can facilitate the identification of causal variants driving gene expression and can increase our understanding of network functionality by providing directionality to the flow of information from the genetic polymorphisms to endophenotypes [179].

48 1.10 Clinical applications of gene expression

Thus, the study of global gene expression can provide information on the functionality, constraints, and controls of biological systems that give rise to higher order phenotypes, such as disease suscepti- bility. These statistical models of gene expression regulation can have significant applications in the clinical setting. Gene expression can be utilized as a clinical diagnostic tool to identify individuals with high disease risk. The incorporation of genetic information plus information on gene expression responses to environmental factors can aid in the development of personalized medicine [240–243]. Hypothesis-free, predictive medicine can shift diagnostic focus to the identification and treatment of non-symptomatic, early stages of the disease and will focus primarily on maintenance of health [11]. An example of the diagnostic potential of gene expression was with the classification of healthy individuals and patients with latent and pulmonary tuberculosis [114]. Chapter 4 continues this study by examining the genetic architecture of the gene expression transcripts used in this predic- tive model [114].

49 Chapter 2

Genetic and non-genetic variation revealed for the principal components of hu- man gene expression

The work presented in this chapter has been published in:

Goldinger A, Henders AK, McRae AF, Martin NG, Gibson G, Montgomery GW, Visscher PM, Powell JE: Genetic and non-genetic variation revealed for the principal components of human gene expression. Genetics 2013, 195:1117-28

50 2.1 Abstract

PCA has been employed in gene expression studies to correct for population substructure, batch, and environmental effects. This method typically involves the removal of variation contained in as many as 50 PCs, which can constitute a large proportion of total variation present in the data. Each PC, however, can detect many sources of variation including gene expression networks and genetic variation influencing transcript levels. We demonstrate that PCs generated from gene expression data can simultaneously contain both genetic and non-genetic factors. From heritability estimates, we show that all PCs contain a considerable portion of genetic variation while non-genetic artifacts such as batch effects were associated to varying degrees with the first 60 PCs. These PCs demonstrate an enrichment of biological pathways including core immune function and metabolic pathways. The use of PC correction in two independent datasets resulted in a reduction in the number of local- and distal-eQTLs detected. Comparisons of PC and linear model correction revealed that PC correction was not as efficient at removing known batch effects and had a higher penalty on genetic variation. Therefore, this study highlights the danger of eliminating biologically relevant data when employing PC correction in gene expression data.

2.2 Introduction

Microarray technology can simultaneously capture the expression of thousands of transcripts within an individual. However these arrays are sensitive to environmental or experimental perturbations, for example due to different laboratory technicians and reagents [153, 244] microarray chip and chip po- sition [159], temperature [154], and even ozone levels [155]. These effects can constitute a substantial proportion of variance within a dataset [158].

Normalization strategies have become standard in gene expression studies to correct for non-normal distributions and inconsistencies between arrays [245]. However, normalization techniques do not control for batch effects caused by technical artifacts. These batch effects require additional correction techniques [154] and failure to do so has led to spurious associations [161, 162].

Many different correction and normalization techniques are currently used in gene expression studies (for review see: [157, 160]). PCA is one method that has been used for the correction of widespread batch effects [77, 157, 164, 170]. PCA determines linear combinations of variables and projects them

51 into orthogonal vectors that are ranked on variance explained [246]. PCA can be used for a variety of analytical approaches such as; dimension reduction [247], identifying gene correlations [248], de- termining co-expressed networks [249], classifying gene expression modules [248], analyzing time- series data [250], and determining differentially expressed genes across tissues.

PCA is a powerful tool to analyze and understand high dimension gene expression data. However, while PCA is commonly used to decompose variation into common axes, the components of variation contributing to each PCs are often unknown. Information on the components of variation captured by PCs is important if the correct inferences of PCA results are to be made. A clear example is in the use of PCs to correct for batch effects, where variation in the first 1-50 PCs is removed from the dataset by fitting these PCs as covariates in a linear model and using the residuals as a corrected phenotype [77, 157, 163, 165, 169, 170]. Though this method has been demonstrated to increase the power to detect eSNPs by eliminating confounding batch and environmental effects [77], there is a risk of the indirect removal of biologically relevant information due to the large amount of variance held in the first few PCs [157].

In this study, we investigated the sources of variation driving PCs generated on a gene expression dataset produced from whole blood. This dataset is comprised of 860 individuals from 314 families from the BSGS [42]. We combine the power of pedigree and SNP-based designs to quantify and dissect the total genetic variance of PCs. From these results, we demonstrate that the PC correction methods used in published literature (for example in [77, 163]) simultaneously remove both genetic variation and batch effects. We also show that with careful analysis, the information contained within PCs can be leveraged to provide an understanding of gene expression pathways underlying complex disease. We use these results to emphasize the importance of using other correction methods when applicable instead of removing PCs to correct for batch effects.

2.3 Materials and Methods

2.3.1 Samples

A total of 335 unrelated individuals were selected from BSGS, a cohort comprising of 860 individuals from 314 families [42]. Individuals were selected on two criteria; the first was to obtain the maximum number of unrelated individuals by choosing parents (n=165) and a single individual from families with no parents (n = 170). Unrelated individuals were chosen from this dataset to exclude family

52 effects from being present within the PCs. All individuals were genotyped on Illumina 610-Quad Beadchip (Illumina Inc, San Diego, USA) by the Scientific Services Division at deCODE Genetics, Iceland. A total of 488,462 SNPs were present after appropriate quality control (see GWAS section below).

RNA was obtained from whole blood samples that were collected in a PAXgeneTM tube (QIAGEN, Valencia, CA), analyzed for purity on an Agilent Bioanalyzer, converted to cDNA, amplified and pu- rified using the Ambion Illumina TotalPrep RNA Amplification Kit (Ambion). The expression levels were quantified on an Illumina HumanHT-12 v4.0 Beadchip (Illumina Inc, San Diego, USA). Sam- ples were randomized across the chip to minimize batch effects due to families, sex and generation. Quality control methods for selecting highly expressed probes in the samples was described in detail in [42]. After appropriate QC, the probes were further filtered for expression in 100% of samples. Probe names starting with HS, KIAA, LOC, were removed from the dataset, as they did not map to characterized ref-seq genes. After filtering, a total of 9,086 probes remained.

2.3.2 Replication sample

The CHDWB study is a population-based cohort consisting of 139 individuals collected in Atlanta USA [115]. Gene expression profiles were generated using Illumina HT-12 V3.0 arrays from whole blood collected in Tempus tubes that preserve RNA. Whole genome genotypes were measured using Illumina Omni Quad arrays.

2.3.3 Normalization and batch effect correction

The gene expression levels of the 9,086 probes were normalized using the quantile, log2, and z-score transformation. Correction methods were then applied to create four different datasets that were used for further analysis, these are; (a) standard normalization (log, quantile and z-score transformation) (b) standard normalization with linear model correction for batch effects (c) standard normalization with correction for 1-25 PCs and (d) standard normalization with correction for 1-50 PCs. These datasets/correction procedures will be referred throughout the paper as datasets (a)-(d).

Batch effects arise from various sources during the generation of the data, and processing date records can provide a useful estimate of differences between subsets of groups [157]. In previous work [42], we showed that chip identifier (ID) and chip position comprehensively account for batch effects.

53 Dataset (b) was created using the residuals from a linear model with the batch effects of chip ID and chip position as well as sex and age as covariates. This model was fitted as follows:

y = Xβ + e (2.1)

Where:

  y1    .  y =  .    yn is a vector of probe values from dataset (a) with n = 335 individual values.

  1 x1 ··· x1c   . . .. .  X = . . . .    1 xn ... xnc is a matrix with c = 86 covariates: chip position (11 levels), chip ID (73 levels) and sex and generation, which were coded as dichotomous variables.

  µ     β1 β =    .   .    βc is a vector of parameters for the model. The parameter µ represents the mean expression level across all the individuals. The vector:

  e1    .  e =  .    en holds the model residuals. The parameters in β are estimated by:

54 β = (XT X)−1XTY (2.2)

Datasets (c) and (d) were created from the residuals obtained after correcting for the first 25 and 50 PCs using Equation 2.1, where y was the normalized dataset (a) and the PC scores values (Equation 2.3, see below) obtained from dataset (a), were fitted as covariates in X. The first 50 PCs were selected to follow the procedure used in previously published gene expression papers [77, 163].

2.3.4 Principal Components

PCs were calculated on dataset (a) to facilitate PC correction, which was used to generate datasets (c) and (d). PCs were also calculated on all four datasets (a)-(d) to analyze the extent of batch effect asso- ciations. To follow the methods used in previous gene expression studies [77,163, 170], the principal components used in this study were generated using SVD. SVD decomposes the high dimensional dataset:

  m11 ··· m1p    . .. .  M =  . . .    mn1 ··· mnp with n= 335 (samples) and p = 9,085 (probes), into a set of uncorrelated, orthogonal vectors:

M = UΣV T (2.3)

Where:

  u11 ··· u1n    . .. .  U =  . . .    un1 ··· unn is a matrix with scores values for each PC (columns), which gives the correlation values between the samples (rows).

55   e1    .  Σ =  .    en is a diagonal matrix containing the eigenvalues, which represent the amount of variance each PC explains.

  v11 ··· v1n    . .. .  V =  . . .    vp1 ··· vpn are the eigenvectors that hold the correlation values for each probe (rows) to the PC (columns) [251].

Dataset (a) demonstrates a homogenous variance structure within the scores plots with no clear popu- lation stratification or substructure comprising of independent clusters of individuals. As this dataset contains the same information as (b)-(d), this indicates the uniform nature of the samples within the BSGS dataset.

The eigenvectors values in V (Equation 2.3) represent the correlation between the probes and the principal component. As multiple probes are correlated with each PC, the selection was based on the eigenvector values for each probe. The minimum eigenvector demonstrates the largest negative correction, while the maximum eigenvector shows the largest positive correlation. Across all PCs the maximum eigenvector values ranged between 0.02 and 0.08 and the minimum values ranged between -0.02 and -0.08 (Figure 2.7A). The PCs that exhibited the smallest maximum and minimum values were the first 1-20 PCs. As the number of probes selected for each PC can impact later enrichment analysis, we wanted to select a consistent number of probes between all PCs. Therefore, a cut-off eigenvector value was used to select probes as this produced the most consistent results and provided a standard approach to use for all PCs. This cut-off value, chosen to be less than -0.02 or greater than 0.02, incorporated the maximum and minimum values for all PCs to ensure that only the most highly correlated probes for each PC were selected. This eigenvector cut-off value chose a consistent number of probes (approx. 550) across all the PCs (Figure 2.7B). The eigenvalues for the first 10 PCs were confirmed via linear regression with the relationship tested using an F-test and p -values corrected for multiple testing using a Bonferroni adjustment.

56 2.3.5 Association with Batch effects

PVCA was used to quantify the extent of the batch, age and sex effects within datasets (a)-(d). This method has been described fully previously (see ref [252]). PCs are generated by an eigenvalue decomposition of a covariance matrix, and the batch effects are quantified with a linear mixed model using the batch effect terms as covariates. The variance components in each model are estimated using Restricted Maximum Likelihood and are scaled by the eigenvalues obtained for each PC. The variance attributed to each factor is divided by the variance (determined from the eigenvalue) of the corresponding PC and then standardized across all factors.

A similar method was used to quantify the extent of batch effects on the principal components ob- tained from SVD. Scores values obtained from the SVD of datasets (a)-(d) were tested for batch effect association using a multiple linear regression (Equation 2.1 and Equation 2.2), where y is a vector of scores values for one PC, with n = 335 individual score values. X is a matrix with c = 86 covariates: chip position (11 levels), chip ID (73 levels) and sex and generation, which were coded as dichotomous variables.

The R2 values for the linear regression are calculated as the sum of squares explained by the Sum of Squares Regression (SSR) divided by the Sum of Squares Total (SST), which gives the percentage of variance in each PC that is associated with batch effects,

1 − SSR R2 = (2.4) SST and adjusted for multiple covariates using:

(1 − R2)(n − 1) R2 = 1 − (2.5) ad j (n − c − 1)

P-values were obtained from an F-test of the total variance explained by the model as opposed to total variance. Multiple testing was accounted for by Bonferroni correction.

57 2.3.6 Estimation of heritability

To assess the extent of genetic variability held within the PCs were adjusted for in datasets (c) and (d), estimates of heritability were calculated for all PCs generated from dataset (a). To compute heritability for the PCs the score values for related individuals had to be estimated, which minimizes the presence of family effects retains the same PCs for eQTL and heritability analyses. The estimated scores were calculated by multiplying the probe values for those individuals with the eigenvectors calculated from the PCA decomposition:

Uest = XV (2.6)

Where:

  u11 ··· u1n    . .. .  Uest =  . . .    un1 ··· unn is the resulting matrix of estimated PC scores values with n= 425 individuals.

  x11 ··· x1p    . .. .  X =  . . .    xn1 ··· xnp is a matrix of probe values with p = 9,085 probes.

  v11 ··· v1n    . .. .  V =  . . .    vp1 ··· vpn is a matrix of eigenvectors, previously derived from the analysis of data from the 335 unrelated indi- viduals.

860x335 Uest was combined with U from the original PCA to create a R scores matrix containing values

58 of all individuals for the 335 PCs.

Heritabilities were estimated for the PCs generated from dataset (a) using Quantitative Trait Dise- quilibrium Test (QTDT), which partitions variance components attributed to additive genetic, envi- ronmental or common family variance [253]. This method utilizes the complete pedigree structure within the data. Two models were compared in this analysis, an Additive genetic and Environment (AE) model, which includes an additive genetic component and unique environment component, and a Common family effects and Environment (CE) model, which includes common and unique envi- ronment components. Additive genetic, common and unique environment variance estimates were divided by the total phenotypic variance to determine the proportion contributed by each factor.

Heritability estimates for all probes were calculated using QTDT on the full dataset of 860 individuals. The full dataset was corrected using the same four methods used for datasets (a)-(d). The distribution of probe heritability estimated obtained from the four different correction methods were compared. QTDT heritability estimates are constrained to have a minimum value of 0.

2.3.7 Genome-wide association study on PCs

A genome-wide association analysis for each PC was performed using PLINK software [254] on dataset (a) to provide an independent assessment of genetic effects present for PCs. SNPs were fil- tered based on a Minor Allele Frequency (MAF) > 0.05, missingness >0.10 and Hardy-Weinberg Equilibrium (HWE) p-value < 1e-6. After filtering, 488,462 SNPs remained for analysis. Signifi- cance was determined at a Family-Wise Error Rate (FWER) with Bonferroni correction (α = 0.05) and with an empirical p-value estimation for each PC using 1000 permutations in PLINK [254].

2.3.8 eQTL analysis

To determine the impact of each of the different correction methods employed in datasets (a) - (d), an eQTL analysis was performed for each probe within these four datasets using the same procedure as described above. To replicate the effect on eQTL detection we performed the sample analysis for an independent sample comprising of 139 unrelated individuals (CHDWB). local- and distal- eQTLs at multiple False Discovery Rate (FDR) levels [255] were extracted and compared between the datasets. local-eQTL associations are defined as being within +/- 1MB of the gene tested.

59 2.3.9 Biological Pathway Analysis

The pathway analysis was performed on the online Database for Annotation, Visualization and Inte- grated Discovery (DAVID) Bioinformatics Resources 6.7 [256]. Probes associated with the PCs of interest were submitted as a list of Illumina probe IDs and KEGG pathway enrichment was performed using Functional Annotation Chart implemented by DAVID [257]. The significance of pathway en- richment was determined from a modified Fisher’s test, which calculates the chance that a set of genes of related terms are presented at a certain percentage in the list. Multi-testing was accounted for using a Benjamini-Hochberg FDR of 0.05 [255].

2.4 Results

2.4.1 Non-genetic contributions to PCs

PVCA was used to estimate the extent of batch, sex and age effects within the whole dataset [252,258]. This method involves firstly decomposing the dataset into a series of PCs and estimating the effects of each covariate on these components. The total variance of each covariance is then summed across all PCs and scaled by the respective eigenvalue. PVCA was applied to datasets (a)-(d) to compare the effectiveness of these methods in removing batch effects from the data. For dataset (a), batch effects explained 23.1% of the total variance, with chip ID having the largest effect (Figure 2.1A). This proportion is far less than the 75.5% cumulative variance explained by the first 50 PCs (Figure 2.2A), which were removed in previous studies to correct for batch effects [77,163]. The results from linear model correction (Figure 2.1B) and PC correction (Figure 2.1C and Figure 2.1D) show that the majority of batch effects have been removed using these methods.

To analyze how batch effects are distributed across PCs, each of the four datasets (a-d) was decom- posed using SVD into 335 PCs. Multiple regression of each PC on chip ID, chip position, sex and age was used to analyze the distribution of these effects across all PCs. The majority of variance attributed to batch effects in the normalized dataset (a) was held within the first 58 PCs of the dataset (Figure 2.3A), with the proportion of variance explained by batch effects ranging from 24-74%. This indicates that unknown sources of variance, not associated to batch, sex or age effects are also present within these PCs. As the first PCs pick up the majority of variance within the dataset (Figure 2.2A), sig- nificant associations to these PCs indicate a much larger impact of batch, sex and age effects within

60 Figure 2.1: PVCA of gene expression datasets (A)-(D). PVCA calculates the proportion of variance in the entire dataset that is attributed to certain batch, sex and age effects. Altogether batch, sex and age effects account for 23% of the variance in the non-corrected dataset (A) and are fully removed in the corrected dataset (B). Batch effects present in PC corrected dataset are represented in (C) PC25 and (D) PC50. The residuals represent the remaining variance in the dataset not attributed to these batch effects.

61 Figure 2.2: Variance explained by PCs. Calculated from the eigenvalues obtained from the Singular Value Decomposition. Variance explained by each PC is plotted in black and cumulative variance in blue. A) The normalized dataset, B) Corrected with linear models, C) PC25 corrected D) PC50 corrected. All variances sum to 1.

62 the dataset. Batch effect associations to the first few PCs are expected to occur more frequently than in later PCs, because the initial PCs can capture the majority of correlation structure within the dataset [246]. Therefore whilst the dataset (b), corrected for batch effects with linear models, demon- strated as association with PC 330, the significance of this association is negligible (Figure 2.3B) as confirmed by PVCA (Figure 2.1).

Overall it is evident that batch effects have a large impact on gene expression variation. However, fitting these as covariates in a linear model during the correction procedure removes the presence of these batch effects (Figure 2.1B). This correction procedure (dataset (b)), which fits all batch effects as individual factors in a linear model (see methods), removes the vast majority of associations with batch effects in the principal components generated on the residuals (Figure 2.3B). There was only one significant association with PC330 present in this corrected dataset, and that accounted ˜0% of the variance, which explains why the variance attributed to batch effects was calculated to be zero with PVCA (Figure 2.1B). Batch effects were also efficiently removed in datasets (c) and (d) which were corrected for the first 25 PCs (68.4% of the variance), and 50 PCs (75.5% of the variance) (Figure 2.1C and Figure 2.1D respectively). However, there is still a 6% and 2% association to batch effects in datasets (c) and (d) respectively, and analysis of PCs generated on these two datasets show many components capturing these residual batch effects (Figure 2.3C and Figure 2.2D respectively). These results indicate that PC correction does not remove known batch effects as efficiently as correcting for chip ID, chip position, sex and generation with linear models.

Traditionally the last few PCs, which contain only a small fraction of the variance, are removed as they are attributed to noise or experimental error [259]. However, the correlation structure present in the dataset can drastically change which PCs are attributed to noise or batch effects [260]. In gene expression data, it has been demonstrated that the final PCs contain genetically relevant information [174, 261]. Due to the large variance that is due to batch effects [158], removal of PCs has recently focused on the initial PCs (up to 50) that explain the majority of the variance [77, 157, 163, 165]. We have demonstrated that the initial PCs indeed have the highest association to these batch effects. However, the variability in which PCs are associated with the batch effects (Figure 2.3A) makes it difficult just to select an arbitrary number of components for normalization. Removing an uninformed number of PCs could lead to inefficient correction as opposed to using linear models with recorded batch effects. Other factors including biological information could be present alongside these artifacts in these PCs although we acknowledge that additional batch effects, not recorded, may also be present.

63 Figure 2.3: Batch effect associations to PCs. Significant values (after Bonferroni correction p < 0.05 2 335 ) are shown in red. (A) R from a regression analysis of each PC on batch, sex and age effects in a dataset that is not corrected for any of those factors. (B) R2 from a regression analysis of each PC on the batch, sex and age effects in a dataset that is corrected for these factors. There is no significant association to batch effects present. (C) Association to batch effects in a dataset corrected for the first 25 PCs. (D) Association for batch effects in dataset corrected for the first 50 PCs.

64 2.4.2 Genetic contribution to PCs

To determine the total genetic component of the PCs that would be removed when using PC correction (as in datasets (c)-(d)), were estimated for all 335 PCs generated from the normalized dataset (a) (see methods). The heritability estimates showed that there was a considerable genetic component to nearly all 335 PCs generated (Figure 2.4). The mean heritability for all PCs was 0.429 (sd 0.1), and 0.39 (sd 0.13) for the first 50, demonstrating a large genetic component despite the strong association with batch effects. Comparisons between models for AE and CE revealed a significant Vg Vc 2 association between the estimates of genetic (V p ) and common family effects (V p ) with an R of 0.08 (p = 8.57e-08) (Figure 2.5). The association between the estimates of the two models indicates that there is a small amount of confounding between of measured genetic variance due to common family effects. This would most likely result in heritability estimates that are slightly biased upwards [262].

Figure 2.4: Narrow-sense heritability estimates of each PC obtained from QTDT. These results indicate that nearly all PCs hold genetic information.

65 Figure 2.5: Correlation between additive genetic and common environmental factors.Significant association (p = 8.57e-08 and R2=0.08) between the common environment and genetic components estimated in an AC and AE models. The proportion of common environment variance was calculated by dividing the variance attributed to common environment (Vc) by the total phenotypic variance (Vp). The proportion of genetic variability (heritability) was calculated by dividing the additive ge- netic component (Vg) by the total phenotypic variance (Vp). This result indicates that the heritability estimates obtained are confounded with common environment variance and, therefore, inflated up- wards by common family effects.

To assess the impact of linear model and PC correction on genetic variation, we estimated the heri- tabilities of the 9,086 probes in the entire dataset of 860 individuals using the same four correction methods (a)-(d) (see methods). Mean heritabilities for the 9,086 probes from the four normalization strategies were 0.32, 0.23, 0.21 and 0.18 respectively (Figure 2.6). The high mean heritability from strategy (a) is likely due to inflation from batch effects such as the date of RNA extraction, which was performed in family groups and, therefore, could not be corrected without removing heritable variation. With PC correction (c) and (d), there was a much higher proportion of zero heritability probes as opposed to strategy (b). The lower mean estimate and the higher proportion of zero heri- tability probes when correcting for 1-50 PCs suggests that PC correction has a much higher penalty of genetic variation than linear model correction.

66 Figure 2.6: Distribution of heritability estimates for 9,086 probes from four different correction methods.Black - standard normalized, Red - standard normalization with linear model correction for batch effects, Blue - corrected for batch effects using the first 25 PCs, Purple - corrected for batch effects using the first 50 PCs. There is a drop in heritability using the correction methods; however, this is more pronounced in the PC corrected dataset. The heritability estimates are constrained to zero due to the nature of genetic variance component estimating in QTDT.

To investigate the impact of PC correction on eQTL detection, we ran a series of eQTL analyzes for each probe in the unrelated BSGS datasets (a)-(d). Associated local and distal variants were extracted at multiple FDR thresholds (Table 2.1). The results demonstrated a much higher number of both local and distal eQTLs detected within the dataset corrected with linear models (b) as opposed to PC corrected datasets (c) and (d). PC correction, however, enhanced eQTL detection when compared to non-corrected datasets (a), likely reflecting a removal of false positives caused by batch effects. The improved removal of batch effects in the PC50 (d) vs. PC25(c) corrected dataset (Figure 2.1 and Figure 2.2) is also reflected by an increased number of local and distal eQTLs in dataset (d). These same trends were also observed in our replication sample (CHDWB) (Table 2.1). The results from these eQTL studies indicate that PC correction method negatively affects the number of eQTLs that

67 can be detected within gene expression datasets.

Table 2.1: eQTL results for all four datasets (a-d) in two cohorts. Local regions were defined as +/- 1MB either side of the TSS and distal as elsewhere in the genome. The numbers of probes with local or distal associations significant at various study-wide FDR thresholds is provided for each of the four correction methods. Study replicated in CHWBD

Dataset Dataset Position FDR = 0.2 FDR = 0.1 FDR = 0.05 FDR = 0.01 FDR = 0.001 FDR = 0.0001 BSGS a local 1586 840 596 449 349 273 BSGS a distal 8056 2144 170 99 75 56 BSGS b local 2824 1914 1746 1199 1005 848 BSGS b distal 9085 3025 490 264 218 183 BSGS c local 1676 929 650 502 402 316 BSGS c distal 9085 2530 254 137 104 88 BSGS d local 1707 1149 806 737 640 510 BSGS d distal 9085 2564 301 169 114 91 CHDWB a local 743 513 345 189 143 101 CHDWB a distal 8102 1249 90 51 22 9 CHDWB b local 1164 822 627 594 470 385 CHDWB b distal 8841 2076 388 215 136 141 CHDWB c local 815 597 373 252 207 168 CHDWB c distal 8234 1471 98 46 31 15 CHDWB d local 909 684 457 366 295 183 CHDWB d distal 8695 1538 152 120 86 27

To investigate loci driving genetic variation captured within PCs generated from dataset (a), we per- formed a GWAS for each PC. The PCs were tested for association with 488,462 genotyped SNPs using linear models implemented in Plink software [254]. At a study-wide significance level deter- mined by Bonferroni correction (0.05 /(488,462 SNPs x 335 PCs) = 3x10−10), no significant SNPs were found. We next examined the top SNPs that were significant after Bonferroni correction on each PC (0.05 / 448,462 = 1.0x10−7). There were 23 SNPs that were significant at this threshold, and these were confirmed using permutations (Table 2.2). The lack of genome-wide significant associations could be attributed to the study not having enough power due to a small sample size [263]. Another explanation is that the genetic variance attributed to PCs is highly polygenic. As multiple probes are driving each PC (Figure 2.7B), the magnitude of different signals can prevent the detection of individual SNPs effects.

68 Figure 2.7: Selection of probes driving PCs. 2A) Maximum (blue) and minimum (black) eigenvector values for each PC. The eigenvector values represent the extent of the correlation between a probe and a PC, with 0 indicating no association. Selection of probes driving each PC is based on the optimal number of probes for each PC that have the same eigenvector value cut-off. As multiple probes can contribute a small amount of variance to each PC it is reasonable that a low cut off value can pick up many of the significant probes driving each PC. Probes that have an eigenvector value of greater than 0.02 or less than -0.02 were selected for further biological enrichment analysis as this incorporated all the maximally and minimally expressed probes in this section. (2B) Selection of probes at this cut-off value enabled approximately similar numbers of probes to be selected for each PC. The slightly lower number of probes that are selected in the initial PCs is due to the lower maximum and minimum eigenvector values in these PCs as shown in (2A).

69 0.05 Table 2.2: Results from the GWAS for each PC. Probes that were found to be significant after a Bonferroni correction (α = 488,462 ) on each PC are listed in this table. Though none of these are significant after correcting for all PCs, they are significant at an empirical p-value of 0.05 for each PC after 1000 permutations. PC - principal component, CHR - chromosome, SNP - SNP ID, BP - base pair, BETA - regression coefficient, STAT - Coefficient T-statistic, P - Asymptotic p-value for t-statistic, EMP - empirical p-value after 1000 permutations, G LOC - the location (in relation to the gene) of the SNP, GENE - genes nearby/containing the SNP, GENE DESC - description of the gene name.

PC CHR SNP BP BETA STAT P EMP G LOC GENE GENE DESC

22 10 rs11004899 56960734 3.754 5.487 8.14E-08 0.036 intron-variant PCDH15 protocadherin related 15

35 2 rs1516174 51724845 2.826 5.43 1.09E-07 0.049 intron-variant LOC730100 uncharacterized LOC730100

81 16 rs9673242 14078070 -3.404 -5.524 6.69E-08 0.021 intron-variant MKL2 MKL1/myocardin like 2

81 16 rs1004637 14113245 -3.404 -5.524 6.69E-08 0.021 intron-variant MKL2 MKL1/myocardin like 2

100 16 rs7190803 77375823 -1.444 -5.535 6.33E-08 0.021 intron-variant WWOX WW domain containing oxidoreductase

106 2 rs10497190 158347486 -1.741 -5.585 4.87E-08 0.021 intron-variant ACVR1 activin A receptor type 1

106 13 rs17072974 21351926 2.275 5.501 7.53E-08 0.027 upstream-variant-2KB LINC00424 long intergenic non-protein coding RNA 424

106 13 rs12428031 21355249 2.275 5.501 7.53E-08 0.027 intergenic LINC00424 long intergenic non-protein coding RNA 424

110 11 rs10501384 59950456 2.555 5.482 8.31E-08 0.033 intergenic MS4A14 Membrane Spanning 4-Domains A14

110 11 rs17542525 59958103 2.555 5.482 8.31E-08 0.033 intron-variant MS4A5 Membrane Spanning 4-Domains A14

119 8 rs4596672 88124581 -1.641 -5.488 8.23E-08 0.032 intron-variant CNBD1 cyclic nucleotide binding domain containing 1

119 8 rs2974279 88144159 -1.385 -5.517 6.95E-08 0.028 intron-variant CNBD1 cyclic nucleotide binding domain containing 1

156 1 rs825113 221564768 1.762 5.503 7.45E-08 0.026 intron-variant SUSD4 sushi domain containing 4

157 2 rs11674634 132055980 -1.305 -6.005 5.02E-09 0.004 intron-variant POTEI POTE ankyrin domain family member I

175 4 rs6848983 298010 1.736 5.71 2.50E-08 0.005 intergenic ZNF732 Zinc Finger Protein 732

214 5 rs1279627 55966337 -1.095 -5.562 5.50E-08 0.019 intergenic C5orf67 Chromosome 5 Open Reading Frame 67

225 10 rs7919814 109720733 -1.055 -5.502 7.52E-08 0.03 intron-variant LINC01435 long intergenic non-protein coding RNA 1435

245 8 rs17128272 19257994 -1.516 -5.449 9.85E-08 0.042 intron-variant SH2D4A SH2 domain containing 4A

323 9 rs10813262 30474037 1.542 5.538 6.22E-08 0.044 intergenic LINC01242 Long Intergenic Non-Protein Coding RNA 1242

323 9 rs4878432 30490252 1.542 5.538 6.22E-08 0.044 intergenic LINC01242 Long Intergenic Non-Protein Coding RNA 1242

323 9 rs7866981 30548222 1.568 5.639 3.65E-08 0.034 intergenic LINC01242 Long Intergenic Non-Protein Coding RNA 1242

324 14 rs10498517 64832534 1.619 6.112 2.74E-09 0.002 intergenic CTD-2509G16.5 Clone-based

324 14 rs4902382 64834310 1.44 5.662 3.24E-08 0.017 intergenic CTD-2509G16.5 Clone-based 2.4.3 Biological pathway analysis:

We next sought to evaluate if the first 50 PCs generated from dataset (a) contained linear combinations of genes involved in an expression pathway. Pathway analysis was performed using DAVID Bioinfor- matics Resources 6.7, Functional Annotation Tool [256] on the first 50 PCs (Table 2.3). For numerous PCs we can demonstrate significant enrichment for multiple different KEGG pathways (FDR = 0.05).

71 Table 2.3: Pathway analysis for the first 50 PCs. Pathway analysis for PC1-50 was performed using DAVID Bioinformatics Resources 6.7, Functional Annotation Tool. PC = Principal Component, Term = name of KEGG pathway, Count = count of probes in each hit, % = percentage of all submitted probes that are present within the pathway, P= the p-value calculated using a modified Fisher’s exact test for enrichment, FDR = correction of p-values using the Benjamini-Hochberg FDR method.

PC Term Count % P FDR

1 Ribosome 8 7.9 1x10−6 4.5x10−5

3 Fc gamma R-mediated phagocytosis 14 2.7 4.3x10−5 5.4x10−3

7 Porphyrin metabolism 7 1.5 1.6x10−4 2x10−2

8 Proteasome 12 2.6 2.7x10−7 4x10−5

8 Oxidative phosphorylation 18 3.9 1.2x10−6 8.7x10−5

8 Huntington’s disease 21 4.6 1.9x10−6 9x10−5

8 Parkinson’s disease 15 3.3 8.5x10−5 3.1x10−3

8 Alzheimer’s disease 15 3.3 1.1x10−3 3x10−2

12 Hematopoietic cell lineage 13 2.5 1.4x10−5 2x10−3

12 B cell receptor signaling pathway 10 1.9 5.6x10−4 3.8x10−2

12 Antigen processing and presentation 10 1.9 1.2x10−3 5.3x10−2

12 Graft-versus-host disease 7 1.3 1.3x10−3 4.4x10−2

12 Non-small cell lung 8 1.5 1.5x10−3 4x10−2

12 Asthma 6 1.2 2x10−3 4.4x10−2

13 RNA degradation 11 2.3 1.3x10−5 1.8x10−3

13 Oxidative phosphorylation 15 3.1 7.8x10−5 5.6x10−3

13 Ribosome 11 2.3 5x10−4 2.4x10−2

18 Ribosome 14 2.9 9.4x10−7 1x10−4

24 B cell receptor signaling pathway 14 2.9 2x10−7 2.7x10−5

25 B cell receptor signaling pathway 11 2.2 1.1x10−4 1.6x10−2

2 Primary immunodeficiency 8 1.7 7.9x10−5 1.2x10−2

32 Oxidative phosphorylation 13 2.7 2.9x10−4 3.7x10−2

Immune function was the most common process with PC3, PC12, PC24, PC25 and PC26 all showing significant enrichment for immune functional pathways. PC3 showed enrichment for Fc-gamma R- mediated phagocytosis ( p = 5.4x10−3). PC12 showed enrichment for hematopoietic cell lineage (p

72 = 2x10−3), B-cell receptor signaling (p = 3.8x10−2), graft-vs-host disease (p = 4.4x10−2), non-small cell lung cancer ( p = 4x10−2) and asthma (p = 4.4x10−2 ). PC24 and PC25 were also enriched for B-cell receptor signaling ( p = 2.7x10−5 and p = 1.6x10−2 respectively) and PC26 showed enrichment for primary immunodeficiency ( p = 1.2x10−2). Metabolic processes were enriched in PC8, PC13 and PC32. Enrichment for oxidative phosphorylation was found in PC8, PC13 and PC32 (p = 4x10−5, p = 5.6x10−3 and p = 3.7x10−2). PC8 also showed enrichment for genes known to be involved in susceptibility to Parkinson’s (p=3x10−3), Alzheimer’s (p = 3x10−3 ) and Huntington’s disease (p = 9x10−5) and proteasomal components involved in peptide processing (p = 4x10−5). These brain conditions have been linked to oxidative metabolic dysfunction, due to oxidative damage to neurons [264–266] and neuronal energy deficiency [267] and also improper peptide processing that leads to the build-up of amyloid plaques [268].

Enrichments for ribosomal components were found in PC1, PC13 and PC18 ( p = 4.5x10−5, p = 2.4x10−2 and p = 1x10−4). PC13 also showed enrichment for RNA degradation processes (p = 1.8x10−3). PC7 was enriched for porphyrin metabolism (p = 2x10−2) involved in heme biosynthe- sis.Of the 11 PCs that showed a significant enrichment (1, 3, 7, 8, 12, 13, 18, 24, 25, 26 and 32), most of them also demonstrated relatively high heritability estimates (0.4, 0.4, 0.58, 0.49, 0.4, 0.45, 0.37, 0.39, 0.57, 0.37, 0.49, respectively), with an average heritability of 0.45. These results together demonstrate that biologically relevant and interesting probe enrichments are present within the first 50 PCs despite the high association with batch effects within the dataset.

2.5 Discussion

The removal of genetic variation alongside batch effects when using PC correction has been alluded to, but never formally investigated, in several papers [77, 163, 165, 171]. Due to the unique study design of BSGS, which contains both unrelated and family information, we can quantify the genetic components driving each PC with two independent approaches: SNP association analysis and her- itability estimates. From this, we have demonstrated that all PCs obtained from the decomposition of a gene expression dataset contain relevant genetic information. Most importantly, we show that the first 50 PCs, which have been removed in previously published papers to correct for batch ef- fects [77, 157, 158, 163, 165, 169, 170] contain both a considerable proportion of genetic variation influencing gene expression (Figure 2.4) and also an enrichment for biological networks. The consid- erable genetic variation found within the initial PCs cautions against the removal of such components

73 in the dataset due to the potential loss of genetic information.

We also show that batch effects are distributed across the first 59 PCs with varying effect sizes. As these initial PCs contain a combination of both genetic and batch effects, there appears to be a trade- off between removing batch effects and removing biologically relevant data when employing PC correction. Removing higher number of PCs improves batch effect correction while at the same time increasing the amount of genetic variation that is removed (Figure 2.6 and Table 2.1) and the removal of a smaller number of PCs (as in some studies e.g. [164]) can lead to the incomplete removal of batch effects. Whilst the exact proportions of genetic and batch effect variance observed in the PCs here are unique to this dataset, similar patterns of variance distributions are expected to be present in other high-throughput expression datasets.

PC correction became prevalent in gene expression studies after being used to correct for expression heterogeneity by a method called Surrogate Variable Analysis (SVA) [170]. This method corrects for noise in the dataset that is not accounted for by the primary variable of interest, which may be different experimental conditions or genes. It has been demonstrated as an effective means of enhancing the genetic signals of interest and minimizing false discovery rates. One fundamental difference between SVA and PCA correction is that the principal components were generated on the residuals of the data that was corrected for the primary variable of interest. These principal components contain residual ’noise’ in the dataset and their subsequent removal enhanced the power to detect signals associated with the primary variable. Later studies removed principal components from the whole dataset without former corrections for variables of interest [77, 163–165]. We have demonstrated here that this can remove genetically driven variation in the gene expression dataset that could be of interest to the researcher, due to the ability of PCs to pick up multiple sources of variation [157,173].

As the PCs pick up linear combinations of factors in the dataset, they also have the power to detect gene regulatory networks and co-expressed modules [247]. Gene regulatory networks have a con- siderable impact on disease as opposed to single gene changes [234], as groups of genes are known to interact and respond to environmental perturbations together [238]. These networks are made up co-expressed gene modules that are robust to environmental changes and even different microar- ray platforms [228]. Principal components have been used to quantify regulatory SNPs governing metabolic networks in both yeast [226] and human data [233]. We observed in our dataset that there was an enrichment for biological networks within the initial PCs. These were dominated by immune function processes, which have been detected previously with clustering analysis of gene expression data [228] as well as metabolic, ribosomal, protein and heme biosynthesis pathways. As these PCs

74 also demonstrate a high heritability, this indicates that a considerable proportion of the variance in these PCs is being explained by biological factors. The removal of such PCs to compensate for batch effects would have led to the removal of this interesting information. It has also been demonstrated previously that PC correction also removes a large proportion of covariance in a dataset, which could constitute these gene networks and interactions [157].

Our results show that PC correction negatively impacts both average probe heritability (Figure 2.6) and the number of eQTLs hits detected (Table 2.1). While the removal of a larger number of PCs improved batch effect correction (Figure 2.1) and increased the number of eQTLs identified in the dataset (Table 2.1), it had a much larger impact on mean probe heritability. This genetic variation could be comprised of additional factors such as genetic covariance contributing to gene networks that may not necessarily be found in an eQTL study. Linear model correction, on the other hand, demon- strated superior eQTL detection and retained a much higher probe average heritability. Therefore, if recorded processing dates are present, these values can be used to correct for batch effects in the data by incorporating them as covariates in a linear model [172]. This improves eQTL detection [165] but ensures that, unlike PC correction, large amounts genetic variance are not removed (Figure 2.6. Strong associations with PCs explaining large proportions of variance in a dataset can be used to se- lect for appropriate batch effects if a large number of different factors have been recorded [269]. We demonstrate here that linear model correction also has the advantage over PC correction in that it is more effective in correcting for known batch effects (Figure 2.3).

We have shown that the removal of PCs from gene expression datasets to correct for batch and en- vironmental effects should be treated with caution, as many different sources of variation can be present within them. This comes from the ability of PCs to detect many linear combinations of trait values [261]. As it can be difficult to distinguish between which factors are driving the PCs without the prior knowledge of the batch and environmental trends in the dataset, removal of PCs runs the risk of removing biologically interesting data. However, removal of the first 50 PCs can sometimes provide a useful method to account for technical artifacts that can lead to spurious associations when no batch effects have been recorded. Our results here show that it is clearly preferable to record such information during the generation of the data and correct for them using standard linear ap- proaches [154].

75 Chapter 3

Seasonal effects on gene expression

The work presented in this chapter has been published in:

Goldinger A, Shakhbazov K, Henders AK, McRae AF, Montgomery GW, Powell JE: Season effects on gene expression. PloS One 2015, 10:e0126995

76 3.1 Abstract

Many health conditions, ranging from psychiatric disorders to , display notable seasonal variation in severity and onset. To understand the molecular processes underlying this phe- nomenon, we have examined seasonal variation in the transcriptome of 606 healthy individuals. We show that 74 transcripts associated with a 12-month seasonal cycle were enriched for processes in- volved in DNA repair and binding. An additional 94 transcripts demonstrated significant seasonal variability that was largely influenced by blood cell count levels.

These transcripts were enriched for immune function, protein production and specific cellular markers for lymphocytes. Accordingly, cell counts for erythrocytes, platelets, neutrophils, monocytes and CD19 cells demonstrated a significant association with a 12-month seasonal cycle.

These results show that seasonal variation is an important environmental regulator of gene expression and blood cell composition. Notable changes in leukocyte counts and genes involved in immune function indicate that immune cell physiology varies throughout the year in healthy individuals.

3.2 Introduction

The expression levels of the majority of transcripts are heritable [40, 42–44]. However, environ- mental variation, which can be considered as 1-H2 contributes the largest source of variation [44]. Identifying and quantifying the influence of the environmental factors can provide a more thorough understanding of the differences in expression levels between individuals and between populations. Knowledge of environmental effects will also provide information on gene function and potentially gene-environment interactions, which will aid in understanding how expression profiles are affected by certain environmental conditions, such as geographical location [270].

One environmental factor that has a well-documented effect is seasonal variation. Changes in gene regulation have been associated with seasonal effects such as photoperiod in animals [271, 272] and plants [273, 274]. Previous research into the effect of seasonal variation has identified a relationship between circannual patterns and clusters of genes, most likely stemming from seasonal variation in blood cell counts [275].

Seasonal changes include, but are not limited to, differences in day length [276], Ultraviolet (UV)

77 index [277], humidity [278] and temperature [279], all of which could potentially influence expres- sion levels either independently or interactively. Several health conditions are affected by seasonal changes, including asthma [280], cardiovascular disease [281], depression [282], diabetes [283], bipo- lar disorder [284], schizophrenia [285] migraine [286] and multiple sclerosis [287, 288]. Environ- mental changes between seasons also influence rates of influenza and respiratory syncytial virus [289], and vitamin D deficiency has been attributed to seasonal UV changes [290, 291]. Ad- ditional factors that can create seasonal trends in physiological state are behaviours changes such as staying indoors [289], exercise [292] and diet [293].

Identifying genes, the expression levels of which change in response to the season could potentially shed light on some of the mechanisms that might be driving these health conditions. Here we report results from a systematic, genome-wide analysis of the effect of season on gene expression levels in a human population. We identified significant blood cell count changes in erythrocytes, leukocytes, and platelets associated with seasonality and enrichment for expressed cellular gene markers for lympho- cytes. Furthermore, after correcting for blood cell counts, we identified 135 probes whose expression levels were significantly associated with a 12-month seasonal cycle.

3.3 Materials and Methods

3.3.1 Samples

This study comprised of 606 individuals from 246 families in the BSGS [42]. Genotype, gene expres- sion and cell counts were measured for each individual.

3.3.2 Genotyping

All individuals were genotyped on an Illumina 610-Quad Beadchip (Illumina Inc, San Diego, USA) by the Scientific Services Division at deCODE Genetics, Iceland. Full details of the genotyping procedure are given in [42].

78 3.3.3 Gene expression

RNA levels were measured from whole blood collected with a PAXgeneTM tube (QIAGEN, Valencia, USA). The expression levels were quantified on an Illumina HumanHT-12 v4.0 Beadchip. Samples were randomized across the chip to minimize batch effects due to family, sex, and age. Full details of the procedures used to generate the expression levels are given in [42]. Pre-processing of the microarray data, including calculation of average bead signal, removal of outliers and the calculation of detection p-values, was performed in the Illumina software Genome studio. Probes were considered significantly expressed above background noise at a p<0.05. All probes falling below this threshold were considered non-expressed and denoted as such for further analysis. Probes that did not map to characterized Ref-Seq genes were removed. Probes with non-expression in > 50 of samples were excluded, leaving 13,311 probes for further analysis.

3.3.4 Cell counts

Cell counts measured in BSGS include individual measures of single cell types, along with measures representing a composite of multiple cell types. For example, total white blood count includes mea- sures of several cell types such as monocytes, lymphocytes, basophils, neutrophils, and eosinophils. We chose to correct for the individual blood cell types, rather than composite measures. The cell types that were selected for correction were red blood cell (RBC), Platelet (PLT), Monocyte (MONO), Basophil (BASO), Neutrophil (NEUT), Eosinophil (EOS), B-cell (CD19), Two subtypes of T-cell (CD4), T-cell (CD8) and Natural Killer Cell (CD56). Cell counts were log transformed and converted to z-scores. Linear regression was used to correct expression levels for effects due to the cellular composition.

3.3.5 Normalization

A rank-based Inverse Normal Transformation (INT) was used to transform probe expression to a normal distribution. The normalization was done using the R package GenABEL [294].

As the BSGS contains related individuals, the polygenetic (cryptic and family) effects were removed by fitting the relationship matrix (A), determined using an Identity-By-State (IBS) Genomic Rela- tionship Matrix (GRM) calculated in the software package, Genome-wide Complex Trait Analysis (GCTA) [295]:

79 y1 = g + e1 (3.1)

g ∼ N( ,A 2) e ∼ N( ,I 2 ) Where 0 σg and 1 0 σe1 . Variation in expression levels can be attributed to batch effects such as chip and chip processing. Corrections were made for batch effects using (Equation 3.2):

y2 = Xβ + Zb + e2 (3.2)

Where y2 = e1, Z is the incidence matrix for the chip ID fitted as a random effect (b) with b ∼ N( ,I 2) e ∼ N( ,I 2 ) X 0 σb and 2 0 σe2 . Fixed effect covariates ( ) were selected from a list of recorded batch effects using forward stepwise regression with model selection based on the lowest Akaike informa- tion criterion (AIC). Covariates that had a significant association with gene expression levels included chip position, age, sex, and length of sample storage, which is the difference between the date that the tissue sample was collected and the date that the mRNA was extracted. The values calculated from e2 (Equation 3.2)were used for further analysis and referred to as the ”uncorrected” dataset.

3.3.6 Cell count correction

The expression dataset was corrected for cell counts using (Equation 3.3):

y3 = Xβ + e3 (3.3)

y = e , e ∼ N( ,I 2 ) X Where 2 2 3 0 σe3 and is the fixed effect cell count covariates selected previously. The values obtained in e3 (Equation 3.3) were used for further analysis and referred to as the ” corrected” dataset.

3.3.7 Conversion to time series

BSGS tissue samples were collected over a six-year period, between February 2005 and March 2010. Expression levels and cell counts were averaged by the month of sample collection, creating a monthly time series for each probe. Of the 606 samples, 597 were collected between February 2005 and

80 February 2008. The remaining nine samples, which were collected between March 2008 and March 2010, were excluded from further time series analysis due to the low number of samples per month.

3.3.8 Seasonal decomposition

The gene expression and cell count time series data were decomposed into seasonal (s), trend (t) and error components (ε) using loess function [296].

y3 = g(t) + g(s) + e3 (3.4)

Where y3 are the residuals, e2, from (Equation 3.3). g(s) and g(t) are estimated by loess smoothing functions, which allow the estimation of repeating periodic variation without any constraint to a par- ticular cyclical pattern. The trend component represents the overall changes that occur over the whole time series.

3.3.9 Autocorrelation

Autocorrelation within time series data indicates the presence of periodic repeating patterns. A Ljung- Box test [297] was used to test for significant levels of autocorrelation in the g(s) estimates:

h r 2 Q = n(n + 2)∑ k (3.5) k n − k

Where n is the sample size, k is the lag, rk is the autocorrelation, and h is the number of lags [297]. The test statistic (Q) follows a chi-square distribution with h degrees of freedom.

3.3.10 Cosinor regression

Cyclic seasonal patterns, which have periodical cycles repeating over set time frames, can be modeled by the cosine function:

2πt f (t) = a × cos[( ) − θ] (3.6) T

81 where t = month (1-12 for January to December), T = time period (in months) over which the cycle repeats, a = amplitude and θ = horizontal shift or phase of the cosine function [298]. This transformation creates the cosinor regression model [299].

2πt 2πt y = β + β × sin( ) + β × cos( ) + e (3.7) 4 o 1 T 2 T 4

Where y4 = s, the seasonal component from (Equation 3.4). Cosine and sine curves with repeating cycles (T) of 12 months was fitted. Significance was determined with Analysis of Variance (ANOVA) F-statistic and multiple testing accounted for using a Bonferroni correction.

3.3.11 Measured weather variables

The cosinor model was applied to following environmental variables that were recorded monthly: mean maximum temperature, mean minimum temperature, mean daily ground minimum temperature, mean rainfall, mean number of rainy days, maximum wind gust speed, mean daily sunshine, mean daily solar exposure (MJ/m2), mean number of sunny days, mean number of cloudy days, and mean daily evaporation (all obtained from the Australian Bureau of Meteorology - http://www.bom.gov.au/) and mean UV levels (obtained from the Australian Radiation Protection and Nuclear Safety Agency - http://www.arpansa.gov.au/). The association of weather variables to a 12-month repeating cosine curve was determined with a Kendall tau rank correlation test. Kendall tau rank correlation is a nonparametric test that determines dependence between two ordinal variables. Weather variables for all samples are provided in (Table 3.1).

3.3.12 Biological enrichment analysis

Probes that demonstrated significant association to seasonal cyclic variation were tested for biological enrichment using DAVID (v6.7) [256, 257]. Functional annotation clustering was used to identify groups with shared annotation. Statistical significance of the clusters is given by an enrichment scores, where a score >= 1.3 is equivalent to a p-value of 0.05. Molecular pathways were identified from the KEGG pathway enrichment implemented by DAVID functional annotation tool. Significance of GO- terms obtained from KEGG pathway enrichment and the functional annotation chart was determined using modified the Fisher’s exact test [256, 257], corrected for multiple testing using the Benjamini- Hochberg FDR [255].

82 3.3.13 Blood cell specific markers

Enrichment for blood cell-specific markers was determined using the userListEnrichment() function in the WGCNA R package [229]. This function compares the gene lists obtained from the seasonal analysis to 11 other gene lists for known indicators of blood cell variation and batch effects found in previously published studies. The lists include a 224,66,23,74,51,988,56 for B-cell, CD4, CD8 T-cell, Natural Killer Cell (NK) cells, neutrophil, red blood cell and reticulocyte cellular abundance. Two lists for platelets were present, which included 48 and 431 genes and a composite measure comprising of 57 genes for lymphocytes, which includes B-cells, T-cells and NK cells. An additional list was available which included 30 genes associated with the batch effect ”time of blood draw.” Significant overlap between each list of genes and the cell markers was determined using a hypergeometric test. Significant enrichment was determined by a study-wide significance threshold of p < 0.05/11 [300].

3.4 Results

3.4.1 Decomposition of time series data

The BSGS dataset [42], comprising gene expression levels for 606 individuals and 13,311 probes, were decomposed into seasonal, trend and irregular (remainder) components using the loess smooth- ing function (Figure 3.1 and Methods). This enabled regular cyclic components for each probe to be isolated from residual or background noise.

83 Figure 3.1: Time series decomposition for TRIM23 (ILMN 1752741) using loess decomposi- tion.Original = The raw time-series data for the probe. Seasonal = The regular cyclic component. Trend = The linear drift over time. Remainder = The irregular (error) component that is not explained by the seasonal and trend components.

3.4.2 Effect of season on gene expression

Cosinor regression was used to test for the effect of season (based on when the expression levels were sampled) for each of the time series adjusted probes. Cosinor regression is a linear model with sine and cosine terms that estimate the parameters of repeating cyclic variation across multiple phases (see Methods). To investigate the effect of season on blood cell counts, we performed the cosinor regression analysis on expression levels that had been adjusted for cell counts (”corrected”, see Methods) and unadjusted (”uncorrected”).

84 Significant associations with season at study-wide threshold of p < 0.05/13,311 were identified for 169 (uncorrected) and 135 (corrected) probes (Table 3.1). The significant probes from these models also demonstrated significant autocorrelation, an alternative statistical test for repeating patterns, in 160 (uncorrected) and 121 (corrected) probes (Table 3.1). Of these probes, 75 (approximately 50% of the significant seasonal probes) were shared between the uncorrected and corrected datasets.

The probes that were significantly associated with season were located throughout the genome (Figure 3.2), indicating a diverse range of probes affected by seasonality. The mean proportion of phenotypic variation in expression levels explained by the seasonal effect was 0.13 (uncorrected) 0.12 (corrected) (Figure 3.3). The variance explained by the models (R2) values for probes with significant levels of association were much higher with both corrected and uncorrected datasets having and R2of 0.59 for (Table 3.1).

Table 3.1: Significant probes for the 12 month cosinor regression. Results for 12-month cosinor model and autocorrelation on two datasets: not corrected (NC) and corrected (CC) for cell count. Probes = number of significant probes for the 12-month cosinor model. E(R2) and Std(R2) = statistics for the R2 values obtained from probes significant for the 12-month cosinor model. Autocorrelation = The number of probes with significant autocorrelation.

Dataset Probes E(R2 ) Std(R2 ) Autcorrelation

NC 169 0.59 0.07 16

CC 135 0.59 0.07 121

85 Figure 3.2: Manhattan plot of the cosinor seasonal analysis. The -log10(p) of each cosinor re- gression model is plotted against the chromosomal location of each probe. Bonferroni correction significance line is added. A) Not corrected for cell count B) Corrected for cell count. Includes autosomal chromosomes 1-22, X(23), Y(24) and Mitochondrial(25).

86 Figure 3.3: Histograms showing the distribution of the variance explained by the model.R2 Blue denotes probes with a statistically significant association of gene expression levels and cyclic variation A) Not corrected for cell count B) Corrected for cell count.

87 3.4.3 Cell count seasonality

Cell counts for ten blood cell types representing distinct subgrouping in erythrocytes, platelets, gran- ulocytes, monocular cells and lymphocytes were selected for an association with seasonal variation using cosinor regression with a 12-month repeating cycle. Five cell types demonstrated a signifi- cant association with season: Erythrocytes ( p=1.78e-3, R2=0.3), Platelets (p =5.46e-6, R2=0.51), Neutrophils (p=4.41e-3, R2=0.27), Monocytes (p=1.58e-5, R2=0.48) and CD19 cells (p=4.89e-6, R2=0.51). The R 2 value denotes how much variance in the cell counts the linear model explains. The fitted 12-month seasonal cycle explained between 30-51% of variance in the cell counts. The change in cell count levels throughout the year demonstrates differing seasonal highs and lows (Fig- ure 3.4). The CD19 cells, Monocytes, and Platelets share a similar seasonal cycle that peaks in autumn and drops in spring. Red blood cells and Neutrophils demonstrate a slightly shifted pattern peaking in late winter/early spring and dropping in summer. These seasonal patterns have been reported before for platelets [301] and red blood cells [302].

88 Figure 3.4: Seasonal variation in cell count. Five cells types that demonstrate significant association 0.05 to seasonal variation (p < 10 , Bonferroni correction). The black lines represent the fitted values in a cosinor regression. The red lines represent the actual cell values. From these figures, it is evident that the cells follow complex repeating patterns of peaks and troughs throughout the year. However, it can be observed that they show a consistent seasonal trend following one clear peak and trough per year. These values were collected over a three-year period and are plotted in sequential order. The year of sample collection is labeled in the axis as a number (5-8) after the month, corresponding to 2005, 2006, 2007 and 2008 respectively.

89 3.4.4 Environmental variables

The cosinor regression models a regular cyclic wave that represents natural seasonal variation. For 12 measured environmental conditions ranging from temperature to UV exposure, there was a significant association with a 12 month repeating cosine curve (Figure 3.5) with tau-rank correlation coefficient of τ ˜1 for temperature (ground, maximum and minimum), τ ˜0.7 for UV, clear days, cloudy days, evaporation, rainy days, rainfall and solar exposure and τ ˜0.16 for wind speed and hours of sunshine (Figure 3.5). The 12-month cycle closely represents environmental conditions, in particular tem- perature, and could account the effect of seasonal environmental changes on gene expression levels (Figure 3.5).

Figure 3.5: Standardized monthly values for 12 weather conditions. Measured weather variables that exhibit seasonal variation in Brisbane (black dots and connecting lines). The red dots represent the cosine curve with a 12-month repeating cycle.

90 3.4.5 Enrichment analysis

We next sought to identify a shared biological function of genes exhibiting significant levels of cyclic seasonal variation by performing an enrichment analysis using the DAVID [257].

DAVID Functional Annotation of the significant seasonal probes for the uncorrected dataset showed enrichment for several KEGG pathways and GO terms related to immune function. This included a number of autoimmune disorders (autoimmune thyroid disease (p=2.3e-3), type 1 diabetes (p =4.77e- 4), asthma (p=2.1e-4)), antigen process and presentation (p=1.97e-3) (Figure 3.6) cellular activation, differentiation and development (cluster enrichment score 1.8) as well as Major Histocompatibility Complex (MHC) class 2 immune response (cluster enrichment score 2.28) (Table 3.2). There was also enrichment for protein production and modification including translation, post-translation modi- fications and localizations (cluster enrichment score >1.4). Cellular components involved in protein production, the endoplasmic reticulum (p=1.24e-3) and the Golgi apparatus membrane (cluster en- richment score 1.44) were also enriched (Table 3.2). However, The significant seasonal probes for the corrected dataset only showed enrichment for DNA repair and binding (Table 3.2), a pathway that was also identified for the uncorrected seasonal probes.

91 Figure 3.6: KEGG enriched pathway for Antigen Processing and Presentation. Significant sea- sonal genes in our study are highlighted with red stars.

92 Table 3.2: Table C2C2: Biological enrichment for 12-month cosinor model. Dataset = Datasets corrected (CC) and not-corrected (NC) for cell count. General process = Overall biological function , Terms= GO terms or pathways, Significance = Determined with a Cluster Enrichment Score (CER) or FDR corrected p-values . Dataset General process Terms Significance CC DNA DNA repair CER= 1.68 CC DNA DNA binding CER= 1.5 NC Protein production and modification Acetylation p=3.21e-5 NC Protein production and modification Ubiquitin CER= 1.64 NC Protein production and modification Ribosome biogenesis CER= 1.48 NC Protein production and modification Protein localization CER= 1.46 NC Protein production and modification Translation CER= 1.44 NC Protein production and modification Peptidase activity CER= 1.42 NC Cellular component Endoplasmic reticulum p=1.24e-3 NC Cellular component Golgi apparatus membrane CER= 1.44 NC Immune response Allograph rejection p= 4.19E-4 NC Immune response Asthma p=0.00021 NC Immune response Intestinal immune network for IgA production p=4.29e-4 NC Immune response Type 1 diabetes mellitus p=4.77e-4 NC Immune response Autoimmune thyroid disease p=2.3e-3 NC Immune response Antigen processing and presentation p=1.97e-3 NC Immune response Lymphocyte differentiation CER= 1.6 NC Immune response Antigen processing and presentation CER= 1.5 NC Immune response Immune cell activation, differentiation & development CER= 1.8 NC Immune response MHC class two immune response pathways CER= 2.28 NC DNA DNA binding CER= 1.75 NC DNA Nucleotide metabolism CER= 1.67 NC Cellular function Apoptosis CER= 1.93

3.4.6 Cell-specific mRNA markers

As the expression levels were measured in whole blood, which is composed of many cell types, we attempted to determine whether a specific cell type drove the seasonal expression patterns. Using the userListEnrichment() function incorporated in the WGCNA R package, we tested 11 lists of different blood cell markers for enrichment. This analysis revealed that the non-corrected seasonal probes were significantly enriched for lymphocyte markers (Table 3.3) after Bonferroni correction (0.05/11 test sets). After correction for cell count, no association to any cell markers were observed, indicating that cellular markers can reflect the cell counts present for the individuals. This demonstrates that fluctuations in cell count can be observed within the transcriptome through the presence of specific

93 cellular markers and also through the enrichment of seasonal genes involved in immune function.

Table 3.3: Gene list enrichment analysis for blood cells. . Enrichment for blood cell signature was found using the userListEnrichment function in the WGCNA R package.

User Defined Categories Type No. Corrected Genes Genes p-values Bcell Blood (composite) Blood 31 3.52E-06 BANK1, BCL11A, C22ORF13, C4ORF34, CCDC106, CCR6, CD24, CD79A, CD79B, CXXC5, CYBASC3, EIF2AK3, GJB6, GNB5, HLA-DOA, HVCN1, ITPR1, IVD, MEF2C, NOC3L, P2RY10, PACAP, PNOC, SMARCB1, SP100, SPIB, TLR10, TPD52, TTC21A, ZDHHC23, ZNF165 Lymphocytes genesCorrelatedAcrossIndividuals Blood 11 3.20E-02 BTG1, CD74, CD79A, CSF1R, HLA-DMB, Whitney HLA-DPA1, HLA-DRA, HLA-DRB4, MS4A1, SPIB, TCL1A

3.4.7 Discussion

There are seasonal differences in the prevalence and severity of conditions such as psychiatric dis- orders [280, 282, 284, 285, 303] and cardiovascular diseases [281, 304]. The effect of seasonal vari- ability has also been recorded for several molecular phenotypes such as homovanillic acid [305], serotonin [306, 307], monoamine neurotransmitters [308], 25-hydroxyvitamin D3 [309] and N-3 poly-unsaturated fatty acid [310]. Gene expression provides an intermediate phenotype between the genome and higher order phenotypes such as metabolites and can provide clues as to the underlying biological functions that are being altered in response to seasonal changes.

We investigated whole blood gene expression for seasonal variability in a cohort of 606 individuals. Using cosinor regression models we were able to identify 135 probes (1% of all probes tested) that showed significant seasonal variation after being corrected for blood cell composition. Significant autocorrelation, an alternative statistical technique, was also identified for 90% of these probes further confirming the presence of repeating cyclic trends in expression levels of numerous transcripts. To examine how this impacts seasonal gene expression, we examined the expression levels with and without corrections for cell counts.

Transcripts showing seasonal variation in the uncorrected data were enriched for immune pathways

94 and protein production. The enriched KEGG immune pathways included (Table 3.2) allograph rejec- tion, antigen processing and presentation (Figure 3.6), lymphocyte differentiation, immune cell acti- vation, differentiation, and development, MHC class two immune response and autoimmune diseases; type 1 diabetes, autoimmune thyroid disease, and asthma. The enrichment for immune responses would indicate a reaction to a seasonal , of which influenza is one of the most prevalent. In- fluenza outbreaks occur primarily in winter [289]. Influenza peaks are concordant with the significant seasonal peaks (in late autumn/early winter) and troughs (mid to late spring) observed for MONO and CD19 immune cells as well as for PLT. The autoimmune pathways enriched for these transcripts are likely to represent diseases that arise from disorders in the same immune pathways used for fighting . However, these pathways have also been shown to exhibit seasonal effects. Asthma also displays similar seasonal trends with autumn peaks and summer troughs [311–313]. has demonstrated similar seasonal patterns, with the highest frequency of new patients occurring in winter and the lowest in summer [314]. While documentation on the seasonal effect of autoimmune thyroid disease is minimal, there is a link between the condition and vitamin D deficiencies [315]. Vitamin D serum levels can exhibit considerable seasonal variation, which can be related to seasonal changes in sun exposure that are higher in summer months and lower in winter [316]. These results suggest that seasonal variations in cell counts may be occurring in response to environmental stimuli such as sun exposure and pathogen exposure, which in turn lead to the presence of various diseases.

Asthma is a chronic inflammatory disease condition, which has been observed to exhibit seasonal variability [280], potentially through the cellular mechanisms identified here. Other cellular function, such as protein modification and apoptosis, were found in the uncorrected gene expression dataset (Table 3.2). There were 74 seasonally associated genes shared between the corrected and uncorrected datasets, and these demonstrated enrichment with for DNA binding genes. The enrichment for DNA binding among these seasonal transcripts suggests that transcripts encoding genes involved with DNA binding experience significant seasonal variation, independent of seasonal fluctuations in cell count.

A previous study by De Jong et al. 2014 [275] identified three modules, comprising 5,062 probes (mapping to 1,458 unique genes), that were associated with cyclic seasonal patterns. However, these probes were primarily driven by changes in red blood cells and platelets [275]. Here, we did not identify cell type specific gene signatures for red blood cells but instead identified enrichment for leukocytes markers. This difference could be attributed to the single-gene approach we employed and that we only shared 2,406 genes (18% of our dataset) with the De Jong et al. analysis [275].

Here have demonstrated that the 12-month cosinor regression has perfect correlation (τ ˜1) with tem-

95 perature, a high correlation (τ ˜0.7) with UV index, number of clear, cloudy and rainy days, evapora- tion, rainfall and solar exposure, and a low correlation (τ ˜0.16) for wind speed and hours of sunshine. This relationship suggests that temperature could be an important factor driving the seasonal variation in gene expression levels identified with the 12-month cosinor regression model.

A limitation of this study is that each time point represents the mean expression levels of a group of samples collected during the same period. Therefore, estimates of effects represent population variation, rather than intra-individual variation. To more accurately assess the impact of seasonal environmental factors of gene expression, repeat measures should be collected for samples throughout the year.

3.4.8 Conclusion

Our results demonstrate the effect of seasonality on cell count and gene expression levels. We ob- serve that the cellular composition of erythrocytes, platelets and leukocytes varies throughout the year, following seasonal patterns. This trend was also evident in gene expression levels, with signifi- cant seasonal changes in the expression of genes involved in immune function and protein translation. However, we are unable to determine the direct route of causation for these transcriptional changes. Seasonal transcriptomic variation that are independent of cell counts, were enriched for DNA binding, a biological process that is essential to transcriptional control. Collectively, our results show that sea- sonality is an important environmental regulator of physiological processes, which can be identified through transcriptional variation.

96 Chapter 4

Common genetic control drives the co-expression of mRNA transcripts

4.1 Abstract

Amongst the expression levels of mRNA transcripts, there is an extensive correlation structure, arising from the co-regulation of genes. Statistical and computation methods can be used to identify discrete modules of systematically co-regulated transcripts, based on phenotypic correlations (rP), which form networks that share biological processes and functions. These modules, robustly identified in previous studies, have been utilized to classify individuals based on defined environmental and immunological conditions. Here, we sought to determine if the rP driving the formation of expression modules is due to common genetic regulation shared between pairs of transcripts. We firstly identified transcripts in whole blood that comprise co-regulating modules of gene expression. Using the GREML estimation method implemented in GCTA [22], we calculated significant heritability measures for the majority of transcripts in the modules. We extended these methods to estimate the degree of shared genetic control between pairs of transcripts - termed a rA. All pairs of transcripts within a module have highly significant rA (p<0.001), suggesting that shared genetic control strongly influences expression networks. Furthermore, we identified and replicated trans -eQTLs that were associated with two or more transcripts within a given module, further supporting the evidence for shared genetic co- regulation of transcripts within a module. Finally, we investigated tissue specificity of the genetic control of co-regulation using expression levels measured in subcutaneous fat, LCLs, and skin. We show that there is little evidence for shared genetic co-regulation across different tissues. Our results provide strong evidence for the role of shared genetic variation in the coordination of regulation across mRNA transcripts, acting in a tissue-specific manner.

97 4.2 Introduction

The majority of Gene transcripts function together with other genes, either due to regulation by factors such as miRNAs or TF or by shared functionalities that occur after translation, i.e., part of a metabolic network. The prevalence of co-expressed genes within the transcriptome is indicative of vast networks of co-regulated genes that interact with each other and share biological functionality. Computational and statistical methods have been used to identify transcriptional networks, and use these to elucidate some of the cellular processes underlying complex diseases and traits in humans [228, 317–320].

By using the covariance structure amongst the expression levels of transcripts, dimension reduction methods [37,114,115,317] can identify modules of correlated genes [228,229,231,321–323]. Chauss- abel et al. 2008 [114] used the correlation structure in peripheral-blood mononuclear cell (PBMC) gene expression. Patterns of up and down regulation in modules were able to distinguish patients with various degrees of disease activity from control patients in several conditions including Systemic Lu- pus Erythematosus (SLE), melanoma, immunosuppression from liver transplantation, Streptococcus pneumoniae infection, Crohn’s disease and ulcerative colitis [228]. Preininger et al. 2013 [114] estab- lished that these 28 coexpressed disease modules were present in healthy individuals. The presence of these axes in healthy individuals indicates that the architecture of gene expression variation is highly structured and constrained along regulated axes of variation. A person’s transcriptomic values within these axes of variation can provide value information on their metabolic or immune status that can change over time depending on specific environmental conditions. These values can highlight a pre- existing medical condition, indicate the degree of susceptibility to disease or can be used to map the progression of an illness and response to drug treatments.

These modules were condensed down from 28 modules obtained in patients with various autoimmune and bacterial [228]. Within two datasets of comprised of healthy individuals, the CHDWB containing 189 samples and a study consisting of 208 samples from Morocco [270], these 28 modules were condensed down into six distinct meta-modules. After correcting for these six modules in the transcriptome, the authors identified three additional modules from the unexplained residual variance. 7,538 transcripts (out of 14,343 tested) were significantly associated (after Bonferroni correction) with the first PC in one or more of the modules in both the CHDWB and Morocco dataset. In total, these nine modules captured up to 39% and 51% of the total transcriptome variance in the CHDWB and Morocco cohorts respectively [114].

Since these modules had demonstrated great utility in as a diagnostic procedure, Preininger et al.

98 2013 [114] wanted to select a smaller number of representative SNPs for each module. The high correlation structure present within these transcripts would mean that the selection of representative transcripts is possible. One selection criteria for the representative transcripts was an association with only one module. From this list, the transcripts with the highest correlation values became the representative transcript for each module. These 90 probes had Pearson or phenotypic correlation values >0.88 (px10−70 compared to the empirical null correlation values obtained from 10,000 sets of randomly sampled transcripts) These transcripts were named blood informative transcripts (BITs). This number of transcripts was sufficient to recreate the correlation structure (axes of variation) for the entire module. The selection of ten transcripts per modules, 90 transcripts in total, is ideal for the creation of a Real Time, Reverse Transcription Polymerase Chain Reaction (qRT-PCR) panel applicable for use in a hospital or clinic.

These BITs showed promise as a diagnostic tool, classifying patients with latent and pulmonary tuberculosis [114]. They could also classify individuals based on other environmental conditions, such as different geographical location [114, 115], demonstrating the ability of the PCA axis scores derived from each BITs module to classify individuals based on particular environmental condi- tions [114, 115]. The gene expression variance of an individual can define their transcriptomic posi- tion in the nine axes of variation. Preininger et al. 2013 hypothesized that an individuals’ position on the axes could constitute a map to determine susceptibility to various types of diseases and condi- tions [114].

The correlation structure of the nine modules was replicable in seven different datasets. The presence of these strongly coexpressed genes suggests a robust regulatory mechanism driving the transcripts of interest. Covariance between the transcripts arises from a combination of additive genetic and environmental covariances (for the rest of the article, the term ”genetic” will refer to ”additive genetic” variation), which is defined (assuming all genetic-environmental covariances are zero) as:

σP(x,y) = σA(x,y) + σE(x,y) (4.1)

Phenotypic correlations between transcripts (rP), is the metric employed by Chaussabel et al. 2008 [228] and Preininger et al. 2013 [114] to cluster genes into specific modules. The covariance between 2 two transcripts (σ P(x,y) from Equation 4.1) can be written as a function of phenotypic(rP), genetic

(rA) and environmental (rE) correlations scaled by the covariance:

99 rPσP(x,y) = rAσA(x,y) + rEσE(x,y) (4.2)

2 The variance (self-covariance) measures of individual transcripts (σP(x)) can be decomposed into corresponding genetic and environmental components as per Equation 4.2. Heritability (h2) is defined as:

2 2 σA h = 2 (4.3) σP

e2 = 1 − h2 2 (4.4) 2 σE e = 2 σP

Taking the square root and rearranging Equation 4.3 and Equation 4.4 results in: σA = hσP and

σE = eσP, which can be substituted into Equation 4.2:

rPσPxσPy = rAhxhyσPxσPy + rEexeyσPxσPy (4.5) rP = rAhxhy + rEexey

rA ranges between {-1,1} and indicates the level of shared genetic variation between a pair of tran- scripts. For example, an rA = 1 suggests that the same genetic variants influence the expression of the two transcripts. The impact of rA on the measured covariance between two transcripts, rP, is deter- 2 2 mined by the degree of h between the two transcripts. The rA and h values can be estimated using genetic relationship data from families or large samples of unrelated individuals [324].

Technical artifacts, such as batch effects, can be severe confounders in gene expression studies. The application of appropriate statistical techniques can control systematic batch effects [44]. Since the covariance structure of the nine BITs modules found in Preininger et al. 2013 [114] were replicated in nine datasets, representing several populations, these modules are unlikely to be driven by random batch effects. rE represents the degree of shared environmental effects and accounts for the remainder of the phenotypic correlation (after the genetic correlation, scaled by heritability is removed). rE are very broad measurements comprised of many factors including environmental conditions that

100 influence two traits simultaneously.

In this paper, we quantified the degree of genetic and environmental contribution driving the high covariance structure and rP between the BITs in each module. We estimated rA values in the BSGS dataset [42] using 85 BITs across the nine modules (3528 transcript pairs). rP values between tran- scripts is the standard measure used to detected covariance structures in transcriptomic data. Taking the rP values between the BITs and composing them into their constituent genetic and environmen- tal components, provides a fresh approach to understanding the regulatory architecture that underlies gene co-expression.

We found that it was indeed a high genetic component that was driving the BIT modules with high rA values found across all modules in the BSGS dataset. Interrogating validated eQTL databases from previous studies [57], we identified 13 previously replicated trans-eSNPs as the physical loci driving the high rA estimates between BITs in blood. The BITs also demonstrated enrichment for TFs, indicating that TFs provide the mobile medium regulating genes across the genome. We next wanted to determine whether this modular structure was present in other tissues. We examined the presences of the modules in three additional tissue types, lymphoblastoid cell lines (LCLs), subcutaneous fat and skin from the MuTHER cohort. From this, we found that the vast majority of modules demonstrated tissues specificity to blood. Two modules (four and six), however, were present in all four tissues. These results show that the modules comprising of BIT represent biologically relevant covariance structures that are regulated by genes and represent distinct cellular processes that are both tissue specific and non-specific.

4.3 Materials and Methods

4.3.1 Brisbane Systems Genetics Study

BSGS comprises of 846 individuals from 274 extended twin families, recruited as part of the Brisbane Twin Nevus and cognition studies [42]. The QIMR Berghofer Medical Research Institute’s human research ethics committee approved these studies, and all participants gave informed written consent. Genotyping was performed on Illumina 610-Quad Beadchips (Illumina Inc, San Diego CA) by the Scientific Services Division at deCODE Genetics, Iceland. Full genotyping procedures are described in [325].

101 Whole blood was collected using a PAXgeneTM tube (QIAGEN, Valencia, CA) and RNA was ex- tracted using the whole blood gene RNA purification kit (QIAGEN, Valencia, CA). The RNA expres- sion levels were measured on Illumina Human HT-12 v4.0 Beadchips (Illumina Inc, San Diego CA). The samples were randomized across microarray chips to minimize batch effects. Full details of the RNA extraction and quantification can be found in [42].

4.3.2 Normalization

Average bead signal and detection p-values were calculated with Genome Studio software. Probes with a detection p-value < 0.05 were considered significantly expressed above background noise, and any probe with non-expression in > 50% of samples were excluded. Likewise, probes that did not map to a characterized Ref-Seq gene were removed, leaving 13,306 probes for further analysis.

Two corrected datasets were used in this analysis. Dataset ε1 from Equation 4.6 was corrected for batch effects, including chip position and ID, sex and age effects as fixed effects (X) in a linear model (Equation 4.6). This dataset was used for estimations for heritability and genetic correlations.

y = Xβ + ε1 (4.6)

Since the BSGS [42] cohort contains related individuals, which facilities the estimation of genetic components, the samples do not meet the independently and identically distribution (iid) assumption underlying phenotypic correlations (rP). A mixed model with a random additive polygenetic effect (g ) corrected for relatedness and created an iid dataset:

y = Xβ + g + ε2 (4.7)

2 2 With g ∼ N(0,Aσg ) and ε ∼ N(0,Iσε ). Where y is a nx1 vector of the expression levels of a probe, and n is the sample size. g is a nx1 vector containing the estimated total additive genetic effects from all SNPs, and A is a genomic relationship matrix (GRM) created using a SNP-based identity 2 by state method [22, 295]. The resulting σg was estimated using Restricted Maximum Likelihood

(REML) implemented in GCTA [295]. Residuals ε1 and ε2 were normalized by a Blom rank-based transformation [326,327], ensuring that the expression levels for each probe follow a standard normal distribution.

102 4.3.3 Multiple Tissue Human Expression Resource

The MuTHER dataset comprises of gene expression and genotype data from healthy female twins with mRNA collected from three tissues: LCLs, (n=777), skin (n =705) and subcutaneous fat (n=776). The MuTHER consortium supplied raw, non-normalized gene expression data, and imputed genotype data. Gene expression levels were quantified using Illumina HT12 v3.0 Beadarrays and samples geno- typed on Illumina 1M-Duo and 1.2M Duo chips. Details of the sample collection, RNA extraction and genotyping can be found in [182].

We performed the same sample quality control and normalization procedures as applied to the BSGS cohort. Initially, probes that did not map to Refseq genes were removed, and remaining probes trans- formed to the log2 scale, followed by quantile normalized to standardize the variance across indi- viduals. Batch effects, consisting of microarray run, were corrected using a linear model (Equation 4.6). Polygenic effects were corrected for with a GRM in a linear mixed model as per Equation 4.7 to ensure independent samples for rP calculations. The residuals were scaled to a normal distribution using a Bloom rank transformation [326, 327].

4.3.4 Modules

The 90 BITs identified in [114] were analysed in this dataset. In the BSGS cohort, 85 of the transcripts were present after appropriate quality control (Table 4.3, Table 4.5, see Normalisation, Methods). In the MuTHER cohort [182], all 90 transcripts were ability for analysis.

The presence of the modules was confirmed in Preininger et al. [114] through the replicable high correlation values between the transcripts. We established the presence or significance of the modules across the tissues based on the null empirical values obtained in blood. The empirical values were collected through bootstrap resampling as described in section 4.3.6 and 4.3.7. We chose to use the upper bound of this empirical estimate, the mean plus standard error, as the metric to measure the presence of a module for both phenotypic and genetic correlation values.

The GO term enrichments described in this chapter for the BIT modules were published in Preininger et al. [114]. All transcripts associated with the modules were used in the enrichment analysis to increase the power to detect enrichments.

103 4.3.5 Heritability of gene expression

The narrow sense heritability (h2) is a measure of the proportion of total additive genetic variance in each trait over the total phenotypic variance. h2 is estimated by fitting the following linear model in GCTA:

y = g + ε (4.8)

2 2 With g ∼ N(0,Gσg ) and ε ∼ N(0,Iσε ). yis a vector containing the adjusted phenotypes in (ε) from (Equation 4.6), is a vector of random polygenic effects and ε is a vector of residuals. G is a genomic relationship matrix obtained from SNP-based IBS calculations (A from Equation 4.6) with a threshold (IBS > 0.05) where all values below that specified threshold were set to 0. This leaves an approximate IBD matrix that focuses on closely related individuals [328].

4.3.6 Phenotypic Correlations

Phenotypic correlations (rP) were calculated for each pair of BITs in a module using Pearson corre- lations. rP was calculated on the BSGS and MuTHER datasets corrected for polygenic effects (ε2) (Equation 4.7) to ensure independent samples. To calculate the significance of module correlation values, we took all the correlation values between probes in the BIT modules and performed a Wilcox signed rank test with the same number of correlation values from randomly sampled probes. rP values calculated from modules comprised of ten probes randomly sampled from the whole transcriptome enabled the calculation of the empirical p-values for rP. This sampling was repeating 10,000 times with replacement. The empirical p-value was calculated as the number of p-values ≤ the nominal test p-value divided by 10,001 tests.

4.3.7 Genetic Correlations

Genetic correlations (rA) were estimated using the bivariate linear mixed model. Equation 4.8 is fitted for transcripts x and y. The variance-covariance matrix (V) is defined as:

104   2 2 Gσgx + Iσex Gσgygx V =   (4.9) 2 2 Gσgxgy Gσgy + Iσey where gx,y and ex,y represents the genetic and environmental variances of transcripts x and y respec- tively. The estimation of rA between the probes was carried out using GCTA software [295,324]. The bootstrapping method described in section 4.3.6 to calculate the null empirical value of rP was used to calculate null empirical value for rA.

4.3.8 Proportion of phenotypic correlation explained by genetic effects

rP is driven by both genetic and environmental components (Equation 4.5). The total genetic contri- bution to rP is a combination of rA, scaled by the square root of the narrow-sense heritability of the two transcripts measured (hxhy)).

The environmental component can be calculated as the remainder after the additive heritability has been accounted for and can comprise of various environmental factors and non-additive genetic ef- fects. The contribution of environmental effects can be calculated as rE scaled by the square root p 2 of the environmental remainder (ex) of the two traits (i.e. ex = 1 − hx). The proportion of genetic components driving rP are calculated as follows:

|rAhxhy| pA = |rAhxhy| + |rEexey| (4.10) |rEhxhy| pE = |rAhxhy| + |rEexey|

The proportion of rP driven by rA (pA) was estimated for each module using the expected values across all transcripts in a module. Therefore Equation 4.10 was modified to 2 |E(rA)E(h )| pˆA = 2 2 andp ˆA measures were used throughout the article. |E(rA)E(h )|+|E(rE )(1−E(h ))|

105 4.3.9 eQTL analysis eQTLs summary data was collected from the blood eQTL browser (http://genenetwork.nl/bloodeqtlbrowser/) [57]. EQTLs that occur within +/-1MB of the Transcrip- tion Start Site (TSS) are called local-eQTLs while eQTLs located elsewhere are termed trans-eQTLs [2]. Due to the location close to the gene of interest, local eQTLs are likely to arise from disrup- tions in the regulatory regions of the genes, particularly in the binding sites of transcription factors, RNA polymerase and other proteins involved in RNA transcription. The presence of long-range act- ing elements can potentially blur the boundaries between distal and local eQTLs occurring on the same chromosome [329]. Distal eQTLs can result from transcription factors or other genes that are involved in regulating the transcripts of interest and affect both alleles of the target gene. We ex- amined the transcripts in this study for known local and distal eQTL associations from a replicated meta-analysis performed in whole blood samples of 5,311 individuals and replicated in a further 2,775 individuals [57]. The authors performed this eQTL analysis by testing 15,578 transcripts for associ- ations with SNPs previously identified as GWAS hits. The focus only on GWAS hits help to narrow the focus of the study to biologically-relevant SNPs and reduced the power to detect and replicated association. Local and distal eSNPs were considered significant at an FDR < 0.05 in [57] and also within this study.

The LD between significant local and distal eSNP was calculated using the online tool SNP Anno- tation and Proxy search (SNAP) [330]. The chromosomal region and the LD between eSNPs were plotted using LocusZoom [331]. The National Human Genome Research Institute (NHGRI) GWAS Catalog was used to determine which GWAS phenotype the shared eSNPs belonged to [332].

Next, we wanted to estimate the significance of the number of common eSNPs obtained. To calculate significance, we counted the number of shared eSNPs per module of ten transcripts that would be expected by chance from sampling the Westra et al. 2013 [57] eQTL database. We re-sampled sets of ten transcripts that were present in the Westra et al. 2013 [57] eQTL database and recorded how many times a significant eSNP (FDR < 0.05) was shared between two or more transcripts in that randomly generated module. This sampling procedure was repeated 10,000 times to create an empirical null estimate of the number of common probes expected by chance within a module.

106 4.3.10 Transcription factor enrichment

Vaquerizas et al [1] published results of 1,391 manually curated sequence-specific DNA binding tran- scription factors. This study published tissue-specific regulation of the transcription factors and the Interpro DNA-binding family. We interrogated this list of known transcripts factors for the occurrence of the BITs. A Fisher’s exact test was used to calculate significance of enrichment for transcription factors in the modules. This contingency table (Table 4.2 (a)) contained: (a) the number transcription factors found amongst the BITs, (b) 1,391 known transcription factors in the genome [1] - a ,(c ) 90 BITs - a and (d) The approximate 140,000 transcripts that are known in the genome [333–335] - (b and c). Enrichment analysis for the individual modules was also performed where (c) is 10-a (Table 4.2 (b)).

4.4 Results

4.4.1 Defining the presence of modules using phenotypic correlations

The nine modules which are driven up ten highly co-expression probes, the BITs, were discovered from decomposing 28 transcriptomic modules that previously associated with changes in diseases such as SLE [228]. Within healthy individuals, these modules these 28 modules of transcripts coa- lesced into six meta-modules. After correcting for these six meta-modules, the authors identified an additional three modules independently. 90 transcripts, called BITs, (10 per module), were selected to represent all transcripts within a module [114]. These modules were replicated in nine independent data sets.

All the gene expression data sets involved in identifying and replicating the modules were whole blood. Between these five studies, the between or intra-module correlation values changed drastically, indicating a further connection between the modules does not exist. The changes in intra-module cor- relation values represent shifts in the technical generation of the data, sources of blood, platforms or environmental conditions. The BIT transcripts that make up the modules are the only transcripts with correlations that remain the same between different studies and conditions. While some mod- ules appeared to condense into meta-modules in one study, other studies did not replicate this effect. Therefore it is the specific co-expression within the modules, as defined by the BITs, which is vali- dated and examined in this study.

107 Extensive co-expression (rP) structures are ubiquitous in transcriptomic datasets. This co-expression between genes is often attributed to shared functionality within transcripts, arising from gene networks or protein complexes. By decomposing the rP into the constituent rA and rE values we can gain an understanding of the mechanisms of that are driving this covariance structure between transcripts. We employed the BSGS, a data set comprised of 846 unrelated and related individuals with whole blood microarray samples to quantify the degree of co-regulation driving these transcripts.

Initially, we verified the presence of the modules within the BSGS cohort and by examining the pairwise phenotypic correlations (rP) between BITs within the nine modules (Table 4.2, Table 4.1).

As expected, on average probe pairs within modules exhibit positive rP (0.36 - 0.89, mean 0.64) (Figure 4.1A, Table 4.2).

108 Table 4.1: Table of BIT transcript information. Gene name, Illumina ID, starting base pair (BP) and chromosome (CHR) of the BIT probes

Module Gene Probe Illumina ID BP CHR 1 BIN1 ILMN 2309245 127805917 2 1 C5ORF39 ILMN 2098616 43075279 5 1 DDX24 ILMN 1700628 94517543 14 1 GLTSCR2 ILMN 1703565 52951763 19 1 IMP3 ILMN 1733696 73718651 15 1 LAGE3 ILMN 1708151 153706351 23 1 LIME1 ILMN 2183687 61840848 20 1 OCIAD2 ILMN 1700306 48887582 4 1 SAE1 ILMN 1657204 52405148 19 1 SNRPD2 ILMN 2369785 50882589 19 2 EPB42 ILMN 1814397 41276900 15 2 GMPR ILMN 1729487 16403382 6 2 IFIT1L ILMN 1759155 91134663 10 2 OR2W3 ILMN 1811927 246126108 1 2 SELENBP1 ILMN 1680652 151337014 1 2 SLC4A1 ILMN 1772809 39682702 17 2 SLC6A10P ILMN 1704446 32888901 16 2 SNCA ILMN 1766165 90869394 4 2 TNS1 ILMN 1807919 218375994 2 3 AFF3 ILMN 1775235 100163843 2 3 BLK ILMN 1668277 11420520 8 3 CD19 ILMN 1782704 28857792 16 3 CD72 ILMN 1723004 35600249 9 3 CD79A ILMN 1734878 42385350 19 3 EBF1 ILMN 1778681 158058476 5 3 FAM129C ILMN 1664063 17525344 19 3 FCRLA ILMN 1691071 159950596 1 3 POU2AF1 ILMN 1811049 110728486 11 3 VPREB3 ILMN 1700147 22425107 22 4 BCLAF1 ILMN 2357272 136622495 6

109 Module Gene Probe Illumina ID BP CHR 4 DYRK1A ILMN 2374293 37808642 21 4 HNRPK ILMN 2378048 85773503 9 4 NPTN ILMN 1792997 71639809 15 4 PAPD4 ILMN 1681845 79017369 5 4 SLK ILMN 1700834 105763023 10 4 SRP54 ILMN 2312275 34568048 14 4 TRIM33 ILMN 1682316 114935759 1 4 WASPIP ILMN 1668417 175133267 2 4 ZFAND5 ILMN 1795228 74159604 9 5 CXCR1 ILMN 1662524 219027895 2 5 C5AR1 ILMN 1689836 47824990 19 5 NUP214 ILMN 1666049 133097510 9 5 AQP9 ILMN 1715068 56265152 15 5 PHC2 ILMN 1808047 33789971 1 5 SIRPA ILMN 2372974 1868085 20 5 TSEN34 ILMN 2368292 59389241 19 5 MBOAT7 ILMN 1722218 59369210 19 5 HCK ILMN 1791771 30153066 20 5 AMICA1 ILMN 1778723 117570158 11 6 USP49 ILMN 1680279 41873670 6 6 ZNF14 ILMN 1692145 19821836 19 6 DTWD2 ILMN 2054554 NA NA 6 NLRP8 ILMN 2075794 56499648 19 6 BLZF1 ILMN 2106658 167632205 1 6 DMC1 ILMN 2162367 38915178 22 6 N4BP2 ILMN 2222101 39833337 4 6 ZNF682 ILMN 2313889 20116068 19 6 OCIAD1 ILMN 2330495 48558245 4 6 IL17RD ILMN 2407851 57099591 3 7 IFIT2 ILMN 1739428 91058378 10 7 HERC5 ILMN 1729749 89427202 4 7 RSAD2 ILMN 1657871 6954552 2

110 Module Gene Probe Illumina ID BP CHR 7 EPSTI1 ILMN 2388547 43462293 13 7 OAS3 ILMN 1745397 111894837 12 7 IRF7 ILMN 2349061 613117 11 7 SAMD9L ILMN 1799467 92759562 7 7 SERPING1 ILMN 1670305 57138571 11 7 MX1 ILMN 1662358 41752921 21 7 DDX58 ILMN 1797001 32467874 9 8 HERPUD2 ILMN 2066348 35638909 7 8 CD164 ILMN 1783852 109794680 6 8 HIF1A ILMN 2379788 61284383 14 8 SERINC3 ILMN 1713752 43127901 20 8 MCTS1 ILMN 1751816 119628902 23 8 PPP3CA ILMN 2044226 102163876 4 8 TMEM167A ILMN 1742813 82388598 5 8 ROCK1 ILMN 1739583 18529964 18 8 SP3 ILMN 2389844 174773391 2 9 GRB2 ILMN 1748797 70833600 17 9 PTPRC ILMN 2340217 196875068 1 9 TMED4 ILMN 1804148 44619131 7 9 PTPRC ILMN 1653652 196875072 1 9 PLEKHB2 ILMN 1698323 131606970 2 9 SLA ILMN 2291954 134156378 8 9 MCL1 ILMN 1756806 150550866 1

111 Table 4.2: Summary of values for each module in Blood in the BSGS cohort.Values are averaged across all transcripts within a module. Information is 2 provided for BIT heritability (h ), Phenotypic correlations (rP) , Genetic correlations (rA)) and Shared genetic associations. Values provided are the mean or expected value (E), standard error (SE) and p-value (p-val). The column ”Number of probes” shows the number of BIT probes that were available for analysis in the BSBS cohort after QC. The column ”Number of probes with significant h2” shows how many BITs in the BSGS cohort had significant h2 estimates. The column ”eSNPs” lists all the SNPs that were significantly (FDR < 0.05) associated with BIT in this module. The column ”eQTLs” contains the number of independent regions (groups of eSNPs in high LD with each other) that were associated with the BIT in this module. For example, in module nine eSNPs made up one eQTL associated with two transcripts and two eSNPs made up another eQTL associated with three transcripts.

Module information Probe heritability Phenotypic Correlations Genetic Correlations Shared genetic associations Number of probes 2 2 2 Module Number of probes E(h ) SE(h ) with significant h E(rP) SE(rP) p-val(rP) E(rA) SE(rA) eSNPs eQTL 1 10 0.28 0.04 10 0.6 0.02 2x10−4 0.6 0.03 0 0 2 9 0.46 0.01 9 0.81 0.009 1x10−4 0.82 0.02 11 2 3 10 0.47 0.02 10 0.65 0.02 1x10−4 0.88 0.01 0 4 10 0.3 0.02 10 0.65 0.015 1x10−4 0.72 0.03 0 0 5 10 0.38 0.02 10 0.52 0.016 1x10−4 0.62 0.02 0 0 6 10 0.35 0.01 10 0.89 0.014 1x10−4 0.96 0.01 0 0 7 10 0.3 0.03 9 0.73 0.014 2x10−4 0.73 0.02 2 1 8 9 0.19 0.03 8 0.51 0.022 2x10−3 0.62 0.04 0 0 9 7 0.48 0.05 7 0.35 0.042 5x10−2 0.56 0.06 0 0 Figure 4.1: These graphs are a visual representation of the correlation values between module probes in the BSGS dataset. Panels A, B, and C are heat maps for the correlation matrices for A) phenotypic (rP), B) genetic (rA) and C) environmental (rE) value between all pairs of the BIT probes. The BITs are ordered from module one to nine. Red values represent high correlation values (˜1) and blue values represent low correlation values (˜-1). In all of the heat map plots below, the intra-module correlation values are all highly positive, demonstrating their unique correlational structure captured by the BITs. There are many low to moderate intra-module correlations, showing similarity between the modules. These values, however, are not pertinent to the module identity, for reasons described in section 4.4.1. Panel D represents the values from panels A, B, and C as boxplots. Modules are ordered on the x-axis with ”R” representing the empirical null values for the modules, collected from 10,000 randomly sampled transcripts. From D) this is evident that the correlation values in all the modules are higher than what would be expected by chance. This graph shows that rA values are the highest for all modules, while the rP and rE approximately the same.

To robustly evaluate the significance of the intra-module rP we created an empirical distribution of intra-module mean rP by calculating the mean rP from randomly sampled sets of ten probes (Figure 4.1D). The bootstrap sampling enabled the estimation of the mean and sampling variance of intra- module rP, also known as the empirical null value. For blood the empirical null rP was 0.07 with a standard error of 0.03. Using the upper bound on the empirical null estimation (mean + standard error), we determined the presence of a module when the rP > 0.1.

113 Although not all of the BIT made it through QC in the BSGS dataset, all of the nine modules were present in the BSGS cohort having between 7-10 transcripts present. The correlation values between the transcripts in the BSGS modules were statistically significant (p< 0.05), compared to the empirical null (Table 4.2). We investigated the presence of modules in skin, LCLs and subcutaneous fat from the MuTHER cohort (˜700 samples) [182] (Figure 4.2F and Table 4.3).

114 Figure 4.2: Stacked histogram comparisons across all modules and tissues. (A) Heritability (B) Phenotypic correlations, (C) Genetic correlations, (D) Environmental correlations, (E) Proportion of phenotypic correlation (p ˆA) attributed to genetics and (F) Modules that are present. The y-axis of the graph is scaled to 1 so that the contribution of each tissue can be easily visualized. The actual correlation values are labeled on the graph. In (F) 1 means the presence of a module based on the empirical null value on phenotypic correlations (all modules are significant when examining genetic correlations). The parts of the graph that are hashed out represent modules that are not significant using the modules minimum method for that particular measure. In (F) the hashed out values are the modules that are not considered to be present overall - based module minimum genetic correlations.

115 The average rP was lower for all tissues except for blood, and some modules did not demonstrate covariance higher than the empirical mean calculated within the BSGS cohort. Modules 4, 6-9, were present across all four tissues (rP >0.1). Module 1 was present across three tissues (blood, fat and LCL), Module 3 (blood, LCL) and 5 (blood, fat) are present across two tissues and Module 2 was only present within blood.

4.4.2 Genetic contribution

The rA values for blood ranged between 0.56 (module 9) and 0.96 (module 6) with an average of 0.64 indicating that between half to nearly all of the genetic variance (h2) driving transcripts is shared between transcripts. We performed a bootstrap analysis of 10,000 randomly sampled transcripts to identify the empirical null rA (0.06 with a standard error of 0.05). We used the upper bound of this estimate rA ¿ 0.11 (empirical null rA + standard error), as the measure to determine whether module rA values are notable across the tissues. We calculated the significance of the BSGS dataset using the values from 10,000 bootstrap analysis and found that all the rA across the modules were significant (p <0.05). The significant rA values found in the blood modules indicates that the same loci are controlling all the transcripts within the module. This of interest since the BIT transcripts are located across the genome and not clustered in specific hotspots. Therefore the shared loci must be exerting their effects distally.

2 The contribution of rA to the rP, is dependent on the h of the transcripts (Equation 4.5 and Equation 4.10). The average BIT h2 in blood is 0.36, which indicates that genetic regulation controls just under half of the transcript variability. The average proportion of genetic contribution to rP,p ˆA (see 4.3.8), is 0.42, ranging from 0.28 in Module 1 to 0.75 in Module 9. The remainder of the variance is driven 2 by environmental variation (e ), with an average rE of 0.58 between the BITs in the modules. While the genetic component driving the measured rP in the modules is high, these values also indicate the presence of coordinated environmental responses. The genetic coexpression could build a framework that enables simultaneous transcript responses to environmental stimuli. pˆA were marginally higher in LCLs than blood, 0.46 on average. The range of these values, however, extended to values that were lower than those found in blood modules (0.16 in module 4) and less than the maximum found in blood (0.66 in module 8). h2 estimates that are higher than blood (0.4

BITs average) and lower rE measures (0.28 BITs average) could drive the largerp ˆA values in LCL modules. Therefore a highp ˆA estimate for a module indicates that the rP is widely driven by rA,

116 2 which is influenced by a large h or particularly low rE values.

Skin demonstrated ap ˆA that was similar to blood, with an averagep ˆA of 0.4. Skinp ˆA values ranged from 0.19 in module 5, a module which was not significantly present in this tissue, to 0.64 in module

9. Fat has an unusually low averagep ˆA of 0.27. Modules 1 and 5, which were significantly present in fat, had ap ˆA value of 0, indicating that there entirely driven by environmental covariance. Thep ˆA values in fat were high (0.44 and 0.78) in modules two and three which were not present in this tissue. This result could indicate that there is the presence of discordant environmental effects that negate presence of the modules in these tissues. In the other modules present in fat (modules 4 and 6-9), the pˆA is low, ranging from 0.1 (module 8) to 0.31 (modules 4 and 9).

4.4.3 Identifying modules across tissues

Correlation values higher than the upper bound of the empirical null estimates for rP (>0.1) and rA (>0.11) represent correlation values that are higher than would be expected by chance. However, while the empirical null value describes whether the correlation values found between a set of tran- scripts is statistically significant, this metric may be overly conservative. Since the null empirical value is so low, it may not be a sufficiently high enough measure to define the large-scale covariance structure that defines BIT modules. Since the modules were verified in blood, we used the ”module minimum” values obtained from this tissue. This module minimum value is obtained from the module with the lowest values for rP (0.36), rA (0.57) andp ˆA (0.23) for blood. The minimum values of the BITs modules represent the base correlation values required for the module structure. Therefore, we used the minimum blood values to identify the presence of modules in the rest of the chapter. Since the importance of the genetic covariance to the modules is evident in blood, (average rA=0.72), it should be noted that shared genetic variance appears to be the defining feature driving the modular structure of co-expression between BITs. Therefore rA is potentially a better measure to detect the presence of a module across a tissue.

Using the module minimum metric on the rP values, only modules 4 and 6 were present in all tis- sues. When taking into account rA andp ˆA values, however, module 4 is no longer present in LCLs

(4.2C).Module 6 has the strongest co-expression of all the modules, with an average rP of 0.9 across all tissues, followed by module 4 (0.54). The strong presence of these genetic correlations between tissue indicates that there is enrichment with transcripts that are essential to cellular function these particular modules. The GO term enrichments from Preininger et al. 2013, for module 4 include GO

117 terms such as ”mRNA metabolism and splicing” and ”intracellular transport,” which are all critical cellular functions [114]. Module six had no enrichments, which could indicate a lack of specificity in the transcriptomic pathways but instead a collection of transcriptions with various capacities. The enrichment for core cellular functions are usually ubiquitous across tissues [336] and could, therefore, explain why these particular modules exhibit such a high degree of tissue non-specificity.

In LCLs, seven modules were identified based on the empirical null rP (modules 1,3,4,6,8 and 9,

Figure 4.2F). Modules 1 and 3 both had a low rP (˜0.1) and rA values (˜0.08), followed by module 7

(rP=0.14 and rA=0.22), indicating that these modules have very little co-expression and are unlikely to be driving modules. Module 4 had a moderate rP (0.44). However, both measures of rA (0.21) and pˆA (0.16) were well below module minimum and are predominantly driven by environmental effects.

Therefore using the module minimum measures of both rP and rA, only modules 6,8 and 9 appear to be present in the LCL cells, with modules 8 and 9 only present in blood and LCL and not in other tissues. Enrichments pertinent to transformed B-cells that make up LCLs, such as ”RNA processing” and ”B-cell activation,” were present in module 8. Module 9 demonstrated enrichment for the GO terms ”Signal transduction by phosphorylation,” ”Programmed cell death,” ”B-cell number” and ”T- cell morphology,” which are again relevant to LCL and whole blood tissue samples.

The remaining modules 1-3,5 and 7 were only present in blood when using the minimum module metric on both rP and rA values. Module 2, was one of the modules only present in blood (using all correlation measures) and had the highest average module rP for all the blood modules (0.81).

This module was driven by a strong genetic component (rA=0.82) and genetic contribution to rP

((p ˆA=0.47). This strong specificity of blood indicates that module 2 could potentially be a critical regulatory hub for many blood specific transcriptomic processes. GO term enrichments for module two obtained from Preininger et al. 2013 [114] provide further support for the role of module two to blood-specific transcriptomic processes. Module 2 showed enrichment for terms involving ”Oxygen transporter activity” and ”Wound healing”, which are specific to blood, namely red blood cells and platelets. With the transcript enrichment for processes that can only occur in specific blood cells, it is not surprising that module 2 was found only in the whole blood samples.

Module 1 and Module 5 had similar rP and rA values (˜(0.6)) and were both enriched for terms specific to whole blood. Module 1 showed enrichment for blood specific terms such as ”T-cell physiology” and ”Leukopoiesis.” Module 5 demonstrated enrichment for ”Cytokine receptor activity,” ”Adaptive immunity,” ”Inflammatory response” and ”Innate immunity.” Module 3 has particular high rA (0.88) andp ˆA (0.64) measures indicating a strong genetic contribution. Enrichments for blood specific terms

118 including, ”B-cell activation,” ”B-cell morphology” and ”Immunoglobulin level” were also found in module 3.

Module 7 was present across all tissues when using the empirical null value. However, it had one of the lowest rP and rA values across the tissues (0.26 and 0.28 resp. on average between three tissues). When using the module minimum values as the cut-off value for these modules, module 7 is only present in blood. In blood rP and rA values for module 7 were high (0.74 and 0.73 resp.). GO term enrichments from Preininger et al. 2013 showed that module seven demonstrated enrichment for immune responses, in particular, viral responses. These included ”viral response (to infection)” and ”interferon-mediated signaling.” The presence of module 7 only in whole blood is therefore not surprising since immune cells make up a part of whole blood samples. It would have also been expected for module 7 to be present LCL cells, which are immortalized B-cells. It is possible that the anti-viral responses enriched in module 7 are indicative of cell-mediated viral responses that originate from T-cells, NK and macrophages, which are only be present in whole blood samples.

4.4.4 eQTL analysis

We have established that is the genetic regulation that drives the high covariance observed with BIT modules. We next sought to identify the genomic regions that were responsible for the genetic regu- lation of the modules.

To determine the genomic loci that shared between BITs in a module, we queried the online database for eSNPs produced by Westra et al. [57] for loci significantly associated with probes with the mod- ules. This database contains eSNPs identified in a meta-analysis of 5,311 individuals with replication in a further 2,775. We subsequently determined the number of eSNPs shared between BIT probes in a module.

At an FDR of 0.05, nine eSNPs were shared between two probes in module 2 at an FDR (Table 4.4, Figure 4.3A). Examining the haplotype block of these nine eSNPs revealed that all the SNPs were in moderate LD (r2 ∼ 0.5) with each other (Figure 4.4), indicating that they are a part of one eQTL. These eSNPs were located on chromosome 2 within the intronic region of the BCL11A gene. BCL11A encodes a transcription factor which has implicated in a wide variety of functions including lymphoma pathogenesis [337], lymphoid development [338], fetal levels [339] and brain development [340]. The biological functions of BCL11A was reflected by GWAS traits in which the SNPs were associated. The traits include fetal hemoglobin cells (F-cell) distribution [341, 342],

119 fetal hemoglobin levels [343–345], sickle-cell anemia [346–348], beta thalassemia or hemoglobin E disease [349]. These GWAS traits all appear to be affected by the dysregualation of fetal hemoglobin that is exerted by BCL11A.

120 Table 4.4: SNPs associated with transcripts in modules two and seven. Information on the signifi- cantly shared SNPs and their associated transcripts. ”SNP CHR” is the chromosome of the associated SNP. ”Probe Gene” is the gene tagged by the Illumina probe used in this study. ”Gene CHR” is the chromosome in which the gene, tagged by the Illumina probe, is located. In module two, nine SNPs were associated with two genes. LD between the nine SNPs ranged between 0.3 and 1, indicating that all of these nine SNPs are linked, forming one eQTL. Two additional SNPs in high LD (R2 = 0.9), comprising one eQTL, were associated with three genes in module two. In module seven, two genes were associated with one SNP (one eQTL).

Module SNP SNP CHR Probe Illumina ID Probe Gene Gene CHR 2 rs4671393 2 ILMN 1729487 GMPR 6 2 rs766432 2 ILMN 1729487 GMPR 6 2 rs1427407 2 ILMN 1729487 GMPR 6 2 rs10172646 2 ILMN 1729487 GMPR 6 2 rs1896294 2 ILMN 1729487 GMPR 6 2 rs10195871 2 ILMN 1729487 GMPR 6 2 rs7557939 2 ILMN 1729487 GMPR 6 2 rs11886868 2 ILMN 1729487 GMPR 6 2 rs7584113 2 ILMN 1729487 GMPR 6 2 rs4671393 2 ILMN 1759155 IFIT1L 10 2 rs766432 2 ILMN 1759155 IFIT1L 10 2 rs1427407 2 ILMN 1759155 IFIT1L 10 2 rs10172646 2 ILMN 1759155 IFIT1L 10 2 rs1896294 2 ILMN 1759155 IFIT1L 10 2 rs10195871 2 ILMN 1759155 IFIT1L 10 2 rs7557939 2 ILMN 1759155 IFIT1L 10 2 rs11886868 2 ILMN 1759155 IFIT1L 10 2 rs7584113 2 ILMN 1759155 IFIT1L 10 2 rs9468692 6 ILMN 1814397 EPB42 15 2 rs10947055 6 ILMN 1814397 EPB42 15 2 rs9468692 6 ILMN 1772809 SLC4A1 17 2 rs10947055 6 ILMN 1772809 SLC4A1 17 2 rs9468692 6 ILMN 1807919 TNS1 2 2 rs10947055 6 ILMN 1807919 TNS1 2 7 rs4917014 7 ILMN 1729749 HERC5 4 7 rs4917014 7 ILMN 1662358 MX1 21

121 Figure 4.3: eSNP association between BITs. eSNP associations in Module two (A) and Module seven (B) Genes are represented as blue boxes (gene labels above). Black and Purple lines represent two intances of shared eSNPs between BITs in Module two

A

B

122 Figure 4.4: Chromosome location and LD map of the SNPs significantly associated with two probes in module two. The -log10 (fdr) of the association between two SNPs is plotted as dots (y-axes on the left) with the relevant effect size (r2) colored as per the legend in the top right hand corner. The recombination of that section of the chromosome is represented by the blue line (y-axes on the right). The purple diamond is the selected SNP for which from which the association analysis statistics are performed (r2 and the FDR values). All SNPs are in moderate LD with each other. There appear to be two groups within this set that are in higher LD with each other.

123 Another three BITs in module 2 were associated with two eSNPs in high LD (R2 = 0.91) and therefrore comprising one eQTL on Chromosome 6 (Table 4.4, Figure 4.3B). These two eSNPs are located within the coding region of TRIM10. TRIM10 plays a role in the terminal differentiation of erythroid cells. One of these SNPs was a GWAS hit for human immunodeficiency virus (HIV) infection [350], indicating a potential link between erythroid cells and infecting virus. The second SNP was associated with cardiac hypertrophy (Table 4.5) [351].

One eSNP shared between two probes in module 7 (Table 4.5) is located in an intergenic region between the Refseq genes C7orf72 and IKZF1. IKZF1 is a gene that is a key regulator of lymphocyte differentiation, and dominant negative isoforms of this gene are associated with acute lymphoblastic leukemia and other B-cell malignancies [352, 353]. This eSNP is a GWAS hit for SLE [354, 355], an interesting observation since these modules were initially identified in a cohort of SLE patients [114, 228]. rs4917014 is also associated with the GWAS traits of lipid levels [356] and Stevens-Johnson syndrome or toxic epidermal necrolysis [357]. These results indicate the presence of dysregulated immune responses.

To calculate the significance of the number of shared loci between BITs, we calculated the expected number of shared eSNPs between 10 randomly sampled transcripts. We repeated this bootstrapped calculating 10,000 times. We found that on average less than one shared loci is present between 10 randomly sampled probes, with the two modules (2 and 7) having shared eSNPs at a higher frequency than would be expected by chance.

4.4.5 Transcription factors

The eSNP associated with the BIT in section 4.4.4 were all functioning distally (in trans) to co- ordinate the expression of the transcripts. We sought to examine functional mechanisms upon which these distal effects may occur. To do this, we tested whether the BIT modules were enriched for any transcription factors [1]. Transcription factors are proteins that bind to specific fragments of DNA and thereby controlling transcriptionally activity of the gene.

Six of the nine modules contained transcripts that identified as transcription factors in blood cells (Table 4.6). Enrichment analysis (Table 4.7) for the ten probes in the 85 BITs demonstrated that there was significant enrichment (p=3.3x10−4) of transcription factors. The enrichment of transcription factors in these modules provides insight into the genetic architecture that is driving the co-expression of the probes. It is possible that changes in the expression of a transcription factor lead to coordinated

124 Table 4.3: Correlation and heritability values for BIT modules. Mean/expected value (E) and 2 standard error (SE) for heritability (h ), phenotypic correlation rP and genetic correlation rA values for all the BITs in each modules, within each tissue. The genetic contribuiton to rP is reported (p ˆA). Transcripts that are labelled ”R” are the randomly selected transcripts used to evaluate the empirical null value.

2 2 Modules Tissues E(h ) SE(h ) E(rP) SE(rP) E(rA) SE(rA) E(rE ) pˆA 1 Blood 0.28 0.04 0.6 0.02 0.6 0.03 0.6 0.28 1 Fat 0.13 0.02 0.27 0.04 0 0.09 0.31 0 1 LCL 0.51 0.03 0.12 0.03 0.09 0.04 0.15 0.38 1 Skin 0.34 0.04 0.1 0.04 0.18 0.06 0.06 0.61 2 Blood 0.46 0.01 0.81 0.01 0.82 0.02 0.8 0.47 2 Fat 0.21 0.03 0.09 0.03 0.19 0.07 0.06 0.44 2 LCL 0.18 0.04 0.03 0.02 0.05 0.1 0.03 0.3 2 Skin 0.18 0.05 0.06 0.02 0.21 0.1 0.03 0.63 3 Blood 0.47 0.02 0.65 0.02 0.88 0.01 0.45 0.64 3 Fat 0.12 0.05 0.02 0.02 0.23 0.12 -0.01 0.78 3 LCL 0.51 0.03 0.11 0.04 0.08 0.05 0.14 0.37 3 Skin 0.21 0.04 0 0.01 0.04 0.08 -0.01 0.5 4 Blood 0.3 0.02 0.66 0.01 0.72 0.03 0.63 0.33 4 Fat 0.27 0.03 0.51 0.03 0.59 0.05 0.48 0.31 4 LCL 0.31 0.03 0.41 0.05 0.21 0.06 0.5 0.16 4 Skin 0.22 0.01 0.6 0.03 0.7 0.03 0.57 0.26 5 Blood 0.38 0.02 0.52 0.02 0.62 0.02 0.46 0.45 5 Fat 0.09 0.03 0.22 0.03 0 0.11 0.24 0 5 LCL 0.38 0.03 0.06 0.02 0.09 0.04 0.04 0.57 5 Skin 0.26 0.06 0.07 0.03 0.05 0.07 0.08 0.19 6 Blood 0.35 0.01 0.89 0.01 0.96 0.01 0.85 0.38 6 Fat 0.25 0.02 0.89 0.01 0.96 0.01 0.87 0.27 6 LCL 0.43 0.01 0.92 0.03 0.94 0.03 0.9 0.44 6 Skin 0.26 0.01 0.91 0.01 0.93 0.01 0.9 0.27 7 Blood 0.3 0.03 0.74 0.01 0.73 0.02 0.74 0.3 7 Fat 0.21 0.03 0.38 0.03 0.35 0.08 0.39 0.19 7 LCL 0.41 0.04 0.14 0.04 0.22 0.08 0.08 0.64 7 Skin 0.27 0.06 0.27 0.03 0.28 0.08 0.27 0.28 8 Blood 0.19 0.03 0.51 0.02 0.62 0.04 0.48 0.23 8 Fat 0.08 0.02 0.35 0.02 0.43 0.11 0.34 0.1 8 LCL 0.45 0.01 0.64 0.03 0.94 0.01 0.39 0.66 8 Skin 0.13 0.04 0.27 0.02 0.45 0.08 0.24 0.22 9 Blood 0.48 0.05 0.36 0.04 0.56 0.06 0.18 0.75 9 Fat 0.13 0.03 0.17 0.03 0.41 0.09 0.13 0.31 9 LCL 0.41 0.02 0.4 0.03 0.57 0.03 0.28 0.58 9 Skin 0.21 0.03 0.15 0.03 0.46 0.08 0.07 0.64 R Blood - - 0.07 0.03 0.06 0.05 - - 125 Table 4.5: Module associated GWAS SNPs. Chromosome, genome location, nearby genes, GWAS associated disease/traits and citations for the shared BIT SNPs (as listed in 4.4)

Module SNP Chromosome: Position Location Gene Disease Trait Citation 2 rs7584113 2:60,494,176 intron-variant BCL11A F-cell distribution [342] F-cell distribution [342] 2 rs11886868 2:60,493,111 intron-variant BCL11A Fetal hemoglobin levels [343] 2 rs7557939 2:60,494,212 intron-variant BCL11A F-cell distribution [342] 2 rs10195871 2:60,493,454 intron-variant BCL11A F-cell distribution [342] 2 rs1896294 2:60,491,939 intron-variant BCL11A F-cell distribution [342] 2 rs10172646 2:60,493,622 intron-variant BCL11A F-cell distribution [342] Fetal hemoglobin levels [344] 2 rs1427407 2:60,490,908 intron-variant BCL11A F-cell distribution [341, 342] Sickle cell anaemia [346–348] F-cell distribution [346] 2 rs766432 2:60,492,835 intron-variant BCL11A Fetal hemoglobin levels [345] Beta thalassemia/hemoglobin E disease [349] Hemoglobin levels [358] 2 rs4671393 2:60,493,816 intron-variant BCL11A F-cell distribution [359] 2 rs9468692 6:30,119,890 utr-variant-3-prime TRIM10 HIV infection [350] 2 rs10947055 6:30,093,364 intergenic TRIM31 - TRIM40 Cardiac hypertrophy [351] Systemic erythematosus [354, 355] 7 rs4917014 7:50,305,863 intergenic C7orf72-IKZF1 Lipid Levels [356] Stevens-Johnson syndrome/toxic epidermal necrolysis [357] changes in expression of its target transcripts. In the case of the modules, it is possible that the transcription factors could modify the expression of other module genes either directly or through other intermediate genes that genetically regulate (and are identified as eQTLs) other BITs.

Table 4.6: List of BITs that are transcription factors. ”Probe illumina ID” is identifier the BIT transcripts used in this study, ”Probe” is the gene in which the BIT probe is tagging. ”Tissue” is where this transcript/transcription factor is predominantly expressed in the body.

Module Probe Illumina ID Gene Tissue 2 ILMN 1784678 PBX1 - 3 ILMN 1775235 AFF3 - 3 ILMN 1778681 EBF1 - 4 ILMN 2357272 BCLAF1 Smooth muscle 6 ILMN 1692145 ZNF14 General 6 ILMN 2162367 DMC1 - 6 ILMN 2313889 ZNF682 - 7 ILMN 2349061 IRF7 Heart, lung, lymph node, thymus, tonsil, whole blood 8 ILMN 2379788 HIF1A Smooth muscle 8 ILMN 2389844 SP3 Whole blood

Table 4.7: BITs Transcription factor enrichment. (a) The number transcription factors found amongst the module probes, (b) 1,381 known transcription factors elsewhere [1], (c) the number of probes within modules that are not known transcription factors and (d) The number of genes that are not known transcriptional factors in whole blood. Details of the transcription factors is given in [1].

In Modules In Genome TF (a) 10 (b) 1381 Not TF (c) 75 (d) 20609

The enrichment of transcription factors within the BITs indicates that these modules could be a part of a regulatory genetic network of genes. Since the BITs themselves are the representative genes for each module, it is possible that the BITs represent the regulatory machines that control the downstream expression of many genes.

127 4.5 Discussion

Preininger et al. [114] identified and replicated nine distinct modules comprising of co-expressed transcripts that covaried in response to various environmental factors [115]. The presence of mod- ules that co-vary in response to disease or environmental conditions suggests that individual gene expression levels vary along pre-defined axes of variation. The degree of co-ordinated transcriptional responses within these axes to environmental or genetic perturbations could predefine an individual’s susceptibility to disease. PCA scores of BIT modules were able to differentiate individuals based on environmental conditions such as infection or geographical location [114, 115]. The mechanisms for this covariance is not yet well understood. However, the pervasive nature of gene coexpression indicates that there is a core regulatory mechanism constraining the expression of transcripts.

The majority of transcripts in CHDWB and Morocco cohorts share the covariance structure captured by the BIT modules. This covariance structure defined up to 50% of the transcriptome variance in these two cohorts [114]. The high correlation between all genes in a module enables the data to be represented and even recreated by a smaller number of representative transcripts. Therefore the ten BITs represent a much larger number (between 99 and 1028 as recorded in [114]) of transcripts. The robust intra-module correlations found between the BITs suggests that genetic regulation drives the rP. This feature is reflected by rA values since rE between transcripts are likely to disappear between cohorts exposed to different environmental conditions. This was supported by high rA values for each module, which was 0.72 on average across all BIT transcripts. Therefore these modules are extensively driven by genetic correlations and therefore a strong genetic component can be a good measure of the the presence of a BIT module.

We used two measured to identify the presence of the modules in this study. To determine the signifi- cance compared to random chance, we calculated the empirical null for the three types of correlation values between ten probes. This method takes transcripts that have expected correlation values of zero and calculates the empirical null values between them. Therefore the empirical null value is ex- tremely low correlation values. Since that BIT modules exhibit an incredibly high correlation value, the criteria for identifying the presence of the modules should include a higher correlation metric that is more representative the BIT correlation structure. Since the BIT modules were only validated in blood, we used the minimum correlation values for the modules in this cohort to establish a new met- ric to identify transcripts. This module minimum metric was much more discerning of the presence of the modules in the other tissues. The modules selected using the module minimum metric were

128 much more appropriate, i.e., had notable rA values, than those flagged by the empirical null value.

Therefore high rA and rP values are an important feature of the BIT modules.

In this study, we quantified the extent of the genetic regulation driving the representative 10 BITs for each module. We found that the BITS had significant intra-module correlation rP (Pearson’s correlation) within the BSGS cohort Figure 4.1D, Table 4.2). The average h2 for BIT modules was

0.34, therefore, the amount of genetic contribution to the rP comprised a little under half (p ˆA=0.42).

Therefore the environmental and genetic effects had roughly similar contributions to the high rP

(average 0.64 across all modules)) between BITs. The rE values between BITs (0.58) were much lower than the average rA values (0.72) (Figure 4.1D). This indicates that the rE had larger contribution 2 to rP than rA due to the low h values of the BITs (0.36). With the high average rA (0.72) found in the BITs within each module confirming that genetic regulation is a core feature driving the modules. Therefore, in healthy individuals, the BIT modules represents biologically meaningful transcriptional units. We also examined the presence of the modules in three different tissues: fat, skin and LCLs.

We observed notable differences in the heritability of the BITs as well as the mean within-module rP and rA across tissues. Differential genetic regulation between tissues has been found in several studies [41, 163, 182, 360]. These studies discovered a notable difference in the eQTLs regulating transcripts between tissues [184], sometimes with different allelic effects [102], demonstrating a different genetic architecture between the tissues. The persistence of these modules in other tissues can provide context on the extent to which this particular regulatory covariance structure is present in the cells in the body. The context specificity of modules provides valuable information regarding the utility of using these modules are molecular markers in other tissues besides whole blood [114, 115]).

The nine modules investigated here were identified and analyzed from whole blood samples; however, tissues specificity may occur within other cell types in the body. To investigate this, we examined the genetic control of BITs in three different tissues: fat, skin and LCLs. We observed notable differences in the heritability of the BITs as well as the mean within-module rP and rA across tissues. Differential genetic regulation between tissues has been found in several studies [41, 163, 182, 360]. These studies quantified various eQTLs regulating transcripts between tissues [184], sometimes with different allelic effects [102].

GWAS and eQTL studies have discovered many regulatory loci for individual transcripts. Little research has been performed to identify regulatory loci affecting co-expression. In this study, we interrogated a well-powered and replicated independent dataset of eQTLs [57] to determine the target loci driving the coexpression in each module. In particular, we looked for loci that are =shared

129 between probes in a module. We found many SNPs associated with five BITs in module two and one with module seven. The presence of loci associated with two or more transcripts in a module uncovers the pleiotropic regulation driving the high genetic covariance and consequently rP within the modules.

Module two appears to have a crucial role in blood. From all the tissues studied in this project, this module was only present in the blood cohort. Module two had one of the highest correlation measures and h2 in the blood cohort. Enrichment analysis performed in [114] for all of the transcripts associated with module two showed enrichment for terms related to RBC and platelet functions. Nine eSNPs in high LD with each other were associated with two BITs in module two. These SNPs were in high LD with each other and were located in the intronic region of the BCL11A gene. BCL11A is a TF associated with variable phenotypes, with the most notable being lymphoid development [338] and the production of fetal hemoglobin levels [339]. The eSNPs located within BCL11A were GWAS hits for a number of phenotypes related to red blood cell functionality, including include fetal hemoglobin cells (F-cell) distribution [341, 342], fetal hemoglobin levels [343–345], sickle-cell anemia [346–348], beta thalassemia or hemoglobin E disease [349]. These highlight the importance of BCL11A in dysregulating the production of hemoglobin (Table 4.5).

Since the BITs in modules two are located in across the genome (Table 4.5), are regulated distally to each other via soluble factors. It is possible that BCL11A interacts or is interacted with by the BITs either directly or through the recruitment other TFs, making up a regulatory network of transcripts and TFs. In fact, module two itself contains a TF that could expand upon on the regulatory network that is encompassed by this module. In fact, two-thirds of modules had a least one TF present, with 11% of the BIT modules in total being TFs. Therefore TFs appear to be a mechanism upon which the BITs are co-regulated.

The two genes in module two that were associated with the SNPs in BCL11A were IFIT1L and GMPR. IFIT1L is a gene up-regulated by interferons and acts as an inhibitor of viral replication [361]. GMPR has more of a housekeeping function in the recycling of nucleotides [362]. The immune-related functions of IFIT1L and BCL11A associated with module two indicate that the transcripts in this module are also involved in the function of immune cells present in blood tissues as well as platelets and RBCs. Another two SNPs in high LD were associated with three transcripts in module two. These SNPs are located within TRIM10 and are GWAS hits for HIV infection [350]. This SNP was in LD with another SNP, also associated with three transcripts, that was located in an intergenic region and was a GWAS hit for cardiac hypertrophy [351]. TRIM10 itself is involved in differentiation of

130 RBCs [363]. It is possible then that RBCs play a role in the infection rates and pathogenesis of HIV. In fact, there is a link between hemoglobin levels and the development of neurological disorders that are the results of HIV infection [364]. The three genes associated the two SNPs linked to TRIM10 (TNS1, EPB42 and SLC4A1) were all involved in red blood cell physiology. TNS1 is an important structural component to signaling molecules and involved in facilitating cellular attachment to the extracellular matrix [365]. EPB42 is a membrane protein that has an important role in the shape of red blood cells [366]. Finally, SLC4A1 is another protein in the cellular membrane of red blood cells that takes part in carbon dioxide processing [367]. Module two clearly shows enrichment of various aspects of red blood physiology, linked with immune function. The connection between RBC and immune function is highlighted by PBX1 a BIT TF in module two, which is a gene that is essential to the development of common lymphoid progenitor cells (B and NK cells) and T-cell development [368]. It could, therefore, be hypothesized that the functionality of red blood cells could have a direct impact on immune cells [369, 370].

Module seven is another module that was only present in blood and contains transcripts that are predominantly involved in viral immune responses [114]. The soluble signaling molecule, the TF, of module seven (IRF7) is involved in up-regulating anti-viral genes [371]. Two transcripts in module seven were associated with one eSNP found in an intergenic region between IKZF1 and C7orf72 on Chromosome 7 (Table 4.5 Figure 4.3B). IKZF1 is a transcription factor has a critical role in the hematopoietic system and is a key regulator of lymphocyte differentiation, which are cells that are important to viral defenses. In chip-seq studies, this protein had higher binding affinity to other genomic regions when compare to other genic regions in the genome [57]. Similar to BCL11A, the two BITs in module seven are associated with a TF, suggesting that the network of regulations present in this model.

The two genes associated with this eSNP were HERC5 and MX1. HERC5 is a gene that is upregulated by pro-inflammatory cytokines and produces a protein ligase that has a significant role in antiviral functions [372]. MX1 is involved in antiviral responses through the interference of the ribonucleo- protein complex [373]. The eSNP associated with the two BITs in module seven is a GWAS hit for two immune diseases, SLE [354, 355] and Stevens-Johnson syndrome or toxic epidermal necroly- sis [357] as well as lipid levels [356]. There has been a link between immune dysfunction after severe inflammation and lipid levels [374]. Diet has demonstrated an association with the development of autoimmune disease, in particular, the high-fat Western diet. The diet related autoimmune diseases were predominantly carried out through T-cells [375]. Exaggerated responses to viral infections have

131 contributed to the development of autoimmune disorders in healthy individuals [376,377]. Therefore it is possible that dysregulation of the viral pathways could contribute to the development of immune disorders in healthy individuals.

The original 28 modules, used to identify the BITs, were identified in gene expression samples from patients with SLE [228]. Therefore, it is interesting that there was an association between two BITs in module seven with a SNP that is an SLE variant. This association indicates that the co-expression networks expressed in patients with SLE remain present within the BITs. These SLE networks are potentially core regulators of immune function, notably viral immune function. Changes to this coex- pression from genetic perturbations or environmental conditions could lead to shifts in SLE disease susceptibility.

In additional to modules two and seven, modules one, three and five were also only present within the blood tissue. The transcripts within these modules show enrichment for immune responses. Module three contained two TFs, AFF3 and EBF1 which are both involved in lymphocytes development [378, 379]. Since there is an enrichment for immune function and these modules are not also present in LCLs. Therefore, these modules could hold networks that are predominantly occurring within T-cells and not B-cells.

LCLs are immune cells that are immortalized with Epstein-Barr virus. LCL have been used ex- tensively in gene expression studies as they provide a stable and replenishable source of RNA and DNA [177]. Previous studies have shown that the genetic architecture of gene expression is different between LCL and similar cells found in whole blood [179, 181, 380], including studies done on the MuTHER cohort used in this study [182]. We found that the correlation structures between tran- scripts differed between LCLs and blood with only modules 8 and 9 shared between the two tissues. These changes in transcript co-expression in LCLs can arise from a combination of gene expression perturbations that occur during the LCL immortalization process and B-cell tissue-specific genetic regulation that is enriched in LCL samples when compared to whole blood. Transcripts in both mod- ules eight and nine showed enrichment for B-cell functionality, explains their presence in the LCL tissue.

All modules, except for four and six, were found only within blood/LCL cells. The blood tissue specificity of the BIT modules is not surprising since these modules are enriched for cellular functions that would only occur in blood or immune cells. Module four was present in all tissues except for LCL based on the rA values. Module four showed enrichment for housekeeping functions which are not

132 tissue specific. Module four also contained one TF, BCLAF1, which was involved in apoptosis [381]. Module six has the highest correlation values for all tissues, including blood, and demonstrated no enrichment for any particular processes amongst the transcripts associated with it [114]. Module six had three TFs, the highest of any module. Two of Module six TFs were involved in the repression of zinc finger proteins (ZNF14 and ZNF682). Zinc-finger proteins are involved in a wide variety of core processes within the cell [382]. The third transcription factor was DMC1, which is a conserved gene involved in repairing double-stranded breaks in the DNA [383]. The housekeeping or core cellular functions expressed within modules four and six are inherently tissue non-specific and could explain the presence of these modules across all of the tissues studied [181, 336, 384, 385]. Module six had the largest number of TFs of any module. This large number of TFs could be indicative of a primarily regulatory role of the BITS in this module.

This study examined the genetic architecture of a set of nine modules that showed promise as a potential diagnostic tool to measure disease susceptibility and progress [114, 115]. These modules represent set of highly correlated transcripts, which are representative of a much larger group of transcripts that control up the half of the variance in the transcriptome. By decomposing rP values, often used as measures of coexpression, into rA values we were able to quantify the extent of the genetic regulation driving the high covariance structure observed in many gene expression datasets. It was also evident that all of these modules have correlation structures between the transcripts. This regulation was identified by shared eQTLs between two more transcripts in modules two and seven, indicating a network between the BITs in the modules. One of the eQTLs shared between two BITs in module two was a transcription factor. Transcription factors are one means in which a gene can regulate the expression of another gene. The enrichment of transcription factors within the BITs indicates that a strong genetically controlled network between the genes is driving the correlation structure observed here. These modules show enrichment for a variety of functions, which ranged from blood cell physiology to general housekeeping processes. In fact, the concordance between the genetic associations, transcript enrichment and tissue presence in these modules indicates that these modules could provide biological insights into the cellular functions and interactions of blood and immune cells. The tissue non-specificity of modules four and six shows that they represent a regulatory network that that is core to cellular functioning.

The network architecture between the transcripts through the use of TFs facilitates the coordinated expression of multiple genes that rapidly adjust to responses to environmental conditions. Genomic perturbation in any of the genes within a module could have a cascading effect throughout and entire

133 module. The genomic profile of an individual can have subtle effects across a molecule, influence their ability to respond to environmental conditions. This factors could be a driving force behind the differences in susceptibility to diseases and conditions observed throughout a population. This interplay between genetic regulation and environmental conditions within the modules can, therefore, provide a model upon which to study the etiology of complex traits and diseases.

134 Chapter 5

Discussion and Conclusion

5.1 Transcriptomic variability

5.1.1 Gene expression as a quantitative trait

The variability of gene expression across a population is one of the driving forces behind pheno- typic variation [386].The process of translating genes into RNA transcripts is the starting point upon which genetic information flows from the DNA into the production of cellular products, essential for cell structure and function. The expression levels of all characterized transcripts can be quickly and cheaply quantified using microarray technology. Gene expression is a quantitative, complex traits, influenced by a mixture of genetic and environmental factors. Expression levels are regulated through a variety of mechanisms that include repression of transcription, modification of transcriptional ma- chinery, degradation of mRNA transcripts and changes in chromosome confirmation.

Gene products rarely function alone and often make up networked and highly co-regulated biological processes, e.g. signal transduction, metabolic networks, multi-subunit proteins and immune signaling. Coordinate functions of genes were observed in chapter 2 where the first set of high variance principal components were able to pick up whole networks of genes alongside the batch effects. These coordi- nated functions within the transcriptome present themselves as coexpressed modules of genes. These networks of genes can comprise up to half of the total variance within the whole transcriptome [114]. The association of distal-eQTLs to several coexpressed genes has provided further weight on the im- portance of studying networks of genes instead of individual transcripts to understand the architecture of a system [39]. Up to 11% of eQTLs are shared between multiple transcripts [65]. In chapter 4, we quantify the extent of genetic control driving sets of highly correlated transcripts in blood [114] and

135 identify several shared eQTLs between these transcripts. We also identified TFs, enriched in these modules, as a potential mechanism of the shared regulation.

5.1.2 Biological impact of common SNPs

GWAS have been very prolific in identifying polymorphisms that underpin many complex traits and diseases. At the start of 2016, over 16,000 SNP-trait associations have been reported [332]. The next challenge is to annotate the SNPs obtained from these studies. Gene expression provides a highly accessible intermediate phenotype between the genome and higher-order traits. Gene expression variability in an HWE population provides a readily accessible phenotype to which to assess the direct biological impact from common SNPs (MAF > 0.05) in a population. This information enables the validation and functional characterization of GWAS SNPs.

GWAS/eQTL analysis was performed in two chapters of this thesis. In chapter 2, we demonstrate that networks of SNPs that are picked by PCs can have eQTLs associated with them(Table 2.2). With the removal of the batch effects, which affects networks of genes, eQTLs across all of the transcripts become easier to detect 2.1. In chapter 4, we identified two eQTLs shared between two and three transcripts respectively. These eQTLs were previous GWAS hits for hemoglobin levels and other blood-related disorders. The association with SNP associated with hemoglobin is consistent with the biology of the associated transcript in the modules, which were predominantly involved in red blood cell functionality. Therefore incorporating genetic and transcriptomic data can provide insight into the biological mechanisms driving the etiology of complex diseases and the regulatory architecture of complex traits [60].

5.1.3 Additive and non-additive genetic variability

Transcriptomic variability can be partitioned into additive genetic variation using GREML approaches [295] or family-based designs [21]. Additive genetic variation is of interest to quantitative genetics as it is the source of variation that passed down between generations. Additive genetic variation drives almost all of the genetic variation between measured in complex traits [20]. Additive genetic variation is also a key driver of global gene expression variation [2, 43, 44, 387].

Genetic-dependent variance not attributed to additive genetic variation can arise from dominance and interaction effects. Dominance and overdominance effects have demonstrated a negligible effect in

136 gene expression studies with only 5.3% of probe even having a significant measure of dominance variation, usually in conjunction (and well below) additive effects [44] (as discussed in section 1.4.1). Interaction effects include effects between genes, known GxG interactions or epistasis (as mentioned in section 1.4.2) and with the environment, known as GxE interactions or GEIs (as examined in section 1.4.3).

Only a few gene expression transcripts have had epistatic effects identified and replicated [85]. Al- together these did not have a significant impact on gene expression variability [44]. The extent of GEI in gene expression still needs to be characterized. The reason behind this is the difficulty in collecting easily comparable samples across a wide range of the environmental conditions. The inter- action effects between particular loci and environmental conditions could be a factor that influences the adaptability of individuals to certain stimuli and the development of certain health conditions/ diseases in the presence of specific environmental factors [44, 79].

One key regulatory feature that is driving the shared transcript regulation in modules is TFs. TFs have been documented to exhibit environmentally-specific binding [113]. The BITs in chapter 4 demonstrated strong enrichment for TFs. These modules demonstrate a great deal of co-ordinated variability to varying environmental conditions [114, 115]. Whilst this cannot be confirmed in this study, it is possible that a number of GEIs drive the responses of these modules. A genetic change in one part of a regulatory network could result in the effects across the entire system. The co-ordinated changes across the modules, however, may not be attributed to the effects of strong GEIs but instead to the natural biological responses required for cell to survive. The high genetic structure driven by TF enable the requires responses to be initiated by the modules.

5.1.4 Non-genetic variability

There are two key types of non-genetic variability that will be present in any transcriptomic dataset. This will be environmental variability, can be described as any variability that is ”not genetic” (1- h2). Common family environmental variation is a confounder of h2 estimates. Common family environmental effects occur when traits are measured within families, and similarities between the individuals are attributed to genetics are actually related to the shared environment between them. In the BSGS dataset, used in all of the chapters in this thesis, common family effects were very small, with ˜0.001% transcripts having any significant common family effects. The second type of non- genetic variability would be attributed to technical artifacts. This kind of variation is ubiquitous to

137 transcriptomic datasets and is often overlooked. However, technical artifacts can be hard to define and identify and they can present themselves as both genetic and environmental trends within the datasets and is a serious confounder that require specific models to remove. Technical artifacts are examined in chapter 2.

Environmental variability can be driven by various types environmental conditions and it comprises the largest portion of variance in gene [44] expression and other complex traits. The magnitude of dif- ferent components driving environmental variability make it difficult to distinguish any features from population-wise transcriptomic data. However, knowledge of any macro environmental (as discussed in section section 1.5.2) could provide a sufficient variable upon which meaningful information could be obtained from. This was performed in chapter 3, where seasonality information for each sample was used to distinguish transcripts that were significantly affected by naturally occurring variations across the year. Studies involving differential expression often want to dissect the cellular responses to particular environmental conditions or disease states [388–390] and are therefore looking for tran- scripts that exhibit a great deal of environmental variation in response to such stimuli.

5.2 Technical variability

A form of non-genetic variability that poses as a serious confounder in gene expression studies are technical artifacts, such as batch effects. These arise from minute fluctuations between batches of samples that occur during collection, processing, microarray chip hybridization and reading of the data [160]. This form of artificial variability can become incorporated into the natural genetic and environmental variation in the dataset.

In chapter 2, we used prerecorded processing dates to evaluate the extent of batch effect in gene expression data. We found that up to 29% (Figure 2.1) of the variance of the data was comprised of batch effects. Within the principal components generated from the data, these batch effects were located within the first 60 principal components, which contain the majority of the variance and the largest linear structures. Due to this structure of batch effect it was assumed in many studies that removing the first 50 PCs would eliminate the data. We show that whilst this is correct, the batch effects have integrated themselves amongst other heritable information and therefore require a more accurate correction method.

138 5.2.1 Batch effects

It is the first step when analyzing transcriptomic datasets to normalize and cleaning the data. Batch effects add artificial correlation structures to the dataset that can induce spurious associations with the variables of interest. In order to identify the batch effects present within a dataset, recorded pro- cessing dates throughout the generation of the data must be kept. The detailed records of processing dates within the BSGS dataset enables the examination of the variance and distribution of batch ef- fect. During the generation of transcriptomic data there are many factors that can generate technical artifacts can occur during the generation of the data that may not be recorded during the generation of the data. However, from the records available for the BSGS dataset it is evident that batch effects have the largest effect on transcriptomic variance. Batch effects often occur when, due to large sample sizes, samples are processed in smaller groups or batches. Approximately 19% of the BSGS dataset was attributed to measuring of the data on the microarray chips [152].

5.2.2 Principal Component correction

There have been many transcriptomic datasets that do not have any batch effects recorded. Inaccurate record keeping makes the removal of batch effects much more difficult and leads to the usage of ad-hoc methods such as PC correction to remove unknown variables. Since PC decomposition is an efficient method of isolating trends within the data, it provides a means of correcting for batch effects when there are no known recorded processing dates [157, 163, 164, 170]. The correlation structure induced by the batch effects are easily identified within the first few PCs generated on a dataset. PCs are ordered based on decreasing amounts of variance explained, with variance plateauing to 1% or less very quickly. Therefore the first few PCs often hold the most information about a dataset. The ability of PCs to decomposed the dataset into independent component comprised of linear trends in the data makes PCs a good method to capture to covariance structure of batch effects. In our study, we detected the presence of batch effects within the first 58 PCs generated from the data (Figure 2.3) indicating that the removal of a vast number of components will be required to completely eliminate these technical artifacts. Up to 50 PCs have been removed in previous studies [77, 163], however since the variables included in these PCs are unknown, the selection of the exact number of principal components to remove can be ambiguous.

The ability of PCs to capture the covariance structure of gene expression data means that any source of variation with the same linear trend will be detected. This can include highly correlated genes in co-

139 expressed transcriptomic networks such as metabolic pathways, immune processes, and muti-subunit proteins. Other correlated sets of genes can include sets of transcripts that have common functionality and shared genetic regulation as demonstrated with the BITs within the modules examined in chapter 4. These biologically-relevant sources of variation can become entangled with the batch effects in the dataset and therefore be present simultaneously within the PCs. All the processing information during the generation of the BSGS dataset were recorded, enabling the analysis of the distribution of batch effects across PC. The results in Chapter 2 revealed that both batch effects and biologically meaning pathways can be present at the same time within one PC. Therefore, PC correction is a blind method to correct for batch effect and can result in the over-correction of the dataset and the the removal of genetically regulated information.

To examine whether PC correction removes genetic information we reviewed the impact of the batch effect correction techniques on probe heritability. We demonstrated that the average probe heritability dropped from 0.32 in the uncorrected dataset to 0.18 when corrected with 50 PCs. In linear models, the average probe heritability fell to 0.23 (Figure 2.6). These results indicate that the uncorrected dataset had inflated heritability measurements due to confounding with systemic technical artifacts. The higher drop in both mean probe heritability and a skewed distribution towards a probe heritability of 0 (Figure 2.6) indicates that PC is removing heritable variation.

To examine whether the PCs are removing biological information we interrogated the first 50 PCs from the BSGS datasets for any evidence of biological enrichments. Since PC detect correlation pat- terns within the dataset, it was expected that biological networks of genes are likely to be present in these PCs. We found that 11 of the 50 PCs to be removed from the dataset had a significant enrich- ment for a number of different biological processes including immune function, metabolic processes, ribosomal components and RNA degradation. Since batch effects are artificial trends within the data, they are unlikely to be enriched with biological networks and therefore indicates that a number of net- works genes are being detected within the first 50 PCs alongside the batch effects. Further evidence of biological information being present with the first 50 PCs is also given by the detection of genomic loci associated with the PCs (Table 2.2). Together this provides evidence that PC correction removes a wide variety of information in a dataset that may not be associated with batch effects.

In previous studies, the number of PCs used for batch effect correction was selected based on the impact observed on eQTL detection. In Fehrmann et al. 2011, a gene expression study comprised of 1,469 individuals, removal of 50 PCs resulted in a two-fold increase in cis-eQTLs detected [77]. The removal of PCs eliminated noise from systemic errors (such as batch effects), environmental sources

140 and physiological effects such as cell types. This removal of noise and confounders increased the power to detect eQTLs associations. Within the BSGS dataset, ˜90% of known batch effects were eliminated when 50 PCs were removed (Table 2.1), demonstrating the effectiveness of PC removal in eradicating batch effects from the dataset. The removal of 50 PCs from the BSGS dataset resulted in 35% increase in local eQTLs and a 77% increase in distal eQTLs at an FDR 0.05. This was replicated in the CHWDB dataset with a 32% and 69% increase in local and distal eQTLs respectively at FDR 0.05 when 50 PCs are were removed from the dataset. Linear model correction increased power to detect eQTL association by a greater degree, with a 193% and 188% increase in local and distal eQTL hits respectively in the BSGS and 82% and 82% increase in local and distal eQTLs respectively in the CHDWD dataset. These results therefore demonstrate the removal of batch effects either using PC correction or linear models can improve the power to detect eQTL association. However, due to the overcorrection that is produced by the PCs, a more precise correction method of the batch effects in a dataset can improve the eQTL detect even more. This is due to a greater removal of batch effect and also the prevention of the accidental removal of genetic variation, which can lead to a drop in the power or complete elimination of the associations that can be detected.

Altogether, PC removal is an effective method of removing batch effects in a dataset (Figure 2.3), thereby increasing power to detect eQTL associations (Table 2.1). PC correction would be an appro- priate method to correct for batch effects in datasets that lack detailed records of processing times which could alternatively be used as surrogate markers for batch effects. PCs can additionally incor- porate environmental or physiological factors such as cell counts, population differences, and other unknown factors. It can be difficult to identify which PCs generated from a PCA decomposition would hold this information and to what extent are the PCs comprised of artifacts.

In chapter 2, we examined only one dataset in which PC correction proved to be an efficient model to remove batch effects. It would be expected that the distribution of batch effects across the PCs and the number required to remove such artifacts will vary widely across different datasets and would require careful examination of the data. Therefore the effectivity of PC correction may differ substantially between the various study designs. The BSGS dataset also had quite a small sample size, which mean that the resulting number of PCs would be lower than what would be lower than some of the studies that have employed PC correction. Therefore the distribution of variance across the first 50 PCs may constitute more of the variance in this dataset than others. This difference in sample size therefore makes comparisons between the BSGS studies and the previous studies correcting for PCs difficult. From our results, we can conclude that PC removal is a delicate balance between removing

141 unknown technical artifacts and environmental confounders at the expensive of over-fitting the data and removing genetic variation.

In summary, the appropriate cleaning of the data is the first step that is required for further analysis of gene expression data. The widespread and systematic effects of batch effects provides and exam- ination of the extensive correlation structures that are present in the gene expression datasets. Batch effects integrate within the covariance structures between genetics and the environment and therefore the decomposition of the is variance into principal components, highlights the presence of these corre- lation structures enriched for a variety of biological functions. Therefore the study of gene expression in the major interconnected groups of genes instead of single independent transcript will help in ex- panding our knowledge of the genetic architecture of gene expression. In chapter 4 we examine how co-ordinated genetic regulation between genes drives highly correlated modules of genes.

5.2.3 Environmental variability

The most prominent source of variation in global gene expression is environmental effects (Equation 1.1) [44]. The examination of transcriptomic variation attributed to environmental effects can elu- cidate the transcripts and associated molecular processes that are either modulated by or represent a targeted response to, specific environmental stimuli [19]. These environmental effects can be divided into macro environmental effects, which affect the majority or all of the individuals in the population, and micro environmental factors, which are specific to individuals [125].

Across a large enough sample size, as found with the BSGS, the environmental variation detected are driven by a magnitude of different environmental conditions. These micro environmental effects are likely to be diluted across a population and are therefore unlikely present themselves as serious confounders to study design. On the other hand, macro-environmental effects, such as seasonality can become potential confounders since all individuals are exposed to this source of variation. In chapter 3, a significant seasonal signature for a number of transcripts across the transcriptome was identified. This highlighted seasonality as an important macro-environmental factor in transcriptomic data.

5.2.3.1 Seasonal variability

Seasonal effects have been reported to have an impact of global gene expression variation [275,391]. In chapter 3, we examine the impact of a pervasive source of environmental variation, seasonality, on

142 a gene expression dataset that was collected over a number of years. To isolate the cyclic seasonal component for each transcript, a loess smoothing function was applied to the dataset. This model separates the cyclic components, from yearly trends and noise within the dataset (Figure 3.1). The significance of a 12-month repeating weather cycle was tested with a cosinor regression model (Figure 3.5) and confirmed with an orthogonal autocorrelation test for cyclic repeating patterns. We identified in total 135 transcripts, 1% of all 13,311 transcripts tested, that were associated with a 12-month seasonal cycle (Table 3.1).

5.2.3.2 Enrichment

We found that there were distinct biological processed that were enriched among the transcripts that were associated with seasonality. The significant seasonal transcripts were enriched for the pro- cesses of DNA repair and binding. This provides potential clues in the underlying biology that drives seasonal trends that have been reported in a number of physiological traits and diseases, such as cardiovascular disease [281] and multiple sclerosis [287, 288].

5.2.3.3 Causality

The direct causal factor driving seasonal transcriptomic variation are likely to be due to a number of secondary factors. These factors could include seasonal changes in influenza transmission fre- quency [289], weather patterns or temperature (Figure 3.5) and therefore cannot be identified in this study. Transcripts associated with significant seasonal cell count variation (as discussed later), further complicated the line of causality. With these transcripts, it is unclear whether it was the change in cell count that caused the increased in detectable transcripts or whether it was a transcriptional change that modified the cell count numbers. Either way, it can be concluded that a seasonal signature influencing cell count can be detected within the transcriptome.

5.2.3.4 Impact of seasonal effects

The identified seasonally-affected transcripts have provided evidence for two distinct notions. Firstly, we have identified biologically relevant processes that demonstrate a seasonal trend. These biological effects require replication in a different dataset for validation. The presence of an opposite seasonal effect in the north hemisphere for these transcripts involved in DNA function and repair will further

143 confirm this biological trend in seasonal fluctuations.

Secondly, since the BSGS was not collected as a part of a time-series design, the presence of clear seasonal effects highlights the ubiquity of this macro-environmental effect on the variation in gene ex- pression datasets. Therefore, it is likely that any transcriptomic data obtained from samples collected over time will contain seasonal signatures.

Whilst the overall impact of seasonal variability in the BSGS dataset is small, the detectable seasonal signature indicates that any samples measured over a period of time could be susceptibility for natural seasonal variation. This additional source of variation could complicate gene expression analyses performed with samples that are collected over time or as a part of a time-series experiment and therefore should be examined during data cleaning process in these instances.

5.2.4 Physiological variability

5.2.4.1 Cell counts

5.2.4.1.1 Seasonal effects Seasonal variation found in gene expression datasets have been at- tributed to changes in cell counts [275, 391]. In chapter 3, we found 94 additional transcripts that had a significant seasonal signature that was driven by cell count, notably lymphocytes, and the tran- scripts were enriched for various immunological processes indicating a strong immune signal that is associated with seasonality. It is reasonable to hypothesize that this signature is due to the influence of seasonal infections of several respiratory viruses such as influenza [289].

Altogether, we identified 168 transcripts that were significantly associated with seasonality were en- riched for cellular markers for lymphocytes (Table 3.3). Evidence for confounding with cellular abundance was further confirmed when the cell count of five distinct cell types was significantly asso- ciated with the cosinor seasonal model that was used to model a 12-month seasonal cycle in the study (Figure 3.4). When corrected for cellular abundance, 94 transcriptomic associations disappeared and additional 61 were identified (Table 3.3).

The 94 transcripts associated with cellular abundance were enriched for various immunological pro- cesses (Table 3.2 and Figure 3.6). This enrichment suggests that it is either change in lymphocyte cell counts or the increased expression of immunological genes that follow a seasonal trend. Both expression changes or cell count changes can be detected within the transcriptome. Similar results

144 in much larger sample sizes in both European and Oceanic samples have demonstrated increases in pro-inflammatory transcription and change changes in cellular blood composition associated with seasonality [391].

Together with having a strong seasonal signature, blood cell count also appears to have a significant contribution to the transcriptomic variability in the BSGS datasets [138]. Therefore it contributes what we classify as physiological variability to the dataset. These identification of variation from the transcriptome indicates that transcript levels can provide accurate markers that can detect the composition of blood and an individuals physiology.

Altogether these results provide information on the importance of examining environmental effect on the expression of genes. Whilst individual environmental effects are difficult to determine, macro en- vironmental effects are experiences by everybody and are therefore important when modeling global factors that change the expression levels of a gene. The incorporation of these environmental factors can improve the predictability of genetic models. This is particularly important since, as demonstrate here, some gene respond to environmental conditions more than other. Therefore by collect a catalog of transcriptomic environmental responses, we can incorporating this information to predict individ- ual responses alongside their expected genetic responses. These environmental response do not only occur on the individual transcript level. Genes that are co-regulated and are a part of a larger holistic network are likely to exhibit response to environmental condition in a co-ordinated fashion. This was observed in the modules examined in chapter 4.

5.2.4.1.2 Cellular heterogeneity This form of physiological variability occurs in samples com- prised of a heterogeneous source of cells, such as whole blood. Whole blood is a tissue source that is easily obtained from donors and therefore frequency used in transcriptomic research. The multiple cell types present in blood pose several challenges to transcriptomic inferences that can be made from this kind of tissue. First of all, even within the same cell types, there is extensive stochasticity be- tween cells [392]. Over a large population of cells, this variability follows a normal distribution [392]. Secondly, different cell types can have, to varying degrees, dissimilar transcriptomic profiles [393]. Therefore the transcriptome obtained from whole blood represents an average across all the various cell types [394].

Different environmental conditions can contribution to physiological variability induced by cell counts in whole blood. The occurs when an increased proliferation of specific cell types is invoked in blood, i.e. infection causing increases in circulating leukocyte numbers, can change the composition of the

145 transcriptome. Increases in cell counts modify the ratio between various cells in blood and therefore the quantity of certain transcript to others [275, 395]. Differences in the transcriptome as a result of the abundance of different cell types that can represent a significant driving factor behind inter- individual variability found in transcriptomic datasets. Overall these results demonstrate the practical importance when working with whole blood samples to ensure that cell count abundance is collected and ruled out as a confounder in downstream analyses.

5.2.4.2 Tissue specificity

When comparing transcriptomic samples collected between different tissues, the tissue-specific nature of gene expression can create an extreme form of physiological variability. Tissue-specific expression can drastically modify the expression landscape observed within the same individual and has been characterized in many previous studies( [40, 41, 71, 74, 181, 182, 393, 394, 396–398] to name a few). These changes are not surprising since each cell has its specific cellular functions that are unique to it. Housekeeping genes have essential functions within the cell and as a result demonstrate less tissue specificity [181, 336, 384]. The non-specific tissue presence of housekeeping genes was observed in chapter 4, where the modules enriched for housekeeping functions were present across all tissues in the studies. Modules that contained genes with specific blood or immune cell functionality were only present within the blood or LCLs. Therefore the genetic regulation and transcripts that are required to carry out functions particular to a cell are going to have cell-specific expression and regulation. Since cells and tissues work together to perform biological functions, an understanding of the genetic regulation across all the different tissues in the body is essential to obtain a global picture of human physiology. As demonstrated in chapter 4, co-ordinated transcriptomic expression occurs in a tissue- specific manner, indicating the networked expression of genes does not dissolve between tissue but is instead a universal regulatory structure in the transcriptome that varies according to cellular function. Diseases also are restricted to particular regions of the body and therefore an understanding of the transcriptomic processing across tissues will assist in understanding the mechanism driving these conditions.

5.2.4.2.1 eQTLs Each tissue type has its own unique differentiated biological functions and this is reflected in their transcriptomic profile. In whole blood each cell type has its own unique expression profile [397] and in additional to that there is a variety of cell-free circulating RNA from various tissue sources in the body [385]. The unique transcriptional profile of cells contributes to the gene expression

146 trends observed from changes in cell counts throughout the seasons in chapter 3. The Genotype-Tissue Expression (GTEx) project showed that the majority of genes were conserved between tissues, with about 21% of transcripts showing varying degrees of dissimilarity between tissues and over 50% of detected eQTLs are shared between tissues [393]. Therefore inferences made from transcriptomic analyses in blood may not be directly applicable in other tissues.

5.2.4.2.2 BIT In chapter 4, the co-regulation between BIT modules different between the four tissues measured: blood, LCLs, fat and skin (Figure 4.2). The extent of shared genetic control, mea- sured as the genetic correlation between transcripts, also varied between modules. The tissue-specific difference in genetic covariance indicates that the base genetic architecture regulating transcripts dif- fers between cell types and tissues. These differences are likely contributing to the highly specialized structure of the approximately 200 different types of cells in the [399]. Previous stud- ies on tissue-specificity have focused on single transcripts. This study provides an example of the covariance and genetic covariance varying between tissues showing a networked difference between tissues.

The modules were identified and validated in whole blood tissue samples. Within the BSGS dataset, all nine modules were present with a high average rA of 0.72 across all BIT module pairs. The presence of the modules were more accurate determined when they were based on the values obtained from BIT modules in blood. Using this measure, modules 1-3,5 and 7 were only present in blood with modules 8 and 9 also present in LCL cells. Modules 4 and 6 were present across all of the tissues. The universal presence of the modules 4 and 6 can be attributed to the predominantly housekeeping functions of the transcripts.

5.3 Genetic covariance

The extensive covariance structure of the transcriptome indicates that the high degree of network interactions between transcripts take place within a biological system to carry out a number of tasks. This complex system of regulation and network is an emergent property found within the -omics datasets as information is transmitted from the DNA into the complex cellular infrastructure required to keep the cell alive. This covariance structure shares the variance between multiple transcripts and provides a means of tracking entire biological pathways instead of individuals transcripts. These pathways can comprise of biological networks of genes can include, for example, metabolic and

147 immunological that require the precise coordination of genes to maintain homeostasis.

5.3.1 BITs

In chapter 4 we examined the mechanisms driving transcript co-expression using nine modules that were identified and replicated in a previous study [114,228]. These modules were primarily enriched from immunological and blood cell biological functions. The BIT modules have a large impact on the variance present in the transcriptome, with the majority of transcripts having 50% or greater of their variation associated with PCA-derived axes of the modules [114].

These modules were decomposed into a smaller set of ten transcripts, called BITs that represent the variance structure of the whole module. These BITs were used as proxies for characterizing the variance captured by the whole modules. These modules were identified in whole blood samples and, as mentioned previously, four of these modules appear to be tissue-specific (Figure 4.1).

5.3.2 Phenotypic correlations

Phenotypic correlation between transcript levels, as measured by Pearson’s correlation form the basis of which coexpressed modules are identified the transcriptome [114, 115, 228, 229, 322, 400]. Pheno- typic correlations between transcripts can be a driven by genetic factors, environment influences, and random chance.

Correlation between transcripts, which is an estimate of scaled covariance, between transcripts is highly susceptible to noise and have low repeatability [401]. The BIT modules were replicated across several studies across five independent studies that comprise of different population groups confirming the co-expression found in the modules [114].

Transcriptomic covariance, using a bivariate extension of GREML in GCTA [324], can be partitioned into genetic and environmental components (Equation 4.1). The genetic correlation between tran- scripts (Equation 4.2) provides a measure of the degree of genetic regulation that shared between transcriptions and the proportion of total covariance attributed to genetic effects.

Since these modules represent such a significant portion of the variance in the datasets, the quan- tification of the degree of genetic and environmental effects driving BIT covariance can give insight into the regulatory architecture driving covariance across the whole transcriptome. Since rP can be

148 driven by environmental effects, the measure of genetic contribution to rP (p ˆA) could provide a greater insight into the modules that are under genetic control, will have stable covariance across different groups and are likely to have import biological functions within the cells. Since the BIT modules were replicated in several studies, they are likely to be strongly regulated by genetic loci as observed in Chapter 4.

5.3.3 Genetic contribution to phenotypic correlations

As expected, the modules have a high degree of genetic correlation between intra-module BITs. How- 2 ever, the extent to which the rA impact rP is dependent on the h between the pair of transcripts (Equa- 2 2 tion 4.5). A low h will give environmental correlations too much weight to the rP. The h of the transcripts varied between 0.19-0.48, indicating that rE values have slightly more weight on the rP.

Coincidentally, the averagep ˆA across the modules was under half (0.42). In the BSGS dataset, the genetic and environmental correlations were similar (Figure 4.2) thereby almost equally contributing the high rP exhibited by the BITs.

The rA estimates themselves, however, were very high ranging from 0.56-0.96. This demonstrates that while the on average the total genetic regulation of the individual BITs is 0.36, the genetic control driving that portion of variance is almost entirely shared between BITs in a module. The robustness of the modules can, therefore, be attributed to the high genetic regulation driving the co-expressed BITS. The genetic regulatory structure driving these modules provides a backbone upon which coordinated responses to environmental conditions can take place.

5.3.4 Shared eQTLs

Taking these ten probes we demonstrated that their pairwise phenotypic correlations were higher than what would be expected by chance (Figure 4.1). The phenotypic correlations were driven roughly equally by both genetic and environmental factors (Figure 4.1). There was a significant enrichment for eSNPs in two modules (Figure 4.2) shared between probes. Within module two, there were loci located within the TRIM10 genes associated with three BITs and loci within BCL11A associated with two probes (Table 4.3). eQTLs found within intergenic regions were associated with two probes in module seven, providing further evidence of the extensive shared genetic regulation between these highly co-expressed transcripts.

149 These results hint at the possible effects of distal-eQTLs as regulatory hotspots for multiple genes [75]. Since the genetic variation shared between eQTLs probes could be the result of multiple loci with small effect sizes, it is likely that there is insufficient power to detect many of the shared associ- ations driving the genetic co-regulation of BIT probes.

5.3.5 Transcription Factors

One mechanism upon which distal-eQTLs can regulate the expression several genes is through tran- scription factors [1]. Across all the BITs there was a significant enrichment for the presence of transcription factors, providing some insight into the potential biological mechanism upon with the shared eQTLs are exerting their effects across the modules. Further analysis of the BITs and the other transcripts associated with each module [114] could help to elucidate the network of genetic control that propagates the genetic regulation of multiple transcripts.

5.3.6 GWAS enrichment

Nearly all of the shared eQTLs between BITs in module two were GWAS studies for blood diseases or heart conditions or traits (Table 4.5). The transcripts within module two are enriched for blood phenotypes such as platelet aggregation and hematopoiesis [114].

The information from the transcriptional modules has provided further information into the mecha- nisms upon which these loci contribute to disease. It is likely that these blood conditions arise due to disruption of biological pathways are the key to maintaining blood cell homeostasis.

Module two appeared to have the strongest correlation values in blood and had the most extensive degree of shared loci, with over half of the transcripts sharing a genetic loci. The transcripts in this module were enriched for a variety of blood cell functionality, predominantly red blood cells with some immune functionality. Two of the eQTLs associated with two BITs were located within a tran- scription factor involved in modulating hemoglobin and immune cell development. The association to a transcription factors indicates that mass regulation of multiple genes through a network of tran- scription factors is the genomic architecture that is driving the co-expression of this module. The enrichment for transcription factor in the BITs themselves confirm that the modular structure iden- tified here is a result of networked regulation of transcripts. Some of the GWAS traits associated with the BITs in transcripts suggested greater functionality of red blood cells. Particularly in HIV

150 function [364], the development of cardiac hypertrophy, the interplay between red blood cells and immune function [369, 370], most of which have been suggested before and examined within the literature. This suggests the networked interactions between multiple transcripts and their regulating loci provide clues into the biological nature of the tissues in which they are examined.

Module seven was also present only within blood and enriched for anti-viral functions. This was the only other modules in which a shared eQTL was observed. The eSNP associated with the two transcripts was located within the TRIM10 genes, which is also a TF. The eSNP associated with two transcripts in this module was associated with two immune-related conditions SLE [354, 355] and Stevens-Johnson syndrome [357]. This eSNP was also associated with lipid levels suggesting that a link between fats and immune disorders, which has been observed before [374]. Viral responses have also been linked to autoimmune disorders through the aggravation of immune cells after inflammation [376, 377].

The fact that one of the eQTLs associated with two transcripts in module seven was associated with SLE is an interesting result because the original studies from which the nine BIT modules were derived, isolated 28 modules from transcriptomic data of patients with SLE [228]. This result sug- gests that this particular module represents normal immunological processes within the cell, however, disruption of the regulatory architecture of the transcripts can lead to irregular immune response, re- sulting in the development of SLE. This study highlights that the origin of transcript detection can influence downstream analysis.

Since the modules were identified from SLE patients, the immune pathways were strongly enriched in the modules identified in [228]. These immune pathways were subsequently present within healthy individuals. Had these modules been initially identified in healthy individuals, there is a chance that this module may have been overlooked in favor of more prominent co-expression networks. Along- side the tissue-specific effects identified in the modules these origin-effects highlight the importance that the original conditions of which transcripts or networks of transcripts were identified can have for downstream analysis and interpretation of results.

The genetic associations found within module two and seven have not only identified the part of the regulatory architecture driving the covariance structure of the BIT modules but have provided information on underlying biological processes. By examining transcripts as part of shared regulatory networks, we can gain a broader global view of the emergent biological properties and gain insight into the genetic architecture driving these systems.

151 5.4 Conclusion

Within this thesis, several sources of transcriptomic variance were examined in detail. The exam- ination of these key sources of variation provided an understanding of the impact of confounders, which can arise not just from technical processing but also from pervasive environmental factors and physiological variability from tissue-specific effects.

The examination of environmental variability demonstrates the power of transcriptomic data to eluci- date specific molecular processes that are influenced by external stimuli. Macro-environmental effects have a significant impact not only the variance of individual transcripts across a population but also change the covariance between co-expressed probes as observed with the high contribution of rE to rP in the BIT modules.

The high genetic co-regulation between BIT transcripts could provide a regulatory framework that constrains the transcriptomic response BITs will have in response to environmental conditions, thereby influencing the flexibility of transcriptomic response. Changes in the degree of transcriptomic flexi- bility frequently occur in neurological disorders [37].

The intricate genetic regulation of transcripts imposes and an extra layer of complexity in tran- scriptomic data. From examining the global influence of environmental influences and genetic co- regulation, it is clear that networked holistic approaches are necessary to gain an accurate portrait of the dynamics that drive gene expression and its emergent properties that ultimately give rise to complex traits.

152 Bibliography

[1] Juan M Vaquerizas, Sarah K Kummerfeld, Sarah A Teichmann, and Nicholas M Luscombe. A census of human transcription factors: function, expression and evolution. Nature reviews. Genetics, 10(4):252–63, apr 2009.

[2] Vivian G Cheung and Richard S Spielman. Genetics of human gene expression: mapping DNA variants that influence gene expression. Nature Reviews Genetics, 10(9):595–604, 2009.

[3] Iakes Ezkurdia, David Juan, Jose Manuel Rodriguez, Adam Frankish, Mark Diekhans, Jennifer Harrow, Jesus Vazquez, Alfonso Valencia, and Michael L Tress. Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes. Human molecular genetics, 23(22):5866–78, nov 2014.

[4] UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Research, 43(D1):D204–D212, jan 2015.

[5] Hosung Jung, Byung C Yoon, and Christine E Holt. Axonal mRNA localization and lo- cal protein synthesis in nervous system assembly, maintenance and repair. Nature reviews. Neuroscience, 13(5):308–24, may 2012.

[6] Pau Rue´ and Alfonso Martinez Arias. Cell dynamics and gene expression control in tissue homeostasis and development. Molecular , 11(1):792, jan 2015.

[7] Tony Ly, Yasmeen Ahmad, Adam Shlien, Dominique Soroka, Allie Mills, Michael J Emanuele, Michael R Stratton, and Angus I Lamond. A proteomic chronology of gene expression through the cell cycle in human myeloid leukemia cells. eLife, 3:e01630, jan 2014.

[8] Maga Rowicka, Andrzej Kudlicki, Benjamin P Tu, and Zbyszek Otwinowski. High-resolution timing of cell cycle-regulated gene expression. Proceedings of the National Academy of Sciences of the United States of America, 104(43):16892–7, oct 2007.

153 [9] Amaury de Montaigu, Antonis Giakountis, Matthew Rubin, Reka´ Toth,´ Fred´ eric´ Cremer, Vladislava Sokolova, Aimone Porri, Matthieu Reymond, Cynthia Weinig, and George Cou- pland. Natural diversity in daily rhythms of gene expression contributes to phenotypic vari- ation. Proceedings of the National Academy of Sciences of the United States of America, 112(3):905–10, jan 2015.

[10] D Schwartz. Genetic Studies on Mutant Enzymes in Maize. III. Control of Gene Action in the Synthesis of Ph 7.5 Esterase. Genetics, 47(11):1609–15, nov 1962.

[11] Leslie G Biesecker. Hypothesis-generating research and predictive medicine. Genome research, 23(7):1051–3, jul 2013.

[12] Margus Lukk, Misha Kapushesky, Janne Nikkila,¨ Helen Parkinson, Angela Goncalves, Wolf- gang Huber, Esko Ukkonen, and Alvis Brazma. A global map of human gene expression. Nature biotechnology, 28(4):322–324, 2010.

[13] Adriane Menssen, Thomas Haupl,¨ Michael Sittinger, Bruno Delorme, Pierre Charbord, and Jochen Ringe. Differential gene expression profiling of human bone marrow-derived mes- enchymal stem cells during adipogenic development. BMC genomics, 12(1):461, 2011.

[14] G Atzmon, X M Yang, R Muzumdar, X H Ma, I Gabriely, and N Barzilai. Differential gene expression between visceral and subcutaneous fat depots. and metabolic research, 34(11-12):622–628, 2002.

[15] Vasilis Sitras, Christopher Fenton, Ruth Paulssen, Ase˚ Vartun,˚ and G Acharya. Differences in gene expression between first and third trimester human placenta: a microarray study. PloS one, 7(3):e33294, jan 2012.

[16] Biljana Jovov, Felix Araujo-Perez, Carlie S Sigel, Jeran K Stratford, Amber N McCoy, Jen Jen Yeh, and Temitope Keku. Differential gene expression between African American and Euro- pean American colorectal cancer patients. PLoS ONE, 7(1):e30168, 2012.

[17] Toshiaki Watanabe, Takashi Kobunai, Yoko Yamamoto, Keiji Matsuda, Soichiro Ishihara, Kei- jiro Nozawa, Hisae Iinuma, Hiroki Ikeuchi, and Kiyoshi Eshima. Differential gene expression signatures between colorectal with and without KRAS mutations: Crosstalk between the KRAS pathway and other signalling pathways. European Journal of Cancer, 47(13):1946– 1954, 2011.

154 [18] Elizabeth K Brint, John MacSharry, Aine Fanning, Fergus Shanahan, and Eamonn M M Quigley. Differential expression of toll-like receptors in patients with irritable bowel syndrome. The American journal of gastroenterology, 106(2):329–36, feb 2011.

[19] Emma Nilsson, Per Anders Jansson, Alexander Perfilyev, Petr Volkov, Maria Pedersen, Maria K Svensson, Pernille Poulsen, Rasmus Ribel-Madsen, Nancy L Pedersen, Peter Alm- gren, Joao˜ Fadista, Tina Ronn,¨ Bente Klarlund Pedersen, Camilla Scheele, Allan Vaag, and Charlotte Ling. Altered DNA methylation and differential expression of genes influencing metabolism and inflammation in adipose tissue from subjects with type 2 diabetes. Diabetes, 63(9):2962–76, sep 2014.

[20] W G Hill, M E Goddard, and P M Visscher. Data and theory point to mainly additive genetic variance for complex traits. PLoS Genetics, 4(2):e1000008, 2008.

[21] Douglas S Falconer, Trudy F C Mackay, and Richard Frankham. Introduction to Quantitative Genetics (4th edn). Trends in Genetics, 12(7):280, 1996.

[22] Jian Yang, Beben Benyamin, Brian P McEvoy, Scott Gordon, Anjali K Henders, Dale R Nyholt, Pamela A Madden, Andrew C Heath, Nicholas G Martin, Grant W Montgomery, Michael E Goddard, and Peter M Visscher. Common SNPs explain a large proportion of the heritability for . Nature genetics, 42(7):565–569, jul 2010.

[23] Joseph E Powell, Peter M Visscher, and Michael E Goddard. Reconciling the analysis of IBD and IBS in complex trait studies. Nature reviews. Genetics, 11(11):800–5, nov 2010.

[24] Ertugrul M Ozbudak, Mukund Thattai, Iren Kurtser, Alan D Grossman, and Alexander van Oudenaarden. Regulation of noise in the expression of a single gene. Nature genetics, 31(1):69–73, may 2002.

[25] Jonathan M Raser and Erin K O’Shea. Noise in gene expression: origins, consequences, and control. science, 309(5743):2010–2013, sep 2005.

[26] Arjun Raj and Alexander van Oudenaarden. Nature, nurture, or chance: stochastic gene ex- pression and its consequences. Cell, 135(2):216–226, oct 2008.

[27] Kasper D Hansen, Zhijin Wu, Rafael A Irizarry, and Jeffrey T Leek. Sequencing technology does not eliminate biological variability. Nature biotechnology, 29(7):572–3, jul 2011.

155 [28] Hiroaki Kitano. Biological robustness. Nature Reviews Genetics, 5(11):826–837, nov 2004.

[29] Joanna Masel and Meredith V Trotter. Robustness and evolvability. Trends in genetics : TIG, 26(9):406–14, sep 2010.

[30] David A Garfield, Daniel E Runcie, Courtney C Babbitt, Ralph Haygood, William J Nielsen, and Gregory A Wray. The impact of gene expression variation on the robustness and evolvabil- ity of a developmental gene regulatory network. PLoS biology, 11(10):e1001696, oct 2013.

[31] Arjun Raj, Scott A Rifkin, Erik Andersen, and Alexander Van Oudenaarden. Variability in gene expression underlies incomplete penetrance. Nature, 463(7283):913–918, feb 2010.

[32] C Waddington. Canalization of development and the inheritance of acquired characters. Nature, 150:563–565, 1942.

[33] Neelroop N. Parikshak, Michael J. Gandal, and Daniel H. Geschwind. Systems biology and gene networks in neurodevelopmental and neurodegenerative disorders. Nature Reviews Genetics, 16(8):441–458, jul 2015.

[34] S Hong Lee, Stephan Ripke, Benjamin M Neale, Stephen V Faraone, Shaun M Purcell, Roy H Perlis, Bryan J Mowry, Anita Thapar, Michael E Goddard, John S Witte, Devin Absher, In- grid Agartz, Huda Akil, Farooq Amin, Ole A Andreassen, Adebayo Anjorin, Richard An- ney, Verneri Anttila, Dan E Arking, Philip Asherson, Maria H Azevedo, Lena Backlund, Ju- dith A Badner, Anthony J Bailey, Tobias Banaschewski, Jack D Barchas, Michael R Barnes, Thomas B Barrett, Nicholas Bass, Agatino Battaglia, Michael Bauer, Monica` Bayes,´ Frank Bellivier, Sarah E Bergen, Wade Berrettini, Catalina Betancur, Thomas Bettecken, Joseph Biederman, Elisabeth B Binder, Donald W Black, Douglas H R Blackwood, Cinnamon S Bloss, Michael Boehnke, Dorret I Boomsma, Gerome Breen, Rene´ Breuer, Richard Brugge- man, Paul Cormican, Nancy G Buccola, Jan K Buitelaar, William E Bunney, Joseph D Buxbaum, William F Byerley, Enda M Byrne, Sian Caesar, Wiepke Cahn, Rita M Cantor, Miguel Casas, Aravinda Chakravarti, Kimberly Chambert, Khalid Choudhury, Sven Cichon, C Robert Cloninger, David A Collier, Edwin H Cook, Hilary Coon, Bru Cormand, Aiden Corvin, William H Coryell, David W Craig, Ian W Craig, Jennifer Crosbie, Michael L Cuc- caro, David Curtis, Darina Czamara, Susmita Datta, Geraldine Dawson, Richard Day, Eco J De Geus, Franziska Degenhardt, Srdjan Djurovic, Gary J Donohoe, Alysa E Doyle, Jubao Duan, Frank Dudbridge, Eftichia Duketis, Richard P Ebstein, Howard J Edenberg, Josephine Elia,

156 Sean Ennis, Bruno Etain, Ayman Fanous, Anne E Farmer, I Nicol Ferrier, Matthew Flickinger, Eric Fombonne, Tatiana Foroud, Josef Frank, Barbara Franke, Christine Fraser, Robert Freed- man, Nelson B Freimer, Christine M Freitag, Marion Friedl, Louise Frisen,´ Louise Gallagher, Pablo V Gejman, Lyudmila Georgieva, Elliot S Gershon, Daniel H Geschwind, Ina Giegling, Michael Gill, Scott D Gordon, Katherine Gordon-Smith, Elaine K Green, Tiffany A Green- wood, Dorothy E Grice, Magdalena Gross, Detelina Grozeva, Weihua Guan, Hugh Gurling, Lieuwe De Haan, Jonathan L Haines, Hakon Hakonarson, Joachim Hallmayer, Steven P Hamil- ton, Marian L Hamshere, Thomas F Hansen, Annette M Hartmann, Martin Hautzinger, An- drew C Heath, Anjali K Henders, Stefan Herms, Ian B Hickie, Maria Hipolito, Susanne Hoe- fels, Peter A Holmans, Florian Holsboer, Witte J Hoogendijk, Jouke-Jan Hottenga, Christina M Hultman, Vanessa Hus, Andres´ Ingason, Marcus Ising, Stephane´ Jamain, Edward G Jones, Ian Jones, Lisa Jones, Jung-Ying Tzeng, Anna K Kahler,¨ Rene´ S Kahn, Radhika Kandaswamy, Matthew C Keller, James L Kennedy, Elaine Kenny, Lindsey Kent, Yunjung Kim, George K Kirov, Sabine M Klauck, Lambertus Klei, James A Knowles, Martin A Kohli, Daniel L Koller, Bettina Konte, Ania Korszun, Lydia Krabbendam, Robert Krasucki, Jonna Kuntsi, Phoenix Kwan, Mikael Landen,´ Niklas Langstr˚ om,¨ Mark Lathrop, Jacob Lawrence, William B Law- son, Marion Leboyer, David H Ledbetter, Phil H Lee, Todd Lencz, Klaus-Peter Lesch, Dou- glas F Levinson, Cathryn M Lewis, Jun Li, Paul Lichtenstein, Jeffrey A Lieberman, Dan-Yu Lin, Don H Linszen, Chunyu Liu, Falk W Lohoff, Sandra K Loo, Catherine Lord, Jennifer K Lowe, Susanne Lucae, Donald J MacIntyre, Pamela A F Madden, Elena Maestrini, Patrik K E Magnusson, Pamela B Mahon, Wolfgang Maier, Anil K Malhotra, Shrikant M Mane, Christa L Martin, Nicholas G Martin, Manuel Mattheisen, Keith Matthews, Morten Mattings- dal, Steven A McCarroll, Kevin A McGhee, James J McGough, Patrick J McGrath, Peter McGuffin, Melvin G McInnis, Andrew McIntosh, Rebecca McKinney, Alan W McLean, Fran- cis J McMahon, William M McMahon, Andrew McQuillin, Helena Medeiros, Sarah E Med- land, Sandra Meier, Ingrid Melle, Fan Meng, Jobst Meyer, Christel M Middeldorp, Lefkos Mid- dleton, Vihra Milanova, Ana Miranda, Anthony P Monaco, Grant W Montgomery, Jennifer L Moran, Daniel Moreno-De-Luca, Gunnar Morken, Derek W Morris, Eric M Morrow, Valentina Moskvina, Pierandrea Muglia, Thomas W Muhleisen,¨ Walter J Muir, Bertram Muller-Myhsok,¨ Michael Murtha, Richard M Myers, Inez Myin-Germeys, Michael C Neale, Stan F Nelson, Car- oline M Nievergelt, Ivan Nikolov, Vishwajit Nimgaonkar, Willem A Nolen, Markus M Nothen,¨ John I Nurnberger, Evaristus A Nwulia, Dale R Nyholt, Colm O’Dushlaine, Robert D Oades, Ann Olincy, Guiomar Oliveira, Line Olsen, Roel A Ophoff, Urban Osby, Michael J Owen, Aarno Palotie, Jeremy R Parr, Andrew D Paterson, Carlos N Pato, Michele T Pato, Brenda W

157 Penninx, Michele L Pergadia, Margaret A Pericak-Vance, Benjamin S Pickard, Jonathan Pimm, Joseph Piven, Danielle Posthuma, James B Potash, Fritz Poustka, Peter Propping, Vinay Puri, Digby J Quested, Emma M Quinn, Josep Antoni Ramos-Quiroga, Henrik B Rasmussen, Soumya Raychaudhuri, Karola Rehnstrom,¨ Andreas Reif, Marta Ribases,´ John P Rice, Mar- cella Rietschel, Kathryn Roeder, Herbert Roeyers, Lizzy Rossin, Aribert Rothenberger, Guy Rouleau, Douglas Ruderfer, Dan Rujescu, Alan R Sanders, Stephan J Sanders, Susan L San- tangelo, Joseph A Sergeant, Russell Schachar, Martin Schalling, Alan F Schatzberg, William A Scheftner, Gerard D Schellenberg, Stephen W Scherer, Nicholas J Schork, Thomas G Schulze, Johannes Schumacher, Markus Schwarz, Edward Scolnick, Laura J Scott, Jianxin Shi, Paul D Shilling, Stanley I Shyn, Jeremy M Silverman, Susan L Slager, Susan L Smalley, Johannes H Smit, Erin N Smith, Edmund J S Sonuga-Barke, David St Clair, Matthew State, Michael Stef- fens, Hans-Christoph Steinhausen, John S Strauss, Jana Strohmaier, T Scott Stroup, James S Sutcliffe, Peter Szatmari, Szabocls Szelinger, Srinivasa Thirumalai, Robert C Thompson, Alexandre A Todorov, Federica Tozzi, Jens Treutlein, Manfred Uhr, Edwin J C G van den Oord, Gerard Van Grootheest, Jim Van Os, Astrid M Vicente, Veronica J Vieland, John B Vin- cent, Peter M Visscher, Christopher A Walsh, Thomas H Wassink, Stanley J Watson, Myrna M Weissman, Thomas Werge, Thomas F Wienker, Ellen M Wijsman, Gonneke Willemsen, Nigel Williams, A Jeremy Willsey, Stephanie H Witt, Wei Xu, Allan H Young, Timothy W Yu, Stanley Zammit, Peter P Zandi, Peng Zhang, Frans G Zitman, Sebastian Zollner,¨ Bernie De- vlin, John R Kelsoe, Pamela Sklar, Mark J Daly, Michael C O’Donovan, Nicholas Craddock, Patrick F Sullivan, Jordan W Smoller, Kenneth S Kendler, and Naomi R Wray. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nature genetics, 45(9):984–94, sep 2013.

[35] Shaun M Purcell, Jennifer L Moran, Menachem Fromer, Douglas Ruderfer, Nadia Solovi- eff, Panos Roussos, Colm O’Dushlaine, Kimberly Chambert, Sarah E Bergen, Anna Kahler,¨ Laramie Duncan, Eli Stahl, Giulio Genovese, Esperanza Fernandez,´ Mark O Collins, Noboru H Komiyama, Jyoti S Choudhary, Patrik K E Magnusson, Eric Banks, Khalid Shakir, Kiran Garimella, Tim Fennell, Mark DePristo, Seth G N Grant, Stephen J Haggarty, Stacey Gabriel, Edward M Scolnick, Eric S Lander, Christina M Hultman, Patrick F Sullivan, Steven A Mc- Carroll, and Pamela Sklar. A polygenic burden of rare disruptive mutations in schizophrenia. Nature, 506(7487):185–90, feb 2014.

[36] Ivan Iossifov, Brian J. O’Roak, Stephan J. Sanders, Michael Ronemus, Niklas Krumm, Dan

158 Levy, Holly A. Stessman, Kali T. Witherspoon, Laura Vives, Karynne E. Patterson, Joshua D. Smith, Bryan Paeper, Deborah A. Nickerson, Jeanselle Dea, Shan Dong, Luis E. Gonzalez, Jeffrey D. Mandell, Shrikant M. Mane, Michael T. Murtha, Catherine A. Sullivan, Michael F. Walker, Zainulabedin Waqar, Liping Wei, A. Jeremy Willsey, Boris Yamrom, Yoon-ha Lee, Ewa Grabowska, Ertugrul Dalkic, Zihua Wang, Steven Marks, Peter Andrews, Anthony Leotta, Jude Kendall, Inessa Hakker, Julie Rosenbaum, Beicong Ma, Linda Rodgers, Jennifer Troge, Giuseppe Narzisi, Seungtai Yoon, Michael C. Schatz, Kenny Ye, W. Richard McCombie, Jay Shendure, Evan E. Eichler, Matthew W. State, and Michael Wigler. The contribution of de novo coding mutations to spectrum disorder. Nature, 515(7526):216–221, oct 2014.

[37] Jessica C Mar, Nicholas A Matigian, Alan Mackay-Sim, George D Mellick, Carolyn M Sue, Peter A Silburn, John J McGrath, John Quackenbush, and Christine A Wells. Variance of gene expression identifies altered network constraints in neurological disease. PLoS genetics, 7(8):e1002207, aug 2011.

[38] Daniel F Jarosz and Susan Lindquist. Hsp90 and environmental stress transform the adaptive value of natural genetic variation. Science (New York, N.Y.), 330(6012):1820–4, dec 2010.

[39] Mete Civelek and Aldons J. Lusis. Systems genetics approaches to understand complex traits. Nature Reviews Genetics, 15(1):34–48, dec 2013.

[40] Alkes L Price, Agnar Helgason, Gudmar Thorleifsson, Steven A McCarroll, Augustine Kong, and Kari Stefansson. Single-tissue and cross-tissue heritability of gene expression via identity- by-descent in related or unrelated individuals. PLoS Genetics, 7(2):e1001317, 2011.

[41] Elin Grundberg, Kerrin S Small, Asa˚ K Hedman, Alexandra C Nica, Alfonso Buil, Sarah Keild- son, Jordana T Bell, Tsun-Po Yang, Eshwar Meduri, Amy Barrett, James Nisbett, Magdalena Sekowska, Alicja Wilk, So-Youn Shin, Daniel Glass, Mary Travers, Josine L Min, Sue Ring, Karen Ho, Gudmar Thorleifsson, Augustine Kong, Unnur Thorsteindottir, Chrysanthi Ainali, Antigone S Dimas, Neelam Hassanali, Catherine Ingle, David Knowles, Maria Krestyaninova, Christopher E Lowe, Paola Di Meglio, Stephen B Montgomery, Leopold Parts, Simon Potter, Gabriela Surdulescu, Loukia Tsaprouni, Sophia Tsoka, Veronique Bataille, Richard Durbin, Frank O Nestle, Stephen O’Rahilly, Nicole Soranzo, Cecilia M Lindgren, Krina T Zondervan, Kourosh R Ahmadi, Eric E Schadt, Kari Stefansson, , Mark I McCarthy, Panos Deloukas, Emmanouil T Dermitzakis, and Tim D Spector. Mapping cis- and trans- regulatory effects across multiple tissues in twins. Nature genetics, 44(10):1084–9, oct 2012.

159 [42] J E Powell, A K Henders, A F McRae, A Caracella, S Smith, M J Wright, J B Whitfield, E T Dermitzakis, N G Martin, and P M Visscher. The Brisbane Systems Genetics Study: Genetical Genomics Meets Complex Trait Genetics. PLoS ONE, 7(4):e35430, 2012.

[43] Anna L Dixon, Liming Liang, Miriam F Moffatt, Wei Chen, Simon Heath, Kenny C C Wong, Jenny Taylor, Edward Burnett, Ivo Gut, and Martin Farrall. A genome-wide association study of global gene expression. Nature genetics, 39(10):1202–1207, 2007.

[44] Joseph E Powell, Anjali K Henders, Allan F McRae, Jinhee Kim, Gibran Hemani, Nicholas G Martin, Emmanouil T Dermitzakis, Greg Gibson, Grant W Montgomery, and Peter M Visscher. Congruence of additive and non-additive effects on gene expression estimated from pedigree and SNP data. PLoS genetics, 9(5):e1003502, may 2013.

[45] Fred A Wright, Patrick F Sullivan, Andrew I Brooks, Fei Zou, Wei Sun, Kai Xia, Vered Madar, Rick Jansen, Wonil Chung, Yi-Hui Zhou, Abdel Abdellaoui, Sandra Batista, Casey Butler, Guanhua Chen, Ting-Huei Chen, David D’Ambrosio, Paul Gallins, Min Jin Ha, Jouke Jan Hottenga, Shunping Huang, Mathijs Kattenberg, Jaspreet Kochar, Christel M Middeldorp, Ani Qu, Andrey Shabalin, Jay Tischfield, Laura Todd, Jung-Ying Tzeng, Gerard van Grootheest, Jacqueline M Vink, Qi Wang, Wei Wang, Weibo Wang, Gonneke Willemsen, Johannes H Smit, Eco J de Geus, Zhaoyu Yin, Brenda W J H Penninx, and Dorret I Boomsma. Heritability and genomics of gene expression in peripheral blood. Nature genetics, 46(5):430–7, may 2014.

[46] V Emilsson, G Thorleifsson, B Zhang, A S Leonardson, F Zink, J Zhu, S Carlson, A Helgason, G B Walters, and S Gunnarsdottir. Genetics of gene expression and its effect on disease. Nature, 452(7186):423–428, 2008.

[47] Greg Elgar and Tanya Vavouri. Tuning in to the signals: noncoding sequence conservation in vertebrate genomes. Trends in genetics: TIG, 24(7):344, 2008.

[48] John S Mattick. RNA regulation: a new genetics? Nature Reviews Genetics, 5(4):316–323, 2004.

[49] John S Mattick and Igor V Makunin. Non-coding RNA. Human molecular genetics, 15(suppl 1):R17–R29, 2006.

[50] Paulo P Amaral, Marcel E Dinger, Tim R Mercer, and John S Mattick. The eukaryotic genome as an RNA machine. Science Signalling, 319(5871):1787, 2008.

160 [51] Sarah Djebali, Carrie A Davis, Angelika Merkel, Alex Dobin, Timo Lassmann, Ali Mortazavi, Andrea Tanzer, Julien Lagarde, Wei Lin, and Felix Schlesinger. Landscape of transcription in human cells. Nature, 489(7414):101–108, 2012.

[52] Jainab Khatun. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature, 2012.

[53] Lucia A Hindorff, Praveen Sethupathy, Heather A Junkins, Erin M Ramos, Jayashri P Mehta, Francis S Collins, and Teri A Manolio. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences, 106(23):9362–9367, jun 2009.

[54] Jian Yang, Teri A Manolio, Louis R Pasquale, Eric Boerwinkle, Neil Caporaso, Julie M Cunningham, Mariza De Andrade, Bjarke Feenstra, Eleanor Feingold, M Geoffrey Hayes, William G Hill, Maria Teresa Landi, Alvaro Alonso, Guillaume Lettre, Peng Lin, Hua Ling, William Lowe, Rasika A Mathias, Mads Melbye, Elizabeth Pugh, Marilyn C Cornelis, Bruce S Weir, Michael E Goddard, and Peter M Visscher. Genome partitioning of genetic variation for complex traits using common SNPs. Nature genetics, 43(6):519–525, jun 2011.

[55] Alexander Gusev, S.Hong Lee, Gosia Trynka, Hilary Finucane, BjarniJ. Vilhjalmsson,´ Han Xu, Chongzhi Zang, Stephan Ripke, Brendan Bulik-Sullivan, Eli Stahl, Anna K. Kahler,¨ Christina M. Hultman, Shaun M. Purcell, Steven A. McCarroll, Mark Daly, Bogdan Pasa- niuc, Patrick F. Sullivan, Benjamin M. Neale, Naomi R. Wray, Soumya Raychaudhuri, and Alkes L. Price. Partitioning Heritability of Regulatory and Cell-Type-Specific Variants across 11 Common Diseases. The American Journal of Human Genetics, 95(5):535–552, nov 2014.

[56] Harald H H Goring,¨ Joanne E Curran, Matthew P Johnson, Thomas D Dyer, Jac Charlesworth, Shelley A Cole, Jeremy B M Jowett, Lawrence J Abraham, David L Rainwater, Anthony G Co- muzzie, Michael C Mahaney, Laura Almasy, Jean W MacCluer, Ahmed H Kissebah, Gregory R Collier, Eric K Moses, and John Blangero. Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nature genetics, 39(10):1208–1216, oct 2007.

[57] Harm-Jan Westra, Marjolein J Peters, Tonu˜ Esko, Hanieh Yaghootkar, Claudia Schurmann, Johannes Kettunen, Mark W Christiansen, Benjamin P Fairfax, Katharina Schramm, Joseph E Powell, Alexandra Zhernakova, Daria V Zhernakova, Jan H Veldink, Leonard H Van den Berg, Juha Karjalainen, Sebo Withoff, Andre´ G Uitterlinden, Albert Hofman, Fernando Rivadeneira,

161 Peter A C ’t Hoen, Eva Reinmaa, Krista Fischer, Mari Nelis, Lili Milani, David Melzer, Luigi Ferrucci, Andrew B Singleton, Dena G Hernandez, Michael A Nalls, Georg Homuth, Matthias Nauck, Dorte¨ Radke, Uwe Volker,¨ Markus Perola, Veikko Salomaa, Jennifer Brody, Astrid Suchy-Dicey, Sina A Gharib, Daniel A Enquobahrie, Thomas Lumley, Grant W Montgomery, Seiko Makino, Holger Prokisch, Christian Herder, Michael Roden, Harald Grallert, Thomas Meitinger, Konstantin Strauch, Yang Li, Ritsert C Jansen, Peter M Visscher, Julian C Knight, Bruce M Psaty, Samuli Ripatti, Alexander Teumer, Timothy M Frayling, Andres Metspalu, Joyce B J van Meurs, and Lude Franke. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nature genetics, 45(10):1238–43, oct 2013.

[58] Qiyuan Li, Ji-Heui Seo, Barbara Stranger, Aaron McKenna, Itsik Pe’er, Thomas Laframboise, Myles Brown, Svitlana Tyekucheva, and Matthew L Freedman. Integrative eQTL-based anal- yses reveal the biology of breast cancer risk loci. Cell, 152(3):633–41, jan 2013.

[59] Tanja Zeller, Philipp Wild, Silke Szymczak, Maxime Rotival, Arne Schillert, Raphaele Castagne, Seraya Maouche, Marine Germain, Karl Lackner, Heidi Rossmann, Medea Eleft- heriadis, Christoph R Sinning, Renate B Schnabel, Edith Lubos, Detlev Mennerich, Werner Rust, Claire Perret, Carole Proust, Viviane Nicaud, Joseph Loscalzo, Norbert Hubner,¨ David Tregouet, Thomas Munzel,¨ Andreas Ziegler, Laurence Tiret, Stefan Blankenberg, and Franc¸ois Cambien. Genetics and beyond–the transcriptome of human monocytes and disease suscepti- bility. PloS one, 5(5):e10693, jan 2010.

[60] William Cookson, Liming Liang, Gonc¸alo Abecasis, Miriam Moffatt, and Mark Lathrop. Map- ping complex disease traits with global gene expression. Nature Reviews Genetics, 10(3):184– 194, 2009.

[61] A C Nica and E T Dermitzakis. Using gene expression to investigate the genetic basis of complex disorders. Human molecular genetics, 17(R2):R129–R134, 2008.

[62] James C Chen, Mariano J Alvarez, Flaminia Talos, Harshil Dhruv, Gabrielle E Rieckhof, Archana Iyer, Kristin L Diefes, Kenneth Aldape, Michael Berens, Michael M Shen, and An- drea Califano. Identification of causal genetic drivers of human disease through systems-level analysis of regulatory networks. Cell, 159(2):402–14, oct 2014.

[63] Barbara E Stranger, Matthew S Forrest, Andrew G Clark, Mark J Minichiello, Samuel Deutsch, Robert Lyle, Sarah Hunt, Brenda Kahl, Stylianos E Antonarakis, and Simon Tavare.´ Genome-

162 wide associations of gene expression variation in humans. PLoS Genetics, 1(6):e78, 2005.

[64] Harm-Jan Westra and Lude Franke. From genome to function by studying eQTLs. Biochimica et biophysica acta, 1842(10):1896–1902, oct 2014.

[65] Alexis Battle, Sara Mostafavi, Xiaowei Zhu, James B Potash, Myrna M Weissman, Courtney McCormick, Christian D Haudenschild, Kenneth B Beckman, Jianxin Shi, Rui Mei, Alexan- der E Urban, Stephen B Montgomery, Douglas F Levinson, and Daphne Koller. Characteriz- ing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome research, 24(1):14–24, jan 2014.

[66] Christopher D Brown, Lara M Mangravite, and Barbara E Engelhardt. Integrative modeling of eQTLs and cis-regulatory elements suggests mechanisms underlying cell type specificity of eQTLs. PLoS genetics, 9(8):e1003649, jan 2013.

[67] James Ronald, Rachel B Brem, Jacqueline Whittle, and Leonid Kruglyak. Local regulatory variation in Saccharomyces cerevisiae. PLoS genetics, 1(2):e25, aug 2005.

[68] Frank W. Albert and Leonid Kruglyak. The role of regulatory variation in complex traits and disease. Nature Reviews Genetics, 16(4):197–212, feb 2015.

[69] Maxime Rotival, Tanja Zeller, Philipp S Wild, Seraya Maouche, Silke Szymczak, Arne Schillert, Raphaele Castagne,´ Arne Deiseroth, Carole Proust, Jessy Brocheton, Tiphaine Gode- froy, Claire Perret, Marine Germain, Medea Eleftheriadis, Christoph R Sinning, Renate B Schnabel, Edith Lubos, Karl J Lackner, Heidi Rossmann, Thomas Munzel,¨ Augusto Rendon, Jeanette Erdmann, Panos Deloukas, Christian Hengstenberg, Patrick Diemert, Gilles Montale- scot, Willem H Ouwehand, Nilesh J Samani, Heribert Schunkert, David-Alexandre Tregouet, Andreas Ziegler, Alison H Goodall, Franc¸ois Cambien, Laurence Tiret, and Stefan Blanken- berg. Integrating genome-wide genetic variations and monocyte expression data reveals trans- regulated gene modules in humans. PLoS genetics, 7(12):e1002367, dec 2011.

[70] Kerrin S Small, Asa K Hedman, Elin Grundberg, Alexandra C Nica, Gudmar Thorleifsson, Augustine Kong, Unnur Thorsteindottir, So-Youn Shin, Hannah B Richards, Nicole Soranzo, Kourosh R Ahmadi, Cecilia M Lindgren, Kari Stefansson, Emmanouil T Dermitzakis, Panos Deloukas, Timothy D Spector, and Mark I McCarthy. Identification of an imprinted master trans regulator at the KLF14 locus related to multiple metabolic phenotypes. Nature genetics, 43(6):561–4, jun 2011.

163 [71] E Petretto, J Mangion, N J Dickens, S A Cook, M K Kumaran, H Lu, J Fischer, H Maatz, V Kren, and M Pravenec. Heritability and tissue specificity of expression quantitative trait loci. PLoS Genetics, 2(10):e172, 2006.

[72] LuzD. Orozco, BrianJ. Bennett, CharlesR. Farber, Anatole Ghazalpour, Calvin Pan, Nam Che, Pingzi Wen, HongXiu Qi, Adonisa Mutukulu, Nathan Siemers, Isaac Neuhaus, Roumyana Yordanova, Peter Gargalovic, Matteo Pellegrini, Todd Kirchgessner, and AldonsJ. Lusis. Un- raveling Inflammatory Responses using Systems Genetics and Gene-Environment Interactions in Macrophages. Cell, 151(3):658–670, oct 2012.

[73] Denis A Smirnov, Michael Morley, Eunice Shin, Richard S Spielman, and Vivian G Che- ung. Genetic analysis of radiation-induced changes in human gene expression. Nature, 459(7246):587–91, may 2009.

[74] Atila van Nas, Leslie Ingram-Drake, Janet S Sinsheimer, Susanna S Wang, Eric E Schadt, Thomas Drake, and Aldons J Lusis. Expression quantitative trait loci: replication, tissue- and sex-specificity in mice. Genetics, 185(3):1059–68, jul 2010.

[75] Hyun Min Kang, Chun Ye, and Eleazar Eskin. Accurate discovery of expression quantita- tive trait loci under confounding from spurious and genuine regulatory hotspots. Genetics, 180(4):1909–1925, dec 2008.

[76] Benjamin P Fairfax, Peter Humburg, Seiko Makino, Vivek Naranbhai, Daniel Wong, Evelyn Lau, Luke Jostins, Katharine Plant, Robert Andrews, Chris McGee, and Julian C Knight. Innate immune activity conditions the effect of regulatory variants upon monocyte gene expression. Science (New York, N.Y.), 343(6175):1246949, mar 2014.

[77] Rudolf S N Fehrmann, Ritsert C Jansen, Jan H Veldink, Harm-Jan J Westra, Danny Arends, Marc Jan Bonder, Jingyuan Fu, Patrick Deelen, Harry J M Groen, Asia Smolonska, Rinse K Weersma, Robert M W Hofstra, Wim a Buurman, Sander Rensen, Marcel G M Wolfs, Mathieu Platteel, Alexandra Zhernakova, Clara C Elbers, Eleanora M Festen, Gosia Trynka, Marten H Hofker, Christiaan G J Saris, Roel a Ophoff, Leonard H van den Berg, David a van Heel, Cisca Wijmenga, Gerard J Te Meerman, and Lude Franke. Trans-eQTLs reveal that independent genetic variants associated with a complex phenotype converge on intermediate genes, with a major role for the HLA. PLoS genetics, 7(8):e1002197, aug 2011.

[78] Julien Bryois, Alfonso Buil, David M Evans, John P Kemp, Stephen B Montgomery, Donald F

164 Conrad, Karen M Ho, Susan Ring, Matthew Hurles, Panos Deloukas, George Davey Smith, and Emmanouil T Dermitzakis. Cis and trans effects of human genomic variants on gene expression. PLoS genetics, 10(7):e1004461, jul 2014.

[79] Towfique Raj, Katie Rothamel, Sara Mostafavi, Chun Ye, Mark N Lee, Joseph M Replogle, Ting Feng, Michelle Lee, Natasha Asinovski, Irene Frohlich, Selina Imboywa, Alina Von Ko- rff, Yukinori Okada, Nikolaos A Patsopoulos, Scott Davis, Cristin McCabe, Hyun-il Paik, Gyan P Srivastava, Soumya Raychaudhuri, David A Hafler, Daphne Koller, Aviv Regev, Nir Hacohen, Diane Mathis, Christophe Benoist, Barbara E Stranger, and Philip L De Jager. Polar- ization of the effects of autoimmune and neurodegenerative risk alleles in leukocytes. Science (New York, N.Y.), 344(6183):519–23, may 2014.

[80] M. N. Lee, C. Ye, A.-C. Villani, T. Raj, W. Li, T. M. Eisenhaure, S. H. Imboywa, P. I. Chipendo, F. A. Ran, K. Slowikowski, L. D. Ward, K. Raddassi, C. McCabe, M. H. Lee, I. Y. Frohlich, D. A. Hafler, M. Kellis, S. Raychaudhuri, F. Zhang, B. E. Stranger, C. O. Benoist, P. L. De Jager, A. Regev, and N. Hacohen. Common Genetic Variants Modulate Pathogen-Sensing Responses in Human Dendritic Cells. Science, 343(6175):1246980–1246980, mar 2014.

[81] Wen-Hua Wei, Gibran Hemani, and Chris S Haley. Detecting epistasis in human complex traits. Nat Rev Genet, 15(11):722–733, 2014.

[82] Xuefeng Wang, Robert C Elston, and Xiaofeng Zhu. The meaning of interaction. Human heredity, 70(4):269–77, 2010.

[83] Or Zuk, Eliana Hechter, Shamil R Sunyaev, and Eric S Lander. The mystery of missing heri- tability: Genetic interactions create phantom heritability. Proceedings of the National Academy of Sciences of the United States of America, 109(4):1193–8, jan 2012.

[84] Sven Stringer, Eske M. Derks, Rene´ S. Kahn, William G. Hill, Naomi R. Wray, B Ma- her, TA Manolio, FS Collins, NJ Cox, DB Goldstein, LA Hindorff, EE Eichler, J Flint, G Gibson, A Kong, SM Leal, AJ Clarke, DN Cooper, S Stringer, NR Wray, RS Kahn, EM Derks, PM Visscher, WG Hill, NR Wray, W Haenszel, SW Guo, NR Wray, II Gottes- man, O Zuk, E Hechter, SR Sunyaev, ES Lander, A Sigal, R Milo, A Cohen, N Geva-Zatorsky, Y Klein, WG Hill, XS Zhang, TFC Mackay, J van Dongen, PE Slagboom, HHM Draisma, NG Martin, DI Boomsma, K Silventoinen, I Rahman, AM Bennet, NL Pedersen, U de Faire, P Svensson, B Teucher, J Skinner, PML Skidmore, A Cassidy, SJ Fairweather-Tait, WG Hill,

165 ME Goddard, PM Visscher, L Petersen, TIA Sørensen, S Wright, NH Barton, PD Keightley, O Carlborg, CS Haley, PC Phillips, G Hemani, S Knott, C Haley, JF Crow, A Eyre-Walker, TF Mackay, RF Lyman, W Huang, S Richards, MA Carbone, D Zhu, RRH Anholt, TFC Mackay, S Richards, EA Stone, A Barbadilla, JF Ayroles, H Kacser, JA Burns, RK Suarez, CD Moyes, DA Fell, PD Keightley, J Yang, B Benyamin, BP McEvoy, S Gordon, AK Henders, SH Lee, NR Wray, ME Goddard, PM Visscher, J Yang, SH Lee, ME Goddard, PM Visscher, N Zaitlen, P Kraft, PM Visscher, MA Brown, MI McCarthy, J Yang, Consortium The Welcome Trust Case Control, L Jostins, S Ripke, RK Weersma, RH Duerr, DP McGovern, JS Bloom, IM Ehrenreich, WT Loo, TLV Lite, L Kruglyak, KE Kemper, ME Goddard, M Haile-Mariam, G Nieuwhof, K Beard, K Konstatinov, B Hayes, K Michailidou, P Hall, A Gonzalez-Neira, M Ghoussaini, J Dennis, PM Visscher, and ME Goddard. Assumptions and Properties of Lim- iting Pathway Models for Analysis of Epistasis in Complex Traits. PLoS ONE, 8(7):e68913, jul 2013.

[85] Gibran Hemani, Konstantin Shakhbazov, Harm-Jan Westra, Tonu Esko, Anjali K Henders, Allan F McRae, Jian Yang, Greg Gibson, Nicholas G Martin, Andres Metspalu, Lude Franke, Grant W Montgomery, Peter M Visscher, and Joseph E Powell. Detection and replication of epistasis influencing transcription in humans. Nature, 508(7495):249–53, apr 2014.

[86] Guillaume Pare,´ Nancy R. Cook, Paul M. Ridker, Daniel I. Chasman, HJ Cordell, MR Nelson, SL Kardia, RE Ferrell, CF Sing, MD Ritchie, LW Hahn, JH Moore, MD Ritchie, BC White, JS Parker, LW Hahn, JH Moore, Z Bochdanovits, D Sondervan, S Perillous, T van Beijster- veldt, D Boomsma, J Millstein, DV Conti, FD Gilliland, WJ Gauderman, J Marchini, P Don- nelly, LR Cardon, DM Evans, J Marchini, AP Morris, LR Cardon, W Putt, J Palmen, V Nicaud, DA Tregouet, N Tahri-Daizadeh, YM Cho, MD Ritchie, JH Moore, JY Park, KU Lee, CE Mur- cray, JP Lewinger, WJ Gauderman, PM Ridker, DI Chasman, RY Zee, A Parker, L Rose, PM Ridker, NR Cook, IM Lee, D Gordon, JM Gaziano, GW Louis, MG Myers Jr, S Farooqi, S O’Rahilly, K Chen, F Li, J Li, H Cai, S Strom, PM Ridker, G Pare, A Parker, RY Zee, JS Danik, G Pare, DI Chasman, M Kellogg, RY Zee, N Rifai, A Ponthieux, D Lambert, B Her- beth, S Droesch, M Pfister, B Puthothu, M Krueger, M Bernhardt, A Heinzmann, B Sarecka- Hujar, I Zak, J Krauze, S Romeo, J Kozlitina, C Xing, A Pertsemlidis, D Cox, LE Johansson, U Lindblad, CA Larsson, L Rastam, M Ridderstrale, Y Song, JE Manson, L Tinker, N Ri- fai, and NR Cook. On the Use of Variance per Genotype as a Tool to Identify Quantitative Trait Interaction Effects: A Report from the Women’s Genome Health Study. PLoS Genetics,

166 6(6):e1000981, jun 2010.

[87] Andrew Anand Brown, Alfonso Buil, Ana Vinuela,˜ Tuuli Lappalainen, Hou-Feng Zheng, J Brent Richards, Kerrin S Small, Timothy D Spector, Emmanouil T Dermitzakis, Richard Durbin, L. Almasy, J. Blangero, J. Ansel, H. Bottin, C. Rodriguez-Beltran, C. Damon, M. Na- garajan, S. Fehrmann, J. Franc¸ois, G. Yvert, YS. Aulchenko, D-J. De Koning, C. Haley, J. Becker, JR. Wendland, B. Haenisch, MM. Nothen, J. Schumacher, JS. Bloom, IM. Ehren- reich, WT. Loo, TL. Lite, L. Kruglyak, D. Bates, M. Maechler, B. Bolker, S. Walker, MS. Breen, C. Kemena, PK. Vlasov, C. Notredame, FA. Kondrashov, A. Dabney, JD. Storey, with assistance from Warnes GR, The Encode Project Consortium, D. Falconer, T. Mackay, HB. Fraser, EE. Schadt, E. Grundberg, KS. Small, AK. Hedman, AC. Nica, A. Buil, S. Keildson, JT. Bell, TP. Yang, E. Meduri, A. Barrett, J. Nisbett, M. Sekowska, A. Wilk, SY. Shin, D. Glass, M. Travers, JL. Min, S. Ring, K. Ho, G. Thorleifsson, A. Kong, U. Thorsteindottir, C. Ainali, AS. Dimas, N. Hassanali, C. Ingle, D. Knowles, M. Krestyaninova, CE. Lowe, P. Di Meglio, SB. Montgomery, L. Parts, S. Potter, G. Surdulescu, L. Tsaprouni, S. Tsoka, V. Bataille, R. Durbin, FO. Nestle, S. O’Rahilly, N. Soranzo, CM. Lindgren, KT. Zondervan, KR. Ah- madi, EE. Schadt, K. Stefansson, GD. Smith, MI. Mccarthy, P. Deloukas, ET. Dermitzakis, TD. Spector, Consortium Multiple Tissue Human SExpression Resource, J. Harrow, A. Frank- ish, JM. Gonzalez, E. Tapanari, M. Diekhans, F. Kokocinski, BL. Aken, D. Barrell, A. Zadissa, S. Searle, I. Barnes, A. Bignell, V. Boychenko, T. Hunt, M. Kay, G. Mukherjee, J. Rajan, G. Despacio-Reyes, G. Saunders, C. Steward, R. Harte, M. Lin, C. Howald, A. Tanzer, T. Der- rien, J. Chrast, N. Walters, S. Balasubramanian, B. Pei, M. Tress, JM. Rodriguez, I. Ezkur- dia, J. Van Baren, M. Brent, D. Haussler, M. Kellis, A. Valencia, A. Reymond, M. Gerstein, R. Guigo, TJ. Hubbard, G. Hemani, S. Knott, C. Haley, G. Hemani, K. Shakhbazov, H-J. Wes- tra, T. Esko, AK. Henders, AF. Mcrae, J. Yang, G. Gibson, NG. Martin, A. Metspalu, L. Franke, GW. Montgomery, PM. Visscher, JE. Powell, WG. Hill, ME. Goddard, PM. Visscher, BN. Howie, P. Donnelly, J. Marchini, A. Huang, S. Xu, X. Cai, W. Huang, S. Richards, MA. Car- bone, D. Zhu, RR. Anholt, JF. Ayroles, L. Duncan, KW. Jordan, F. Lawrence, MM. Magwire, CB. Warner, K. Blankenburg, Y. Han, M. Javaid, J. Jayaseelan, SN. Jhangiani, D. Muzny, F. Ongeri, L. Perales, YQ. Wu, Y. Zhang, X. Zou, EA. Stone, RA. Gibbs, TF. Mackay, JM. Jimenez-Gomez, JA. Corwin, B. Joseph, JN. Maloof, DJ. Kliebenstein, T. Lappalainen, SB. Montgomery, AC. Nica, ET. Dermitzakis, T. Lappalainen, M. Sammeth, MRT. Friedlander, PA. Hoen, J. Monlong, MA. Rivas, M. Gonzalez-Porta, N. Kurbatova, T. Griebel, PG. Fer- reira, M. Barann, T. Wieland, L. Greger, M. Van Iterson, J. Almlof, P. Ribeca, I. Pulyakhina,

167 D. Esser, T. Giger, A. Tikhonov, M. Sultan, G. Bertier, DG. Macarthur, M. Lek, E. Lizano, HP. Buermans, I. Padioleau, T. Schwarzmayr, O. Karlberg, H. Ongen, H. Kilpinen, S. Bel- tran, M. Gut, K. Kahlem, V. Amstislavskiy, O. Stegle, M. Pirinen, SB. Montgomery, P. Don- nelly, MI. Mccarthy, P. Flicek, TM. Strom, H. Lehrach, S. Schreiber, R. Sudbrak, A. Car- racedo, SE. Antonarakis, R. Hasler, AC. Syvanen, GJ. Van Ommen, A. Brazma, T. Meitinger, P. Rosenstiel, R. Guigo, IG. Gut, X. Estivill, ET. Dermitzakis, Geuvadis Consortium, H. Li, R. Durbin, TF. Mackay, TA. Manolio, FS. Collins, NJ. Cox, DB. Goldstein, LA. Hindorff, DJ. Hunter, MI. Mccarthy, EM. Ramos, LR. Cardon, A. Chakravarti, JC. Marioni, CE. Mason, SM. Mane, M. Stephens, Y. Gilad, N. Martin, D. Rowell, J. Whitfield, S. Nakagawa, H. Schielzeth, SP. Otto, MW. Feldman, G. Pare,´ NR. Cook, PM. Ridker, DI. Chasman, L. Parts, O. Ste- gle, J. Winn, R. Durbin, JE. Powell, AK. Henders, AF. McRae, J. Kim, G. Hemani, NG. Martin, ET. Dermitzakis, G. Gibson, GW. Montgomery, PM. Visscher, R Core Team, CA. Reynolds, M. Gatz, S. Berg, NL. Pedersen, KR. Rosenbloom, CA. Sloan, VS. Malladi, TR. Dreszer, K. Learned, VM. Kirkup, MC. Wong, M. Maddren, R. Fang, SG. Heitner, JD. Storey, X. Sun, R. Elston, N. Morris, X. Zhu, The 1000 Genomes Project Consortium, The Interna- tional Human Genome Sequencing Consortium, G. Wang, E. Yang, CL. Brinkmeyer-Langford, JJ. Cai, Z. Wang, M. Gerstein, M. Snyder, AM. Wentzell, HC. Rowe, BG. Hansen, C. Ticconi, BA. Halkier, DJ. Kliebenstein, J. Yang, RJ. Loos, JE. Powell, SE. Medland, EK. Speliotes, DI. Chasman, LM. Rose, G. Thorleifsson, V. Steinthorsdottir, R. Magi, L. Waite, AV. Smith, LM. Yerges-Armstrong, KL. Monda, D. Hadley, A. Mahajan, G. Li, K. Kapur, V. Vitart, JE. Huffman, SR. Wang, C. Palmer, T. Esko, K. Fischer, JH. Zhao, A. Demirkan, A. Isaacs, MF. Feitosa, J. Luan, NL. Heard-Costa, C. White, AU. Jackson, M. Preuss, A. Ziegler, J. Eriksson, Z. Kutalik, F. Frau, IM. Nolte, JV. Van Vliet-Ostaptchouk, JJ. Hottenga, KB. Jacobs, N. Ver- weij, A. Goel, C. Medina-Gomez, K. Estrada, . Bragg-Gresham, S. Sanna, C. Sidore, J. Tyrer, A. Teumer, I. Prokopenko, M. Mangino, CM. Lindgren, TL. Assimes, AR. Shuldiner, J. Hui, JP. Beilby, WL. McArdle, P. Hall, T. Haritunians, L. Zgaga, I. Kolcic, O. Polasek, T. Zemu- nik, BA. Oostra, MJ. Junttila, H. Gronberg, S. Schreiber, A. Peters, AA. Hicks, J. Stephens, NS. Foad, J. Laitinen, A. Pouta, M. Kaakinen, G. Willemsen, JM. Vink, SH. Wild, G. Navis, FW. Asselbergs, G. Homuth, U. John, C. Iribarren, T. Harris, L. Launer, V. Gudnason, JR. O’Connell, E. Boerwinkle, G. Cadby, LJ. Palmer, AL. James, AW. Musk, E. Ingelsson, BM. Psaty, JS. Beckmann, G. Waeber, P. Vollenweider, C. Hayward, AF. Wright, I. Rudan, LC. Groop, A. Metspalu, KT. Khaw, CM. van Duijn, IB. Borecki, MA. Province, NJ. Wareham, JC. Tardif, HV. Huikuri, LA. Cupples, LD. Atwood, CS. Fox, M. Boehnke, FS. Collins, KL. Mohlke, J. Erdmann, H. Schunkert, C. Hengstenberg, K. Stark, M. Lorentzon, C. Ohlsson,

168 D. Cusi, JA. Staessen, MM. Van der Klauw, PP. Pramstaller, S. Kathiresan, JD. Jolley, S. Ri- patti, MR. Jarvelin, EJ. de Geus, DI. Boomsma, B. Penninx, JF. Wilson, H. Campbell, SJ. Chanock, P. van der Harst, A. Hamsten, H. Watkins, A. Hofman, JC. Witteman, MC. Zillikens, AG. Uitterlinden, F. Rivadeneira, MC. Zillikens, LA. Kiemeney, SH. Vermeulen, GR. Abeca- sis, D. Schlessinger, S. Schipf, M. Stumvoll, A. Tonjes,¨ TD. Spector, KE. North, G. Lettre, MI. McCarthy, SI. Berndt, AC. Heath, PA. Madden, DR. Nyholt, GW. Montgomery, NG. Martin, B. McKnight, DP. Strachan, WG. Hill, H. Snieder, PM. Ridker, U. Thorsteinsdottir, K. Ste- fansson, TM. Frayling, JN. Hirschhorn, ME. Goddard, PM. Visscher, O. Zuk, E. Hechter, SR. Sunyaev, and ES. Lander. Genetic interactions affecting human gene expression identified by variance association mapping. eLife, 3:e01381, apr 2014.

[88] Leonid Padyukov, Camilla Silva, Patrik Stolt, Lars Alfredsson, and Lars Klareskog. A gene- environment interaction between smoking and shared epitope genes in HLA-DR provides a high risk of seropositive rheumatoid arthritis. Arthritis & Rheumatism, 50(10):3085–3092, oct 2004.

[89] Vladislav Grishkevich and Itai Yanai. The genomic determinants of genotype environment interactions in gene expression. Trends in Genetics, 29(8):479–487, 2013.

[90] M. H. Van IJzendoorn, Marian J. Bakermans-Kranenburg, Jay Belsky, Steven Beach, Gene Brody, Kenneth A Dodge, Mark Greenberg, Michael Posner, and Stephen Scott. Gene-by- environment experiments: A new approach to finding the missing heritability. Nature Reviews Genetics, 12(12):881–881, 2011.

[91] Danielle M Dick. Gene-environment interaction in psychological traits and disorders. Annual review of clinical psychology, 7:383–409, 2011.

[92] J Kim-Cohen, A Caspi, A Taylor, B Williams, R Newcombe, I W Craig, and T E Moffitt. MAOA, maltreatment, and geneenvironment interaction predicting children’s mental health: new evidence and a meta-analysis. Molecular Psychiatry, 11(10):903–913, oct 2006.

[93] T C Eley, K Sugden, A Corsico, A M Gregory, P Sham, P McGuffin, R Plomin, and I W Craig. Geneenvironment interaction analysis of serotonin system markers with adolescent depression. Molecular Psychiatry, 9(10):908–915, oct 2004.

[94] Marian J. Bakermans-Kranenburg and Marinus H. van IJzendoorn. Gene-environment interac- tion of the dopamine D4 receptor (DRD4) and observed maternal insensitivity predicting ex-

169 ternalizing behavior in preschoolers. Developmental Psychobiology, 48(5):406–409, jul 2006.

[95] Avshalom Caspi, Terrie E. Moffitt, Mary Cannon, Joseph McClay, Robin Murray, HonaLee Harrington, Alan Taylor, Louise Arseneault, Ben Williams, Antony Braithwaite, Richie Poul- ton, and Ian W. Craig. Moderation of the Effect of Adolescent-Onset Cannabis Use on Adult Psychosis by a Functional Polymorphism in the Catechol-O-Methyltransferase Gene: Longi- tudinal Evidence of a Gene X Environment Interaction. Biological Psychiatry, 57(10):1117– 1127, 2005.

[96] Joel Nigg, Molly Nikolas, and S. Alexandra Burt. Measured Gene-by-Environment Interaction in Relation to Attention-Deficit/Hyperactivity Disorder. Journal of the American Academy of Child & Adolescent Psychiatry, 49(9):863–873, 2010.

[97] Neil Risch, Richard Herrell, Thomas Lehner, Kung-Yee Liang, Lindon Eaves, Josephine Hoh, Andrea Griem, Maria Kovacs, Jurg Ott, and Kathleen Ries Merikangas. Interaction Between the Serotonin Transporter Gene (5-HTTLPR), Stressful Life Events, and Risk of Depression. JAMA, 301(23):2462, jun 2009.

[98] Xiang-Yang Lou, Guo-Bo Chen, Lei Yan, Jennie Z. Ma, Jun Zhu, Robert C. Elston, and Ming D. Li. A Generalized Combinatorial Approach for Detecting Gene-by-Gene and Gene- by-Environment Interactions with Application to Nicotine Dependence. The American Journal of Human Genetics, 80(6):1125–1137, 2007.

[99] Michael Kabesch. Gene by environment interactions and the development of asthma and al- lergy. Letters, 162(1):43–48, 2006.

[100] Stephanie J. London and Isabelle Romieu. Gene by Environment Interaction in Asthma. Annual Review of , 30(1):55–80, apr 2009.

[101] Gerard H. Koppelman. Gene by environment interaction in asthma. Current Allergy and Asthma Reports, 6(2):103–111, mar 2006.

[102] Benjamin P Fairfax, Seiko Makino, Jayachandran Radhakrishnan, Katharine Plant, Stephen Leslie, Alexander Dilthey, Peter Ellis, Cordelia Langford, Fredrik O Vannberg, and Julian C Knight. Genetics of gene expression in primary immune cells identifies cell type-specific mas- ter regulators and roles of HLA alleles. Nature genetics, 44(5):502–510, 2012.

[103] Luis M Franco, Kristine L Bucasas, Janet M Wells, Diane Nino,˜ Xueqing Wang, Gladys E

170 Zapata, Nancy Arden, Alexander Renwick, Peng Yu, John M Quarles, Molly S Bray, Robert B Couch, John W Belmont, and Chad A Shaw. Integrative genomic analysis of the human im- mune response to influenza vaccination. eLife, 2:e00299, jan 2013.

[104] Joel Schwartz, Sung Kyun Park, Marie S. O’Neill, Pantel S. Vokonas, David Sparrow, Scott Weiss, and Karl Kelsey. Glutathione-S-Transferase M1, Obesity, Statins, and Autonomic Ef- fects of Particles. American Journal of Respiratory and Critical Care Medicine, 172(12):1529– 1533, dec 2005.

[105] Matthew C Keller, L.E. Duncan, M.C. Keller, L.J. Eaves, M.R. Munafo, C. Durrant, G. Lewis, J. Flint, M.R. Munafo, J. Flint, N. Risch, R. Herrell, T. Lehner, K.Y. Liang, L. Eaves, J. Hoh, et Al., J. Flint, M.R. Munafo, G.H. McClelland, C.M. Judd, J.P. Ioannidis, F.J. Bosker, C.A. Hartman, I.M. Nolte, B.P. Prins, P. Terpstra, D. Posthuma, et Al., A.L. Collins, Y. Kim, P. Sklar, M.C. O’Donovan, P.F. Sullivan, A.C. Need, D. Ge, M.E. Weale, J. Maia, S. Feng, E.L. Heinzen, et Al., A.R. Sanders, J. Duan, D.F. Levinson, J. Shi, D. He, C. Hou, et Al., P.F. Sullivan, D. Lin, J.Y. Tzeng, E. van den Oord, D. Perkins, T.S. Stroup, et Al., C.E. Mur- cray, J.P. Lewinger, W.J. Gauderman, J.K. Hewitt, C. Johnston, B.B. Lahey, W. Matthys, J.G. Hull, J.C. Tedlie, D.A. Lehn, V.Y. Yzerbyt, D. Muller, C.M. Judd, D. Muller, C.M. Judd, V.Y. Yzerbyt, J. Kaufman, B.Z. Yang, H. Douglas-Palumberi, S. Houshyar, D. Lipschitz, J.H. Krys- tal, et Al., J. Cohen, P. Cohen, S.G. West, L.S. Aiken, A. Caspi, T.E. Moffitt, M. Cannon, J. McClay, R. Murray, H. Harrington, et Al., M.T. Lynskey, A.C. Heath, K.K. Bucholz, W.S. Slutske, P.A. Madden, E.C. Nelson, et Al., D. Cicchetti, F.A. Rogosch, M.L. Sturge-Apple, S.Z. Sabol, S. Hu, D. Hamer, D.M. Dick, A. Agrawal, M.A. Schuckit, L. Bierut, A. Hinrichs, L. Fox, et Al., A. Caspi, K. Sugden, T.E. Moffitt, A. Taylor, I.W. Craig, H. Harrington, et Al., A.B. Amstadter, K.C. Koenen, K.J. Ruggiero, R. Acierno, S. Galea, D.G. Kilpatrick, et Al., C. Aslund, J. Leppert, E. Comasco, N. Nordquist, L. Oreland, K.W. Nilsson, I.D. Waldman, M.J. Bakermans-Kranenburg, M.H. van Ijzendoorn, C.H. Bau, S. Almeida, M.H. Hutz, P.M. Bet, B.W. Penninx, Z. Bochdanovits, A.G. Uitterlinden, A.T. Beekman, N.M. van Schoor, et Al., E.B. Binder, R.G. Bradley, W. Liu, M.P. Epstein, T.C. Deveau, K.B. Mercer, et Al., D. Blomeyer, J. Treutlein, G. Esser, M.H. Schmidt, G. Schumann, M. Laucht, R.G. Bradley, E.B. Binder, M.P. Epstein, Y. Tang, H.P. Nair, W. Liu, et Al., A. Caspi, J. McClay, T.E. Moffitt, J. Mill, J. Martin, I.W. Craig, et Al., J. Chotai, A. Serretti, E. Lattuada, C. Lorenzi, R. Lilli, J. Covault, H. Tennen, S. Armeli, T.S. Conner, A.I. Herman, A.H. Cillessen, et Al., N.A. Fox, K.E. Nichols, H.A. Henderson, K. Rubin, L. Schmidt, D. Hamer, et Al., P. Gacek, T.S. Con-

171 ner, H. Tennen, H.R. Kranzler, J. Covault, J. Gervai, A. Novak, K. Lakatos, I. Toth, I. Danis, Z. Ronai, et Al., H.J. Grabe, M. Lange, B. Wolff, H. Volzke, M. Lucht, H.J. Freyberger, et Al., H.J. Grabe, C. Spitzer, C. Schwahn, A. Marcinek, A. Frahnow, S. Barnow, et Al., G.J. Ha- effel, M. Getchell, R.A. Koposov, C.M. Yrigollen, C.G. Deyoung, B.A. Klinteberg, et Al., C. Henquet, A. Rosa, P. Delespaul, S. Papiol, L. Fananas, J. van Os, et Al., M. Jokela, T. Lehti- maki, L. Keltikangas-Jarvinen, M. Jokela, T. Lehtimaki, L. Keltikangas-Jarvinen, M. Jokela, L. Keltikangas-Jarvinen, M. Kivimaki, S. Puttonen, M. Elovainio, R. Rontu, et Al., M. Jokela, K. Raikkonen, T. Lehtimaki, R. Rontu, L. Keltikangas-Jarvinen, R.S. Kahn, J. Khoury, W.C. Nichols, B.P. Lanphear, L. Keltikangas-Jarvinen, K. Raikkonen, J. Ekelund, L. Peltonen, K.C. Koenen, A.E. Aiello, E. Bakshis, A.B. Amstadter, K.J. Ruggiero, R. Acierno, et Al., J. Lahti, K. Raikkonen, J. Ekelund, L. Peltonen, O.T. Raitakari, Kelitkangas-Jarvinen, J. Lasky-Su, S.V. Faraone, C. Lange, M.T. Tsuang, A.E. Doyle, J.W. Smoller, et Al., M. Laucht, M.H. Skowronek, K. Becker, M.H. Schmidt, G. Esser, T.G. Schulze, et Al., M. Nobile, R. Giorda, C. Marino, O. Carlet, V. Pastore, L. Vanzin, et Al., M. Nobile, M. Rusconi, M. Bellina, C. Marino, R. Giorda, O. Carlet, et Al., T. Ozkaragoz, E.P. Noble, N. Perroud, P. Courtet, I. Vincze, I. Jaussent, F. Jollant, F. Bellivier, et Al., S.E. Racine, K.M. Culbert, C.L. Larson, K.L. Klump, W. Retz, C.M. Freitag, P. Retz-Junginger, D. Wenzler, M. Schneider, C. Kissling, et Al., G. Seeger, P. Schloss, M.H. Schmidt, A. Ruter-Jungfleisch, F.A. Henn, M.B. Stein, N.J. Schork, J. Gelernter, S.E. Stevens, R. Kumsta, J.M. Kreppner, K.J. Brookes, M. Rutter, E.J. Sonuga-Barke, N. Sun, Y. Xu, Y. Wang, H. Duan, S. Wang, Y. Ren, et Al., R.D. Todd, R.J. Neuman, R. van Winkel, C. Henquet, A. Rosa, S. Papiol, L. Fananas, M. De Hert, et Al., M.M. Vanyukov, B.S. Maher, B. Devlin, G.P. Kirillova, L. Kirisci, L.M. Yu, et Al., Y. Xu, F. Li, X. Huang, N. Sun, F. Zhang, P. Liu, et Al., Y.C. Yen, G.W. Rebok, M.J. Yang, and F.W. Lung. Gene environment interaction studies have not properly controlled for potential confounders: the problem and the (simple) solution. Biological psychiatry, 75(1):18–24, jan 2014.

[106] Casey E Romanoski, Sangderk Lee, Michelle J Kim, Leslie Ingram-Drake, Christopher L Plaisier, Roumyana Yordanova, Charles Tilford, Bo Guan, Aiqing He, Peter S Gargalovic, Todd G Kirchgessner, Judith A Berliner, and Aldons J Lusis. Systems genetics analysis of gene-by-environment interactions in human cells. American journal of human genetics, 86(3):399–410, mar 2010.

[107] Marian J. Bakermans-Kranenburg, Marinus H. Van IJzendoorn, Femke T. A. Pijlman, Judi Mesman, and Femmie Juffer. Experimental evidence for differential susceptibility: Dopamine

172 D4 receptor polymorphism (DRD4 VNTR) moderates intervention effects on toddlers’ exter- nalizing behavior in a randomized controlled trial. , 44(1):293–300, jan 2008.

[108] Christian R. Landry, Julia Oh, Daniel L. Hartl, and Duccio Cavalieri. Genome-wide scan reveals that genetic variation for transcriptional plasticity in yeast is biased towards multi-copy and dispensable genes. Gene, 366(2):343–351, 2006.

[109] Erin N Smith and Leonid Kruglyak. Gene-environment interaction in yeast gene expression. PLoS biology, 6(4):e83, apr 2008.

[110] Yang Li, Olga Alda Alvarez,´ Evert W Gutteling, Marcel Tijsterman, Jingyuan Fu, Joost A G Riksen, Esther Hazendonk, Pjotr Prins, Ronald H A Plasterk, and Ritsert C Jansen. Mapping determinants of gene expression plasticity by genetical genomics in C. elegans. PLoS Genetics, 2(12):e222, 2006.

[111] Vladislav Grishkevich, Shay Ben-Elazar, Tamar Hashimshony, Daniel H Schott, Craig P Hunter, Itai Yanai, TM. Barnes, Y. Kohara, A. Coulson, S. Hekimi, LR. Baugh, AA. Hill, DK. Slonim, EL. Brown, CP. Hunter, S. Brenner, ME. Feder, JC. Walser, DA. Garsin, JM. Villanueva, J. Begun, DH. Kim, CD. Sifri, SB. Calderwood, G. Ruvkun, FM. Ausubel, AP. Gasch, PT. Spellman, CM. Kao, O. CarmelHarel, MB. Eisen, G. Storz, D. Botstein, PO. Brown, J. Gerhart, M. Kirschner, J. Gerke, K. Lorenz, S. Ramnarine, B. Cohen, V. Gr- ishkevich, T. Hashimshony, I. Yanai, J. Hodgkin, P. Khaitovich, G. Weiss, M. Lachmann, I. Hellmann, W. Enard, B. Muetzel, U. Wirkner, W. Ansorge, S. Paabo, Y. Li, OA. Alvarez, EW. Gutteling, M. Tijsterman, J. Fu, JA. Riksen, E. Hazendonk, P. Prins, RH. Plasterk, RC. Jansen, R. Breitling, JE. Kammenga, TF. Mackay, EA. Stone, JF. Ayroles, WL. Nicholas, EC. Dougherty, EL. Hansen, SA. Rifkin, J. Kim, KP. White, SS. ShenOrr, Y. Pilpel, CP. Hunter, EN. Smith, L. Kruglyak, CA. Spike, J. Bader, V. Reinke, S. Strome, I. Tirosh, N. Barkai, I. Tirosh, S. Reikhav, AA. Levy, N. Barkai, I. Tirosh, A. Weinberger, M. Carmi, N. Barkai, I. Tirosh, KH. Wong, N. Barkai, K. Struhl, R. Wernersson, AS. Juncker, HB. Nielsen, PJ. Wit- tkopp, BK. Haerum, AG. Clark, PJ. Wittkopp, BK. Haerum, AG. Clark, I. Yanai, D. Graur, R. Ophir, I. Yanai, CP. Hunter, K. Yook, TW. Harris, T. Bieri, A. Cabunoc, J. Chan, WJ. Chen, P. Davis, N. de la Cruz, A. Duong, R. Fang, U. Ganesan, C. Grove, K. Howe, S. Kadam, R. Kishore, R. Lee, Y. Li, HM. Muller, C. Nakamura, and B. Nash. A genomic bias for genotype-environment interactions in C. elegans. Molecular systems biology, 8(1):587, jun

173 2012.

[112] Daniel Promislow. A regulatory network analysis of phenotypic plasticity in yeast. The American naturalist, 165(5):515–23, may 2005.

[113] Christopher T. Harbison, D. Benjamin Gordon, Tong Ihn Lee, Nicola J. Rinaldi, Kenzie D. Macisaac, Timothy W. Danford, Nancy M. Hannett, Jean-Bosco Tagne, David B. Reynolds, Jane Yoo, Ezra G. Jennings, Julia Zeitlinger, Dmitry K. Pokholok, Manolis Kellis, P. Alex Rolfe, Ken T. Takusagawa, Eric S. Lander, David K. Gifford, Ernest Fraenkel, and Richard A. Young. Transcriptional regulatory code of a eukaryotic genome. Nature, 431(7004):99–104, sep 2004.

[114] Marcela Preininger, Dalia Arafat, Jinhee Kim, Artika P Nath, Youssef Idaghdour, Kenneth L Brigham, and Greg Gibson. Blood-informative transcripts define nine common axes of periph- eral blood gene expression. PLoS genetics, 9(3):e1003362, jan 2013.

[115] Artika Praveeta Nath, Dalia Arafat, and Greg Gibson. Using blood informative transcripts in geographical genomics: impact of lifestyle on gene expression in fijians. Frontiers in genetics, 3:243, jan 2012.

[116] Alfonso Buil, Andrew Anand Brown, Tuuli Lappalainen, Ana Vinuela,˜ Matthew N Davies, Hou-Feng Zheng, J Brent Richards, Daniel Glass, Kerrin S Small, Richard Durbin, Timo- thy D Spector, and Emmanouil T Dermitzakis. Gene-gene and gene-environment interactions detected by transcriptome sequence analysis in twins. Nature genetics, 47(1):88–91, 2015.

[117] John D Storey, Jennifer Madeoy, Jeanna L Strout, Mark Wurfel, James Ronald, and Joshua M Akey. Gene-expression variation within and among human populations. American journal of human genetics, 80(3):502–9, mar 2007.

[118] Danitsja M van Leeuwen, Ebienus van Agen, Ralph W H Gottschalk, Robert Vlietinck, Marij Gielen, Marcel H M van Herwijnen, Lou M Maas, and Jos Kleinjans. Cigarette smoke-induced differential gene expression in blood cells from monozygotic twin pairs. Carcinogenesis, 28(3):691–697, 2006.

[119] Edwin Choy, Roman Yelensky, Sasha Bonakdar, Robert M Plenge, Richa Saxena, Philip L De Jager, Stanley Y Shaw, Cara S Wolfish, Jacqueline M Slavik, Chris Cotsapas, Manuel Rivas, Emmanouil T Dermitzakis, Ellen Cahir-McFarland, Elliott Kieff, David Hafler, Mark J

174 Daly, and David Altshuler. Genetic analysis of human traits in vitro: drug response and gene expression in lymphoblastoid cell lines. PLoS genetics, 4(11):e1000287, nov 2008.

[120] Sailendra Nath Sarma, Youn-Jung Kim, and Jae-Chun Ryu. Differential gene expression pro- files of human leukemia cell lines exposed to benzene and its metabolites. Environmental Toxicology and Pharmacology, 32(2):285–295, 2011.

[121] Michael J Lee, Albert S Ye, Alexandra K Gardino, Anne Margriet Heijink, Peter K Sorger, Gavin MacBeath, and Michael B Yaffe. Sequential application of anticancer drugs enhances cell death by rewiring apoptotic signaling networks. Cell, 149(4):780–94, may 2012.

[122] Ary A Hoffmann and Yvonne Willi. Detecting genetic responses to environmental change. Nature reviews. Genetics, 9(6):421–32, jun 2008.

[123] Fabio Morgante, Peter Sørensen, Daniel A Sorensen, Christian Maltecca, and Trudy F C Mackay. Genetic Architecture of Micro-Environmental Plasticity in Drosophila melanogaster. Scientific reports, 5:9785, jan 2015.

[124] Juliet Ansel, Hel´ ene` Bottin, Camilo Rodriguez-Beltran, Christelle Damon, Muniyandi Nagara- jan, Steffen Fehrmann, Jean Franc¸ois, and Gael¨ Yvert. Cell-to-cell stochastic variation in gene expression is a complex genetic trait. PLoS Genetics, 4(4):e1000049, apr 2008.

[125] Han A Mulder, Lars Ronneg¨ ard,˚ W Freddy Fikse, Roel F Veerkamp, and Erling Strandberg. Estimation of genetic variance for macro- and micro-environmental sensitivity using double hierarchical generalized linear models. Genetics, selection, evolution : GSE, 45(1):23, jan 2013.

[126] Ivan P. Gorlov, Olga Y. Gorlova, and Christopher I. Amos. Allelic Spectra of Risk SNPs Are Different for Environment/Lifestyle Dependent versus Independent Diseases. PLOS Genetics, 11(7):e1005371, jul 2015.

[127] G. de Jong and P. Bijma. Selection and phenotypic plasticity in evolutionary biology and animal breeding. Livestock Production Science, 78(3):195–214, dec 2002.

[128] Adrian Bird. Perceptions of . Nature, 447(7143):396–398, may 2007.

[129] Marco Trerotola, Valeria Relli, Pasquale Simeone, and Saverio Alberti. Epigenetic inheritance and the missing heritability. Human genomics, 9(1):17, jul 2015.

175 [130] David Ratel, Jean-Luc Ravanat, Franc¸ois Berger, and Didier Wion. N6-methyladenine: the other methylated base of DNA. BioEssays : news and reviews in molecular, cellular and developmental biology, 28(3):309–15, mar 2006.

[131] Tao P. Wu, Tao Wang, Matthew G. Seetin, Yongquan Lai, Shijia Zhu, Kaixuan Lin, Yifei Liu, Stephanie D. Byrum, Samuel G. Mackintosh, Mei Zhong, Alan Tackett, Guilin Wang, Lawrence S. Hon, Gang Fang, James A. Swenberg, and Andrew Z. Xiao. DNA methylation on N6-adenine in mammalian embryonic stem cells. Nature, 532(7599):329–333, mar 2016.

[132] Adrian P. Bird. CpG-rich islands and the function of DNA methylation. Nature, 321(6067):209–213, may 1986.

[133] S. Saxonov, P. Berg, and D. L. Brutlag. A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proceedings of the National Academy of Sciences, 103(5):1412–1417, jan 2006.

[134] Mehrnaz Fatemi, Martha M Pao, Shinwu Jeong, Einav Nili Gal-Yam, Gerda Egger, Daniel J Weisenberger, and Peter A Jones. Footprinting of mammalian promoters: use of a CpG DNA methyltransferase revealing nucleosome positions at a single molecule level. Nucleic acids research, 33(20):e176, nov 2005.

[135] Assaf Zemach, Ivy E. McDaniel, Pedro Silva, and Daniel Zilberman. Genome-Wide Evolu- tionary Analysis of Eukaryotic DNA Methylation. Science, 328(5980), 2010.

[136] A. Bird. DNA methylation patterns and epigenetic memory. Genes & Development, 16(1):6– 21, jan 2002.

[137] Kamel Jabbari and Giorgio Bernardi. Cytosine methylation and CpG, TpG (CpA) and TpA frequencies. Gene, 333:143–149, may 2004.

[138] Konstantin Shakhbazov, Joseph E. Powell, Gibran Hemani, Anjali K. Henders, Nicholas G. Martin, Peter M. Visscher, Grant W. Montgomery, Allan F. McRae, FW Albert, K Leonid, AF McRae, JE Powell, AK Henders, B Lisa, H Gibran, S Sonia, JT Bell, AA Pai, JK Pick- rell, DJ Gaffney, P-R Roger, JF Degner, EL Moen, Z Xu, M Wenbo, SM Delaney, W Claudia, MQ Jennifer, G-A Maria, L Tuuli, SB Montgomery, B Alfonso, O Halit, Y Alisa, J Raphael Gibbs, MP Brug, DG Hernandez, BJ Traynor, MA Nalls, L Shiao-Lin, Z Dandan, C Lijun, JA Badner, C Chao, C Qi, L Wei, L Mathieu, SHE Zaidi, B Maria, G Bing, A Dylan, G Ma-

176 rine, R Holliday, JE Pugh, AD Riggs, KR Eijk, S Jong, MPM Boks, L Terry, C Fabrice, JH Veldink, JE Powell, AK Henders, AF McRae, MJ Wright, NG Martin, ET Dermitzakis, JE Powell, AK Henders, AF McRae, K Jinhee, H Gibran, NG Martin, JE Powell, AK Henders, AF McRae, C Anthony, S Sara, MJ Wright, ME Price, AM Cotton, LL Lam, F Pau, E Eldon, CJ Brown, RA Fisher, RA Fisher, LE Reinius, A Nathalie, J Maaike, P Goran,¨ D Sven-Erik, G Dario, A Neil, J Mabbott, B Kenneth, B Helen, TC Freeman, DA Hume, H Eugene Andres, WP Accomando, DC Koestler, BC Christensen, CJ Marsit, HH Nelson, S Jian Yang, L Hong, ME Goddard, PM Visscher, SH Lee, J Yang, ME Goddard, PM Visscher, NR Wray, AE Jaffe, RA Irizarry, AL Williams, P Nick, G Joseph, H Hakon, R David, GR Abecasis, A Adam, LD Brooks, MA DePristo, RM Durbin, H Bryan, M Jonathan, S Matthew, H Bryan, F Chris- tian, S Matthew, M Jonathan, GR Abecasis, SE Medland, DR Nyholt, JN Painter, BP McEvoy, AF McRae, Z Gu, GR Abecasis, LR Cardon, WO Cookson, and R Gonc¸alo. Shared genetic control of expression and methylation in peripheral blood. BMC Genomics, 17(1):278, dec 2016.

[139] Matthew N Davies, Manuela Volta, Ruth Pidsley, Katie Lunnon, Abhishek Dixit, Simon Love- stone, Cristian Coarfa, R Alan Harris, Aleksandar Milosavljevic, Claire Troakes, Safa Al- Sarraj, Richard Dobson, Leonard C Schalkwyk, and Jonathan Mill. Functional annotation of the human brain methylome identifies tissue-specific epigenetic variation across brain and blood. Genome Biology, 13(6):R43, jun 2012.

[140] Allan F McRae, Joseph E Powell, Anjali K Henders, Lisa Bowdler, Gibran Hemani, Sonia Shah, Jodie N Painter, Nicholas G Martin, Peter M Visscher, and Grant W Montgomery. Con- tribution of genetic variation to transgenerational inheritance of DNA methylation. Genome biology, 15(5):R73, jan 2014.

[141] Sonia Shah, Allan F McRae, Riccardo E Marioni, Sarah E Harris, Jude Gibson, Anjali K Hen- ders, Paul Redmond, Simon R Cox, Alison Pattie, Janie Corley, Lee Murphy, Nicholas G Mar- tin, Grant W Montgomery, John M Starr, Naomi R Wray, Ian J Deary, and Peter M Visscher. Genetic and environmental exposures constrain epigenetic drift over the human life course. Genome research, 24(11):1725–33, dec 2014.

[142] K M Godfrey, P M Costello, and K A Lillycrop. The developmental environment, epigenetic and long-term health. Journal of developmental origins of health and disease, 6(5):399–406, oct 2015.

177 [143] Eric J Richards. Inherited epigenetic variation–revisiting soft inheritance. Nature reviews. Genetics, 7(5):395–401, may 2006.

[144] Andrea Baccarelli, Robert O. Wright, Valentina Bollati, Letizia Tarantini, Augusto A. Litonjua, Helen H. Suh, Antonella Zanobetti, David Sparrow, Pantel S. Vokonas, and Joel Schwartz. Rapid DNA Methylation Changes after Exposure to Traffic Particles. American Journal of Respiratory and Critical Care Medicine, 179(7):572–578, apr 2009.

[145] Paul A Wade and Trevor K Archer. Epigenetics: environmental instructions for the genome. perspectives, 114(3):A140–1, mar 2006.

[146] Mario F Fraga, Esteban Ballestar, Maria F Paz, Santiago Ropero, Fernando Setien, Maria L Ballestar, Damia Heine-Suner,˜ Juan C Cigudosa, Miguel Urioste, Javier Benitez, Manuel Boix- Chornet, Abel Sanchez-Aguilera, Charlotte Ling, Emma Carlsson, Pernille Poulsen, Allan Vaag, Zarko Stephan, Tim D Spector, Yue-Zhong Wu, Christoph Plass, and Manel Esteller. Epigenetic differences arise during the lifetime of monozygotic twins. Proceedings of the National Academy of Sciences of the United States of America, 102(30):10604–9, jul 2005.

[147] G K Smyth and T Speed. Normalization of cDNA microarray data. Methods, 31(4):265–273, 2003.

[148] Hyuna Yang, Christina A Harrington, Kristina Vartanian, Christopher D Coldren, Rob Hall, and Gary A Churchill. Randomization in laboratory procedure is key to obtaining reproducible microarray results. PLoS ONE, 3(11):e3724, 2008.

[149] Stefano Calza and Yudi Pawitan. Normalization of gene-expression microarray data. Methods in molecular biology (Clifton, N.J.), 673:37–52, jan 2010.

[150] Leming Shi, Laura H Reid, Wendell D Jones, Richard Shippy, Janet A Warrington, Shawn C Baker, Patrick J Collins, Francoise de Longueville, Ernest S Kawasaki, and Kathleen Y Lee. The MicroArray Quality Control (MAQC) project shows inter-and intraplatform reproducibil- ity of gene expression measurements. Nature biotechnology, 24(9):1151–1161, 2006.

[151] Penelope A Bryant, Gordon K Smyth, Roy Robins-Browne, and Nigel Curtis. Technical vari- ability is greater than biological variability in a microarray experiment but both are outweighed by changes induced by stimulation. PLoS ONE, 6(5):e19556, 2011.

[152] Anita Goldinger, Anjali K Henders, Allan F McRae, Nicholas G Martin, Greg Gibson, Grant W

178 Montgomery, Peter M Visscher, and Joseph E Powell. Genetic and nongenetic variation re- vealed for the principal components of human gene expression. Genetics, 195(3):1117–28, nov 2013.

[153] R A Irizarry, D Warren, F Spencer, I F Kim, S Biswal, B C Frank, E Gabrielson, J G N Gar- cia, J Geoghegan, and G Germino. Multiple-laboratory comparison of microarray platforms. Nature methods, 2(5):345–350, 2005.

[154] A Scherer. Batch effects and noise in microarray experiments: sources and solutions, volume 868. Wiley, 2009.

[155] L Thomas, E M Coffey, H Dai, Y D He, D A Kessler, K A Kilian, J E Koch, E LeProust, M J Marton, and M R Meyer. Effects of atmospheric ozone on microarray data quality. Analytical chemistry, 75(17):4672–4675, 2003.

[156] Benjamin M Bolstad, Rafael A Irizarry, Magnus Astrand,˚ and Terence P Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185–193, 2003.

[157] Shaopu Peter Qin, Jinhee Kim, Dalia Arafat, and Greg Gibson. Effect of normalization on statistical and biological interpretation of gene expression profiles. Frontiers in Genetics, 3:160, 2012.

[158] J T Leek, R B Scharpf, H C Bravo, D Simcha, B Langmead, W E Johnson, D Geman, K Bag- gerly, and R A Irizarry. Tackling the widespread and critical impact of batch effects in high- throughput data. Nature Reviews Genetics, 11(10):733–739, 2010.

[159] J Luo, M Schumacher, A Scherer, D Sanoudou, D Megherbi, T Davison, T Shi, W Tong, L Shi, and H Hong. A comparison of batch effect removal methods for enhancement of prediction per- formance using MAQC-II microarray gene expression data. The pharmacogenomics journal, 10(4):278–291, 2010.

[160] Chao Chen, Kay Grennan, Judith Badner, Dandan Zhang, Elliot Gershon, Li Jin, and Chunyu Liu. Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods. PLoS ONE, 6(2):e17238, 2011.

[161] K A Baggerly, K R Coombes, and E S Neeley. Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancer. Journal of Clinical Oncology, 26(7):1186–

179 1187, 2008.

[162] R S Spielman and V G Cheung. Reply to On the design and analysis of gene expression studies in human populations. Nature genetics, 39(7):808–809, 2007.

[163] J Fu, M G M Wolfs, P Deelen, H J Westra, R S N Fehrmann, G J te Meerman, W A Buurman, S S M Rensen, H J M Groen, and R K Weersma. Unraveling the regulatory mechanisms under- lying tissue-dependent genetic variation of gene expression. PLoS Genetics, 8(1):e1002431, 2012.

[164] J K Pickrell, J C Marioni, A A Pai, J F Degner, B E Engelhardt, E Nkadori, J B Veyrieras, M Stephens, Y Gilad, and J K Pritchard. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464(7289):768–772, 2010.

[165] O Stegle, A Kannan, R Durbin, and J Winn. Accounting for non-genetic factors improves the power of eQTL studies. In Research in Computational Molecular Biology, pages 411–422. Springer, 2008.

[166] O Alter, P O Brown, and D Botstein. Singular value decomposition for genome-wide ex- pression data processing and modeling. Proceedings of the National Academy of Sciences, 97(18):10101–10106, 2000.

[167] A L Price, N J Patterson, R M Plenge, M E Weinblatt, N A Shadick, and D Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics, 38(8):904–909, 2006.

[168] N Patterson, A L Price, and D Reich. Population structure and eigenanalysis. PLoS Genetics, 2(12):e190, 2006.

[169] B E Stranger, S B Montgomery, A S Dimas, L Parts, O Stegle, C E Ingle, M Sekowska, G D Smith, D Evans, and M Gutierrez-Arcelus. Patterns of cis regulatory variation in diverse human populations. PLoS Genetics, 8(4):e1002639, 2012.

[170] J T Leek and J D Storey. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3(9):e161, 2007.

[171] C D Brown, L M Mangravite, and B E Engelhardt. Integrative modeling of eQTLs and cis- regulatory elements suggest mechanisms underlying cell type specificity of eQTLs. arXiv

180 preprint arXiv:1210.3294, 2012.

[172] C Li and A Rabinovic. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118–127, 2007.

[173] O Stegle, L Parts, R Durbin, and J Winn. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS computational biology, 6(5):e1000770, 2010.

[174] S Ma and M R Kosorok. Identification of differential gene pathways with principal component analysis. Bioinformatics, 25(7):882–889, 2009.

[175] F Emmert-Streib and G V Glazko. Pathway analysis of expression data: deciphering functional building blocks of complex diseases. PLoS computational biology, 7(5):e1002053, 2011.

[176] J Misra, W Schmitt, D Hwang, L L Hsiao, S Gullans, and G Stephanopoulos. Interactive exploration of microarray gene expression patterns in a reduced dimensional space. Genome Res, 12(7):1112–1120, 2002.

[177] Vivian G Cheung, Richard S Spielman, Kathryn G Ewens, Teresa M Weber, Michael Morley, and Joshua T Burdick. Mapping determinants of human gene expression by regional and genome-wide association. Nature, 437(7063):1365–1369, 2005.

[178] Vincent Plagnol, Elif Uz, Chris Wallace, Helen Stevens, David Clayton, Tayfun Ozcelik, and John A Todd. Extreme clonality in lymphoblastoid cell lines with implications for allele spe- cific expression analyses. PLoS ONE, 3(8):e2966, 2008.

[179] Solveig K Sieberts and Eric E Schadt. Moving toward a system genetics view of disease. Mammalian Genome, 18(6):389–401, 2007.

[180] Jun Ding, Johann E Gudjonsson, Liming Liang, Philip E Stuart, Yun Li, Wei Chen, Michael Weichenthal, Eva Ellinghaus, Andre Franke, and William Cookson. Gene Expression in Skin and Lymphoblastoid Cells: Refined Statistical Method Reveals Extensive Overlap in cis-eQTL Signals. The American Journal of Human Genetics, 87(6):779–789, 2010.

[181] Joseph E Powell, Anjali K Henders, Allan F McRae, Margaret J Wright, Nicholas G Martin, Emmanouil T Dermitzakis, Grant W Montgomery, and Peter M Visscher. Genetic control of gene expression in whole blood and lymphoblastoid cell lines is largely independent. Genome

181 Res, 22(3):456–466, mar 2012.

[182] Alexandra C Nica, Leopold Parts, Daniel Glass, James Nisbet, Amy Barrett, Magdalena Sekowska, Mary Travers, Simon Potter, Elin Grundberg, Kerrin Small, Asa K Hedman, Veronique Bataille, Jordana Tzenova Bell, Gabriela Surdulescu, Antigone S Dimas, Catherine Ingle, Frank O Nestle, Paola di Meglio, Josine L Min, Alicja Wilk, Christopher J Hammond, Neelam Hassanali, Tsun-Po Yang, Stephen B Montgomery, Steve O’Rahilly, Cecilia M Lind- gren, Krina T Zondervan, Nicole Soranzo, Inesˆ Barroso, Richard Durbin, Kourosh Ahmadi, Panos Deloukas, Mark I McCarthy, Emmanouil T Dermitzakis, and Timothy D Spector. The architecture of gene regulatory variation across multiple human tissues: the MuTHER study. PLoS genetics, 7(2):e1002003, jan 2011.

[183] PM Reddy, G Stamatoyannopoulos, T Papayannopoulou, and CK Shen. Genomic footprinting and sequencing of human beta-globin locus. Tissue specificity and cell line artifact. J. Biol. Chem., 269(11):8287–8295, mar 1994.

[184] Antigone S Dimas, Samuel Deutsch, Barbara E Stranger, Stephen B Montgomery, Christelle Borel, Homa Attar-Cohen, Catherine Ingle, Claude Beazley, Maria Gutierrez Arcelus, and Magdalena Sekowska. miCommon regulatory variation impacts gene expression in a cell type- dependent manner. science, 325(5945):1246–1250, 2009.

[185] Ewan Birney, John A Stamatoyannopoulos, Anindya Dutta, Roderic Guigo,´ Thomas R Gin- geras, Elliott H Margulies, Zhiping Weng, Michael Snyder, Emmanouil T Dermitzakis, and Robert E Thurman. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447(7146):799–816, 2007.

[186] Hua Zhong, John Beaulaurier, Pek Yee Lum, Cliona Molony, Xia Yang, Douglas J MacNeil, Drew T Weingarth, Bin Zhang, Danielle Greenawalt, and Radu Dobrin. Liver and adipose expression associated SNPs are enriched for association to type 2 diabetes. PLoS Genetics, 6(5):e1000932, 2010.

[187] Dena G Hernandez, Mike A Nalls, Matthew Moore, Sean Chong, Allissa Dillman, Daniah Trabzuni, J Raphael Gibbs, Mina Ryten, Sampath Arepalli, and Michael E Weale. Integration of GWAS SNPs and tissue specific expression profiling reveal discrete eQTLs for human traits in blood and brain. Neurobiology of disease, 2012.

[188] Daniel J Gaffney. Global properties and functional complexity of human gene regulatory vari-

182 ation. PLoS genetics, 9(5):e1003501, may 2013.

[189] Christine Vogel and Edward M. Marcotte. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nature Reviews Genetics, 13(4):227–32, mar 2012.

[190] Mathias Wilhelm, Judith Schlegl, Hannes Hahne, Amin Moghaddas Gholami, Marcus Lieberenz, Mikhail M. Savitski, Emanuel Ziegler, Lars Butzmann, Siegfried Gessulat, Harald Marx, Toby Mathieson, Simone Lemeer, Karsten Schnatbaum, Ulf Reimer, Holger Wenschuh, Martin Mollenhauer, Julia Slotta-Huspenina, Joos-Hendrik Boese, Marcus Bantscheff, Anja Gerstmair, Franz Faerber, and Bernhard Kuster. Mass-spectrometry-based draft of the human proteome. Nature, 509(7502):582–587, may 2014.

[191] Juan A Vizca´ıno, Eric W Deutsch, Rui Wang, Attila Csordas, Florian Reisinger, Daniel R´ıos, Jose´ A Dianes, Zhi Sun, Terry Farrah, Nuno Bandeira, Pierre-Alain Binz, Ioannis Xe- narios, Martin Eisenacher, Gerhard Mayer, Laurent Gatto, Alex Campos, Robert J Chalk- ley, Hans-Joachim Kraus, Juan Pablo Albar, Salvador Martinez-Bartolome,´ Rolf Apweiler, Gilbert S Omenn, Lennart Martens, Andrew R Jones, and Henning Hermjakob. ProteomeX- change provides globally coordinated proteomics data submission and dissemination. Nature Biotechnology, 32(3):223–226, mar 2014.

[192] Mathias Uhlen, Per Oksvold, Linn Fagerberg, Emma Lundberg, Kalle Jonasson, Mattias Fors- berg, Martin Zwahlen, Caroline Kampf, Kenneth Wester, Sophia Hober, Henrik Wernerus, Lisa Bjorling,¨ and Fredrik Ponten. Towards a knowledge-based Human Protein Atlas. Nature Biotechnology, 28(12):1248–1250, dec 2010.

[193] Min-Sik Kim, Sneha M. Pinto, Derese Getnet, Raja Sekhar Nirujogi, Srikanth S. Manda, Raghothama Chaerkady, Anil K. Madugundu, Dhanashree S. Kelkar, Ruth Isserlin, Shobhit Jain, Joji K. Thomas, Babylakshmi Muthusamy, Pamela Leal-Rojas, Praveen Kumar, Nan- dini A. Sahasrabuddhe, Lavanya Balakrishnan, Jayshree Advani, Bijesh George, Santosh Renuse, Lakshmi Dhevi N. Selvan, Arun H. Patil, Vishalakshi Nanjappa, Aneesha Radhakrish- nan, Samarjeet Prasad, Tejaswini Subbannayya, Rajesh Raju, Manish Kumar, Sreelakshmi K. Sreenivasamurthy, Arivusudar Marimuthu, Gajanan J. Sathe, Sandip Chavan, Keshava K. Datta, Yashwanth Subbannayya, Apeksha Sahu, Soujanya D. Yelamanchi, Savita Jayaram, Pavithra Rajagopalan, Jyoti Sharma, Krishna R. Murthy, Nazia Syed, Renu Goel, Aafaque A. Khan, Sartaj Ahmad, Gourav Dey, Keshav Mudgal, Aditi Chatterjee, Tai-Chung Huang, Jun

183 Zhong, Xinyan Wu, Patrick G. Shaw, Donald Freed, Muhammad S. Zahari, Kanchan K. Mukherjee, Subramanian Shankar, Anita Mahadevan, Henry Lam, Christopher J. Mitchell, Susarla Krishna Shankar, Parthasarathy Satishchandra, John T. Schroeder, Ravi Sirdeshmukh, Anirban Maitra, Steven D. Leach, Charles G. Drake, Marc K. Halushka, T. S. Keshava Prasad, Ralph H. Hruban, Candace L. Kerr, Gary D. Bader, Christine A. Iacobuzio-Donahue, Harsha Gowda, and Akhilesh Pandey. A draft map of the human proteome. Nature, 509(7502):575– 581, may 2014.

[194] Terry Farrah, Eric W. Deutsch, Gilbert S. Omenn, Zhi Sun, Julian D. Watts, Tadashi Yamamoto, David Shteynberg, Micheleen M. Harris, and Robert L. Moritz. State of the Human Proteome in 2013 as Viewed through PeptideAtlas: Comparing the Kidney, Urine, and Plasma Proteomes for the Biology- and Disease-Driven Human Proteome Project. Journal of Proteome Research, 13(1):60–75, jan 2014.

[195] Markus Schirle, Marie-Anne Heurtier, and Bernhard Kuster. Profiling core proteomes of hu- man cell lines by one-dimensional PAGE and liquid chromatography-tandem mass spectrome- try. Molecular & cellular proteomics : MCP, 2(12):1297–305, dec 2003.

[196] M. Uhlen, Linn Fagerberg, B. M. Hallstrom, Cecilia Lindskog, Per Oksvold, Adil Mardinoglu, A. Sivertsson, Caroline Kampf, E. Sjostedt, Anna Asplund, Ingmarie Olsson, Karolina Edlund, Emma Lundberg, Sanjay Navani, Cristina A.-K. Szigyarto, Jacob Odeberg, Dijana Djureinovic, Jenny Ottosson Takanen, Sophia Hober, Tove Alm, P.-H. Edqvist, Holger Berling, Hanna Tegel, Jan Mulder, Johan Rockberg, Peter Nilsson, Jochen M Schwenk, Marica Hamsten, K. von Feilitzen, Mattias Forsberg, Lukas Persson, Fredric Johansson, Martin Zwahlen, G. von Heijne, Jens Nielsen, and F. Ponten. Tissue-based map of the human proteome. Science, 347(6220):1260419–1260419, 2015.

[197] Sarah Djebali, Carrie A. Davis, Angelika Merkel, Alex Dobin, Timo Lassmann, Ali Mor- tazavi, Andrea Tanzer, Julien Lagarde, Wei Lin, Felix Schlesinger, Chenghai Xue, Georgi K. Marinov, Jainab Khatun, Brian A. Williams, Chris Zaleski, Joel Rozowsky, Maik Roder,¨ Felix Kokocinski, Rehab F. Abdelhamid, Tyler Alioto, Igor Antoshechkin, Michael T. Baer, Na- dav S. Bar, Philippe Batut, Kimberly Bell, Ian Bell, Sudipto Chakrabortty, Xian Chen, Jacque- line Chrast, Joao Curado, Thomas Derrien, Jorg Drenkow, Erica Dumais, Jacqueline Dumais, Radha Duttagupta, Emilie Falconnet, Meagan Fastuca, Kata Fejes-Toth, Pedro Ferreira, Syl- vain Foissac, Melissa J. Fullwood, Hui Gao, David Gonzalez, Assaf Gordon, Harsha Gu-

184 nawardena, Cedric Howald, Sonali Jha, Rory Johnson, Philipp Kapranov, Brandon King, Colin Kingswood, Oscar J. Luo, Eddie Park, Kimberly Persaud, Jonathan B. Preall, Paolo Ribeca, Brian Risk, Daniel Robyr, Michael Sammeth, Lorian Schaffer, Lei-Hoon See, Atif Shahab, Jor- gen Skancke, Ana Maria Suzuki, Hazuki Takahashi, Hagen Tilgner, Diane Trout, Nathalie Wal- ters, Huaien Wang, John Wrobel, Yanbao Yu, Xiaoan Ruan, Yoshihide Hayashizaki, Jennifer Harrow, Mark Gerstein, Tim Hubbard, Alexandre Reymond, Stylianos E. Antonarakis, Gre- gory Hannon, Morgan C. Giddings, Yijun Ruan, Barbara Wold, Piero Carninci, Roderic Guigo,´ and Thomas R. Gingeras. Landscape of transcription in human cells. Nature, 489(7414):101– 108, sep 2012.

[198] Paola Picotti, Mathieu Clement-Ziza,´ Henry Lam, David S. Campbell, Alexander Schmidt, Eric W. Deutsch, Hannes Rost,¨ Zhi Sun, Oliver Rinner, Lukas Reiter, Qin Shen, Jacob J. Michaelson, Andreas Frei, Simon Alberti, Ulrike Kusebauch, Bernd Wollscheid, Robert L. Moritz, Andreas Beyer, and Ruedi Aebersold. A complete mass-spectrometric map of the yeast proteome applied to quantitative trait analysis. Nature, 494(7436):266–270, jan 2013.

[199] TeckYew Low, Sebastiaan vanHeesch, Henk vandenToorn, Piero Giansanti, Alba Cristobal, Pim Toonen, Sebastian Schafer, Norbert Hubner,¨ Bas vanBreukelen, Shabaz Mohammed, Ed- win Cuppen, AlbertJ.R. Heck, and Victor Guryev. Quantitative and Qualitative Proteome Characteristics Extracted from In-Depth Integrated Genomics and Proteomics Analysis. Cell Reports, 5(5):1469–1478, dec 2013.

[200] Bjorn¨ Schwanhausser,¨ Dorothea Busse, Na Li, Gunnar Dittmar, Johannes Schuchhardt, Jana Wolf, Wei Chen, and Matthias Selbach. Global quantification of mammalian gene expression control. Nature, 473(7347):337–342, may 2011.

[201] Anatole Ghazalpour, Brian Bennett, Vladislav A Petyuk, Luz Orozco, Raffi Hagopian, Imran N Mungrue, Charles R Farber, Janet Sinsheimer, Hyun M Kang, Nicholas Furlotte, Christopher C Park, Ping-Zi Wen, Heather Brewer, Karl Weitz, David G Camp, Calvin Pan, Roumyana Yor- danova, Isaac Neuhaus, Charles Tilford, Nathan Siemers, Peter Gargalovic, Eleazar Eskin, Todd Kirchgessner, Desmond J Smith, Richard D Smith, and Aldons J Lusis. Comparative analysis of proteome and transcriptome variation in mouse. PLoS genetics, 7(6):e1001393, jun 2011.

[202] David Melzer, John R B Perry, Dena Hernandez, Anna-Maria Corsi, Kara Stevens, Ian Raf- ferty, Fulvio Lauretani, Anna Murray, J Raphael Gibbs, Giuseppe Paolisso, Sajjad Rafiq,

185 Javier Simon-Sanchez, Hana Lango, Sonja Scholz, Michael N Weedon, Sampath Arepalli, Neil Rice, Nicole Washecka, Alison Hurst, Angela Britton, William Henley, Joyce van de Leemput, Rongling Li, Anne B Newman, Greg Tranah, Tamara Harris, Vijay Panicker, Colin Dayan, Amanda Bennett, Mark I McCarthy, Aimo Ruokonen, Marjo-Riitta Jarvelin, Jack Guralnik, Stefania Bandinelli, Timothy M Frayling, Andrew Singleton, and Luigi Ferrucci. A genome-wide association study identifies protein quantitative trait loci (pQTLs). PLoS genetics, 4(5):e1000072, may 2008.

[203] Eric J. Foss, Dragan Radulovic, Scott A. Shaffer, David R. Goodlett, Leonid Kruglyak, Anto- nio Bedalov, Y Chen, J Zhu, P. Y Lum, X Yang, S Pinto, V Emilsson, G Thorleifsson, B Zhang, A. S Leonardson, F Zink, A de la Fuente, T Ravasi, H Suzuki, C. V Cannistraci, S Katayama, V. B Bajic, E. E Schadt, E. J Chesler, L Lu, S Shou, Y Qu, J Gu, M. J Holland, P Lu, C Vo- gel, R Wang, X Yao, E. M Marcotte, S. P Gygi, Y Rochon, B. R Franza, R Aebersold, S. P Schrimpf, M Weiss, L Reiter, C. H Ahrens, M Jovanovic, M. T Marr 2nd, J. A D’Alessio, O Puig, R Tjian, J. R Rohde, R Bastidas, R Puria, M. E Cardenas, P Kapahi, D Chen, A. N Rogers, S. D Katewa, P. W Li, E. J Foss, D Radulovic, S. A Shaffer, D. M Ruderfer, A Bedalov, D Radulovic, S Jelveh, S Ryu, T. G Hamilton, E Foss, M. J Appel, R Labarre, D Radulovic, G Yvert, R. B Brem, J Whittle, J. M Akey, E Foss, D. M Ruderfer, S. C Pratt, H. S Seidel, L Kruglyak, R. B Brem, L Kruglyak, R. B Brem, J. D Storey, J Whittle, L Kruglyak, R. B Brem, G Yvert, R Clinton, L Kruglyak, M Weiss, S Schrimpf, M. O Hengartner, M. J Lercher, C von Mering, G Palla, I Derenyi, I Farkas, T Vicsek, M Lynch, B Walsh, Y Wang, C. L Liu, J. D Storey, R. J Tibshirani, D Herschlag, A Belle, A Tanay, L Bitincka, R Shamir, E. K O’Shea, J. R Warner, N. S Cutler, J Heitman, M. E Cardenas, J Rohde, J Heitman, M. E Car- denas, E. E Schadt, J Lamb, X Yang, J Zhu, S Edwards, M. J Hickman, F Winston, J Ronald, R. B Brem, J Whittle, L Kruglyak, S Doss, E. E Schadt, T. A Drake, A. J Lusis, M. V Rock- man, L Kruglyak, P Jorgensen, I Rupes, J. R Sharom, L Schneper, J. R Broach, D. E Martin, A Soulard, M. N Hall, A. G Hinnebusch, E. X Kwan, E. J Foss, L Kruglyak, A Bedalov, T. L Hamilton, M Stoneley, K. A Spriggs, M Bushell, A Panchaud, S Jung, S. A Shaffer, J. D Aitchison, D. R Goodlett, A Panchaud, A Scherl, S. A Shaffer, P. D von Haller, H. D Ku- lasekara, P Picotti, B Bodenmiller, L. N Mueller, B Domon, R Aebersold, J Zhu, B Zhang, E. N Smith, B Drees, and R. B Brem. Genetic Variation Shapes Protein Networks Mainly through Non-transcriptional Mechanisms. PLoS Biology, 9(9):e1001144, sep 2011.

[204] A. Battle, Z. Khan, S. H. Wang, A. Mitrano, M. J. Ford, J. K. Pritchard, and Y. Gilad. Impact

186 of regulatory variation from RNA to protein. Science, 347(6222):science.1260793–, dec 2014.

[205] Daniel R Schoenberg and Lynne E Maquat. Regulation of cytoplasmic mRNA decay. Nature reviews. Genetics, 13(4):246–59, apr 2012.

[206] Athma A Pai, Carolyn E Cain, Orna Mizrahi-Man, Sherryl De Leon, Noah Lewellen, Jean- Baptiste Veyrieras, Jacob F Degner, Daniel J Gaffney, Joseph K Pickrell, Matthew Stephens, Jonathan K Pritchard, and Yoav Gilad. The contribution of RNA decay quantitative trait loci to inter-individual variation in steady-state gene expression levels. PLoS genetics, 8(10):e1003000, jan 2012.

[207] Jingyuan Fu, Joost J B Keurentjes, Harro Bouwmeester, Twan America, Francel W A Verstap- pen, Jane L Ward, Michael H Beale, Ric C H de Vos, Martijn Dijkstra, Richard A Scheltema, Frank Johannes, Maarten Koornneef, Dick Vreugdenhil, Rainer Breitling, and Ritsert C Jansen. System-wide molecular evidence for phenotypic buffering in Arabidopsis. Nature genetics, 41(2):166–7, feb 2009.

[208] Zia Khan, Michael J. Ford, Darren A. Cusanovich, Amy Mitrano, Jonathan K. Pritchard, and Yoav Gilad. Primate Transcript and Protein Expression Levels Evolve Under Compensatory Selection Pressures. Science, 342(6162), 2013.

[209] Gary J. Patti, Oscar Yanes, and Gary Siuzdak. Innovation: Metabolomics: the apogee of the omics trilogy. Nature Reviews Molecular Cell Biology, 13(4):263–269, mar 2012.

[210] MichelleF. Clasquin, Eugene Melamud, Alexander Singer, JessicaR. Gooding, Xiaohui Xu, Aiping Dong, Hong Cui, ShawnR. Campagna, Alexei Savchenko, AlexanderF. Yakunin, JoshuaD. Rabinowitz, and AmyA. Caudy. Riboneogenesis in Yeast. Cell, 145(6):969–980, jun 2011.

[211] Wenyun Lu, Michelle F. Clasquin, Eugene Melamud, Daniel Amador-Noguez, Amy A. Caudy, and Joshua D. Rabinowitz. Metabolomic Analysis via Reversed-Phase Ion-Pairing Liquid Chromatography Coupled to a Stand Alone Orbitrap Mass Spectrometer. Analytical Chemistry, 82(8):3212–3221, apr 2010.

[212] Luiz Pedro S de Carvalho, Hong Zhao, Caitlyn E Dickinson, Nancy M Arango, Christopher D Lima, Steven M Fischer, Ouathek Ouerfelli, Carl Nathan, and Kyu Y Rhee. Activity-based metabolomic profiling of enzymatic function: identification of Rv1248c as a mycobacterial

187 2-hydroxy-3-oxoadipate synthase. Chemistry & biology, 17(4):323–32, apr 2010.

[213] Michelle F Clasquin, Eugene Melamud, Alexander Singer, Jessica R Gooding, Xiaohui Xu, Aiping Dong, Hong Cui, Shawn R Campagna, Alexei Savchenko, Alexander F Yakunin, Joshua D Rabinowitz, and Amy A Caudy. Riboneogenesis in yeast. Cell, 145(6):969–80, jun 2011.

[214] Karsten Suhre, So-Youn Shin, Ann-Kristin Petersen, Robert P Mohney, David Meredith, Brigitte Wagele,¨ Elisabeth Altmaier, Panos Deloukas, Jeanette Erdmann, Elin Grundberg, Christopher J Hammond, Martin Hrabe´ de Angelis, Gabi Kastenmuller,¨ Anna Kottgen,¨ Florian Kronenberg, Massimo Mangino, Christa Meisinger, Thomas Meitinger, Hans-Werner Mewes, Michael V Milburn, Cornelia Prehn, Johannes Raffler, Janina S Ried, Werner Romisch-Margl,¨ Nilesh J Samani, Kerrin S Small, H-Erich Wichmann, Guangju Zhai, Thomas Illig, Tim D Spector, Jerzy Adamski, Nicole Soranzo, and Christian Gieger. Human metabolic individual- ity in biomedical and pharmaceutical research. Nature, 477(7362):54–60, sep 2011.

[215] Johannes Kettunen, Taru Tukiainen, Antti-Pekka Sarin, Alfredo Ortega-Alonso, Emmi Tikka- nen, Leo-Pekka Lyytikainen,¨ Antti J Kangas, Pasi Soininen, Peter Wurtz,¨ Kaisa Silander, Danielle M Dick, Richard J Rose, Markku J Savolainen, Jorma Viikari, Mika Kah¨ onen,¨ Terho Lehtimaki,¨ Kirsi H Pietilainen,¨ Michael Inouye, Mark I McCarthy, Antti Jula, Johan Eriks- son, Olli T Raitakari, Veikko Salomaa, Jaakko Kaprio, Marjo-Riitta Jarvelin,¨ Leena Pelto- nen, Markus Perola, Nelson B Freimer, Mika Ala-Korpela, Aarno Palotie, and Samuli Ripatti. Genome-wide association study identifies multiple loci influencing human serum metabolite levels. Nature genetics, 44(3):269–76, mar 2012.

[216] Christian Gieger, Ludwig Geistlinger, Elisabeth Altmaier, Martin Hrabe´ de Angelis, Florian Kronenberg, Thomas Meitinger, Hans-Werner Mewes, H-Erich Wichmann, Klaus M Wein- berger, Jerzy Adamski, Thomas Illig, and Karsten Suhre. Genetics meets metabolomics: a genome-wide association study of metabolite profiles in human serum. PLoS genetics, 4(11):e1000282, nov 2008.

[217] Thomas Illig, Christian Gieger, Guangju Zhai, Werner Romisch-Margl,¨ Rui Wang-Sattler, Cor- nelia Prehn, Elisabeth Altmaier, Gabi Kastenmuller,¨ Bernet S Kato, Hans-Werner Mewes, Thomas Meitinger, Martin Hrabe´ de Angelis, Florian Kronenberg, Nicole Soranzo, H-Erich Wichmann, Tim D Spector, Jerzy Adamski, and Karsten Suhre. A genome-wide perspective of genetic variation in human metabolism. Nature Genetics, 42(2):137–141, feb 2010.

188 [218] Zeneng Wang, Elizabeth Klipfell, Brian J. Bennett, Robert Koeth, Bruce S. Levison, Bran- don DuGar, Ariel E. Feldstein, Earl B. Britt, Xiaoming Fu, Yoon-Mi Chung, Yuping Wu, Phil Schauer, Jonathan D. Smith, Hooman Allayee, W. H. Wilson Tang, Joseph A. DiDonato, Al- dons J. Lusis, and Stanley L. Hazen. Gut flora metabolism of phosphatidylcholine promotes cardiovascular disease. Nature, 472(7341):57–63, apr 2011.

[219] Catherine A Lozupone, Jesse I Stombaugh, Jeffrey I Gordon, Janet K Jansson, and Rob Knight. Diversity, stability and resilience of the human . Nature, 489(7415):220–30, sep 2012.

[220] Andrew B. Shreiner, John Y. Kao, and Vincent B. Young. The gut microbiome in health and in disease. Current opinion in gastroenterology, 31(1):69, 2015.

[221] Chunaram Choudhary, Brian T. Weinert, Yuya Nishida, Eric Verdin, and Matthias Mann. The growing landscape of lysine acetylation links metabolism and cell signalling. Nature Reviews Molecular Cell Biology, 15(8):536–550, jul 2014.

[222] Xianlin Han. Lipidomics for studying metabolism. Nature Reviews Endocrinology, 12(11):668–679, jul 2016.

[223] Siobhan F Clarke, Eileen F Murphy, Orla O’Sullivan, Alice J Lucey, Margaret Humphreys, Aileen Hogan, Paula Hayes, Maeve O’Reilly, Ian B Jeffery, Ruth Wood-Martin, David M Kerins, Eamonn Quigley, R Paul Ross, Paul W O’Toole, Michael G Molloy, Eanna Falvey, Fergus Shanahan, and Paul D Cotter. Exercise and associated dietary extremes impact on gut microbial diversity. Gut, 63(12):1913–1920, dec 2014.

[224] Edward J O’Brien, Joshua A Lerman, Roger L Chang, Daniel R Hyduke, Bernhard Ø Pals- son, R. Adadi, B. Volkmer, R. Milo, M. Heinemann, T. Shlomi, QK. Beg, A. Vazquez, J. Ernst, MA. de Menezes, Z. BarJoseph, AL. Barabasi, ZN. Oltvai, BD. Bennett, EH. Kim- ball, M. Gao, R. Osterhout, SJ. Van Dien, JD. Rabinowitz, S. Berthoumieux, H. de Jong, G. Baptist, C. Pinel, C. Ranquet, D. Ropers, J. Geiselmann, VM. Boer, CA. Crutchfield, PH. Bradley, D. Botstein, JD. Rabinowitz, DK. Button, BK. Cho, SA. Federowicz, M. Em- bree, YS. Park, D. Kim, BO. Palsson, BK. Cho, EM. Knight, BO. Palsson, AM. Feist, BO. Palsson, L. Gerosa, K. Kochanowski, M. Heinemann, U. Sauer, R. Gil, FJ. Silva, J. Pereto, A. Moya, WR. Harcombe, NF. Delaney, N. Leiby, N. Klitgord, CJ. Marx, BR. Haverkorn van Rijsewijk, A. Nanchen, S. Nallet, RJ. Kleijn, U. Sauer, Q. Hua, C. Yang, T. Oshima, H. Mori,

189 K. Shimizu, JR. Karr, JC. Sanghvi, DN. Macklin, MV. Gutschow, JM. Jacobs, BJ. Bolival, N. AssadGarcia, JI. Glass, MW. Covert, KV. Meyenburg, FG. Hansen, J. Kato, M. Hashimoto, IM. Keseler, A. Mackie, M. PeraltaGil, A. SantosZavaleta, S. GamaCastro, C. Bonavides- Martinez, C. Fulcher, AM. Huerta, A. Kothari, M. Krummenacker, M. Latendresse, L. Mu- nizRascado, Q. Ong, S. Paley, I. Schroder, AG. Shearer, P. Subhraveti, M. Travers, D. Weeras- inghe, V. Weiss, M. Kim, Z. Zhang, H. Okano, D. Yan, A. Groisman, T. Hwa, S. Klumpp, T. Hwa, S. Klumpp, Z. Zhang, T. Hwa, AL. Koch, P. Langfelder, S. Horvath, JA. Lerman, DR. Hyduke, H. Latif, VA. Portnoy, NE. Lewis, JD. Orth, AC. SchrimpeRutledge, RD. Smith, JN. Adkins, K. Zengler, BO. Palsson, GW. Li, E. Oh, JS. Weissman, D. Machado, R. Costa, M. Rocha, E. Ferreira, B. Tidor, I. Rocha, D. Molenaar, R. van Berlo, D. de Ridder, B. Teusink, J. Monod, R. Nahku, K. Valgepea, PJ. Lahtvee, S. Erm, K. Abner, K. Adamberg, R. Vilu, A. Nanchen, A. Schicker, U. Sauer, RW. O’Brien, OM. Neijssel, DW. Tempest, JD. Orth, TM. Conrad, J. Na, JA. Lerman, H. Nam, AM. Feist, BO. Palsson, SJ. Pirt, S. Proshkin, AR. Rahmouni, A. Mironov, E. Nudler, M. Riley, T. Abe, MB. Arnaud, MK. Berlyn, FR. Blat- tner, RR. Chaudhuri, JD. Glasner, T. Horiuchi, IM. Keseler, T. Kosuge, H. Mori, NT. Perna, Gr. Plunkett, KE. Rudd, MH. Serres, GH. Thomas, NR. Thomson, D. Wishart, BL. Wanner, R. Schuetz, L. Kuepfer, U. Sauer, R. Schuetz, N. Zamboni, M. Zampieri, M. Heinemann, U. Sauer, M. Scott, CW. Gunderson, EM. Mateescu, Z. Zhang, T. Hwa, DW. Selinger, MA. Wright, GM. Church, RW. Smith, AC. Dean, I. Thiele, RM. Fleming, A. Bordbar, J. Schel- lenberger, BO. Palsson, I. Thiele, RM. Fleming, R. Que, A. Bordbar, D. Diep, BO. Palsson, I. Thiele, N. Jamshidi, RM. Fleming, BO. Palsson, T. Tuller, A. Carmi, K. Vestsigian, S. Navon, Y. Dorfan, J. Zaborske, T. Pan, O. Dahan, I. Furman, Y. Pilpel, K. Valgepea, K. Adamberg, A. Seiman, R. Vilu, GN. Vemuri, E. Altman, DP. Sangurdekar, AB. Khodursky, MA. Eiteman, WD. Donachie, AC. Robinson, R. Young, H. Bremer, K. Zhuang, GN. Vemuri, and R. Ma- hadevan. Genome-scale models of metabolism and gene expression extend and refine growth phenotype prediction. Molecular systems biology, 9(1):693, oct 2013.

[225] U. S. Bhalla and Ravi Iyengar. Emergent Properties of Networks of Biological Signaling Pathways. Science, 283(5400):381–387, jan 1999.

[226] S Biswas, J D Storey, and J M Akey. Mapping gene expression quantitative trait loci by singular value decomposition and independent component analysis. BMC bioinformatics, 9(1):244, 2008.

[227] Paul Fogel, S Stanley Young, Douglas M Hawkins, and Nathalie Ledirac. Inferential, robust

190 non-negative matrix factorization analysis of microarray data. Bioinformatics, 23(1):44–49, 2007.

[228] Damien Chaussabel, Charles Quinn, Jing Shen, Pinakeen Patel, Casey Glaser, Nicole Bald- win, Dorothee Stichweh, Derek Blankenship, Lei Li, Indira Munagala, Lynda Bennett, Flo- rence Allantaz, Asuncion Mejias, Monica Ardura, Ellen Kaizer, Laurence Monnet, Windy Allman, Henry Randall, Diane Johnson, Aimee Lanier, Marilynn Punaro, Knut M Wittkowski, Perrin White, Joseph Fay, Goran Klintmalm, Octavio Ramilo, a Karolina Palucka, Jacques Banchereau, and Virginia Pascual. A modular analysis framework for blood genomics studies: application to systemic lupus erythematosus. Immunity, 29(1):150–64, jul 2008.

[229] Langfelder Peter, Horvath Steve, Peter Langfelder, and Steve Horvath. WGCNA: an R package for weighted correlation network analysis. BMC bioinformatics, 9(1):559, jan 2008.

[230] Eric A Stone and Julien F Ayroles. Modulated modularity clustering as an exploratory tool for functional genomic inference. PLoS Genetics, 5(5):e1000479, 2009.

[231] Peter S Gargalovic, Minori Imura, Bin Zhang, Nima M Gharavi, Michael J Clark, Joanne Pagnon, Wen-Pin Yang, Aiqing He, Amy Truong, and Shilpa Patel. Identification of inflamma- tory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proceedings of the National Academy of Sciences, 103(34):12741–12746, 2006.

[232] Jeremy A Miller, Michael C Oldham, and Daniel H Geschwind. A systems level analysis of transcriptional changes in Alzheimer’s disease and normal aging. The Journal of Neuroscience, 28(6):1410–1420, feb 2008.

[233] R Abo, G D Jenkins, L Wang, and B L Fridley. Identifying the Genetic Variation of Gene Expression Using Gene Sets: Application of Novel Gene Set eQTL Approach to PharmGKB and KEGG. PLoS ONE, 7(8):e43301, 2012.

[234] Y Chen, J Zhu, P Y Lum, X Yang, S Pinto, D J MacNeil, C Zhang, J Lamb, S Edwards, and S K Sieberts. Variations in DNA elucidate molecular networks that cause disease. Nature, 452(7186):429–435, 2008.

[235] Alice Gerrits, Brad Dykstra, Marcel Otten, Leonid Bystrykh, and Gerald de Haan. Combining transcriptional profiling and genetic linkage analysis to uncover gene networks operating in hematopoietic stem cells and their progeny. Immunogenetics, 60(8):411–422, 2008.

191 [236] Dirk-Jan de Koning, Orjan Carlborg, and Chris S Haley. The genetic dissection of immune response using gene-expression studies and genome mapping. Veterinary immunology and immunopathology, 105(3-4):343–352, 2005.

[237] E E Schadt, J Lamb, X Yang, J Zhu, S Edwards, D GuhaThakurta, S K Sieberts, S Monks, M Reitman, and C Zhang. An integrative genomics approach to infer causal associations between gene expression and disease. Nature genetics, 37(7):710–717, 2005.

[238] J Ihmels, R Levy, and N Barkai. Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae. Nature biotechnology, 22(1):86–92, 2003.

[239] Sanjay K Shukla, Narayana S Murali, and Murray H Brilliant. Personalized medicine going precise: from genomics to microbiomics. Trends in molecular medicine, 21(8):461–2, aug 2015.

[240] G Ginsburg. Personalized medicine: revolutionizing drug discovery and patient care. Trends in Biotechnology, 19(12):491–496, dec 2001.

[241] Christos Katsios and Dimitrios H Roukos. Individual genomes and personalized medicine: life diversity and complexity. Personalized Medicine, 7(4):347–350, jul 2010.

[242] Andrea D. Weston and Leroy Hood. Systems Biology, Proteomics, and the Future of Health Care: Toward Predictive, Preventative, and Personalized Medicine. Journal of Proteome Research, 3(2):179–196, apr 2004.

[243] Jeremy K Nicholson. Global systems biology, personalized medicine and molecular epidemi- ology. Molecular systems biology, 2(1):52, jan 2006.

[244] G A Churchill. Fundamentals of experimental design for cDNA microarrays. Nature genetics, 32(supp):490–495, 2002.

[245] D B Allison, X Cui, G P Page, and M Sabripour. Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics, 7(1):55–65, 2006.

[246] J E Jackson and J Wiley. A user’s guide to principal components, volume 19. Wiley Online Library, 1991.

[247] N S Holter, M Mitra, A Maritan, M Cieplak, J R Banavar, and N V Fedoroff. Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proceedings of the

192 National Academy of Sciences, 97(15):8409–8414, 2000.

[248] A Fukushima, M Wada, S Kanaya, and M Arita. SVD-based anatomy of gene expressions for correlation analysis in Arabidopsis thaliana. DNA research, 15(6):367–374, 2008.

[249] M Y Hirai, K Sugiyama, Y Sawada, T Tohge, T Obayashi, A Suzuki, R Araki, N Sakurai, H Suzuki, and K Aoki. Omics-based identification of Arabidopsis Myb transcription fac- tors regulating aliphatic glucosinolate biosynthesis. Proceedings of the National Academy of Sciences, 104(15):6478–6483, 2007.

[250] S Raychaudhuri, J M Stuart, and R B Altman. Principal components analysis to summa- rize microarray experiments: application to sporulation time series. In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, page 455. NIH Public Access, 2000.

[251] G H Golub and C Reinsch. Singular value decomposition and least squares solutions. Numerische Mathematik, 14(5):403–420, 1970.

[252] J Li, P R Bushell, T Chu, and R D Wolfinger. Principal Variance Components Analysis: Es- timating Batch Effects in Microarray Gene Expression Data. In Batch Effects and Noise in Microarray Experiments: Sources and Solutions., pages 141–154. Wiley & Sons, Ltd, Chich- ester, UK, 2009.

[253] G R Abecasis, L R Cardon, and W O C Cookson. A general test of association for quantitative traits in nuclear families. The American Journal of Human Genetics, 66(1):279–292, 2000.

[254] S Purcell, B Neale, K Todd-Brown, L Thomas, M A R Ferreira, D Bender, J Maller, P Sklar, P I W De Bakker, and M J Daly. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81(3):559–575, 2007.

[255] Y Benjamini and Y Hochberg. Controlling the false discovery rate: a practical and powerful ap- proach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300, 1995.

[256] B T S Da Wei Huang and R A Lempicki. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols, 4(1):44–57, 2008.

[257] B T Sherman, Q Tan, J Kir, D Liu, D Bryant, Y Guo, R Stephens, M W Baseler, H C Lane, and

193 R A Lempicki. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic acids research, 35(suppl 2):W169–W175, 2007.

[258] M J Boedigheimer, R D Wolfinger, M B Bass, P R Bushel, J W Chou, M Cooper, J C Corton, J Fostel, S Hester, and J S Lee. Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories. BMC genomics, 9(1):285, 2008.

[259] H G Gauch Jr. Noise reduction by eigenvector ordinations. Ecology, pages 1643–1649, 1982.

[260] P R Peres-Neto, D A Jackson, and K M Somers. How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Computational Statistics & Data Analysis, 49(4):974–997, 2005.

[261] K Y Yeung and W L Ruzzo. Principal component analysis for clustering gene expression data. Bioinformatics, 17(9):763–774, 2001.

[262] M Lynch and B Walsh. Genetics and analysis of quantitative traits. Sinauer, Sunderland, MA, 1998.

[263] J H Park, S Wacholder, M H Gail, U Peters, K B Jacobs, S J Chanock, and N Chatterjee. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nature genetics, 42(7):570–575, 2010.

[264] S E Browne, A C Bowling, U Macgarvey, M J Baik, S C Berger, M M K Muquit, E D Bird, and M F Beal. Oxidative damage and metabolic dysfunction in Huntington’s disease: selective vulnerability of the basal ganglia. Annals of neurology, 41(5):646–653, 1997.

[265] P Mecocci, U MacGarvey, and M F Beal. Oxidative damage to mitochondrial DNA is increased in Alzheimer’s disease. Annals of neurology, 36(5):747–751, 2004.

[266] V Rhein, X Song, A Wiesner, L M Ittner, G Baysang, F Meier, L Ozmen, H Bluethmann, S Drose,¨ and U Brandt. Amyloid-β and tau synergistically impair the oxidative phospho- rylation system in triple transgenic Alzheimer’s disease mice. Proceedings of the National Academy of Sciences, 106(47):20057–20062, 2009.

[267] S Hoyer. Oxidative energy metabolism in Alzheimer brain. Molecular and chemical

194 neuropathology, 16(3):207–224, 1992.

[268] T Jonsson, J K Atwal, S Steinberg, J Snaedal, P V Jonsson, S Bjornsson, H Stefansson, P Sulem, D Gudbjartsson, and J Maloney. A mutation in APP protects against Alzheimer/’s disease and age-related cognitive decline. Nature, 2012.

[269] L Parts, A˚ K Hedman, S Keildson, A J Knights, C Abreu-Goodger, M van de Bunt, J A Guerra- Assunc¸ao,˜ N Bartonicek, S van Dongen, and R Magi.¨ Extent, Causes, and Consequences of Small RNA Expression Variation in Human Adipose Tissue. PLoS Genetics, 8(5):e1002704, 2012.

[270] Youssef Idaghdour, Wendy Czika, Kevin V Shianna, Sang H Lee, Peter M Visscher, Hilary C Martin, Kelci Miclaus, Sami J Jadallah, David B Goldstein, Russell D Wolfinger, and Greg Gibson. Geographical genomics of human leukocyte gene expression variation in southern Morocco. Nature genetics, 42(1):62–7, jan 2010.

[271] Benjamin B Tournier, Hugues Dardente, Valerie´ Simonneaux, Berthe Vivien-Roels, Paul Pevet,´ Mireille Masson-Pevet,´ and Patrick Vuillez. Seasonal variations of clock gene expression in the suprachiasmatic nuclei and pars tuberalis of the European hamster (Cricetus cricetus). European Journal of Neuroscience, 25(5):1529–1536, 2007.

[272] Andrew Davie, Matteo Minghetti, and Herve Migaud. Seasonal variations in clock-gene ex- pression in Atlantic salmon (Salmo salar). Chronobiology international, 26(3):379–395, 2009.

[273] Suk-Hwan Yang and Carol A Loopstra. Seasonal variation in gene expression for loblolly pines (Pinus taeda) from different geographical regions. Tree physiology, 25(8):1063–1073, 2005.

[274] Lucia Landi and Gianfranco Romanazzi. Seasonal variation of defense-related gene expression in leaves from Bois noir affected and recovered grapevines. Journal of Agricultural and Food chemistry, 59(12):6628–6637, 2011.

[275] Simone De Jong, Marjolein Neeleman, Jurjen J Luykx, Maarten J ten Berg, Eric Strengman, Hanneke H Den Breeijen, Leon C Stijvers, Jacobine E Buizer-Voskamp, Steven C Bakker, Rene´ S Kahn, Steve Horvath, Wouter W Van Solinge, and Roel A Ophoff. Seasonal changes in gene expression represent cell-type composition in whole blood. Human molecular genetics, 23(10):2721–8, may 2014.

[276] Abdelhalim Azzi, Robert Dallmann, Alison Casserly, Hubert Rehrauer, Andrea Patrignani,

195 Bert Maier, Achim Kramer, and Steven A Brown. Circadian behavior is light-reprogrammed by plastic DNA methylation. Nature neuroscience, 17(3):377–82, mar 2014.

[277] Benjamin Weinkauf, Roman Rukwied, Hans Quiding, Leif Dahllund, Patrick Johansson, and Martin Schmelz. Local gene expression changes after UV-irradiation of human skin. PloS one, 7(6):e39411, jan 2012.

[278] Masanori Okamoto, Yoko Tanaka, Suzanne R Abrams, Yuji Kamiya, Motoaki Seki, and Eiji Nambara. High humidity induces abscisic acid 8’-hydroxylase in stomata and vasculature to regulate local and systemic abscisic acid responses in Arabidopsis. Plant physiology, 149(2):825–34, feb 2009.

[279] Toshikazu Kiyohara, Seiji Miyata, Takahiro Nakamura, Osamu Shido, Toshihiro Nakashima, and Masaaki Shibata. Differences in Fos expression in the rat brains between cold and warm ambient exposures. Brain Research Bulletin, 38(2):193–201, jan 1995.

[280] Chun-Chieh Kao, Jing-Long Huang, Liang-Shiou Ou, and Lai-Chu See. The prevalence, severity and seasonal variations of asthma, rhinitis and eczema in Taiwanese schoolchildren. Pediatric allergy and immunology : official publication of the European Society of Pediatric Allergy and Immunology, 16(5):408–15, aug 2005.

[281] Adrian G Barnett, Michael de Looper, and John F Fraser. The seasonality in heart failure deaths and total cardiovascular deaths. Australian and New Zealand journal of public health, 32(5):408–13, oct 2008.

[282] S Kasper, T A Wehr, and N E Rosenthal. Season-related forms of depression. I. Principles and clinical description of the syndrome. Der Nervenarzt, 59(4):191–9, apr 1988.

[283] A. Green and C. C. Patterson. Trends in the incidence of childhood-onset diabetes in Europe 19891998. Diabetologia, 44(S3):B3–B8, oct 2001.

[284] E Friedman, L Gyulai, M Bhargava, M Landen, S Wisniewski, J Foris, M Ostacher, R Medina, and M Thase. Seasonal changes in clinical status in : a prospective study in 1000 STEP-BD patients. Acta psychiatrica Scandinavica, 113(6):510–7, jun 2006.

[285] Erick Messias, Brian Kirkpatrick, Evelyn Bromet, David Ross, Robert W Buchanan, William T Carpenter, Cenk Tek, Kenneth S Kendler, Dermot Walsh, and Sonia Dollfus. Summer birth and deficit schizophrenia: a pooled analysis from 6 countries. Archives of general psychiatry,

196 61(10):985–9, oct 2004.

[286] K B Alstadhaug, R Salvesen, and S I Bekkelund. Seasonal variation in migraine. Cephalalgia : an international journal of headache, 25(10):811–6, oct 2005.

[287] Gerardo Iuliano, Cavit Boz, Edgardo Cristiano, Pierre Duquette, Alessandra Lugaresi, Celia Oreja-Guevara, and Vincent Van Pesch. Historical changes of seasonal differences in the fre- quency of multiple sclerosis clinical attacks: a multicenter study. Journal of neurology, pages 1–5, 2012.

[288] Grant P Parnell, Prudence N Gatt, Fiona C McKay, Stephen Schibeci, Malgorzata Krupa, Joseph E Powell, Peter M Visscher, Grant W Montgomery, Jeannette Lechner-Scott, Simon Broadley, Christopher Liddle, Mark Slee, Steve Vucic, Graeme J Stewart, and David R Booth. Ribosomal protein S6 mRNA is a upregulated in multiple sclerosis, downregulated by interferon treatment, and affected by season. Multiple sclerosis (Houndmills, Basingstoke, England), 20(6):675–85, may 2014.

[289] Kimberly Bloom-Feshbach, Wladimir J Alonso, Vivek Charu, James Tamerius, Lone Simon- sen, Mark A Miller, and Cecile´ Viboud. Latitudinal variations in seasonal activity of influenza and respiratory syncytial virus (RSV): a global comparative review. PloS one, 8(2):e54445, jan 2013.

[290] Vin Tangpricha, Elizabeth N Pearce, Tai C Chen, and Michael F Holick. Vitamin D insuffi- ciency among free-living healthy young adults. The American journal of medicine, 112(8):659, 2002.

[291] Silvina Levis, Angela Gomez, Camilo Jimenez, Luis Veras, Fangchao Ma, Shenghan Lai, Bruce Hollis, and Bernard A Roos. Vitamin D deficiency and seasonal variation in an adult South Florida population. Journal of Clinical Endocrinology & Metabolism, 90(3):1557–1562, 2005.

[292] P R Woodhouse and K T Khaw. Seasonal variation of risk factors for cardiovascular disease and diet in older adults. International journal of circumpolar health, 59(3-4):204–9, oct 2000.

[293] Dann G. Uitenbroek. Seasonal variation in leisure time physical activity. Medicine & Science in Sports & Exercise, 25(6):755–760, jun 1993.

[294] Yurii S Aulchenko, Stephan Ripke, Aaron Isaacs, and Cornelia M van Duijn. GenABEL: an R

197 library for genome-wide association analysis. Bioinformatics (Oxford, England), 23(10):1294– 6, may 2007.

[295] Jian Yang, S Hong Lee, Michael E Goddard, and Peter M Visscher. GCTA: a tool for genome- wide complex trait analysis. American journal of human genetics, 88(1):76–82, jan 2011.

[296] Robert B Cleveland, William S Cleveland, Jean E McRae, and Irma Terpenning. STL: A seasonal-trend decomposition procedure based on loess. Journal of Official Statistics, 6(1):3– 73, 1990.

[297] G. M. LJUNG and G. E. P. BOX. On a measure of lack of fit in time series models. Biometrika, 65(2):297–303, aug 1978.

[298] A. M. Stolwijk, H. Straatman, and G. A. Zielhuis. Studying seasonality by using sine and cosine functions in regression analysis. Journal of & Community Health, 53(4):235–238, apr 1999.

[299] Susan K Mikulich, Gary O Zerbe, Richard H Jones, and Thomas J Crowley. Comparing linear and nonlinear mixed model approaches to cosinor analysis. Statistics in medicine, 22(20):3195–211, oct 2003.

[300] Jeremy A Miller, Chaochao Cai, Peter Langfelder, Daniel H Geschwind, Sunil M Kurian, Daniel R Salomon, and Steve Horvath. Strategies for aggregating gene expression data: the collapseRows R function. BMC bioinformatics, 12(1):322, jan 2011.

[301] Massimo Gallerani, Roberto Reverberi, Raffaella Salmi, Michael H Smolensky, and Roberto Manfredini. Seasonal variation of platelets in a cohort of Italian blood donors: a preliminary report. European journal of medical research, 18(1):31, jan 2013.

[302] E. Kristal-Boneh, P. Froom, G. Harari, Y. Shapiro, and M. S. Green. Seasonal changes in red blood cell parameters. British Journal of Haematology, 85(3):603–607, mar 2008.

[303] G L Faedda, L Tondo, M H Teicher, R J Baldessarini, H A Gelbard, and G F Floris. Seasonal mood disorders. Patterns of seasonal recurrence in mania and depression. Archives of general psychiatry, 50(1):17–23, jan 1993.

[304] ROBERT SCRAGG. Seasonality of Cardiovascular Disease Mortality and the Possible Protec- tive Effect of Ultra-violet Radiation. International Journal of Epidemiology, 10(4):337–341,

198 dec 1981.

[305] Miklos F. Losonczy, Richard C. Mohs, and Kenneth L. Davis. Seasonal variations of human lumbar CSF neurotransmitter metabolite concentrations. Psychiatry Research, 12(1):79–87, may 1984.

[306] GW Lambert, C Reid, DM Kaye, GL Jennings, and MD Esler. Effect of and season on serotonin turnover in the brain. The Lancet, 360(9348):1840–1842, dec 2002.

[307] Jurjen J Luykx, Steven C Bakker, Eef Lentjes, Marco P M Boks, Nan van Geloven, Marinus J C Eijkemans, Esther Janson, Eric Strengman, Anne M de Lepper, Herman Westenberg, Kai E Klopper, Hendrik J Hoorn, Harry P M M Gelissen, Julian Jordan, Noortje M Tolenaar, Eric P A van Dongen, Bregt Michel, Lucija Abramovic, Steve Horvath, Teus Kappen, Peter Bruins, Peter Keijzers, Paul Borgdorff, Roel A Ophoff, and Rene´ S Kahn. Season of sampling and season of birth influence serotonin metabolite levels in human cerebrospinal fluid. PloS one, 7(2):e30497, jan 2012.

[308] Markus J. P. Kruesi. Cerebrospinal Fluid Monoamine Metabolites, Aggression, and Impulsivity in Disruptive Behavior Disorders of Children and Adolescents. Archives of General Psychiatry, 47(5):419, may 1990.

[309] SS Harris and B Dawson-Hughes. Seasonal changes in plasma 25-hydroxyvitamin D concen- trations of young American black and white women. Am J Clin Nutr, 67(6):1232–1236, jun 1998.

[310] S R De Vriese, A B Christophe, and M Maes. In humans, the seasonal variation in poly- unsaturated fatty acids is related to the seasonal variation in violent and serotonergic markers of violent suicide. Prostaglandins, leukotrienes, and essential fatty acids, 71(1):13–8, jul 2004.

[311] Mona Kimbell-Dunn, Neil Pearce, and Richard Beasley. Seasonal variation in asthma hospi- talizations and death rates in New Zealand. Respirology, 5(3):241–246, sep 2000.

[312] T. Harju, T. Tuuponen, T. Keistinen, S-L. Kivela,¨ T Tuuponen, T Keistinen, SL Kivela, A Khot, R Bum, N Evans, C Lenney, W Lenney, RE Upshur, K Knight, V Goel, H Eng, JB Mer- cer, DJ Lanska, RG Hoffmann, Y Mao, R Semenciw, H Morrison, DT Wigle, DM Flem- ing, KW Cross, R Sunderland, AM Ross, KB Weiss, RY Tseng, CN Lo, CK Li, TW Ling,

199 MM Mok, DM Fleming, R Sunderland, KW Cross, AM Ross, T Harju, T Keistinen, T Tu- uponen, SL Kivela, KH Carlsen, I Orstavik, J Leegaard, H Hoeg, W Fuller, LP Boulet, TR Bai, A Becker, D Berube, R Beveridge, DM Bowie, KR Chapman, J Cote, D Cockcroft, FM Ducharme, P Ernst, JM FitzGerald, T Kovesi, RV Hodder, P O’Byrne, B Rowe, MR Sears, FE Simons, S Spier, MA Monteil, S Juman, R Hassanally, KP Williams, L Pierre, M Rahaman, H Singh, A Trinidade, M Kimbell-Dunn, N Pearce, R Beasley, RJ Troisi, FE Speizer, WC Wil- lett, D Trichopoulos, B Rosner, D Lieberman, G Kopernik, A Porath, S Lazer, D Heimer, ML Osborne, WM Vollmer, AS Buist, RT Burnett, RE Dales, ME Raizenne, D Krewski, PW Summers, GR Roberts, M Raad-Young, T Dann, J Brook, N Mygind, SL Johnston, PK Pat- temore, G Sanderson, S Smith, MJ Campbell, LK Josephs, A Cunningham, BS Robinson, SH Myint, ME Ward, DA Tyrrell, ST Holgate, J Storr, W Lenney, A Khot, and R Bum. Sea- sonal variations in hospital treatment periods and deaths among adult asthmatics. European Respiratory Journal, 12(6):1362–1365, dec 1998.

[313] Stephen J Teach, Peter J Gergen, Stanley J Szefler, Herman E Mitchell, Agustin Calatroni, Jeremy Wildfire, Gordon R Bloomberg, Carolyn M Kercsmar, Andrew H Liu, Melanie M Makhija, Elizabeth Matsui, Wayne Morgan, George O’Connor, William W Busse, J.E. Moor- man, L.J. Akinbami, C.M. Bailey, H.S. Zahran, M.E. King, C.A. Johnson, et Al., J.I. Ivanova, R. Bergman, H.G. Birnbaum, G.L. Colice, R.A. Silverman, K. McLaurin, L.Y. Wang, Y. Zhong, L. Wheeler, S.J. Szefler, H. Mitchell, C.A. Sorkness, P.J. Gergen, W.J. Morgan, M. Kattan, et Al., M.R. Sears, G.R. Bloomberg, K.M. Trinkaus, E.B. Fisher, J.R. Musick, R.C. Strunk, H.A. Cohen, H. Blau, M. Hoshen, E. Batat, R.D. Balicer, P.J. Gergen, H. Mitchell, H. Lynn, N.W. Johnston, S.L. Johnston, J.M. Duncan, J.M. Greene, T. Kebadze, P.K. Keith, et Al., W.W. Busse, R.F. Lemanske, J.E. Gern, M.R. Sears, N.W. Johnston, N.W. Johnston, M.R. Sears, Lung National Heart, Blood Institute, R.A. Covar, S.J. Szefler, R.S. Zeiger, C.A. Sorkness, M. Moss, D.T. Mauger, et Al., E. Forno, J.C. Celedon, E. Forno, A. Fuhlbrigge, M.E. Soto-Quiros, L. Avila, B.A. Raby, J. Brehm, et Al., T. Haselkorn, R.S. Zeiger, B.E. Chipps, D.R. Mink, S.J. Szefler, F.E. Simons, et Al., W.W. Busse, W.J. Morgan, P.J. Gergen, H.E. Mitchell, J.E. Gern, A.H. Liu, et Al., A. Chevan, M. Sutherland, J.J. Wildfire, P.J. Gergen, C.A. Sorkness, H.E. Mitchell, A. Calatroni, M. Kattan, et Al., S.L. Johnston, P.K. Pattemore, G. Sanderson, S. Smith, F. Lampe, L. Josephs, et Al., S.L. Friedlander, W.W. Busse, N. Khet- suriani, N.N. Kazerouni, D.D. Erdman, X. Lu, S.C. Redd, L.J. Anderson, et Al., J.V. Williams, J.E. Crowe, R. Enriquez, P. Minton, R.S. Peebles, R.G. Hamilton, et Al., M. Schatz, R.S. Zeiger, S. Zhang, W. Chen, S.-J. Yang, C.A. Camargo, E.D. Bateman, R. Buhl, P.M. O’Byrne,

200 M. Humbert, H.K. Reddel, M.R. Sears, et Al., C.A. Sorkness, R.F. Lemanske, D.T. Mauger, S.J. Boehmer, V.M. Chinchilli, F.D. Martinez, et Al., M.C. McCormack, C. Aloe, J. Curtin- Brosnan, G.B. Diette, P.N. Breysse, E.C. Matsui, N.A. Hanania, S. Wenzel, K. Rosen,´ H.J. Hsieh, S. Mosesova, D.F. Choy, et Al., J. Corren, R.F. Lemanske, N.A. Hanania, P.E. Koren- blat, M.V. Parsey, J.R. Arron, et Al., D.L. Rosenstreich, P. Eggleston, M. Kattan, D. Baker, R.G. Slavin, P. Gergen, et Al., C.S. Murray, G. Poletti, T. Kebadze, J. Morris, A. Woodcock, S.L. Johnston, et Al., S.M. Tarlo, I. Broder, P. Corey, M. Chan-Yeung, A. Ferguson, A. Becker, et Al., E.F. Juniper, K. Svensson, A.C. Mork, and E. Stahl. Seasonal risk factors for asthma exacerbations among inner-city children. The Journal of allergy and clinical immunology, 135(6):1465–73.e5, jun 2015.

[314] E. V. Moltchanova, N. Schreier, N. Lammi, and M. Karvonen. Seasonal variation of diagnosis of Type1 diabetes mellitus in children worldwide. Diabetic Medicine, 26(7):673–678, jul 2009.

[315] Shaye Kivity, Nancy Agmon-Levin, Michael Zisappl, Yinon Shapira, Endre V Nagy, Katalin Danko,´ Zoltan Szekanecz, Pnina Langevitz, and Yehuda Shoenfeld. Vitamin D and autoim- mune thyroid diseases. Cellular and Molecular Immunology, 8(3):243–247, may 2011.

[316] M. Djennane, S. Lebbah, C. Roux, H. Djoudi, E. Cavalier, and J.-C. Souberbielle. Vitamin D status of schoolchildren in Northern Algeria, seasonal variations and determinants of vitamin D deficiency. Osteoporosis International, 25(5):1493–1502, may 2014.

[317] Irina Voineagu, Xinchen Wang, Patrick Johnston, Jennifer K Lowe, Yuan Tian, Steve Horvath, Jonathan Mill, Rita M Cantor, Benjamin J Blencowe, and Daniel H Geschwind. Transcriptomic analysis of autistic brain reveals convergent . Nature, 474(7351):380–4, jun 2011.

[318] F. Hormozdiari, O. Penn, E. Borenstein, and E. Eichler. The discovery of integrated gene networks for autism and related disorders. Genome Research, 25(1):142–154, nov 2014.

[319] Bin Zhang, Chris Gaiteri, Liviu-Gabriel Bodea, Zhi Wang, Joshua McElwee, Alexei A Podtelezhnikov, Chunsheng Zhang, Tao Xie, Linh Tran, Radu Dobrin, Eugene Fluder, Bruce Clurman, Stacey Melquist, Manikandan Narayanan, Christine Suver, Hardik Shah, Milind Ma- hajan, Tammy Gillis, Jayalakshmi Mysore, Marcy E MacDonald, John R Lamb, David A Ben- nett, Cliona Molony, David J Stone, Vilmundur Gudnason, Amanda J Myers, Eric E Schadt, Harald Neumann, Jun Zhu, and Valur Emilsson. Integrated systems approach identifies genetic

201 nodes and networks in late-onset Alzheimer’s disease. Cell, 153(3):707–20, apr 2013.

[320] Michael R Johnson, Jacques Behmoaras, Leonardo Bottolo, Michelle L Krishnan, Katharina Pernhorst, Paola L Meza Santoscoy, Tiziana Rossetti, Doug Speed, Prashant K Srivastava, Marc Chadeau-Hyam, Nabil Hajji, Aleksandra Dabrowska, Maxime Rotival, Banafsheh Raz- zaghi, Stjepana Kovac, Klaus Wanisch, Federico W Grillo, Anna Slaviero, Sarah R Langley, Kirill Shkura, Paolo Roncon, Tisham De, Manuel Mattheisen, Pitt Niehusmann, Terence J O’Brien, Slave Petrovski, Marec von Lehe, Per Hoffmann, Johan Eriksson, Alison J Coffey, Sven Cichon, Matthew Walker, Michele Simonato, Ben´ edicte´ Danis, Manuela Mazzuferi, Pa- trik Foerch, Susanne Schoch, Vincenzo De Paola, Rafal M Kaminski, Vincent T Cunliffe, Al- bert J Becker, and Enrico Petretto. Systems genetics identifies Sestrin 3 as a regulator of a pro- convulsant gene network in human epileptic hippocampus. Nature communications, 6:6031, jan 2015.

[321] Xiaolin Xiao, Aida Moreno-Moral, Maxime Rotival, Leonardo Bottolo, and Enrico Petretto. Multi-tissue Analysis of Co-expression Networks by Higher-Order Generalized Singular Value Decomposition Identifies Functionally Coherent Transcriptional Modules. PLoS genetics, 10(1):e1004006, jan 2014.

[322] Eran Segal, Michael Shapira, Aviv Regev, Dana Pe’er, David Botstein, Daphne Koller, and Nir Friedman. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature genetics, 34(2):166–76, jun 2003.

[323] Hongqiang Li, Hao Chen, Lei Bao, Kenneth F Manly, Elissa J Chesler, Lu Lu, Jintao Wang, Mi Zhou, Robert W Williams, and Yan Cui. Integrative genetic analysis of transcription mod- ules: towards filling the gap between genetic loci and inherited traits. Human molecular genetics, 15(3):481–492, 2006.

[324] S H Lee, J Yang, M E Goddard, P M Visscher, and N R Wray. Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics (Oxford, England), 28(19):2540–2, oct 2012.

[325] Sarah E Medland, Dale R Nyholt, Jodie N Painter, Brian P McEvoy, Allan F McRae, Gu Zhu, Scott D Gordon, Manuel A R Ferreira, Margaret J Wright, Anjali K Henders, Megan J Camp- bell, David L Duffy, Narelle K Hansell, Stuart Macgregor, Wendy S Slutske, Andrew C Heath, Grant W Montgomery, and Nicholas G Martin. Common variants in the trichohyalin gene are

202 associated with straight hair in Europeans. American journal of human genetics, 85(5):750–5, nov 2009.

[326] T Mark Beasley, Stephen Erickson, and David B Allison. Rank-based inverse normal trans- formations are increasingly used, but are they merited? Behavior genetics, 39(5):580–95, sep 2009.

[327] Isabella Zwiener, Barbara Frisch, and Harald Binder. Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures. PloS one, 9(1):e85150, jan 2014.

[328] Noah Zaitlen, Peter Kraft, Nick Patterson, Bogdan Pasaniuc, Gaurav Bhatia, Samuela Pollack, and Alkes L Price. Using extended genealogy to estimate components of heritability for 23 quantitative and dichotomous traits. PLoS genetics, 9(5):e1003520, may 2013.

[329] Amartya Sanyal, Bryan R Lajoie, Gaurav Jain, and Job Dekker. The long-range interaction landscape of gene promoters. Nature, 489(7414):109–13, sep 2012.

[330] Andrew D Johnson, Robert E Handsaker, Sara L Pulit, Marcia M Nizzari, Christopher J O’Donnell, and Paul I W de Bakker. SNAP: a web-based tool for identification and anno- tation of proxy SNPs using HapMap. Bioinformatics (Oxford, England), 24(24):2938–9, dec 2008.

[331] Randall J Pruim, Ryan P Welch, Serena Sanna, Tanya M Teslovich, Peter S Chines, Terry P Gliedt, Michael Boehnke, Gonc¸alo R Abecasis, and Cristen J Willer. LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics (Oxford, England), 26(18):2336–7, sep 2010.

[332] Danielle Welter, Jacqueline MacArthur, Joannella Morales, Tony Burdett, Peggy Hall, Heather Junkins, Alan Klemm, Paul Flicek, Teri Manolio, Lucia Hindorff, and Helen Parkinson. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic acids research, 42(Database issue):D1001–6, jan 2014.

[333] Mar Gonzalez-Porta,` Adam Frankish, Johan Rung, Jennifer Harrow, and Alvis Brazma. Tran- scriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome biology, 14(7):R70, jan 2013.

[334] Paul Flicek, M Ridwan Amode, Daniel Barrell, Kathryn Beal, Simon Brent, Denise Carvalho- Silva, Peter Clapham, Guy Coates, Susan Fairley, Stephen Fitzgerald, Laurent Gil, Leo Gor-

203 don, Maurice Hendrix, Thibaut Hourlier, Nathan Johnson, Andreas K Kah¨ ari,¨ Damian Keefe, Stephen Keenan, Rhoda Kinsella, Monika Komorowska, Gautier Koscielny, Eugene Kulesha, Pontus Larsson, Ian Longden, William McLaren, Matthieu Muffato, Bert Overduin, Miguel Pignatelli, Bethan Pritchard, Harpreet Singh Riat, Graham R S Ritchie, Magali Ruffier, Michael Schuster, Daniel Sobral, Y Amy Tang, Kieron Taylor, Stephen Trevanion, Jana Vandrovcova, , Mark Wilson, Steven P Wilder, Bronwen L Aken, Ewan Birney, Fiona Cun- ningham, Ian Dunham, Richard Durbin, Xose´ M Fernandez-Suarez,´ Jennifer Harrow, Javier Herrero, Tim J P Hubbard, Anne Parker, Glenn Proctor, Giulietta Spudich, Jan Vogel, Andy Yates, Amonida Zadissa, and Stephen M J Searle. Ensembl 2012. Nucleic acids research, 40(Database issue):D84–90, jan 2012.

[335] Fiona Cunningham, M Ridwan Amode, Daniel Barrell, Kathryn Beal, Konstantinos Billis, Si- mon Brent, Denise Carvalho-Silva, Peter Clapham, Guy Coates, Stephen Fitzgerald, Laurent Gil, Carlos Garc´ıa Giron,´ Leo Gordon, Thibaut Hourlier, Sarah E Hunt, Sophie H Janacek, Nathan Johnson, Thomas Juettemann, Andreas K Kah¨ ari,¨ Stephen Keenan, Fergal J Mar- tin, Thomas Maurel, William McLaren, Daniel N Murphy, Rishi Nag, Bert Overduin, Anne Parker, Mateus Patricio, Emily Perry, Miguel Pignatelli, Harpreet Singh Riat, Daniel Shep- pard, Kieron Taylor, Anja Thormann, Alessandro Vullo, Steven P Wilder, Amonida Zadissa, Bronwen L Aken, Ewan Birney, Jennifer Harrow, Rhoda Kinsella, Matthieu Muffato, Magali Ruffier, Stephen M J Searle, Giulietta Spudich, Stephen J Trevanion, Andy Yates, Daniel R Zerbino, and Paul Flicek. Ensembl 2015. Nucleic acids research, 43(D1):D662–669, oct 2014.

[336] Cheng-Wei Chang, Wei-Chung Cheng, Chaang-Ray Chen, Wun-Yi Shu, Min-Lung Tsai, Ching-Lung Huang, Ian C. Hsu, JD Watson, NH Hopkins, JW Roberts, JA Steitz, AM Weiner, AJ Butte, VJ Dzau, SB Glueck, Z Tu, L Wang, M Xu, X Zhou, T Chen, KI Goh, ME Cusick, D Valle, B Childs, M Vidal, AP Pilbrow, LJ Ellmers, MA Black, CS Moravec, WE Sweet, AA Morgan, JT Dudley, T Deshpande, AJ Butte, WC Cheng, CW Chang, CR Chen, ML Tsai, WY Shu, JA Warrington, A Nair, M Mahadevappa, M Tsyganskaya, LL Hsiao, F Dangond, T Yoshida, R Hong, RV Jensen, E Eisenberg, EY Levanon, J Zhu, F He, S Song, J Wang, J Yu, Z Dezso, Y Nikolsky, E Sviridov, W Shi, T Serebriyskaya, X She, CA Rohl, JC Castle, AV Kulkarni, JM Johnson, S Liang, Y Li, X Be, S Howes, W Liu, A Saito-Hisaminato, T Kata- giri, S Kakiuchi, T Nakamura, T Tsunoda, J Misra, W Schmitt, D Hwang, LL Hsiao, S Gul- lans, I Yanai, H Benjamin, M Shmoish, V Chalifa-Caspi, M Shklar, D Greco, P Somervuo, A Di Lieto, T Raitila, L Nitsch, L Wang, AK Srivastava, CE Schwartz, AE Vinogradov,

204 AE Vinogradov, D Farre, N Bellora, L Mularoni, X Messeguer, MM Alba, J Zhu, F He, S Hu, J Yu, L Zhang, WH Li, WC Cheng, ML Tsai, CW Chang, CL Huang, CR Chen, JN McClintick, HJ Edenberg, D Greco, D Leo, U di Porzio, C Perrone Capano, P Auvinen, M Ashburner, CA Ball, JA Blake, D Botstein, H Butler, M Kanehisa, S Goto, M Hattori, KF Aoki-Kinoshita, M Itoh, H Morita, HL Rehm, A Menesses, B McDonough, AE Roberts, KG Becker, KC Barnes, TJ Bright, SA Wang, GP Page, JW Edwards, GL Gadbury, P Yelisetti, J Wang, R Tibshirani, WJ Lin, HM Hsueh, JJ Chen, C Wei, J Li, RE Bumgarner, P Cahan, F Rovegno, D Mooney, JC Newman, G St Laurent 3rd, A Ramasamy, A Mondry, CC Holmes, DG Altman, RA Irizarry, D Warren, F Spencer, IF Kim, S Biswal, S Podder, P Mukhopadhyay, TC Ghosh, AI Su, MP Cooke, KA Ching, Y Hakak, JR Walker, AI Su, T Wiltshire, S Bat- alov, H Lapp, KA Ching, AE Vinogradov, OV Anatskaya, AD Smith, P Sumazin, MQ Zhang, J Salvatico, JH Kim, IK Chung, MT Muller, L DiMascio, C Voermans, M Uqoezwa, A Dun- can, D Lu, H Sandoval, P Thiagarajan, SK Dasgupta, A Schumacher, JT Prchal, H Parkinson, M Kapushesky, N Kolesnikov, G Rustici, M Shojatalab, CL Wilson, CJ Miller, AL Asare, Z Gao, VJ Carey, R Wang, V Seyfert-Margolis, RC Gentleman, VJ Carey, DM Bates, B Bol- stad, M Dettling, E Hubbell, WM Liu, R Mei, G Liu, AE Loraine, R Shigeta, M Cline, J Cheng, W Huang Da, BT Sherman, RA Lempicki, RM Kuhn, D Karolchik, AS Zweig, H Trumbower, and DJ Thomas. Identification of Human Housekeeping Genes and Tissue-Selective Genes by Microarray Meta-Analysis. PLoS ONE, 6(7):e22859, jul 2011.

[337] M A Weniger, K Pulford, S Gesk, S Ehrlich, A H Banham, L Lyne, J I Martin-Subero, R Siebert, M J S Dyer, P Moller,¨ and T F E Barth. Gains of the proto-oncogene BCL11A and nuclear accumulation of BCL11AXL protein are frequent in primary mediastinal B-cell lymphoma. Leukemia, 20(10):1880–1882, oct 2006.

[338] Yong Yu, Juexuan Wang, Walid Khaled, Shannon Burke, Peng Li, Xiongfeng Chen, Wei Yang, Nancy A. Jenkins, Neal G. Copeland, Shujun Zhang, and Pentao Liu. Bcl11a is essential for lymphoid development and negatively regulates p53. The Journal of Experimental Medicine, 209(13):2467–2483, dec 2012.

[339] T. Masuda, X. Wang, M. Maeda, M. C. Canver, F. Sher, A. P. W. Funnell, C. Fisher, M. Suciu, G. E. Martyn, L. J. Norton, C. Zhu, R. Kurita, Y. Nakamura, J. Xu, D. R. Higgs, M. Crossley, D. E. Bauer, S. H. Orkin, P. V. Kharchenko, and T. Maeda. Transcription factors LRF and BCL11A independently repress expression of fetal hemoglobin. Science, 351(6270):285–289, jan 2016.

205 [340] Cristina Dias, SaraB. Estruch, SarahA. Graham, Jeremy McRae, StephenJ. Sawiak, JaneA. Hurst, ShelaghK. Joss, SusanE. Holder, JennyE.V. Morton, Claire Turner, Julien Thevenon, Kelly Mellul, Gabriela Sanchez-Andrade,´ Ximena Ibarra-Soria, Pelagia Deriziotis, RuiF. San- tos, Song-Choon Lee, Laurence Faivre, Tjitske Kleefstra, Pentao Liu, MathewE. Hurles, Si- monE. Fisher, DarrenW. Logan, and Darren W Logan. BCL11A Haploinsufficiency Causes an Intellectual Disability Syndrome and Dysregulates Transcription. The American Journal of Human Genetics, 99(2):253–274, aug 2016.

[341] Stephan Menzel, Chad Garner, Ivo Gut, Fumihiko Matsuda, Masao Yamaguchi, Simon Heath, Mario Foglio, Diana Zelenika, Anne Boland, Helen Rooks, Steve Best, Tim D Spector, Martin Farrall, Mark Lathrop, and Swee Lay Thein. A QTL influencing F cell production maps to a gene encoding a zinc-finger protein on chromosome 2p15. Nature genetics, 39(10):1197–9, oct 2007.

[342] Pallav Bhatnagar, Shirley Purvis, Emily Barron-Casella, Michael R DeBaun, James F Casella, Dan E Arking, and Jeffrey R Keefer. Genome-wide association study identifies genetic variants influencing F-cell levels in sickle-cell patients. Journal of human genetics, 56(4):316–23, apr 2011.

[343] Manuela Uda, Renzo Galanello, Serena Sanna, Guillaume Lettre, Vijay G Sankaran, Weimin Chen, Gianluca Usala, Fabio Busonero, Andrea Maschio, Giuseppe Albai, Maria Grazia Pi- ras, Natascia Sestu, Sandra Lai, Mariano Dei, Antonella Mulas, Laura Crisponi, Silvia Naitza, Isadora Asunis, Manila Deiana, Ramaiah Nagaraja, Lucia Perseu, Stefania Satta, Maria Do- lores Cipollina, Carla Sollaino, Paolo Moi, Joel N Hirschhorn, Stuart H Orkin, Gonc¸alo R Abecasis, David Schlessinger, and Antonio Cao. Genome-wide association study shows BCL11A associated with persistent fetal hemoglobin and amelioration of the phenotype of beta-thalassemia. Proceedings of the National Academy of Sciences of the United States of America, 105(5):1620–5, feb 2008.

[344] Siana Nkya Mtatiro, Tarjinder Singh, Helen Rooks, Josephine Mgaya, Harvest Mariki, De- ogratius Soka, Bruno Mmbando, Evarist Msaki, Iris Kolder, Swee Lay Thein, Stephan Menzel, Sharon E Cox, Julie Makani, and Jeffrey C Barrett. Genome wide association study of fetal hemoglobin in sickle cell anemia in Tanzania. PloS one, 9(11):e111464, nov 2014.

[345] Nadia Solovieff, Jacqueline N Milton, Stephen W Hartley, Richard Sherva, Paola Sebastiani, Daniel A Dworkis, Elizabeth S Klings, Lindsay A Farrer, Melanie E Garrett, Allison Ashley-

206 Koch, Marilyn J Telen, Supan Fucharoen, Shau Yin Ha, Chi-Kong Li, David H K Chui, Clin- ton T Baldwin, and Martin H Steinberg. Fetal hemoglobin in sickle cell anemia: genome-wide association studies suggest a regulatory region in the 5’ olfactory receptor gene cluster. Blood, 115(9):1815–22, mar 2010.

[346] Pallav Bhatnagar, Shirley Purvis, Emily Barron-Casella, Michael R DeBaun, James F Casella, Dan E Arking, and Jeffrey R Keefer. Genome-wide association study identifies genetic variants influencing F-cell levels in sickle-cell patients. Journal of human genetics, 56(4):316–23, apr 2011.

[347] Jacqueline N Milton, Helen Rooks, Emma Drasar, Elizabeth L McCabe, Clinton T Baldwin, Efi Melista, Victor R Gordeuk, Mehdi Nouraie, Gregory R Kato, Gregory J Kato, Caterina Minniti, James Taylor, Andrew Campbell, Lori Luchtman-Jones, Sohail Rana, Oswaldo Castro, Yingze Zhang, Swee Lay Thein, Paola Sebastiani, Mark T Gladwin, Walk-PHAAST Investigators, and Martin H Steinberg. Genetic determinants of haemolysis in sickle cell anaemia. British journal of haematology, 161(2):270–8, apr 2013.

[348] Paula J Griffin, Paola Sebastiani, Heather Edward, Clinton T Baldwin, Mark T Gladwin, Vic- tor R Gordeuk, David H K Chui, and Martin H Steinberg. The genetics of hemoglobin A2 regulation in sickle cell anemia. American journal of hematology, 89(11):1019–23, nov 2014.

[349] Manit Nuinoon, Wattanan Makarasara, Taisei Mushiroda, Iswari Setianingsih, Pustika Amalia Wahidiyat, Orapan Sripichai, Natsuhiko Kumasaka, Atsushi Takahashi, Saovaros Svasti, Thongperm Munkongdee, Surakameth Mahasirimongkol, Chayanon Peerapittayamongkol, Vip Viprakasit, Naoyuki Kamatani, Pranee Winichagoon, Michiaki Kubo, Yusuke Nakamura, and Suthat Fucharoen. A genome-wide association identified the common genetic variants in- fluence disease severity in beta0-thalassemia/hemoglobin E. Human genetics, 127(3):303–14, mar 2010.

[350] Jacques Fellay, Dongliang Ge, Kevin V Shianna, Sara Colombo, Bruno Ledergerber, Eliz- abeth T Cirulli, Thomas J Urban, Kunlin Zhang, Curtis E Gumbs, Jason P Smith, An- tonella Castagna, Alessandro Cozzi-Lepri, Andrea De Luca, Philippa Easterbrook, Huldrych F Gunthard,¨ Simon Mallal, Cristina Mussini, Judith Dalmau, Javier Martinez-Picado, Jose´ M Miro, Niels Obel, Steven M Wolinsky, Jeremy J Martinson, Roger Detels, Joseph B Margolick, Lisa P Jacobson, Patrick Descombes, Stylianos E Antonarakis, Jacques S Beckmann, Stephen J O’Brien, Norman L Letvin, Andrew J McMichael, Barton F Haynes, Mary Carrington, Sheng

207 Feng, Amalio Telenti, David B Goldstein, and NIAID Center for HIV/AIDS Vaccine Immunol- ogy (CHAVI). Common genetic variation and the control of HIV-1 in humans. PLoS genetics, 5(12):e1000791, dec 2009.

[351] Afshin Parsa, Yen-Pei C Chang, Reagan J Kelly, Mary C Corretti, Kathleen A Ryan, Shawn W Robinson, Stephen S Gottlieb, Sharon L R Kardia, Alan R Shuldiner, and Stephen B Liggett. Hypertrophy-associated polymorphisms ascertained in a founder cohort applied to heart failure risk and mortality. Clinical and translational science, 4(1):17–23, feb 2011.

[352] Kim D Pruitt, Garth R Brown, Susan M Hiatt, Franc¸oise Thibaud-Nissen, Alexander Astashyn, Olga Ermolaeva, Catherine M Farrell, Jennifer Hart, Melissa J Landrum, Kelly M McGarvey, Michael R Murphy, Nuala A O’Leary, Shashikant Pujar, Bhanu Rajput, Sanjida H Rangwala, Lillian D Riddick, Andrei Shkeda, Hanzhen Sun, Pamela Tamez, Raymond E Tully, Craig Wallin, David Webb, Janet Weber, Wendy Wu, Michael DiCuccio, Paul Kitts, Donna R Ma- glott, Terence D Murphy, and James M Ostell. RefSeq: an update on mammalian reference sequences. Nucleic acids research, 42(Database issue):D756–63, jan 2014.

[353] Kim D Pruitt, Tatiana Tatusova, and Donna R Maglott. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research, 35(Database issue):D61–5, jan 2007.

[354] Jian-Wen Han, Hou-Feng Zheng, Yong Cui, Liang-Dan Sun, Dong-Qing Ye, Zhi Hu, Jin-Hua Xu, Zhi-Ming Cai, Wei Huang, Guo-Ping Zhao, Hong-Fu Xie, Hong Fang, Qian-Jin Lu, Jian- Hua Xu, Xiang-Pei Li, Yun-Feng Pan, Dan-Qi Deng, Fan-Qin Zeng, Zhi-Zhong Ye, Xiao-Yan Zhang, Qing-Wen Wang, Fei Hao, Li Ma, Xian-Bo Zuo, Fu-Sheng Zhou, Wen-Hui Du, Yi- Lin Cheng, Jian-Qiang Yang, Song-Ke Shen, Jian Li, Yu-Jun Sheng, Xiao-Xia Zuo, Wei-Fang Zhu, Fei Gao, Pei-Lian Zhang, Qing Guo, Bo Li, Min Gao, Feng-Li Xiao, Cheng Quan, Chi Zhang, Zheng Zhang, Kun-Ju Zhu, Yang Li, Da-Yan Hu, Wen-Sheng Lu, Jian-Lin Huang, Sheng-Xiu Liu, Hui Li, Yun-Qing Ren, Zai-Xing Wang, Chun-Jun Yang, Pei-Guang Wang, Wen-Ming Zhou, Yong-Mei Lv, An-Ping Zhang, Sheng-Quan Zhang, Da Lin, Yi Li, Hui Qi Low, Min Shen, Zhi-Fang Zhai, Ying Wang, Feng-Yu Zhang, Sen Yang, Jian-Jun Liu, and Xue-Jun Zhang. Genome-wide association study in a Chinese Han population identifies nine new susceptibility loci for systemic lupus erythematosus. Nature genetics, 41(11):1234–7, nov 2009.

[355] James Bentham, David L Morris, Deborah S Cunninghame Graham, Christopher L Pinder,

208 Philip Tombleson, Timothy W Behrens, Javier Mart´ın, Benjamin P Fairfax, Julian C Knight, Lingyan Chen, Joseph Replogle, Ann-Christine Syvanen,¨ Lars Ronnblom,¨ Robert R Graham, Joan E Wither, John D Rioux, Marta E Alarcon-Riquelme,´ and Timothy J Vyse. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nature genetics, 47(12):1457–64, dec 2015.

[356] Cristen J Global Lipids Genetics Consortium, Cristen J Willer, Ellen M Schmidt, Sebanti Sen- gupta, Gina M Peloso, Stefan Gustafsson, Stavroula Kanoni, Andrea Ganna, Jin Chen, Mar- tin L Buchkovich, Samia Mora, Jacques S Beckmann, Jennifer L Bragg-Gresham, Hsing-Yi Chang, Aye Demirkan, Heleen M Den Hertog, Ron Do, Louise A Donnelly, Georg B Ehret, Tonu˜ Esko, Mary F Feitosa, Teresa Ferreira, Krista Fischer, Pierre Fontanillas, Ross M Fraser, Daniel F Freitag, Deepti Gurdasani, Kauko Heikkila,¨ Elina Hypponen,¨ Aaron Isaacs, Anne U Jackson, Asa Johansson, Toby Johnson, Marika Kaakinen, Johannes Kettunen, Marcus E Kle- ber, Xiaohui Li, Jian’an Luan, Leo-Pekka Lyytikainen,¨ Patrik K E Magnusson, Massimo Mangino, Evelin Mihailov, May E Montasser, Martina Muller-Nurasyid,¨ Ilja M Nolte, Jef- frey R O’Connell, Cameron D Palmer, Markus Perola, Ann-Kristin Petersen, Serena Sanna, Richa Saxena, Susan K Service, Sonia Shah, Dmitry Shungin, Carlo Sidore, Ci Song, Rona J Strawbridge, Ida Surakka, Toshiko Tanaka, Tanya M Teslovich, Gudmar Thorleifsson, Evita G Van den Herik, Benjamin F Voight, Kelly A Volcik, Lindsay L Waite, Andrew Wong, Ying Wu, Weihua Zhang, Devin Absher, Gershim Asiki, Inesˆ Barroso, Latonya F Been, Jennifer L Bolton, Lori L Bonnycastle, Paolo Brambilla, Mary S Burnett, Giancarlo Cesana, Maria Dim- itriou, Alex S F Doney, Angela Doring,¨ Paul Elliott, Stephen E Epstein, Gudmundur Ingi Eyjolfsson, Bruna Gigante, Mark O Goodarzi, Harald Grallert, Martha L Gravito, Christo- pher J Groves, Goran¨ Hallmans, Anna-Liisa Hartikainen, Caroline Hayward, Dena Hernan- dez, Andrew A Hicks, Hilma Holm, Yi-Jen Hung, Thomas Illig, Michelle R Jones, Pon- tiano Kaleebu, John J P Kastelein, Kay-Tee Khaw, Eric Kim, Norman Klopp, Pirjo Komu- lainen, Meena Kumari, Claudia Langenberg, Terho Lehtimaki,¨ Shih-Yi Lin, Jaana Lindstrom,¨ Ruth J F Loos, Franc¸ois Mach, Wendy L McArdle, Christa Meisinger, Braxton D Mitchell, Gabrielle Muller,¨ Ramaiah Nagaraja, Narisu Narisu, Tuomo V M Nieminen, Rebecca N Nsub- uga, Isleifur Olafsson, Ken K Ong, Aarno Palotie, Theodore Papamarkou, Cristina Pomilla, Anneli Pouta, Daniel J Rader, Muredach P Reilly, Paul M Ridker, Fernando Rivadeneira, Igor Rudan, Aimo Ruokonen, Nilesh Samani, Hubert Scharnagl, Janet Seeley, Kaisa Silan- der, Alena Stancakov´ a,´ Kathleen Stirrups, Amy J Swift, Laurence Tiret, Andre G Uitterlinden, L Joost van Pelt, Sailaja Vedantam, Nicholas Wainwright, Cisca Wijmenga, Sarah H Wild,

209 Gonneke Willemsen, Tom Wilsgaard, James F Wilson, Elizabeth H Young, Jing Hua Zhao, Linda S Adair, Dominique Arveiler, Themistocles L Assimes, Stefania Bandinelli, Franklyn Bennett, Murielle Bochud, Bernhard O Boehm, Dorret I Boomsma, Ingrid B Borecki, Ste- fan R Bornstein, Pascal Bovet, Michel Burnier, Harry Campbell, Aravinda Chakravarti, John C Chambers, Yii-Der Ida Chen, Francis S Collins, Richard S Cooper, John Danesh, George De- doussis, Ulf de Faire, Alan B Feranil, Jean Ferrieres,` Luigi Ferrucci, Nelson B Freimer, Chris- tian Gieger, Leif C Groop, Vilmundur Gudnason, Ulf Gyllensten, Anders Hamsten, Tamara B Harris, Aroon Hingorani, Joel N Hirschhorn, Albert Hofman, G Kees Hovingh, Chao Agnes Hsiung, Steve E Humphries, Steven C Hunt, Kristian Hveem, Carlos Iribarren, Marjo-Riitta Jarvelin,¨ Antti Jula, Mika Kah¨ onen,¨ Jaakko Kaprio, Antero Kesaniemi,¨ Mika Kivimaki, Jas- pal S Kooner, Peter J Koudstaal, Ronald M Krauss, Diana Kuh, Johanna Kuusisto, Kirsten O Kyvik, Markku Laakso, Timo A Lakka, Lars Lind, Cecilia M Lindgren, Nicholas G Mar- tin, Winfried Marz,¨ Mark I McCarthy, Colin A McKenzie, Pierre Meneton, Andres Metspalu, Leena Moilanen, Andrew D Morris, Patricia B Munroe, Inger Njølstad, Nancy L Pedersen, Chris Power, Peter P Pramstaller, Jackie F Price, Bruce M Psaty, Thomas Quertermous, Rainer Rauramaa, Danish Saleheen, Veikko Salomaa, Dharambir K Sanghera, Jouko Saramies, Peter E H Schwarz, Wayne H-H Sheu, Alan R Shuldiner, Agneta Siegbahn, Tim D Spector, Kari Stefansson, David P Strachan, Bamidele O Tayo, Elena Tremoli, Jaakko Tuomilehto, Matti Uusitupa, Cornelia M van Duijn, Peter Vollenweider, Lars Wallentin, Nicholas J Wareham, John B Whitfield, Bruce H R Wolffenbuttel, Jose M Ordovas, Eric Boerwinkle, Colin N A Palmer, Unnur Thorsteinsdottir, Daniel I Chasman, Jerome I Rotter, Paul W Franks, Samuli Ripatti, L Adrienne Cupples, Manjinder S Sandhu, Stephen S Rich, Michael Boehnke, Panos Deloukas, Sekar Kathiresan, Karen L Mohlke, Erik Ingelsson, and Gonc¸alo R Abecasis. Dis- covery and refinement of loci associated with lipid levels. Nature genetics, 45(11):1274–83, nov 2013.

[357] Mayumi Ueta, Hiromi Sawai, Chie Sotozono, Yuki Hitomi, Nahoko Kaniwa, Mee Kum Kim, Kyoung Yul Seo, Kyung-Chul Yoon, Choun-Ki Joo, Chitra Kannabiran, Tais Hitomi Waka- matsu, Virender Sangwan, Varsha Rathi, Sayan Basu, Takeshi Ozeki, Taisei Mushiroda, Emiko Sugiyama, Keiko Maekawa, Ryosuke Nakamura, Michiko Aihara, Kayoko Matsunaga, Aki- hiro Sekine, Jose´ Alvaro´ Pereira Gomes, Junji Hamuro, Yoshiro Saito, Michiaki Kubo, Shigeru Kinoshita, and Katsushi Tokunaga. IKZF1, a new susceptibility gene for cold medicine-related Stevens-Johnson syndrome/toxic epidermal necrolysis with severe mucosal involvement. The Journal of allergy and clinical immunology, 135(6):1538–45.e17, jun 2015.

210 [358] Fabrice Danjou, Magdalena Zoledziewska, Carlo Sidore, Maristella Steri, Fabio Busonero, Andrea Maschio, Antonella Mulas, Lucia Perseu, Susanna Barella, Eleonora Porcu, Giorgio Pistis, Maristella Pitzalis, Mauro Pala, Stephan Menzel, Sarah Metrustry, Timothy D Spector, Lidia Leoni, Andrea Angius, Manuela Uda, Paolo Moi, Swee Lay Thein, Renzo Galanello, Gonc¸alo R Abecasis, David Schlessinger, Serena Sanna, and Francesco Cucca. Genome-wide association analyses based on whole-genome sequencing in Sardinia provide insights into reg- ulation of hemoglobin levels. Nature genetics, 47(11):1264–71, nov 2015.

[359] Pallav Bhatnagar, Shirley Purvis, Emily Barron-Casella, Michael R DeBaun, James F Casella, Dan E Arking, and Jeffrey R Keefer. Genome-wide association study identifies genetic variants influencing F-cell levels in sickle-cell patients. Journal of human genetics, 56(4):316–23, apr 2011.

[360] I C Grieve, N J Dickens, M Pravenec, V Kren, N Hubner, S A Cook, T J Aitman, E Petretto, and J Mangion. Genome-wide co-expression analysis in multiple tissues. PLoS ONE, 3(12):e4033, 2008.

[361] Volker Fensterl and Ganes C Sen. Interferon-induced Ifit proteins: their role in viral pathogen- esis. Journal of virology, 89(5):2462–8, mar 2015.

[362] Jixi Li, Zhiyi Wei, Mei Zheng, Xing Gu, Yingfeng Deng, Rui Qiu, Fei Chen, Chao- neng Ji, Weimin Gong, Yi Xie, and Yumin Mao. Crystal Structure of Human Guanosine Monophosphate Reductase 2 (GMPR2) in Complex with GMP. Journal of Molecular Biology, 355(5):980–988, 2006.

[363] Kelly O’Brien, Adrianna Vlachos, Stacie M. Anderson, Crystiana Tsujiura, Lionel Blanc, Eva Atsidaftos, NIH Intramural Sequencing Center, Jason E. Farrar, Steven R Ellis, Jeffrey Michael Lipton, and David M. Bodine. Transcriptome Analysis of Erythroid Cells Cultured from Dia- mond Blackfan Anemia Patients with Ribosomal and GATA1 Mutations Reveals Dysregulation of Inflammatory Response Genes. Blood, 126(23), 2015.

[364] Asha R. Kallianpur, Quan Wang, Peilin Jia, Todd Hulgan, Zhongming Zhao, Scott L. Le- tendre, Ronald J. Ellis, Robert K. Heaton, Donald R. Franklin, Jill Barnholtz-Sloan, Ann C. Collier, Christina M. Marra, David B. Clifford, Benjamin B. Gelman, Justin C. McArthur, Susan Morgello, David M. Simpson, J. A. McCutchan, and Igor Grant. Anemia and Red Blood Cell Indices Predict HIV-Associated Neurocognitive Impairment in the Highly Active

211 Antiretroviral Therapy Era. Journal of Infectious Diseases, 213(7):1065–1073, apr 2016.

[365] Michifumi Yamashita, Satoshi Horikoshi, Katsuhiko Asanuma, Hisatsugu Takahara, Isao Shi- rato, and Yasuhiko Tomino. Tensin is potentially involved in extracellular matrix production in mesangial cells. Histochemistry and Cell Biology, 121(3):245–254, mar 2004.

[366] Y Takakuwa. Protein 4.1, a multifunctional protein of the erythrocyte membrane skele- ton: structure and functions in erythrocytes and nonerythroid cells. International journal of hematology, 72(3):298–309, oct 2000.

[367] Michael F Romero, An-Ping Chen, Mark D Parker, and Walter F Boron. The SLC4 family of bicarbonate (HCO) transporters. Molecular aspects of medicine, 34(2-3):159–82, 2013.

[368] Mrinmoy Sanyal, James W. Tung, Holger Karsunky, Hong Zeng, Licia Selleri, Irving L. Weiss- man, Leonore A. Herzenberg, and Michael L. Cleary. B-cell development fails in the absence of the Pbx1 proto-oncogene. Blood, 109(10), 2007.

[369] I.M. EL Bishlawy. Red blood cells, hemoglobin and the immune system. Medical Hypotheses, 53(4):345–346, oct 1999.

[370] Davinia Morera and Simon A MacKenzie. Is there a direct role for erythrocytes in the immune response? Veterinary research, 42(1):89, jul 2011.

[371] E J Smith, I Marie,´ A Prakash, A Garc´ıa-Sastre, and D E Levy. IRF3 and IRF7 phosphory- lation in virus-infected cells does not require double-stranded RNA-dependent protein kinase R or Ikappa B kinase but is blocked by Vaccinia virus E3L protein. The Journal of biological chemistry, 276(12):8951–7, mar 2001.

[372] Tomoharu Takeuchi, Satoshi Inoue, and Hideyoshi Yokosawa. Identification and Herc5- mediated ISGylation of novel target proteins. Biochemical and Biophysical Research Communications, 348(2):473–477, sep 2006.

[373] Judith Verhelst, Eef Parthoens, Bert Schepens, Walter Fiers, and Xavier Saelens. Interferon- inducible protein Mx1 inhibits influenza virus by interfering with functional viral ribonucleo- protein complex assembly. Journal of virology, 86(24):13445–55, dec 2012.

[374] James N. Fullerton, Alastair J. O’Brien, and Derek W. Gilroy. Lipid mediators in immune dysfunction after severe inflammation. Trends in Immunology, 35(1):12–21, jan 2014.

212 [375] Arndt Manzel, Dominik N. Muller, David A. Hafler, Susan E. Erdman, Ralf A. Linker, and Markus Kleinewietfeld. Role of Western Diet in Inflammatory Autoimmune Diseases. Current Allergy and Asthma Reports, 14(1):404, jan 2014.

[376] Robert S Fujinami, Matthias G von Herrath, Urs Christen, and J Lindsay Whitton. Molec- ular mimicry, bystander activation, or viral persistence: infections and autoimmune disease. Clinical microbiology reviews, 19(1):80–94, jan 2006.

[377] Daniel R Getts, Emily M L Chastain, Rachael L Terry, and Stephen D Miller. Virus infection, antiviral immunity, and autoimmunity. Immunological reviews, 255(1):197–209, sep 2013.

[378] Jakob Hofvander, Johnbosco Tayebwa, Jenny Nilsson, Linda Magnusson, Otte Brosjo,¨ Olle Larsson, Fredrik Vult von Steyern, Henryk A Domanski, Nils Mandahl, and Fredrik Mertens. RNA sequencing of sarcomas with simple karyotypes: identification and enrichment of fusion transcripts. Laboratory Investigation, 95(6):603–609, jun 2015.

[379] Haishan Lin and Rudolf Grosschedl. Failure of B-cell differentiation in mice lacking the tran- scription factor EBF. Nature, 376(6537):263–267, jul 1995.

[380] Josine L Min, Amy Barrett, Tim Watts, Fredrik H Pettersson, Helen E Lockstone, Cecilia M Lindgren, Jennifer M Taylor, Maxine Allen, Krina T Zondervan, and Mark I McCarthy. Vari- ability of gene expression profiles in human blood and lymphoblastoid cell lines. BMC genomics, 11(1):96, jan 2010.

[381] Peter E. Czabotar, Guillaume Lessene, Andreas Strasser, and Jerry M. Adams. Control of apop- tosis by the BCL-2 protein family: implications for physiology and therapy. Nature Reviews Molecular Cell Biology, 15(1):49–63, dec 2013.

[382] J H Laity, B M Lee, and P E Wright. Zinc finger proteins: new insights into structural and functional diversity. Current opinion in structural biology, 11(1):39–46, feb 2001.

[383] M.-T. Kurzbauer, C. Uanschou, D. Chen, and P. Schlogelhofer. The Recombinases DMC1 and RAD51 Are Functionally and Spatially Separated during Meiosis in Arabidopsis. The Plant Cell, 24(5):2058–2070, may 2012.

[384] Alice Bossi and Ben Lehner. Tissue specificity and the human protein interaction network. Molecular systems biology, 5:260, jan 2009.

213 [385] Winston Koh, Wenying Pan, Charles Gawad, H Christina Fan, Geoffrey A Kerchner, Tony Wyss-Coray, Yair J Blumenfeld, Yasser Y El-Sayed, and Stephen R Quake. Noninvasive in vivo monitoring of tissue-specific global gene expression in humans. Proceedings of the National Academy of Sciences of the United States of America, 111(20):7361–6, may 2014.

[386] Mads Kærn, Timothy C. Elston, William J. Blake, and James J. Collins. Stochasticity in gene expression: from theories to phenotypes. Nature Reviews Genetics, 6(6):451–464, jun 2005.

[387] E Grundberg, K S Small, A˚ K Hedman, A C Nica, A Buil, S Keildson, J T Bell, T P Yang, E Meduri, and A Barrett. Mapping cis-and trans-regulatory effects across multiple tissues in twins. Nature genetics, 44(10):1084–1089, 2012.

[388] Simon Anders, Davis J McCarthy, Yunshun Chen, Michal Okoniewski, Gordon K Smyth, Wolfgang Huber, and Mark D Robinson. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. Nature Protocols, 8(9):1765–1786, aug 2013.

[389] Fang-Fang Yu, Yan-Xiang Zhang, Lian-He Zhang, Wen-Rong Li, Xiong Guo, and Mikko J Lammi. Identified molecular mechanism of interaction between environmental risk factors and differential expression genes in cartilage of Kashin-Beck disease. Medicine, 95(52):e5669, dec 2016.

[390] Emma Nilsson, Per Anders Jansson, Alexander Perfilyev, Petr Volkov, Maria Pedersen, Maria K. Svensson, Pernille Poulsen, Rasmus Ribel-Madsen, Nancy L. Pedersen, Peter Alm- gren, Joao˜ Fadista, Tina Ronn,¨ Bente Klarlund Pedersen, Camilla Scheele, Allan Vaag, and Charlotte Ling. Altered DNA Methylation and Differential Expression of Genes Influenc- ing Metabolism and Inflammation in Adipose Tissue From Subjects With Type 2 Diabetes. Diabetes, 63(9), 2014.

[391] Xaquin Castro Dopico, Marina Evangelou, Ricardo C. Ferreira, Hui Guo, Marcin L. Pekalski, Deborah J. Smyth, Nicholas Cooper, Oliver S. Burren, Anthony J. Fulford, Branwen J. Hen- nig, Andrew M. Prentice, Anette-G. Ziegler, Ezio Bonifacio, Chris Wallace, and John a. Todd. Widespread seasonal gene expression reveals annual differences in human immunity and phys- iology. Nature Communications, 6(May):7000, 2015.

[392] Vincent Piras and Kumar Selvarajoo. The reduction of gene expression variability from single cells to populations follows simple statistical laws. Genomics, 105(3):137–44, mar 2015.

214 [393] The GTEx Consortium, K. G. Ardlie, D. S. Deluca, A. V. Segre, T. J. Sullivan, T. R. Young, E. T. Gelfand, C. A. Trowbridge, J. B. Maller, T. Tukiainen, M. Lek, L. D. Ward, P. Kheradpour, B. Iriarte, Y. Meng, C. D. Palmer, T. Esko, W. Winckler, J. N. Hirschhorn, M. Kellis, D. G. MacArthur, G. Getz, A. A. Shabalin, G. Li, Y.-H. Zhou, A. B. Nobel, I. Rusyn, F. A. Wright, T. Lappalainen, P. G. Ferreira, H. Ongen, M. A. Rivas, A. Battle, S. Mostafavi, J. Monlong, M. Sammeth, M. Mele, F. Reverter, J. M. Goldmann, D. Koller, R. Guigo, M. I. McCarthy, E. T. Dermitzakis, E. R. Gamazon, H. K. Im, A. Konkashbaev, D. L. Nicolae, N. J. Cox, T. Flutre, X. Wen, M. Stephens, J. K. Pritchard, Z. Tu, B. Zhang, T. Huang, Q. Long, L. Lin, J. Yang, J. Zhu, J. Liu, A. Brown, B. Mestichelli, D. Tidwell, E. Lo, M. Salvatore, S. Shad, J. A. Thomas, J. T. Lonsdale, M. T. Moser, B. M. Gillard, E. Karasik, K. Ramsey, C. Choi, B. A. Foster, J. Syron, J. Fleming, H. Magazine, R. Hasz, G. D. Walters, J. P. Bridge, M. Miklos, S. Sullivan, L. K. Barker, H. M. Traino, M. Mosavel, L. A. Siminoff, D. R. Valley, D. C. Rohrer, S. D. Jewell, P. A. Branton, L. H. Sobin, M. Barcus, L. Qi, J. McLean, P. Hariharan, K. S. Um, S. Wu, D. Tabor, C. Shive, A. M. Smith, S. A. Buia, A. H. Undale, K. L. Robinson, N. Roche, K. M. Valentino, A. Britton, R. Burges, D. Bradbury, K. W. Hambright, J. Seleski, G. E. Korzeniewski, K. Erickson, Y. Marcus, J. Tejada, M. Taherian, C. Lu, M. Basile, D. C. Mash, S. Volpi, J. P. Struewing, G. F. Temple, J. Boyer, D. Colantuoni, R. Little, S. Koester, L. J. Carithers, H. M. Moore, P. Guan, C. Compton, S. J. Sawyer, J. P. Demchok, J. B. Vaught, C. A. Rabiner, and N. C. Lockhart. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235):648–660, may 2015.

[394] Shai S Shen-Orr, , Purvesh Khatri, Dale L Bodian, Frank Staedtler, Nicholas M Perry, Trevor Hastie, Minnie M Sarwal, Mark M Davis, and Atul J Butte. Cell type-specific gene expression differences in complex tissues. Nature methods, 7(4):287–9, apr 2010.

[395] Jing Zhao, Idowu Akinsanmi, Dalia Arafat, T.J. Cradick, Ciaran M. Lee, Samridhi Banskota, Urko M. Marigorta, Gang Bao, and Greg Gibson. A Burden of Rare Variants Associated with Extremes of Gene Expression in Human Peripheral Blood. The American Journal of Human Genetics, 98(2):299–309, feb 2016.

[396] Arianne C Richard, Paul A Lyons, James E Peters, Daniele Biasci, Shaun M Flint, James C Lee, Eoin F McKinney, Richard M Siegel, Kenneth GC Smith, DB Allison, X Cui, GP Page, M Sabripour, RP Loewe, PJ Nelson, R Hoffmann, T Seidl, M Dugas, BM Bolstad, RA Irizarry, M Astrand, TP Speed, LM Cope, RA Irizarry, HA Jaffee, Z Wu, TP Speed, RA Irizarry, Z Wu,

215 HA Jaffee, R Shippy, S Fulmer-Smentek, RV Jensen, WD Jones, PK Wolber, CD Johnson, PS Pine, C Boysen, X Guo, E Chudin, YA Sun, JC Willey, J Thierry-Mieg, D Thierry-Mieg, RA Setterquist, M Wilson, AB Lucas, N Novoradovskaya, A Papallo, Y Turpaz, SC Baker, JA Warrington, L Shi, D Herman, RA Irizarry, BM Bolstad, F Collin, LM Cope, B Hobbs, TP Speed, JT Leek, RB Scharpf, HC Bravo, D Simcha, B Langmead, WE Johnson, D Geman, K Baggerly, RA Irizarry, C Lazar, S Meganck, J Taminau, D Steenhoff, A Coletta, C Molter, DY Weiss-Solis, R Duque, H Bersini, A Nowe, C Chen, K Grennan, J Badner, D Zhang, E Ger- shon, L Jin, C Liu, J Luo, M Schumacher, A Scherer, D Sanoudou, D Megherbi, T Davison, T Shi, W Tong, L Shi, H Hong, C Zhao, F Elloumi, W Shi, R Thomas, S Lin, G Tillinghast, G Liu, Y Zhou, D Herman, Y Li, Y Deng, H Fang, P Bushel, M Woods, J Zhang, WE Johnson, C Li, A Rabinovic, ML Wong, JF Medrano, M Flagella, S Bui, Z Zheng, CT Nguyen, A Zhang, L Pastor, Y Ma, W Yang, KL Crawford, GK McMaster, F Witney, Y Luo, J Mieczkowski, ME Tyburczy, M Dabrowski, P Pokarowski, MN McCall, RA Irizarry, FF Millenaar, J Okyere, ST May, M van Zanten, LA Voesenek, AJ Peeters, N Jiang, LJ Leach, X Hu, E Potokina, T Jia, A Druka, R Waugh, MJ Kearsey, ZW Luo, J Seo, EP Hoffman, RD Canales, Y Luo, JC Willey, B Austermiller, CC Barbacioru, C Boysen, K Hunkapiller, RV Jensen, CR Knight, KY Lee, Y Ma, B Maqsodi, A Papallo, EH Peters, K Poulter, PL Ruppel, RR Samaha, L Shi, W Yang, L Zhang, FM Goodsaid, JE Larkin, BC Frank, H Gavras, R Sultana, J Quackenbush, T Yuen, E Wurmbach, RL Pfeffer, BJ Ebersole, SC Sealfon, GK Geiss, RE Bumgarner, B Birditt, T Dahl, N Dowidar, DL Dunaway, HP Fell, S Ferree, RD George, T Grogan, JJ James, M Maysuria, JD Mitton, P Oliveri, JL Osborn, T Peng, AL Ratcliffe, PJ Webster, EH David- son, L Hood, K Dimitrov, SA Bustin, T Nolan, SD Prokopec, JD Watson, DM Waggott, AB Smith, AH Wu, AB Okey, R Pohjanvirta, PC Boutros, JC Lee, PA Lyons, EF McKinney, JM Sowerby, EJ Carr, F Bredin, HM Rickman, H Ratlamwala, A Hatton, TF Rayner, M Parkes, KGC Smith, PA Lyons, M Koukoulaki, A Hatton, K Doggett, HB Woffendin, AN Chaudhry, KGC Smith, EF McKinney, PA Lyons, EJ Carr, JL Hollis, DR Jayne, LC Willcocks, M Kouk- oulaki, A Brazma, V Jovanovic, DM Kemeny, AJ Pollard, PA Macary, AN Chaudhry, KGC Smith, BS Carvalho, RA Irizarry, A Kauffmann, R Gentleman, W Huber, PP Reis, L Waldron, RS Goswami, W Xu, Y Xuan, B Perez-Ordonez, P Gullane, J Irish, I Jurisica, S Kamel-Reid, MJ Zilliox, RA Irizarry, MN McCall, K Uppal, HA Jaffee, MJ Zilliox, RA Irizarry, J Vandes- ompele, K De Preter, F Pattyn, B Poppe, N Van Roy, A De Paepe, F Speleman, R Bourgon, R Gentleman, W Huber, AJ Hackstadt, and AM Hess. Comparison of gene expression mi- croarray data with count-based RNA measurements informs microarray interpretation. BMC Genomics, 15(1):649, 2014.

216 [397] Chana Palmer, Maximilian Diehn, Ash A Alizadeh, and Patrick O Brown. Cell-type specific gene expression profiles of leukocytes in human peripheral blood. BMC genomics, 7(1):115, jan 2006.

[398] Adeline R Whitney, Maximilian Diehn, Stephen J Popper, Ash A Alizadeh, Jennifer C Boldrick, David A Relman, and Patrick O Brown. Individuality and variation in gene ex- pression patterns in human blood. Proceedings of the National Academy of Sciences of the United States of America, 100(4):1896–901, feb 2003.

[399] A. Hatano, H. Chiba, H. A. Moesa, T. Taniguchi, S. Nagaie, K. Yamanegi, T. Takai-Igarashi, H. Tanaka, and W. Fujibuchi. CELLPEDIA: a repository for human cell information for cell studies and differentiation analyses. Database, 2011(0):bar046–bar046, oct 2011.

[400] Joshua M Stuart, Eran Segal, Daphne Koller, and Stuart K Kim. A gene-coexpression network for global discovery of conserved genetic modules. Science (New York, N.Y.), 302(5643):249– 55, oct 2003.

[401] Peter Langfelder, Rui Luo, Michael C Oldham, and Steve Horvath. Is my network module preserved and reproducible? PLoS computational biology, 7(1):e1001057, jan 2011.

217