Statistical Methods for the Analysis of the Genetics of Gene Expression

Statistical methods for the analysis of the genetics of gene expression Matthias Alexander Heinig April 13, 2011 Dissertation zur Erlangung des Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) am Fachbereich Mathematik und Informatik der Freien Universität Berlin Gutachter: Prof. Dr. Martin Vingron Prof. Dr. Norbert Hübner 1. Referent: Prof. Dr. Martin Vingron 2. Referent: Prof. Dr. Norbert Hübner Tag der Promotion: 13. Dezember 2010 Preface The work that led to this thesis is part of several collaborative projects in which I participated. For the sake of clarity this thesis presents all results that led to the conclusions drawn and individual contributions will be detailed here. For each project I mention publications, highlight my contributions and acknowledge the work of my collaborators. Star project The work presented in section 3.3.1 was published in Nature Genetics [1] by the Star Consortium. It was a large EU funded project initiated by Norbert Hübner and led by Kathrin Saar aiming to assess genetic variability among rat strains and to provide haplotype and genetic maps. My contribution was the design of the strategy for the construction of the maps and its implementation and application to three different populations of rats. Moreover I performed the comparative analysis of these maps. I would like to acknowledge Herbert Schulz, Kathrin Saar, Norbert Hübner, Edwin Cuppen, Richard Mott, Domenique Gaugier and Marié-Thérèse Bihoreau for discussions and guidance and Denise Brocklebank for discussion and contributions to the implementation of the analysis. I would like to thank Oliver Hummel and Herbert Schulz for the preprocessing of genotype data. Ephx2 project Analysis and results presented in sections 3.4.1, 4.1 and 4.2.3 were pulished in Nature Genetics [2]. This project was designed by Judith Fischer, Jan Monti and Norbert Hübner and carried out in the lab of Norbert Hübner mainly by Judith Fischer and Svetlana Paskas. My contribution to this project was the eQTL analysis, analysis of the genetic effect on the arachidonic acid pathway (section 4.1) and the transcription factor binding site analysis of the Ephx2 promoter (section 4.2.3). In addition I participated in the analysis of the physiological data together with Judith Fischer. I would like to thank Judith Fischer for leading the project and producing all the data. I would like to thank Norbert Hübner for discussions and giving me the opportunity to work on this data. Finally I would like to acknowledge Martin Vingron for his ideas on the pathway analysis (section 4.1). I have presented the methodology of this part of the project as a talk at the International conference on Genetics 2008 in Berlin. Euratools project The work presented in sections 3.3 and 4.3 was published in Nature [3]. The data were generated as part of the Euratools consortium – an EU funded large integrated project. I contributed the eQTL analysis across seven tissues (section 3.3) and the design of the transcription factor and co-expression analysis in sections 4.3.1 and 4.3.2 which form the basis of this study. Together with Maxim Rotival and Enrico Petretto, I performed the co- expression analysis in the human eQTL data set (section 4.3.6). Analysis of the influence of the human chromosome 13 locus on the network (section 4.3.7) was carried out in parallel by myself, Anja Bauerfeind and Leonardo Bottolo. Analysis of the EBI2 cis-eQTL (section 4.3.7) was performed in parallel by myself and Anja Bauerfeind. Translation to human T1D i data (4.3.8) was performed in collaboration with Chris Wallace, David Clayton and John Todd with contributions of Anja Bauerfeind and myself. In particular I would like to acknowledge that Chris Wallace developed the theory for the extensions of enrichment analysis to account for confounding factors (sections 2.2.3 and 4.3.8). I am grateful to the Euratools consortium for providing me with a short term fellowship that got me started on this project during my stay in London. I thank Norbert Hübner, Enrico Petretto and Stuart Cook for initiating this project and finding the right collaboration partners to make it a success. I would like to thank Enrico Petretto for his constant support and advice in this project. I thank Helge Roider for his assistance with the PASTAA analysis. I am grateful to Carola Rintisch and Svetlana Paskas who performed the experiment to confirm my TFBS predictions. My special thanks go to Anja Bauerfeind for help and advice for the human eQTL analysis and genotype imputation. I would like to thank Sarah Langley who performed the cell type specificity analysis in section 4.3.3 and Leonardo Bottolo for the analysis in section 4.3.4. I would like to thank Stefan Blankeberg and his coworkers for providing us access to eQTL data from the Gutenberg Heart Study, Francois Cambien on behalf of the Cardiogenics consortium for access to eQTL data from the Cardiogenics study and John Todd for access to the human T1D data. The analysis presented in section 4.4 has not been published yet, but a manuscript is currently being submitted. The data analysed there are the same as presented above. I have designed and implemented the analysis. I would like to thank Enrico Petretto and Mario Falchi for stimulating discussions over coffee that shaped my ideas. I would like to thank Martin Vingron for his ideas and advice on writing the manuscript. I acknowledge numerous discussions about linear models with Hughes Richard. Finally I would like to thank Norbert Hübner who prompted me to evaluate phenotype data in this context. sTRAP project The method presented in section 4.2 was published in Human Mutation [4]. The project was a close collaboration between Thomas Manke and myself. We designed and performed the analysis together. I would like to thank Helge Roider for providing me the C-code of TRAP which allowed me to implement the tRap R-package and the sTRAP webserver. Acknowledgements I would like to thank both my supervisors Norbert Hübner and Martin Vingron for offering me the opportunity to get the best of both worlds, experimental and com- putational by co-supervising my thesis. I am grateful for the many fruitful discussions, the ideas and support I have enjoyed from both during my thesis. I would like to thank Norbert Hübner for embedding me into the very stimulating environment of the Euratools consortium which allowed me to participate in many collaborations. I would like to thank Martin Vingron for cre- ating an extraordinary working atmosphere in his group. I am very grateful to all members of the Vingron and Hübner groups and all PhD students of both institutes many of whom became friends. Last but not least I would like to thank my parents for their support and Kristina for her patience. Berlin, September 2010 Matthias Heinig ii Contents 1 Introduction 1 1.1 Objective and structure of the thesis ......................... 4 1.2 Genetic mapping of quantitative trait loci (QTL) .................. 5 1.3 Genome wide association studies ........................... 8 1.4 The BXH/HXB recombinant inbred strains ..................... 8 1.5 The F2 intercross between SHHF and SHRSP .................... 9 2 Statistical tools 11 2.1 Linear models ...................................... 11 2.1.1 Estimation of parameters ........................... 11 2.1.2 Hypothesis testing ............................... 12 2.1.3 The multiple linear regression model ..................... 12 2.1.4 Model selection via lasso ............................ 13 2.2 Functional enrichment analysis ............................ 13 2.2.1 Exact test on a contingency table ....................... 14 2.2.2 Gene set enrichment analysis ......................... 14 2.2.3 Extensions to gene set enrichment analysis ................. 15 2.2.4 The iterated hypergeometric test ....................... 17 3 Expression quantitative trait loci (eQTL) 19 3.1 Gene expression as quantitative trait ......................... 19 3.2 Measuring gene expression with microarrays ..................... 20 3.2.1 Gene expression microarrays ......................... 20 3.2.2 Normalisation .................................. 20 3.3 Mapping of eQTLs in the BXH/HXB RI strains .................. 24 3.3.1 Construction of a high density SNP map ................... 24 3.3.2 Gene expression data .............................. 25 3.3.3 Expression QTL mapping ........................... 25 3.4 Integrated analysis of eQTLs and physiological data ................ 31 3.4.1 Identification of a risk factor for heart failure ................ 32 3.4.2 Identification of a candidate gene for systolic blood pressure ........ 33 4 eQTL genes in gene expression networks 37 4.1 Extension of gene set enrichment analysis for genetic mapping ........... 37 4.1.1 Linking the arachidonic acid pathway to heart failure ........... 38 4.2 Sequence variation in transcription factor binding sites ............... 41 4.2.1 Modelling transcription factor binding site affinities ............ 41 iii Contents 4.2.2 sTRAP: A framework to rank affinity changes ................ 43 4.2.3 Identification of the regulator of Ephx2 ................... 50 4.2.4 Conclusions ................................... 52 4.3 The role of transcription factors in clusters of trans-eQTLs ............ 52 4.3.1 Identification of genetically regulated TF-networks ............. 53 4.3.2 Extension of TF-networks by co-expression analysis ............ 58 4.3.3 Cross species cell type enrichment analysis ................. 58 4.3.4 Genetic mapping of the expanded TF-network ............... 60 4.3.5 Identification of a candidate regulatory factor ................ 60 4.3.6 Comparative co-expression analysis with humans .............. 62 4.3.7 Analysis of the human regulatory locus ................... 63 4.3.8 Translation to human GWAS data ...................... 65 4.3.9 Conclusions ................................... 68 4.4 Co-expression as quantitative trait .......................... 68 4.4.1 A linear model for the mapping of co-expression as a quantitative trait .

Statistical Methods for the Analysis of the Genetics of Gene Expression

A Method to Infer Changed Activity of Metabolic Function from Transcript Proﬁles

The Kyoto Encyclopedia of Genes and Genomes (KEGG)

3 13437143.Pdf

Ontology-Based Methods for Analyzing Life Science Data

Genbank by Walter B

Causal Diagrams for Interference Elizabeth L

Research Report 2006 Max Planck Institute for Molecular Genetics, Berlin Imprint | Research Report 2006

Integrative Annotation of 21,037 Human Genes Validated by Full-Length Cdna Clones

Next Generation Machine Learning Prediction of Protein Cellular Sorting

First Analysis Steps

Probabilistic Models for Gene Silencing Data

The Analysis of RNA-Seq Experiments Using Approximate Likelihood