Analysis of Genomic Regions of Increased Gene Expression (RIDGE)S in Immune Activation
Total Page:16
File Type:pdf, Size:1020Kb
Analysis of genomic Regions of IncreaseD Gene Expression (RIDGE)s in immune activation Lena Hansson Doctor of Philosophy Institute for Adaptive and Neural Computation School of Informatics and Division of Pathway Medicine Medical School University of Edinburgh 2009 Abstract A RIDGE (Region of IncreaseD Gene Expression), as defined by previous studies, is a con- secutive set of active genes on a chromosome that span a region around 110 kbp long. This study investigated RIDGE formation by focusing on the well-defined, immunological impor- tant MHC locus. Macrophages were assayed for gene expression levels using the Affymetrix MG-U74Av2 chip are were either 1) uninfected, 2) primed with IFN-g, 3) viral activated with mCMV, or 4) both primed and viral activated. Gene expression data from these conditions was studied using data structures and new software developed for the visualisation and handling of structured functional genomic data. Specifically, the data was used to study RIDGE structures and investigate whether physically linked genes were also functionally related, and exhibited co-expression and potentially co-regulation. A greater number of RIDGEs with a greater number of members than expected by chance were found. Observed RIDGEs featured functional associations between RIDGE members (mainly explored via GO, UniProt, and Ingenuity), shared upstream control elements (via PROMO, TRANSFAC, and ClustalW), and similar gene expression profiles. Furthermore RIDGE formation cannot be explained by sequence duplication events alone. When the analysis was extended to the entire mouse genome, it became apparent that known genomic loci (for example the protocadherin loci) were more likely to contain more and longer RIDGEs. RIDGEs outside such loci tended towards single-gene RIDGEs unaf- fected by the conditions of study. New RIDGEs were also uncovered in the cascading response to IFNg priming and mCMV infection, as found by investigating an extensive time series dur- ing the first 12 hours after treatment. Existing RIDGEs were found to be elongated having more members the further the cascade progress. iii Acknowledgements I would like to thank the entire team at the Division of Pathway Medicine. A special ac- knowledgement to those involved with the experiments analysed in this study; Sara Rodriguez Martin, Andrew Livingston, Kevin Robertson, Thorsten Forster, Paul Dickinson, and Garwin Kim Sing. A further thanks to Marilyn Horne for providing vital support. Finally I would like to thank my two supervisors, Dr Douglas Armstrong and Professor Peter Ghazal, for giving me this oppertunity and for all the time and effort they spent on the project. iv Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Lena Hansson) v Contents 1 Introduction 1 1.1 Background . 3 1.1.1 Chromatin structure . 4 1.1.2 Possible explanations for non-random gene organisation . 9 1.1.3 Chromatin loops . 14 1.1.4 Regions of IncreaseD Gene Expression (RIDGE) . 15 1.2 Summary . 21 2 Materials and Methods 23 2.1 Experimental methods and datasets . 23 2.1.1 Gene expression data . 23 2.1.2 Active genes . 24 2.1.3 Data sources . 24 2.1.4 Probe-to-gene-projection . 24 2.1.5 Biological experiments . 25 2.2 Bioinformatics methods . 26 2.2.1 RIDGE determination . 26 2.2.2 Gene function . 28 2.2.3 Sequence comparisons . 29 2.2.4 Architecture . 31 2.3 Statistics . 31 2.3.1 The RIDGE activity score . 32 3 The Conceptual Framework 33 3.0.2 Requirements . 33 3.0.3 Existing software resources . 34 3.0.4 Architecture . 35 3.1 SORGE DB . 36 3.1.1 The genomic database . 37 vii 3.1.2 The database of functional annotation . 41 3.2 The data processing layer . 45 3.2.1 Probe-to-gene projection . 45 3.2.2 Determination of active genes . 49 3.2.3 SORGE DATA . 50 3.3 SORGE Visualisation . 51 3.3.1 Method . 52 3.3.2 Results . 54 3.4 Functionality of SORGE . 57 3.5 Discussion . 58 4 RIDGE definition and characterisation 61 4.1 RIDGE definition . 61 4.1.1 RIDGE, loop, dimensions . 62 4.1.2 The RW/GL and the MLS models . 62 4.1.3 Additionally suggested RIDGE dimensions . 64 4.1.4 RIDGEs are 110 ± 30 kbp long genomic regions . 66 4.2 RIDGE characterisation . 66 4.2.1 RIDGE members . 67 4.2.2 RIDGE distributions . 69 4.2.3 Genomic organisation of RIDGEs . 72 4.2.4 ClustalW analysis of RIDGE member sequences . 75 4.3 Evaluation of RIDGE dimensions . 76 4.3.1 RIDGE dimension 80 ± 20 kbp . 76 4.3.2 RIDGE dimension 123 ± 16 kbp . 77 4.3.3 RIDGE dimension 150 ± 50 kbp . 77 4.3.4 RIDGE dimension 220 ± 40 kbp . 79 4.3.5 The chosen RIDGE dimension, 110 ± 30 kbp . 79 5 RIDGE analysis of the MHC locus 81 5.1 The MHC locus .................................. 81 5.1.1 Rationale . 81 5.1.2 The biology . 82 5.1.3 Locus definition . 84 5.1.4 Experimental data . 86 5.2 Identification of RIDGEs in the MHC locus by SORGE . 87 5.2.1 Overlapping RIDGEs . 89 5.2.2 Discussion . 92 viii 5.3 RIDGE analysis in the MHC locus . 92 5.3.1 Gene expression profiles for RIDGE members . 94 5.3.2 RIDGE gain in macrophages that were both primed and viral activated . 95 5.3.3 RIDGE loss in primed macrophages . 97 5.3.4 RIDGE gain in activated macrophages . 99 5.3.5 RIDGE loss in activated macrophages . 100 5.3.6 Static RIDGEs . 101 5.4 RIDGE characteristics for the observed RIDGEs . 103 5.4.1 RIDGE gain, RIDGE loss, and RIDGE members in a flux . 103 5.4.2 Static RIDGES . 104 5.4.3 Quantitative data for RIDGEs and RIDGE members . 104 5.4.4 Functional associations between RIDGE members . 105 5.4.5 Protein Interaction Network (PIN) analysis of RIDGE members . 106 5.4.6 Regulatory control of RIDGEs . 107 5.4.7 Number of silenced genes in a RIDGE . 108 5.5 Concluding remarks . 108 6 Genome-wide RIDGE analysis 111 6.1 Genome-wide RIDGE analysis of the macrophage activation dataset . 111 6.1.1 Non-random chromosome organisation of RIDGEs . 111 6.1.2 Immune system genes . 114 6.1.3 RIDGEs on chromosome 17 . 116 6.1.4 RIDGEs on chromosome 11 . 117 6.2 Additional datasets . 118 6.2.1 The time series dataset . 118 6.2.2 The tissue dataset . 121 6.2.3 Immune system RIDGEs . 126 6.2.4 Discussion . 127 6.3 Additional loci . 128 6.3.1 The Protocadherin locus on chromosome 18 . 129 6.3.2 The Immunoglobin locus on chromosome 6 . 130 6.3.3 The Immunoglobin locus on chromosome 12 . 131 6.3.4 Random regions . 131 7 Discussion 135 7.1 RIDGEs . 135 7.1.1 Evolutionary linked units . 135 7.1.2 RIDGE definition . 135 ix 7.1.3 Consecutive RIDGE analysis . 136 7.1.4 Immune system RIDGEs . 137 7.1.5 Housekeeping RIDGEs . 137 7.1.6 Functional associations between RIDGE members (and RIDGEs) . 138 7.1.7 Time series analysis . 138 7.1.8 RIDGE gain, RIDGE loss, static RIDGEs, and RIDGEs in a flux . 139 7.2 Further Work . 140 7.2.1 Are RIDGEs conserved over evolution? . 140 7.2.2 Longer time series with less time in between time points . 140 7.2.3 Biological replicates . 140 7.2.4 Predictive biology . 140 7.3 Conclusion . 141 A ER diagrams and a JAVA class diagram 143 A.1 ER diagram of the project part of the genomic database . 144 A.2 ER diagram of the bootstrap part of the genomic database . ..