Analysis of Microarray Gene Expression Data Sets
Total Page:16
File Type:pdf, Size:1020Kb
Analysis of microarray gene expression data sets © Lars M.T. Eijssen, Schimmert, 2006 ISBN-10: 90-9021327-9 ISBN-13: 978-90-9021327-9 Cover design Lars Eijssen Illustrations Chapter 1 Mike Gerards Printed by Drukkerij Econoom BV Beek-L Analysis of microarray gene expression data sets PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit Maastricht, op gezag van de Rector Magnificus, Prof.mr. G.P.M.F. Mols, volgens het besluit van het College van Decanen, In het openbaar te verdedigen op dinsdag 19 december 2006 om 16.00 uur door Lars Maria Theo Eijssen geboren te Schimmert op 30 mei 1976 Promotor Prof.dr. J.P.M. Geraedts Copromotores dr. P.J. Lindsey dr. H.J.M. Smeets Beoordelingscommissie / Assessment committee Prof.dr. J.C.S. Kleinjans (voorzitter) Prof.dr. F.C.P. Holstege (UMC Utrecht) Prof.dr. C.A.J. Klaassen (Universiteit van Amsterdam) Prof.dr. Y.M. Pinto Prof.dr. E.O. Postma The studies presented in this thesis were performed at the Department of Genetics and Cell Biology, Cardiovascular Research Institute Maastricht (CARIM), Maastricht University, Maastricht, the Netherlands ‘μαιευτικη τεχνη’ -Socrates- Table of contents Chapter 1 General introduction 9 Chapter 2 A novel stepwise analysis procedure of genome-wide expression 59 profiles identifies transcript signatures of thiamine genes as classifiers of mitochondrial mutants in yeast Chapter 3 Myocardial gene expression reveals maladaptive processes 79 in cardiac myosin binding protein C knock-out mice Chapter 4 Affymetrix expression chip data analysis: the gain of modeling 125 Chapter 5 Multivariate normal probe modeling for Affymetrix expression chip data 147 Chapter 6 The use of spikes in Affymetrix chip expression data analysis 177 Chapter 7 General discussion 205 Summary 217 Samenvatting 223 Dankwoord 229 Curriculum Vitae 233 Chapter 1 General introduction Chapter 1 From genetics to genomics Since its start but especially during the last decades the field of genetics has gone through major changes. Up till the nineties of the last century focus within genetics was on chromosomal abnormalities and monogenic diseases, disorders characterized by one of the classical Mendelian patterns of inheritance. The research in the area of monogenic disease was devoted to find the causing locus (gene) and to unravel the function of the protein encoded by the gene and the mechanism of the protein pathways it functioned in. Sometimes the situation was characterized by genetic heterogeneity, where defects in more than one gene can cause the same clinical phenotype, or pleiotropy (phenotypic heterogeneity), where defects in one gene can cause one of several clinical phenotypes. Since about 15 years, a development has taken place from genetics to so-called genomics, which is characterized by the simultaneous study of many genes and/or gene products at the same time. The human genome project, by which almost all human genes have become known [1, 2], has been one of the biggest stimuli to the genomics approach. In parallel with the sequencing of the human genome, many other eukaryotic genomes have been sequenced, e.g. those of the mouse [3] and yeast [4]. Furthermore, for many genes it has become known in which molecular pathways their gene products are involved. The transition to genomics has been fuelled by an enormous increase in the technological possibilities available for genetic and molecular studies. Especially the field of gene expression analysis has profited from the development of several technical platforms, amongst which the microarray or chip technology has been one of the major achievements [5-7]. In general, microarrays are glass slides that contain particular molecules (probes) attached to their surface, each of which can specifically bind a particular target molecule. Microarrays are used for functional genomics research (retrieving differentially expressed genes), target discovery, biomarker determination, pharmacology, toxicology (to find effects of respectively drugs and toxic compounds), predicting disease prognosis, and subclassifying disease [8]. A corollary of the availability of high-throughput technology for research is an increased focus on far more frequent population diseases such as cardiovascular disease, diabetes, obesity, and others. These are caused by the interaction of several genes and the environment and require genome-wide approaches to detect the genes involved. These diseases are especially complex to study, since all of the contributing genes and effects interact with each other and only partially explain the disease [9, 10]. A direct result from this increase in scale is that experimental outcome can no longer be judged by eye and computational systems have become needed to interpret the results produced by these novel techniques. Per study typically tens of thousands or even hundreds of thousands of values are produced. Fortunately, at the same time also the computational power of the available hardware platforms has strongly increased, as dictated by Moore’s law [11]. However, the increase in number of 10 Chapter 1 features of the different microarray platforms is consistently a step ahead of the increase in computer power, keeping the analysis of the biggest chips a great challenge. For many chips, data analysis is possible on desktop or laptop machines, but for some a dedicated (server) system is needed. Besides computational power, of course also analytical methods are needed to process results. Before elaborating on those, a more detailed description of the microarray platform and its background is given. In Figure 1, an overview of the total experimental procedure for a gene expression microarray study is presented. Figure 1 Overview of the general complete procedure followed when performing a gene expression microarray experiment. Throughout this text all steps will be discussed. 11 Chapter 1 Gene expression In order for the cell to produce proteins, DNA (deoxyribonucleic acid) is first transcribed into mRNAs (messenger ribonucleic acids) that are transported outside the nucleus to be translated into proteins. If this happens the gene is said to be expressed. At the DNA and mRNA level each triplet of bases codes for a certain amino acid, the building block of proteins. Besides the code to be translated, the mRNA molecule also contains control sequences in the untranslated regions (UTRs) at both ends of the molecule. As long as the mRNA is not degraded by the cellular machinery, it can be used to build more copies of the protein it encodes. Since the completion of the human genome project it is clear that the great complexity of our genome does not lie in the number of genes (around 30000), but in the much larger number of proteins produced from those (more than 300000). The clue to this order of magnitude difference is alternative splicing, which beholds that from a single gene several protein products can be made. On the DNA level the sequences within a gene encoding parts of the mRNA (so called exons) are interrupted by sequences that are not transcribed (introns) and have to be split out. Alternative splicing is then performed by also leaving out one or more of the exons, to produce several different types of mRNA from the same genetic locus (gene). Based on the processes of transcription and translation, a diversity of methods can be used to monitor the molecular functioning of the cell or parts of it. In accordance with the word genome, used for the entire DNA – nuclear, mitochondrial, chloroplastic – in the cell, the collection of all RNAs present in the cell is called the transcriptome and the collection of all proteins the proteome. Several systems have been developed to massively monitor either of those collections in parallel, where microarrays abound in each group. DNA microarrays have been developed to either measure common variances (single nucleotide polymorphisms, SNPs) on a genome- wide basis, or to sequence long stretches of DNA to find mutations in specific regions or genes. To detect expression levels of tens of thousands of RNA molecules or even the whole known transcriptome of the organism at hand, gene expression arrays have been developed. Finally, also for the detection of the proteome, microarrays are available. Because the protein is the eventual functional product within the cell, a proteomic array most directly measures the number of effective molecules present. However, because of the far more demanding complexity of proteome measurements, the mRNA expression array has been the first type of array to be used on a wide scale. The next sections discuss the gene expression platform in more detail, where the word ‘(micro)array’ is meant to refer to gene expression microarray. 12 Chapter 1 Microarrays In very basic terms, a microarray is a slide of glass that contains many probes, in this case sequences of nucleotides, attached to its surface. Each of these probes specifically recognizes a certain mRNA molecule by hybridization. The copies of the probe recognizing a certain transcript are spotted together at the same position on the slide. When labeled sample material is brought onto the slide, transcripts bind to their respective probes. Because it is known which probe is at which location, scanning label intensities gives a parallel measurement of the abundances of all transcripts recognized. Figure 2 shows examples of scans of slides of two microarray platforms. Figure 2 Zoomed parts of scans of two different array types: a) a custom home-made slide; b) a commercial Affymetrix GeneChip. At the start of the era of microarray technology, when arrays were still limited to detecting a few thousand transcripts, the probes represented a sample of all genes in a certain genome or a thematic collection of genes (e.g. those expressed in a certain tissue, related to a certain biological process, or expected to be related to a type of disease). With the rapid increase in the number of probes per slide, at this moment most arrays are general, covering most or all of a certain genome.