An Order Estimation Based Approach to Identify Response Genes
Total Page:16
File Type:pdf, Size:1020Kb
AN ORDER ESTIMATION BASED APPROACH TO IDENTIFY RESPONSE GENES FOR MICRO ARRAY TIME COURSE DATA A Thesis Presented to The Faculty of Graduate Studies of The University of Guelph by ZHIHENG LU In partial fulfilment of requirements for the degree of Doctor of Philosophy September, 2008 © Zhiheng Lu, 2008 Library and Bibliotheque et 1*1 Archives Canada Archives Canada Published Heritage Direction du Branch Patrimoine de I'edition 395 Wellington Street 395, rue Wellington Ottawa ON K1A0N4 Ottawa ON K1A0N4 Canada Canada Your file Votre reference ISBN: 978-0-494-47605-5 Our file Notre reference ISBN: 978-0-494-47605-5 NOTICE: AVIS: The author has granted a non L'auteur a accorde une licence non exclusive exclusive license allowing Library permettant a la Bibliotheque et Archives and Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par Plntemet, prefer, telecommunication or on the Internet, distribuer et vendre des theses partout dans loan, distribute and sell theses le monde, a des fins commerciales ou autres, worldwide, for commercial or non sur support microforme, papier, electronique commercial purposes, in microform, et/ou autres formats. paper, electronic and/or any other formats. The author retains copyright L'auteur conserve la propriete du droit d'auteur ownership and moral rights in et des droits moraux qui protege cette these. this thesis. Neither the thesis Ni la these ni des extraits substantiels de nor substantial extracts from it celle-ci ne doivent etre imprimes ou autrement may be printed or otherwise reproduits sans son autorisation. reproduced without the author's permission. In compliance with the Canadian Conformement a la loi canadienne Privacy Act some supporting sur la protection de la vie privee, forms may have been removed quelques formulaires secondaires from this thesis. ont ete enleves de cette these. While these forms may be included Bien que ces formulaires in the document page count, aient inclus dans la pagination, their removal does not represent il n'y aura aucun contenu manquant. any loss of content from the thesis. Canada ABSTRACT AN ORDER ESTIMATION BASED APPROACH TO IDENTIFY RESPONSE GENES FOR MICROARRAY TIME COURSE DATA Zhiheng Lu (Kevin) Advisor: University of Guelph, 2008 Dr. B. Allen Microarray time course experiments have been widely used to investigate temporal patterns of gene expression profiles. These expression profiles provide a unique opportunity to examine genome-wide signal processing and gene responses. A fundamental issue in microarray experimental design is that the treatment condition can only be controlled to the cell level rather than to the gene level. Given that some genes depend on other genes to detect changes in external conditions and that this kind of dependency is not fully deterministic and may vary across genes and treatment conditions, the expression of each gene is potentially affected by two confounding effects: the treatment effect and the gene context effect arising from the regulatory interaction structure among genes. This gene context effect is hard to isolate. Neither can it be simply ignored. Instead, this gene context information which is different under different treatment conditions is of primary biological interest and thus demands attention of statistical analysis. We introduce an approach which provides a way to deal with the confounding effects and takes into account the uncontrollable gene context effect. Our method is developed to estimate the number of hidden states which is also referred to as the order of a hidden Markov model (HMM) for each gene. The observed gene expressions are modeled by gamma distributions determined by the corresponding hidden state at each time point. Those genes showing evidence for more than one hidden state can be categorized as the signaling genes, or in a wider sense, as the response genes which are coordinated by a cell system in reaction to a specific external condition. These response genes can be used in the comparison of different treatment conditions, to investigate the gene context effect under different treatments. Our method also provides flexibility in adjusting type I error rates to find response genes at different response intensity levels. Both simulated data and real microarray time course data are analyzed to demonstrate our method. Acknowledgements I sincerely thank my advisor Dr. Brian Allen for his insightful guidance, support and help during my study. I would like to thank my other advisory committee members, Dr. T. Desmond, Dr. R. Lu, Dr. G. Darlington and Dr. J. Horrocks for their treasured advice and help. Dr. A. Canty deserves special thanks for very extensive and constructive suggestions which have greatly improved this thesis. i Table of Contents List of Tables iv List of Figure v 1 Introduction 1 1.1 Microarray Technology 2 1.2 Microarray Time Series Experiments 7 1.3 Microarray Data Pre-processing 22 2 General Model 27 3 Empirical Bayes Estimation of Shape Parameter 34 4 Order Estimation 42 4.1 Literature Review of Order Estimation for HMMs and Finite Mixture models 42 4.2 Order Estimation 46 4.3 Computational Algorithm 59 4.4 Choices of the Threshold Parameters 64 4.5 Identifiability 67 5 Simulation Study 70 6 Real Data Analysis 89 7 Conclusions and Further Discussion 99 References 104 ii Table of Contents Appendix A 116 Appendix A.l, Response gene set for GDS1428 control group 116 Appendix A.2, Response gene set for GDS1428 treatment group 123 Appendix A.3, The common response gene set for GDS1428 treatment group and control group 131 Appendix A.4, The response gene set only for GDS1428 control group 136 Appendix A.5, The response gene set only for GDS1428 treatment group 139 Appendix A.6, The common gene set of gene set A and B 144 Appendix A.7, The genes in gene set A but not in gene setB 145 Appendix A.8, The genes not in gene set A but in gene setB 148 Appendix A.9, The common gene set of gene set A and C 155 Appendix A.10, The genes in gene set A but not in gene set C 156 Appendix A. 11, The genes not in gene set A but in gene set C 159 Appendix B, Computation codes in R 163 HI List of Tables Table 5.1, The four situations of order estimation 72 Table 5.2, Type I error rate at three separation thresholds for various single gamma distributions 75 Table 5.3, The specifications for the simulation of mixtures of two gamma distributions 81 Table 5.4, Simulation evaluations for the mixture models with 2 components 83 Table 5.5, Simulation evaluations for the mixture models with 3 components 84 Table 5.6, Simulation evaluations for HMMs with 2 hidden states 87 Table 6.1, The number of response gene for each treatment condition of GDS1428 93 Table 6.2, The comparison of gene lists between the two studies 98 IV List of Figures Figure 1.1, A schematic of the role of RNA in gene expression and protein production 3 Figure 1.2, cDNA Array Image 5 Figure 1.3, GeneChip Affymetrix Array Image 5 Figure 1.4, cDNA microarray scheme 7 Figure 1.5, Gene expression profiles for 8 genes at time points 0, 1.5, 3, 6, 9, 12 and 24 hours 9 Figure 1.6, Microarray time course experiment under one specific treatment condition 10 Figure 1.7, First category of assumptions 12 Figure 1.8, Second category of assumptions 14 Figure 4.2.1, A gamma mixture with three component distributions 53 Figure 6.1, Plot of CV vs. probe (gene) index for treatment and control data for GDS1428 90 Figure 6.2, Plot of shape parameter vs. probe (gene) index for treatment and control data for GDS1428 91 V Chapter 1 INTRODUCTION Microarray technology makes it possible for researchers to simultaneously assess the expression of thousands of genes (Schena et al., 1995; Lockhart et al., 1996; Richmond et al., 1997; Eisen et al., 1998). A huge amount of data that represents the expression patterns of potentially all of the genes in a cell under various treatments are generated by the scientific community and becomes available for statistical analysis. See Craig et al. (2003) for a more detailed review. In this chapter, we will first briefly introduce microarray technology in section 1.1. We then provide a description of microarray time course experiments in section 1.2 and explain the motivation and the goal of our order estimation analysis. In section 1.3, we briefly introduce microarray data pre-processing. 1 1.1 Microarray Technology According to the central dogma of molecular biology (Snustad and Simmons, 2003), many key biological functions of cells are performed by proteins. The production of proteins is controlled by genes, which are coded in DNA sequences and passed along generations of cells by the DNA replication process. Protein production from genes involves two stages, known as transcription and translation, as shown in Figure 1.1. A single strand of messenger RNA or mRNA is first copied from the DNA sequence of the coding gene. After transcription, mRNA is used as a template to assemble a chain of amino acids to produce the protein during the translation stage. Since most of the functions of cellular biological processes are related to the changes of mRNA levels for some genes, systematic investigation of mRNA abundance on a genome-wide scale is critical in understanding cell systems. Microarray technology makes it possible to examine gene expression for any number of genes simultaneously, and hence for the first time in history, the comprehensive measurement of a large system as complex as a cell becomes available.