Testing for Differentially Expressed Genes and Key Biological Categories in DNA Microarray Analysis
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CINCINNATI Date:___________________ I, _________________________________________________________, hereby submit this work as part of the requirements for the degree of: in: It is entitled: This work and its defense approved by: Chair: _______________________________ _______________________________ _______________________________ _______________________________ _______________________________ Testing for Differentially Expressed Genes and Key Biological Categories in DNA Microarray Analysis A dissertation submitted to the Graduate School of the University of Cincinnati In partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in the Department of Environmental Health of the College of Medicine 2007 By Maureen A. Sartor Masters in Biomathematics, North Carolina State University, August 2000 B.S., Xavier University, Cincinnati, Ohio, 1998 Committee Chair: Dr. Mario Medvedovic ABSTRACT DNA microarrays are a revolutionary technology able to measure the expression levels of thousands of genes simultaneously, providing a snapshot in time of a tissue or cell culture‟s transcriptome. Although microarrays have been in existence for several years now, research is yet ongoing for how to best analyze the data, at least partly due to the combination of small sample sizes (few replicates) with large numbers of genes. Several challenges remain in maximizing the amount of biological information attainable from a microarray experiment. The key components of microarray analysis where these challenges lie are experimental design, preprocessing, statistical inference, identifying expression patterns, and understanding biological relevance. In this dissertation we aim to improve the analysis and interpretation of microarray data by concentrating on two key steps in microarray analysis: obtaining accurate estimates of significance when testing for differentially expressed genes, and identifying key biological functions and cellular pathways affected by the experimental conditions. We identify opportunities to enhance analytical techniques, and demonstrate that these enhancements significantly improve the functional interpretation of microarray results. We develop three related Bayesian statistical models to improve the estimates of significance by exploiting the information available from all genes, and functionally relating the gene variances to their expression levels. These novel methodologies are compared to previously proposed methods both in simulations and with real-world experimental data performed on multiple microarray platforms. In addition, we introduce a logistic regression method for identifying key biological categories and molecular pathways and compared this method with the commonly used Fisher‟s ii exact test and other relevant previously developed methods. We make our statistical methods available to the biomedical research community through the use of statistical software widely used for microarray analysis. iii iv ACKNOWLEDGMENTS I would like to express my gratitude to Dr. Mario Medvedovic for his guidance and support in my research and professional development. I would also like to thank the other members of my committee, Dr. Paul Succop, Dr. Siva Sivaganesan, Dr. Alvaro Puga, and Dr. Michael Wagner for their time and helpful advice and suggestions. Of course, I cannot forget my husband, George Schmiesing, for his continual encouragement, my children Brixon and Elliot, and my parents for providing me with a strong educational foundation and an appreciation for knowledge. v TABLE OF CONTENTS CHAPTER 1: Introduction, specific aims, and background .................................................. 1 1.1. INTRODUCTION ............................................................................................................................. 1 1.2. BACKGROUND ............................................................................................................................... 4 1.2.1. Microarray technology ............................................................................................................ 4 1.2.2. T-statistics and Bayesian models ............................................................................................. 6 1.2.3. Microarray data statistics: testing for significance of differential expression ........................ 9 1.2.4. Testing for key biological categories ..................................................................................... 16 1.2.4.1. Gene Ontology Consortium and Kyoto Encyclopedia of Genes and Genomes .............. 16 1.2.4.2. Biological gene set enrichment analysis methods ........................................................... 18 1.2.5. R statistical software and Bioconductor ................................................................................ 22 CHAPTER 2: Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarrays ........................................................................................ 23 2.1. OVERVIEW .................................................................................................................................. 23 2.2. RESULTS AND DISCUSSION .......................................................................................................... 24 2.2.1. Intensity-based Bayesian model ............................................................................................ 24 2.2.2. Estimation of hyperparameters ............................................................................................. 28 2.2.3. Simulation study .................................................................................................................... 31 2.2.4. Results from controlled spike-in datasets .............................................................................. 38 2.2.5. Case Studies: Analysis and interpretation of two microarray datasets ................................. 41 -/- 2.2.5.1. Results from the MEF Ahr dataset ............................................................................... 41 2.2.5.2. Results from nickel exposure dataset .............................................................................. 43 2.3. CONCLUSIONS ............................................................................................................................. 47 2.4. METHODS .................................................................................................................................... 50 2.4.1. Mice and exposure protocol .................................................................................................. 50 2.4.2. Microarray hybridizations ..................................................................................................... 51 vi 2.4.3. Data normalization and analysis ........................................................................................... 52 CHAPTER 3: Systematic comparisons reveal that logistic regression provides a simple yet powerful approach to identifying enriched biological groups in gene expression data ...................... 53 3.1. OVERVIEW .................................................................................................................................. 53 3.2. RESULTS AND DISCUSSION .......................................................................................................... 54 3.2.1. LRpath model ........................................................................................................................ 54 3.2.2. Simulation study .................................................................................................................... 57 3.2.3. Comparisons with real-world breast cancer datasets ........................................................... 62 3.2.3.1. Subset analyses ............................................................................................................... 62 3.2.3.2. Concordance analysis ..................................................................................................... 64 3.2.4. Application: Results from a human IPF dataset .................................................................... 66 3.3. CONCLUSIONS ............................................................................................................................. 69 3.4. MATERIALS AND METHODS ......................................................................................................... 73 3.4.1. Simulation steps ..................................................................................................................... 73 3.4.2. Construction of the gold-standard set of GO terms for the breast cancer dataset ................ 74 CHAPTER 4: Full hierarchical and empirical Bayesian spline-based models for the analysis of multiple types of microarray data ........................................................................................ 75 4.1. OVERVIEW .................................................................................................................................. 75 4.2. METHODS .................................................................................................................................... 79 4.2.1. Empirical Bayes model .......................................................................................................... 79 4.2.1.1. Estimating the hyperparameters ...................................................................................... 83 4.2.1.2.