Integration of Heterogeneous Expression Data Sets Extends the Role of the Retinol Pathway in Diabetes and Insulin Resistance
Total Page:16
File Type:pdf, Size:1020Kb
Integration of heterogeneous expression data sets extends the role of the retinol pathway in diabetes and insulin resistance The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Park, P. J. et al. “Integration of Heterogeneous Expression Data Sets Extends the Role of the Retinol Pathway in Diabetes and Insulin Resistance.” Bioinformatics 25.23 (2009): 3121–3127. Web. As Published http://dx.doi.org/10.1093/bioinformatics/btp559 Publisher Oxford University Press Version Final published version Citable link http://hdl.handle.net/1721.1/72972 Terms of Use Creative Commons Attribution Non-Commercial Detailed Terms http://creativecommons.org/licenses/by-nc/2.5 Vol. 25 no. 23 2009, pages 3121–3127 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btp559 Gene expression Integration of heterogeneous expression data sets extends the role of the retinol pathway in diabetes and insulin resistance Peter J. Park1,2,3,, Sek Won Kong1,4, Toma Tebaldi5,6, Weil R. Lai3, Simon Kasif1,7,8 and Isaac S. Kohane1,2,3,∗ 1Children’s Hospital Informatics Program at the Harvard-MIT Division of Health Sciences and Technology, 2Brigham and Women’s Hospital, 3Center of Biomedical Informatics, Harvard Medical School, 4Department of Cardiology, Children’s Hospital, Boston, MA 02115, USA, 5Centre for Integrative Biology, 6Department of Information Engineering and Computer Science, University of Trento, Italy, 7Center for Advanced Genomic Technology, Boston University and 8Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA Received on June 14, 2009; revised on September 7, 2009; accepted on September 22, 2009 Advance Access publication September 28, 2009 Associate Editor: Joaquin Dopazo ABSTRACT To understand the interface between insulin action, insulin Motivation: Type 2 diabetes is a chronic metabolic disease that resistance, obesity and the genetics of type 2 diabetes, the Diabetes involves both environmental and genetic factors. To understand the Genome Anatomy Project (DGAP) was initiated in 2003 to use a genetics of type 2 diabetes and insulin resistance, the DIabetes multi-dimensional genomic approach to characterize the relevant set Genome Anatomy Project (DGAP) was launched to profile gene of genes and gene products as well as the secondary changes in gene expression in a variety of related animal models and human subjects. expression that occur in response to the metabolic abnormalities We asked whether these heterogeneous models can be integrated present in diabetes. Over the course of the project, a variety of to provide consistent and robust biological insights into the biology data sets were collected through expression profiling studies on the of insulin resistance. Affymetrix platform, both from human and mouse tissues. In human Results: We perform integrative analysis of the 16 DGAP data studies, gene expression data were collected from case–control sets that span multiple tissues, conditions, array types, laboratories, studies involving normal, insulin resistant, obese and diabetic species, genetic backgrounds and study designs. For each data subjects; in mouse studies, expression patterns were obtained before set, we identify differentially expressed genes compared with and after insulin stimulation in normal and various knock-out control. Then, for the combined data, we rank genes according models, and adipogenic diets. An open question was whether to the frequency with which they were found to be statistically there were common mechanisms in insulin resistance or sensitivity significant across data sets. This analysis reveals RetSat as a widely that could be identified by integrating results across this highly shared component of mechanisms involved in insulin resistance and heterogeneous corpus. sensitivity and adds to the growing importance of the retinol pathway In this work, we carry out an integrative analysis of the ∼450 in diabetes, adipogenesis and insulin resistance. Top candidates arrays from the 16 data sets collected in this project. Analysis obtained from our analysis have been confirmed in recent laboratory of the aggregate data presents complications due to the multiple studies. sources of heterogeneity, such as species, platforms, laboratories, Contact: [email protected] sample sizes and experimental design. The data set, for instance, includes several array types including Hu6800 and U133 (human) and U74, U74v2 and MOE430 (mouse). Few are simple two-group 1 INTRODUCTION comparisons of clinical samples, while others involve strain, age, Type 2 diabetes mellitus is a chronic, progressive metabolic disorder tissue comparisons in multi-factorial designs. A few of the data and is one of the fastest-growing public health problems. Given sets have been studied extensively already but in isolation. We an increased prevalence of obesity and aging population, recent aim to carry out a comprehensive analysis of the aggregate data estimates suggest that the worldwide prevalence will grow from focusing on the commonalities between the data sets. There are 2.8% in 2000 to 4.4% in 2030, affecting 171 million in 2000 to 366 two important underlying assumptions in our analysis. The first million in 2030 (Wild et al., 2004). The primary characteristics of is that the individual experiments were appropriately designed to type 2 diabetes are insulin resistance, relative insulin deficiency and capture a transcriptome signature relevant to insulin resistance hyperglycemia, and it can be easily diagnosed based on chronic whether in a mouse model of IRS-1 ‘knock outs’ versus wild- elevated blood glucose concentration. While there is a strong type mice or in a comparison of obese diabetic humans versus inheritable component, this has not been defined in the vast majority obese non-diabetics. Second, given the well known heterogeneity of cases. of measurement across different platforms (Kuo et al., 2002), even from the same manufacturer (Nimgaonkar et al., 2003), only ∗To whom correspondence should be addressed. robustly shared molecular processes pertaining to several models © The Author(s) 2009. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. [20:10 3/11/2009 Bioinformatics-btp559.tex] Page: 3121 3121–3127 P.J.Park et al. of insulin resistance, obesity and/or diabetes will be detectable. 3 RESULTS That is, regardless of the multiplicity of etiologies, we assume 3.1 Combining data from multiple studies that there exists a small number of common pathophysiological mechanisms across diabetes, insulin resistance and obesity. Based The data sets in this project were highly diverse, as shown in Table 1. on our computations with these two assumptions, we have been For unbiased analysis, variables to consider in analysis include able to extend previous findings that implicate the retinol pathway organism (human, mouse), array type (MG-U74Av2, MOE430, HG- (Yang et al., 2005) in insulin sensitivity/resistance and adipogenesis, U133A), tissue type (adipocyte, brown preadipocyte, fibroblast, as well as reconfirming the well known dysregulation of oxidative hepatocyte, myocyte, pancreas, skeletal), mouse strain (B6 versus phosphorylation (Lowell and Shulman, 2005; Mootha et al., 2003; 129), mouse genotype (WT versus IRS-1 KOs), insulin sensitivity, Patti et al., 2003) and the JAK-STAT pathway (Schwartz and Porte, treatment (DM2, IGT, NGT), laboratory in which the experiment 2005). was conducted, and study design (case versus control, time series). After mapping the human and mouse genes using the Homologene database (release 43.1), we displayed all usable arrays as points in 2 METHODS a 3D principal component space (Figure 1A). To examine possible artefactual variations, we examined the distribution of the points 2.1 Data availability, normalization and quality control by each variable including species, tissue type, array type, and Both raw data (CEL files) and processed data are available from the DGAP laboratory (Figures 1B–E). In Figure 1C, for instance, we show the website (http://www.diabetesgenome.org). The total number of data sets in example of variation due to tissue type for mouse data. The figure this database is 19. Three data sets were excluded for the following reasons: illustrates that the variation due to tissue type is greater than that dataset 1 was generated on MG-U74A array, which was later found to have a large fraction of incorrectly labeled probes; dataset 3 was a time-course within each data set. In Figure 1E, we show the clustering effect experiment on adipocyte differentiation that did not have a proper control for the laboratory setting. The data from the four laboratories in to be informative for this study; dataset 9 was excluded because it was the which the experiments were performed are clearly separated. Some only one generated on the Affymetrix hu6800 platform. Including this early- of the variation is due to the differences in array type or species model platform would have reduced the total number of genes common to used in each laboratory, but there is clear effect of experimental the platforms significantly. protocols or technician at each site. These data sets were generated Given the large number of arrays generated