Multiplex Meta-Analysis of RNA Expression to Identify Genes with Variants Associated with Immune Dysfunction
Total Page:16
File Type:pdf, Size:1020Kb
Research and applications Multiplex meta-analysis of RNA expression to identify genes with variants associated with immune dysfunction Alexander A Morgan,1 Vasilios J Pyrgos,2 Kari C Nadeau,3 Peter R Williamson,2 Atul Janardhan Butte1 1Biomedical Informatics ABSTRACT many major, common diseases; this is known as the e Graduate Training Program and Objective We demonstrate a genome-wide method for ‘missing heritability’ problem.7 9 We need addi- Division of Systems Medicine, the integration of many studies of gene expression of tional tools to accelerate the process of uncovering Department of Pediatrics, 10 Stanford University, Stanford, phenotypically similar disease processes, a method of the causal variations giving rise to pathology. California, USA multiplex meta-analysis. We use immune dysfunction as Unfortunately, targeted candidate gene association 2Laboratory of Clinical Infectious an example disease process. studies have had a notoriously poor rate of Diseases, National Institute of Design We use a heterogeneous collection of datasets replication in contrast to the much less biased Allergy and Infectious Diseases, 11 12 Bethesda, Maryland, USA across human and mice samples from a range of tissues genome-wide approaches. An approach that 3Division of Immunology, and different forms of immunodeficiency. We developed can prioritize targeted portions of the genome Department of Pediatrics, a method integrating Tibshirani’s modified t-test (SAM) is implicated in disease association in an unbiased, Stanford University, Stanford, used to interrogate differential expression within a study genome-wide manner, based on data-driven, func- California, USA and Fisher’s method for omnibus meta-analysis to identify tional properties would provide a powerful tool for Correspondence to differentially expressed genes across studies. The ability future medical genomics. In this paper, we describe Dr Atul Janardhan Butte, of this overall gene expression profile to prioritize disease such a method and demonstrate its ability to Division of Systems Medicine, associated genes is evaluated by comparing against the prioritize the genes with variants implicated in Department of Pediatrics, results of a recent genome wide association study for a specific form of immunodeficiency, ‘common Stanford University, 1265 Welch common variable immunodeficiency (CVID). variable immunodeficiency’ (CVID), characterized Road X-163 MS-5415, CA fi 94305-5415, USA; Results Our approach is able to prioritize genes by patient inability to produce suf cient antibodies. [email protected] associated with immunodeficiency in general (area under Many common, multifactorial diseases include an the ROC curve ¼ 0.713) and CVID in particular (area infectious, autoimmune or inflammatory compo- Received 22 October 2011 under the ROC curve ¼ 0.643). nent; the mammalian immune response is a very Accepted 29 December 2011 Conclusions This approach may be used to investigate finely tuned, highly complex system with hundreds a larger range of failures of the immune system. Our of signaling molecules, dozens of different cell types, method may be extended to other disease processes, and the involvement of multiple tissue types and using RNA levels to prioritize genes likely to contain organs.13 It features all the motifs and elements of disease associated DNA variants. the most complex biological circuits and control systems, with many interacting feedback and feed- forward elements.14 Dysfunction can arise from variation in many different components of this INTRODUCTION complex, highly inter-connected system, and the One of the major goals of translational bioinfor- phenotypic changes in the immune system to the matics is to equip clinical medicine with the ability range of genomic variations possible is only begin- to use information about a patient’s genome for ning to be understood. However, characterization of diagnosis and decision-making. Several commercial the variations that lead to serious immune failure companies provide disease risk information can help in the treatment of affected patients and according to individual genotype,1 and other also, hopefully, provide insight into the range of approaches have been developed and used to human immune response as influenced by genetics. analyze and interpret high-depth patient sequence Although large knowledge bases on immune data to guide treatment23; these are all instances of function have been constructed,15 we also have personalized medicine approaches that integrate access to genome-wide functional data in the form genomics. Genotyping using arrays is a well-estab- of gene expression information. Large repositories lished commercial service, and the technology exists like the NIH NCBI Gene Expression Omnibus16 to provide high-depth sequencing and coverage at provide access to tens of thousands of different a cost comparable to many commonly used diag- highly parallelized gene expression measurements. nostic tests4; basic techniques exist for using these We and others have previously described methods data in a medically relevant fashion.56However, that look across multiple studies which each provide genomic medicine relies on knowledge of the many gene expression measurements (ie, multiplex) e genetic basis of disease, and although methods such in what we call ‘multiplex meta-analysis’17 20 to as genome-wide association studies using geno- obtain an overall picture of gene expression across This paper is freely available typing arrays have become the gold standard for studies. Taking advantage of the central dogma that online under the BMJ Journals unlocked scheme, see http:// discovering and exploring genetic variations, they DNA codes for RNA, which codes for protein, jamia.bmj.com/site/about/ have had a relatively poor success rate in explaining studying RNA at the gene expression level in specific unlocked.xhtml the chief genetic contributions to the heritability of phenotypes has provided potential insight into 284 J Am Med Inform Assoc 2012;19:284e288. doi:10.1136/amiajnl-2011-000657 Research and applications functional processes in the phenotype and suggests genes (DNA) detail in Witten and Tibshirani,38 using the standard deviation 21 22 s that may contain variants associated with that disease. In ( g) and a value so,g (scaling factor) selected to minimize the fi this study, we extend this idea and describe a method for inte- coef cient of variation of Tg across all oligonucleotides. grating gene expression data across multiple gene expression e À e ¼ xi;g xc;g studies for clinically/phenotypically similar presentations of Tg s (2) g þ so;g a range of immune deficiencies across species and tissue types to fi fi create an integrated multiplex (parallelized) meta-pro le of gene The modi ed t statistic Tg is then bootstrapped to calculate fi expression. This meta-expression pro le allows powerful priori- a p value (pg) for each gene. The oligonucleotides on the array are tization of the genes involved in general immune deficiency but mapped to genes using AILUN39 and across species using also shows predictive power over a recent genome-wide associa- HomoloGene groups.40 Using a method proposed by Fisher41 tion study of CVID (figure 1). We suggest that this method could and based on the fact that p values selected at random should be be used to investigate the genetic and molecular pathology of uniformly distributed, we can look for deviations by the c2 test other forms of immune dysfunction. (equation 3), with pg the p value for that gene in the subgroup g and ng the number of subgroups measuring that gene. METHODS ng À Á c2 ¼ + We collected data from the NIH NCBI Gene Expression 2ng 2 log pg (3) Omnibus for 16 different studies of immunodeficiency (table 1). g ¼ 1 Importantly, these gene expression measurements are from Importantly, this method will freely mix different directions many different forms of immune dysfunction and span multiple of variation across studies, up or down, using only the signifi- species and different tissue types. The gene expression samples cance of the test. The final meta p values may then be used to in each study were hand annotated and further divided into 37 rank the significance of expression differences between immune different experimental comparison subgroups (such as different deficient and normal controls across studies. We predict that mouse backgrounds used in the same study). Gene expression genes that are significantly differentially expressed across studies levels were compared between immune deficient samples and identified through our method are more likely to be involved in controls (normal immune function samples) in each subgroup immunodeficiency; the lower the p value, the greater the priority using the modified t test proposed by Tusher et al37 and incor- given. porated into the significance analysis of microarrays.38 The To establish a baseline and identify genes involved with the annotations of all samples are available upon request. immune system in general, we also took lists of genes annotated We calculate the mean log fold change38 of each oligonucleotide for involvement in acquired immunity, innate immunity, and in each array (m) in each experimental class (k,eitheri for inflammation taken from the Molecular Signatures Database.42 immunodeficiency or c for control) and each subgroup (g), with The genes with an annotation in each respective category were n indicating the total number of arrays of class k in subgroup g. k,g taken as predictions for having variants associated with