Effects of Copy Number Variable Regions on Local Gene Expression in White Blood Cells of Mexican Americans
Total Page:16
File Type:pdf, Size:1020Kb
European Journal of Human Genetics (2015) 23, 1229–1235 & 2015 Macmillan Publishers Limited All rights reserved 1018-4813/15 www.nature.com/ejhg ARTICLE Effects of copy number variable regions on local gene expression in white blood cells of Mexican Americans August Blackburn*,1,2,MarcioAlmeida1, Angela Dean3, Joanne E Curran1, Matthew P Johnson1,EricKMoses4, Lawrence J Abraham4, Melanie A Carless1,ThomasDDyer1,SatishKumar1,LauraAlmasy1, Michael C Mahaney1, Anthony Comuzzie1, Sarah Williams-Blangero1,5,JohnBlangero1, Donna M Lehman6 and Harald HH Göring1 Only few systematic studies on the contribution of copy number variation to gene expression variation have been published to date. Here we identify effects of copy number variable regions (CNVRs) on nearby gene expression by investigating 909 CNVRs and expression levels of 12059 nearby genes in white blood cells from Mexican-American participants of the San Antonio Family Heart Study. We empirically evaluate our ability to detect the contribution of CNVs to proximal gene expression (presumably in cis)atvariouswindowsizes(uptoa10Mbdistance) between the gene and CNV. We found a ~ 1-Mb window size to be optimal for capturing cis effects of CNVs. Up to 10% of the CNVs in this study were found to be significantly associated with the expression of at least one gene within their vicinity. As expected, we find that CNVs that directly overlap gene sequences have the largest effects on gene expression (compared with non-overlapping CNVRs located nearby), with positive correlation (except for a few exceptions) between estimated genomic dosage and expression level. We find that genes whose expression level is significantly influenced by nearby CNVRs are enriched for immunity and autoimmunity related genes. These findings add to the currently limited catalog of CNVRs that are recognized as expression quantitative trait loci, and have implications for future study designs as well as for prioritizing candidate causal variants in genomic regions associated with disease. European Journal of Human Genetics (2015) 23, 1229–1235; doi:10.1038/ejhg.2014.280; published online 14 January 2015 INTRODUCTION Copy number variation reported in the Database of Genomic Inter-individual variation in transcript abundance is known to be Variants covers roughly 70% of the genome,5 although this estimate is significantly heritable for many genes. Transcript level can be likely upward biased by inaccurate breakpoint identification.6 None- considered as a quantitative endophenotype whose genetic regulatory theless, in an individual, copy number variants (CNVs) make up more machinery can be mapped to the genome.1 The expression quantita- variation than SNPs on a per nucleotide basis.7 Thus, the potential tive trait loci (eQTL) with the strongest effects on gene expression act effects of CNVs as eQTL are likely large. Effects caused by gene dosage primarily in cis.1 A critical bottleneck in the search for disease genes is should be less tissue specific than variation in gene expression caused the identification of the underlying causal variants, which are often by genetic variation in distant regulatory regions, which is important initially localized in genome-wide association studies (GWAS). as we are often limited to studying surrogate tissues to identify eQTL. A promising hypothesis now being explored for complex traits and Only a few recent studies have sought to systematically identify CNVs diseases is that functional alleles may be regulatory in nature and exert which act as eQTL.8,9 Stranger et al9 identified 238 genes with their effect by altering gene expression2 (and thus making them expression levels that were significantly associated with copy number detectable by genetic investigations of expression levels). This general variation. More recently, Schlattl et al8 identified 110 genes with hypothesis is supported by various observations, including the fact that expression affected by CNVs. Both studies investigated the relative most of the identified common disease-associated SNPs are not in proportions of eQTL attributable to CNVs and SNPs, but the effects protein coding regions and often are located far away from exons of by both variants are difficult to disentangle because of the linkage known genes. Recent studies have shown that GWAS SNPs are disequilibrium between CNVs and SNPs. This correlation among enriched among eQTL.3,4 Further, GWAS SNPs, that are known genetic variants located in genomic proximity to one another also eQTL, often affect gene expression in the disease tissue.3 Given these suggests that some eQTL previously identified in SNP-based studies observations, gene transcript levels have received a high level of may be attributable to CNVs. Gamazon et al10 found that SNPs interest as endophenotypes that can be correlated with disease status, tagging CNVs were enriched for cis-eQTL, and that these SNPs are and whose genetic regulatory mechanisms can be mapped with overrepresented in the National Human Genome Research Institute’s considerable power.2 (NHGRI) catalog of GWAS SNPs. 1Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX, USA; 2Department of Cellular and Structural Biology, UT Health Science Center San Antonio, San Antonio, TX, USA; 3Department of Computer Science, University of Texas San Antonio, San Antonio, TX, USA; 4Centre for Genetic Origins of Health and Disease, University of Western Australia, Perth, WA, Australia; 5Southwest National Primate Research Center, San Antonio, TX, USA; 6Department of Medicine/Division of Clinical Epidemiology, UT Health Science Center San Antonio, San Antonio, TX, USA *Correspondence: Dr A Blackburn, Department of Genetics, Texas Biomedical Research Institute, 7620 NW Loop 410, San Antonio, TX 78227, USA. Tel: +1 210 258 9208; Fax: +1 210 258 9444; E-mail: [email protected] Received 25 March 2014; revised 25 September 2014; accepted 26 November 2014; published online 14 January 2015 CNVs and gene expression ABlackburnet al 1230 In this study we seek to add to the growing catalog of eQTL by Gene expression identifying genes whose expression level in white blood cells (mainly For this study, we used gene expression values from Goring et al.1 The lymphocytes) is affected by CNVs in the San Antonio Family Heart ascertainment of transcript abundance measurements has been previously 1 fl fi Study (SAFHS). described in detail. Brie y, genome-wide transcription pro les were created using Illumina Sentrix Human Whole Genome (WG-6) Series I BeadChips, SUBJECTS AND METHODS and are archived under ArrayExpress accession number E-TABM-305. How- ever, these data were re-processed using the following approach. Based on the Study design number of probes with detectable expression (at ‘detection P-value’ ≤ 0.05), the Participants in the SAFHS11 are members of extended, multigenerational average of the raw expression levels across probes, and the average correlation families of Mexican-American descent. SAFHS is a family study where the across all probes between each sample and all others, 1244 samples were subjects were not ascertained on disease status. The Institutional Review Board determined to yield expression profile data of acceptable quality. Among these at the University of Texas Health Science Center San Antonio approved the samples, we tested whether there was an enrichment of samples with a current study, and informed consent was obtained from all participants. All detection P-value ≤ 0.05 for each probe (the ‘detection P-value’ is a quantity study related clinical exams were conducted in San Antonio, TX, USA. Gene provided by Illumina software for each probe in each sample, generated by expression and copy number variation data were available for 1104 participants. comparing the expression level of a given probe to null control probes on the array) to determine which probes detected significant expression. We did this Copy number variable regions using a binomial test of the number of samples with ‘detection P-value’ ≤ 0.05 We recently identified 2937 copy number variable regions (CNVRs) in (5% of the samples would be expected to have a P-value at this level by chance). participants of the SAFHS using various Illumina (San Diego, CA, USA) To correct for multiple testing (as there are many probes being tested) we kept Infinium Beadchips.12 Our ability to characterize these CNVRs is limited in the probes at a false discovery rate of 0.05. Subsequently, we performed background absence of sequencing data. Some CNVRs fall within known complex regions, noise correction, log2 transformation, and quantile normalization. We have or fall within regions of the genome that are predisposed to recurrent copy- used this procedure previously and it is also described here.20 For the sake of number-altering mutational events.13,14 However, we reason that the majority simplicity, tests were performed at the probe level, although in some cases genes are diallelic CNVs as we previously observed reasonable concordance in size were represented by more than one probe. illuminaHumanv1.db available and location between these CNVRs and those identified to be polymorphisms through the Bioconductor website was used for probe annotation. by HapMap3.12,15 Using Log R ratios for probes within each CNVR, we generated quantitative Statistical genetic analysis values representative of copy number using the principal components function The relationship between copy number variation and probe-level gene implemented within CNVtools.16 As described by Barnes et