Exome-Based Linkage Disequilibrium Maps of Individual Genes: Functional Clustering and Relationship to Disease
Total Page:16
File Type:pdf, Size:1020Kb
Hum Genet (2013) 132:233–243 DOI 10.1007/s00439-012-1243-6 ORIGINAL INVESTIGATION Exome-based linkage disequilibrium maps of individual genes: functional clustering and relationship to disease Jane Gibson • William Tapper • Sarah Ennis • Andrew Collins Received: 3 August 2012 / Accepted: 20 October 2012 / Published online: 4 November 2012 Ó Springer-Verlag Berlin Heidelberg 2012 Abstract Exome sequencing identifies thousands of perception and immune response). This category is not DNA variants and a proportion of these are involved in enriched for genes containing disease variation. In contrast, disease. Genotypes derived from exome sequences provide there is significant enrichment of genes containing disease particularly high-resolution coverage enabling study of the variants amongst genes with more average levels of linkage linkage disequilibrium structure of individual genes. The disequilibrium. Mutations in these genes may less likely lead extent and strength of linkage disequilibrium reflects the to in utero lethality and be subject to less intense selection. combined influences of mutation, recombination, selection and population history. By constructing linkage disequi- librium maps of individual genes, we show that genes Introduction containing OMIM-listed disease variants are significantly under-represented amongst genes with complete or very Exome sequencing generates up to 97 % of consensus cod- strong linkage disequilibrium (P = 0.0004). In contrast, ing sequences (CCDS) representing the protein coding por- genes with disease variants are significantly over-repre- tions (exons) of the genome (Parla et al. 2011). Exome sented amongst genes with levels of linkage disequilibrium sequencing has been very successful in the identification of close to the average for genes not known to contain disease disease causal mutations for a range of dominant and variants (P = 0.0038). Functional clustering reveals, recessive Mendelian disorders (for e.g. Ng et al. 2010). amongst genes with particularly strong linkage disequilib- Along with whole genome and more targeted sequencing, rium, significant enrichment of essential biological func- exome sequencing is revolutionising clinical medicine and tions (e.g. phosphorylation, cell division, cellular transport research into all genetic diseases. Sequencing studies typi- and metabolic processes). Strong linkage disequilibrium, cally align the samples’ short sequence reads against a ref- corresponding to reduced haplotype diversity, may reflect erence genome assembly and identify positions and regions selection in utero against deleterious mutations which have of sequence difference. Extensive filtering to remove com- profound impact on the function of essential genes. Genes mon variation unlikely to be involved in disease reveals with very weak linkage disequilibrium show enrichment of novel, rare or otherwise interesting single nucleotide poly- functions requiring greater allelic diversity (e.g. sensory morphisms (SNPs) and small insertions and deletions (in- dels) which are the basis of follow-up studies. As we describe here, genotypes can also be determined at every SNP for each Electronic supplementary material The online version of this individual which enables high-resolution linkage disequi- article (doi:10.1007/s00439-012-1243-6) contains supplementary librium (LD) map construction. A key component of the material, which is available to authorized users. development of recent genome-wide association studies J. Gibson Á W. Tapper Á S. Ennis Á A. Collins (&) (GWAS), targeting common diseases, was the character- Genetic Epidemiology and Genomic informatics Group, isation of the underlying genome LD structure. From this Human Genetics, Faculty of Medicine, work came the recognition that historical recombination is a University of Southampton, Duthie Building (808), Southampton General Hospital, Tremona Road, Southampton SO16 6YD, UK dominant force in shaping LD patterns (Tapper et al. 2005) e-mail: [email protected] and recombination hot-spots are ‘highly punctate’ (Jeffreys 123 234 Hum Genet (2013) 132:233–243 et al. 2001) dividing ‘blocks’ of low haplotype diversity same platform, as part of in-house disease gene studies. (Gabriel et al. 2002). The HapMap project (The International These samples comprised 10 with leukaemia, 5 with lym- HapMap consortium 2003) set out to characterise LD phoma, 4 with Beckwith–Wiedemann syndrome, 3 with structure for a range of populations and determine panels of macrocephaly-capillary malformation and 8 with paediatric ‘tagging’ SNPs (Daly et al. 2001) to be genotyped cost inflammatory bowel disease (IBD). The eight IBD samples effectively in large-scale GWAS. Because exome sequenc- and their exome analysis are described by Christodoulou ing can generate genotypes for every variant in a coding et al. (2012). We reasoned that genotypes from this set of sequence, it provides a particularly high-resolution view of 30 samples would be amenable to the construction of high- the LD structure of individual exons. Although most of the resolution LD maps of individual genes, given our earlier non-coding regions of genes are not sequenced in exome experience with the construction of whole-genome maps data, LD is quite extensive in the genome so SNPs in adjacent from the lower marker density HapMap database of 60 sequenced exons are often in LD with each other. Therefore, samples (Tapper et al. 2005). For LD map construction, we maps of most genes which have a high density of exonic removed all low frequency variants (potentially somewhat markers will be representative of the LD structure across enriched for disease variants in these samples), so the whole genes. Exome data are used here to construct high- disease status of these samples had little or no impact on resolution LD maps of individual genes for the investigation the resulting maps. Furthermore, use of these samples of relationships between gene function and LD structure. supported ease of access to raw sequence read data and the Gene LD maps are constructed in linkage disequilibrium ability to call out individual genotypes simultaneously on units (LDUs, Maniatis et al. 2002). LDU maps provide a all samples. metric which is analogous to, and closely related to, the Exome sequencing through targeted exome capture was centimorgan scale of linkage maps but reflects the impact of undertaken with the SureSelect Human All Exon 50 Mb kit the full spectrum of LD influencing processes. One LD unit (Agilent) and sequence data generated on the Illumina represents the distance, varying widely on the kilobase scale, HiSeq platform. Novoalign (http://www.novocraft.com/ over which LD declines to background levels. As gene sizes main/index.php) was used for alignment of sequence vary widely, the ratio of LDU/kb lengths provides a measure reads to the hg18 UCSC reference sequence, creating an to compare the intensity of LD for individual genes. Studies aligned.sam file. The format was specified as Illumina and by Smith et al. (2005) and Sun et al. (2011) investigated the approximate fragment lengths and standard deviations relationships between local sequence features and patterns of were set at 200 and 30. Alignments were ‘‘softclipped’’ LD using data from the HapMap consortium. They examined back to the best local alignment. Quality calibration was differences between genes in regions of strong and weak LD enabled. Adapter sequences were stripped using the default by classifying each gene as overlapping the quartile of the sequences. The gap opening penalty was set to 65 and the genome with strong LD or the quartile with weak LD. Gene gap extend penalty set to 7. Picardtools was used to iden- classifications were then compared to the Gene Ontology tify duplicate reads. SamTools (v0.0.18, http://samtools. database (Ashburner et al. 2000). We present here LD sourceforge.net/) ‘‘view’’ was used to create a.bam file maps of individual genes using exome sequence data from from the aligned.sam file, alignments were skipped when- 30 individuals to further elucidate relationships between ever; the map quality was\20, the read was unmapped, the functional gene classification and the extent of LD within alignment was not the primary alignment, the alignment individual genes. We consider these LD patterns in rela- failed platform/vendor quality checks, or the alignment was tion to genes which contain known disease variants, as a PCR/optical duplicate. SamTools ‘‘mpileup’’ was used to defined in OMIM (http://www.ncbi.nlm.nih.gov/omim/). call sites which varied from the reference sequence, the The approach developed here reveals relationships between maximum number of reads was kept to 2,000. Extended the biological functions of genes, the strength of LD and BAQ computation was carried out; the per sample read the tendency of genes to contain disease variants which depth and per sample phred-scaled strand bias P values impact the understanding of disease variation and the were output. All 30 samples were called together. BCF- influence of selection in the genome. tools (part of the samtools package) was used to create a variant call file (.vcf), skipping sites where the reference allele was not A/C/G/T and outputting variant sites only. Materials and methods Summary statistics representing mapping quality and cov- erage are given in Supplementary Table 1. The ranges and Exome sequence data and analysis means represent the total number of reads sequenced, the number of reads aligned to the reference sequence, the SNP genotypes from 30 DNA samples from unrelated number of reads which mapped uniquely to the reference patients were exome sequenced at the same time on the sequence, the percentage of reads mapped to the target 123 Hum Genet (2013) 132:233–243 235 (SureSelect bait), the percentage of reads mapped to the reference sequence and/or poor marker coverage, we target ± 150 bp (*1 read length), the percentage of tar- extensively filtered the gene set, as described below. gets with read depth of 1, 5, 10 and 20 and the mean To produce LDMAP genotype input files, we filtered coverage depths. raw variants to remove all indels and poorly supported Genotypes at each SNP and indel were obtained using SNPs.