Hum Genet (2013) 132:233–243 DOI 10.1007/s00439-012-1243-6

ORIGINAL INVESTIGATION

Exome-based linkage disequilibrium maps of individual : functional clustering and relationship to disease

Jane Gibson • William Tapper • Sarah Ennis • Andrew Collins

Received: 3 August 2012 / Accepted: 20 October 2012 / Published online: 4 November 2012 Ó Springer-Verlag Berlin Heidelberg 2012

Abstract Exome sequencing identifies thousands of perception and immune response). This category is not DNA variants and a proportion of these are involved in enriched for genes containing disease variation. In contrast, disease. Genotypes derived from exome sequences provide there is significant enrichment of genes containing disease particularly high-resolution coverage enabling study of the variants amongst genes with more average levels of linkage linkage disequilibrium structure of individual genes. The disequilibrium. Mutations in these genes may less likely lead extent and strength of linkage disequilibrium reflects the to in utero lethality and be subject to less intense selection. combined influences of mutation, recombination, selection and population history. By constructing linkage disequi- librium maps of individual genes, we show that genes Introduction containing OMIM-listed disease variants are significantly under-represented amongst genes with complete or very Exome sequencing generates up to 97 % of consensus cod- strong linkage disequilibrium (P = 0.0004). In contrast, ing sequences (CCDS) representing the coding por- genes with disease variants are significantly over-repre- tions () of the genome (Parla et al. 2011). Exome sented amongst genes with levels of linkage disequilibrium sequencing has been very successful in the identification of close to the average for genes not known to contain disease disease causal mutations for a range of dominant and variants (P = 0.0038). Functional clustering reveals, recessive Mendelian disorders (for e.g. Ng et al. 2010). amongst genes with particularly strong linkage disequilib- Along with whole genome and more targeted sequencing, rium, significant enrichment of essential biological func- exome sequencing is revolutionising clinical medicine and tions (e.g. phosphorylation, cell division, cellular transport research into all genetic diseases. Sequencing studies typi- and metabolic processes). Strong linkage disequilibrium, cally align the samples’ short sequence reads against a ref- corresponding to reduced haplotype diversity, may reflect erence genome assembly and identify positions and regions selection in utero against deleterious mutations which have of sequence difference. Extensive filtering to remove com- profound impact on the function of essential genes. Genes mon variation unlikely to be involved in disease reveals with very weak linkage disequilibrium show enrichment of novel, rare or otherwise interesting single nucleotide poly- functions requiring greater allelic diversity (e.g. sensory morphisms (SNPs) and small insertions and deletions (in- dels) which are the basis of follow-up studies. As we describe here, genotypes can also be determined at every SNP for each Electronic supplementary material The online version of this individual which enables high-resolution linkage disequi- article (doi:10.1007/s00439-012-1243-6) contains supplementary librium (LD) map construction. A key component of the material, which is available to authorized users. development of recent genome-wide association studies J. Gibson Á W. Tapper Á S. Ennis Á A. Collins (&) (GWAS), targeting common diseases, was the character- Genetic Epidemiology and Genomic informatics Group, isation of the underlying genome LD structure. From this Genetics, Faculty of Medicine, work came the recognition that historical recombination is a University of Southampton, Duthie Building (808), Southampton General Hospital, Tremona Road, Southampton SO16 6YD, UK dominant force in shaping LD patterns (Tapper et al. 2005) e-mail: [email protected] and recombination hot-spots are ‘highly punctate’ (Jeffreys 123 234 Hum Genet (2013) 132:233–243 et al. 2001) dividing ‘blocks’ of low haplotype diversity same platform, as part of in-house disease studies. (Gabriel et al. 2002). The HapMap project (The International These samples comprised 10 with leukaemia, 5 with lym- HapMap consortium 2003) set out to characterise LD phoma, 4 with Beckwith–Wiedemann syndrome, 3 with structure for a range of populations and determine panels of macrocephaly-capillary malformation and 8 with paediatric ‘tagging’ SNPs (Daly et al. 2001) to be genotyped cost inflammatory bowel disease (IBD). The eight IBD samples effectively in large-scale GWAS. Because exome sequenc- and their exome analysis are described by Christodoulou ing can generate genotypes for every variant in a coding et al. (2012). We reasoned that genotypes from this set of sequence, it provides a particularly high-resolution view of 30 samples would be amenable to the construction of high- the LD structure of individual exons. Although most of the resolution LD maps of individual genes, given our earlier non-coding regions of genes are not sequenced in exome experience with the construction of whole-genome maps data, LD is quite extensive in the genome so SNPs in adjacent from the lower marker density HapMap database of 60 sequenced exons are often in LD with each other. Therefore, samples (Tapper et al. 2005). For LD map construction, we maps of most genes which have a high density of exonic removed all low frequency variants (potentially somewhat markers will be representative of the LD structure across enriched for disease variants in these samples), so the whole genes. Exome data are used here to construct high- disease status of these samples had little or no impact on resolution LD maps of individual genes for the investigation the resulting maps. Furthermore, use of these samples of relationships between gene function and LD structure. supported ease of access to raw sequence read data and the Gene LD maps are constructed in linkage disequilibrium ability to call out individual genotypes simultaneously on units (LDUs, Maniatis et al. 2002). LDU maps provide a all samples. metric which is analogous to, and closely related to, the Exome sequencing through targeted exome capture was centimorgan scale of linkage maps but reflects the impact of undertaken with the SureSelect Human All 50 Mb kit the full spectrum of LD influencing processes. One LD unit (Agilent) and sequence data generated on the Illumina represents the distance, varying widely on the kilobase scale, HiSeq platform. Novoalign (http://www.novocraft.com/ over which LD declines to background levels. As gene sizes main/index.php) was used for alignment of sequence vary widely, the ratio of LDU/kb lengths provides a measure reads to the hg18 UCSC reference sequence, creating an to compare the intensity of LD for individual genes. Studies aligned.sam file. The format was specified as Illumina and by Smith et al. (2005) and Sun et al. (2011) investigated the approximate fragment lengths and standard deviations relationships between local sequence features and patterns of were set at 200 and 30. Alignments were ‘‘softclipped’’ LD using data from the HapMap consortium. They examined back to the best local alignment. Quality calibration was differences between genes in regions of strong and weak LD enabled. Adapter sequences were stripped using the default by classifying each gene as overlapping the quartile of the sequences. The gap opening penalty was set to 65 and the genome with strong LD or the quartile with weak LD. Gene gap extend penalty set to 7. Picardtools was used to iden- classifications were then compared to the tify duplicate reads. SamTools (v0.0.18, http://samtools. database (Ashburner et al. 2000). We present here LD sourceforge.net/) ‘‘view’’ was used to create a.bam file maps of individual genes using exome sequence data from from the aligned.sam file, alignments were skipped when- 30 individuals to further elucidate relationships between ever; the map quality was\20, the read was unmapped, the functional gene classification and the extent of LD within alignment was not the primary alignment, the alignment individual genes. We consider these LD patterns in rela- failed platform/vendor quality checks, or the alignment was tion to genes which contain known disease variants, as a PCR/optical duplicate. SamTools ‘‘mpileup’’ was used to defined in OMIM (http://www.ncbi.nlm.nih.gov/omim/). call sites which varied from the reference sequence, the The approach developed here reveals relationships between maximum number of reads was kept to 2,000. Extended the biological functions of genes, the strength of LD and BAQ computation was carried out; the per sample read the tendency of genes to contain disease variants which depth and per sample phred-scaled strand bias P values impact the understanding of disease variation and the were output. All 30 samples were called together. BCF- influence of selection in the genome. tools (part of the samtools package) was used to create a variant call file (.vcf), skipping sites where the reference allele was not A/C/G/T and outputting variant sites only. Materials and methods Summary statistics representing mapping quality and cov- erage are given in Supplementary Table 1. The ranges and Exome sequence data and analysis means represent the total number of reads sequenced, the number of reads aligned to the reference sequence, the SNP genotypes from 30 DNA samples from unrelated number of reads which mapped uniquely to the reference patients were exome sequenced at the same time on the sequence, the percentage of reads mapped to the target 123 Hum Genet (2013) 132:233–243 235

(SureSelect bait), the percentage of reads mapped to the reference sequence and/or poor marker coverage, we target ± 150 bp (*1 read length), the percentage of tar- extensively filtered the gene set, as described below. gets with read depth of 1, 5, 10 and 20 and the mean To produce LDMAP genotype input files, we filtered coverage depths. raw variants to remove all indels and poorly supported Genotypes at each SNP and indel were obtained using SNPs. Individual genotypes were recoded as ‘missing’ the mpileup option in SamTools for all 30 samples where they had a sequence read depth of \4 and/or a simultaneously. This enabled the assignment of genotypes genotype quality (GQ: a phred quality -10 log10P, where for individuals, even where homozygous for the hg18 ref- P is the probability that the genotype call is wrong) of\10. erence allele, for any marker where the non-reference allele At this stage, after recoding individual genotypes as was present in at least one individual. missing, all SNPs with more than 50 % missing genotypes were excluded from further consideration. Input files rep- Linkage disequilibrium maps resenting 1–22, for the LDMAP program, contained a total of 305,588 SNPs. Because we expect to The program LDMAP (Maniatis et al. 2002; Service et al. see limited LD around rare SNPs, the LDMAP program 2006) constructs maps in LDUs. LDU map locations for also excludes all rare SNPs with a minor allele frequency individual variants, when plotted against underlying kb (computed from the input data files) of 0.05 or less. The locations, reveal a non-linear ‘block and step’ structure resultant LDU maps, therefore, reflect the LD structure in where blocks correspond to regions of low haplotype terms of common haplotypes with rare haplotypes exclu- diversity and steps reflect recombination events and hot- ded. After removing rare SNPs, LDMAP determined LDU spots. Zhang et al. (2002) describe the close correspon- map locations for 182,353 SNPs. The average SNP density dence between the steps and meiotic recombination events in these maps is substantially higher for the exonic portion identified in sperm typing identified by Jeffreys et al. of the genome than is achieved by HapMap. Release 27 of (2001) and between the LDU blocks and regions of low HapMap (CEU population) contains 4,030,774 SNPs giv- haplotype diversity identified by Daly et al. (2001). The ing an average SNP density of 1,260 SNPs/Mb, assuming contour of an LDU map corresponds closely to the linkage the genome has 3,200 Mb. In these exomes, 182,353 SNPs map, confirming the dominant impact of recombination on are mapped from exome capture size *50 Mb giving a LD patterns (Service et al. 2006) but the local LDU mean density of 3,647 SNPs/Mb, approaching threefold the structure also reflects the impact of processes such as average density for the HapMap data. The density in mutation, selection and population history. This is in con- HapMap would be further reduced if used for LDU map trast to methods which employ an approximation to the construction and filtered for rare variation as employed coalescent to extract patterns of historical recombination above. from LD data (McVean et al. 2004) which exclude the impact of processes other than recombination. SNP annotation and characterisation of gene-specific The LDU scale provides a framework for comparison of maps the extent of LD in different genomic regions that are independent of the kb scale. One LDU represents the All SNPs with LDU locations were annotated using Ann- (variable number of kilobases) distance over which LD ovar (http://www.openbioinformatics.org/annovar/annovar_ declines to ‘background’ levels, so genomic locations more gene.html) and the UCSC ‘Known Gene’ annotation than one LDU apart are LD independent. LDMAP com- option. Once annotated with respect to genes, intragenic putes the LDU distance between adjacent SNPs by com- SNPs were retained (exonic, intronic, splicing and UTR) posite likelihood using all informative pairwise SNP data and intergenic SNPs were excluded. Individual gene-spe- that include the interval. The default settings for windows cific LDU maps were extracted only for genes named in the and selection of informative pairwise data are detailed by HUGO committee list (http://www. Service et al. (2006). genenames.org/cgi-bin/hgnc_downloads.cgi) using the As LDMAP computes LDU distances between adjacent ‘custom downloads’ facility to select approved symbols SNPs, we anticipated that portions of the inter-gene and synonyms. After matching on approved names or regions, distant from target baits, would be poorly mapped. synonyms, and converting any synonyms to approved However, as the average extent of LD in the Hapmap names, we determined the beginning and end locations of a ‘CEU’ genome is *3,200 Mb/57,819 LDU ^ 55 kb (Lau total of 17,000 genes on both kb and LDU scales, the kb et al. 2007), we expected that many intervals outside exons, and LDU scale SNP densities and the ratio of LDU/kb gene but within introns and untranslated regions of specific lengths. To conservatively exclude genes with insufficient genes, would be well mapped. To minimise inclusion of marker density to accurately model the LD structure genes in genomic regions with poor alignment to the between sequenced exon baits, we only considered genes 123 236 Hum Genet (2013) 132:233–243 with an average spacing of at least one SNP every 20 kb the list and were therefore not LD independent, perhaps (300 genes excluded) and at least one SNP every 3 LDU reflecting a gene cluster with related functions. For these (42 genes excluded). For the same reason, we only ana- dependent pairs of genes, the smaller gene was excluded in lysed genes containing at least 6 SNPs in total (6,197 genes all cases (identified in Supplementary Tables 2 and 3). excluded). In addition, a total of 47 outlying genes with LDU/kb [ 10 were excluded as occupying genomic The LD structure of genes containing OMIM disease regions with particularly low marker density on the LD variants scale. The final analysis file contained 10,414 genes. We downloaded 9,133 unique SNP variants from the Functional enrichment analysis for low and high LD OMIM database known to be associated with some human genes diseases from Galaxy (http://main.g2.bx.psu.edu/library), following Li et al. (2012). Phenotypes, as described by the The DAVID (Database for Annotation, Visualisation and OMIM database, are dominated by single-gene Mendelian Integrated Discovery) Gene Functional Classification Tool disorders and traits and include some susceptibilities to (http://david.abcc.ncifcrf.gov) organises subsets of genes complex disease (for e.g. CFH and macular degeneration) into classes based on related biology (Wei Haung et al. and some somatic cell genetic disease (for e.g. FGFR3 and 2009). The tool implements a measure of gene–gene sim- bladder cancer). These SNP variants are found within 1,782 ilarity assuming functional relationships amongst genes genes, and 1,086 of these are represented in our list of sharing functional annotation profiles. DAVID functional 10,414 genes. We tested for enrichment of disease genes in annotation data are derived from 14 annotation categories four LDU/kb categories: from ‘very strong LD’ (LDU/ which include Gene Ontology (GO), KEGG Pathways, kb B 0.01) through ‘very weak LD’ (LDU/kb [ 0.05). NCBI and OMIM. DAVID implements an agglomeration method to classify a gene list into related gene groups (GG), given functional similarity scores. Each gene group Results is more important if most of the members are associated with highly enriched annotation terms (overrepresented in Because LDMAP computes LDU distances between adja- the gene list) and this quantified as a high enrichment cent SNPs, maps for entire chromosomes were constructed EASE score (a chance-adjusted Fisher’s exact P value). and poorly mapped inter-genic regions subsequently Enrichment scores order the relative importance of gene excluded. However, despite incomplete coverage of chro- groups such that a higher score reflects group members mosomes, reasonably contiguous exome-derived whole with more enriched functions relative to the general human LD maps show typical patterns (Fig. 1) genome background. including decreased LD towards telomeres and extensive We applied the Gene Functional Classification Tool from LD around the centromere. The pattern reflects the DAVID to determine whether groups of genes with partic- smoothing effect of extended LD between sequenced ularly strong LD, and genes with particularly weak LD, exons. However, the exome-derived maps contain a num- showed enrichment of biological function. We tested for ber of ‘holes’ in which a maximum distance between shared function amongst genes in either tail of the LDU/kb adjacent SNPs was set at a limit of three LDU (Service distribution: genes with complete or very strong LD show et al. 2006) which impacts total map length. Comparison very low LDU/kb ratios and genes with weak LD show with a HapMap-derived LDU map (Fig. 1) demonstrates particularly high ratios. A large number of genes show very close correspondence between the map contours, complete or very strong LD reflecting the prevalence in the despite the differences in marker density and spacing and genome of ‘blocks’ of low haplotype diversity (Daly et al. the different sample sizes employed. Therefore, LDU map 2001). Each tail contained 208 genes representing 2 % of the construction can be seen as generally robust to these total of 10,414 genes. The 2 % tails were chosen to represent sources of variation. After constructing the exome-based ‘extremes’ of the LDU/kb distribution representing a man- LD maps, we excluded inter-genic regions from further ageable number of genes to facilitate interpretation of analysis and undertook extensive filtering to remove indi- functional clusters. Genes with very strong LD (with LDU/ vidual genes with poor SNP coverage. kb gene size ratios of zero) are described in Supplementary Analysis of the gene set from the strong LD tail of the Table 2 and the weakest LD genes LDU/kb [ 0.22 are in LDU/kb distribution by DAVID identified four distinct Supplementary Table 3. For testing for functional enrich- gene groups showing enrichment (enrichment scores ment relative to a background in DAVID, we ranging from 2.05 to 4.11, Table 1). Table 2 shows enri- excluded 12 of the strong LD genes and 4 of the weak LD ched GO terms (http://www.geneontology.org/GO.doc. genes because they map within one LDU of another gene in shtml) in the ‘biological process’ category for gene 123 Hum Genet (2013) 132:233–243 237 groups 1–4 (GG1–GG4). All gene groups show enrichment Table 3) suggesting high recombination/mutation rates in functions underlying core biological processes. GG1 related to maintaining adaptability in the immune response. shows enriched functions concerned with phosphorylation Table 5 describes relationships between genes contain- which, amongst other functions, is critical in altering the ing OMIM-derived disease variants with LDU/kb catego- activity and function of enzymes. GG2 contains terms ries. Disease genes are found across the whole LDU/kb related to microtubules which are crucial in maintaining distribution but there is significant depletion and enrich- cell structure, are involved in formation of the mitotic ment in two categories. In the strong LD category, 9.5 % of spindle and have roles in intracellular transport (Desai and genes contain known disease variants (10.4 % expected Mitchison 1997). GG3 has no enriched GO biological across the whole sample), indicating significant under- process terms but contains genes involved in mitotic representation of disease genes (P = 0.0004). Within the spindle assembly. GG4 comprises essential transport pro- subset of these genes represented by the 2 % tail (large cesses including those involving the Golgi apparatus. genes with the strongest LD), only 9.1 % of genes contain Genes with the weakest LD contain members with known disease variants (Supplementary Table 2). shared functions in one gene group (GG1) (Table 3), In contrast, the 0.02–0.05 LDU/kb category contains dominated by processes involved in sensory perception 12.5 % of genes with disease variants indicating significant (Table 4). These genes include many odour and taste enrichment (P = 0.004). This category includes the mean receptors, groups of genes known to have high levels of LDU/kb for all non-disease genes (LDU/kb = 0.0307, allelic diversity with individually high recombination/ Table 5) consistent with enrichment of genes containing mutation rates (Sharon et al. 1999) consistent with their disease variants amongst genes which show ‘average’ function to discriminate between numerous and often novel amounts of LD. sensory inputs. Furthermore, immune system genes repre- To evaluate whether the broad functional clustering sented here include CD70 and others (Supplementary patterns identified using DAVID were consistent for the

Fig. 1 Linkage disequilibrium unit maps of . LDU region, with few sequenced exons, is shown on the proximal q-arm (at maps of chromosome 19 constructed from the exome data and *33,000 kb) where several holes merge. LDU maps of individual HapMap (the latter map presented by Lau et al. 2007). Both maps genes were extracted from the exome-derived map and all inter-gene show very closely similar contours including extensive LD around the regions were discarded. The locations of gene midpoints, from the centromere (at location 30,000 kb), reduced LD towards the telo- 736 genes with LDU maps, demonstrate the lack of genes in the meres and numerous matching finer-scale features. The exome- centromeric region and the high gene density towards the telomeres. derived map is shorter reflecting numerous ‘holes’ between the This mirrors the SNP density distribution (SNPs/Kb) with particularly sequenced exons where there are no markers and limited LD between low SNP density in and around the centromere outermost exons of adjacent genes. A particularly poorly mapped 123 238 Hum Genet (2013) 132:233–243

Table 1 Genes in DAVID functional gene groups (GG1–GG4, genes Table 1 continued with strong LD: LDU/kb = 0) Gene Chromosome Gene name Enriched Gene Chromosome Gene name Enriched symbol gene group symbol gene group membership membership CLIP2 7 CAP-GLY domain GG3 CDK13 7 Cell division cycle 2-like GG1 containing linker 5 protein 2 NEK4 3 NIMA (never in mitosis GG1 DCTN4 5 Dynactin 4 (p62) GG3 gene a)-related kinase 4 DOPEY1 6 Dopey family member 1 GG4 TESK2 1 Testis-specific kinase 2 GG1 COG5 7 Component of GG4 VRK2 2 Vaccinia-related kinase 2 GG1 oligomeric golgi IP6K1 3 Inositol GG1, GG2 complex 5 hexakisphosphate STX18 4 Syntaxin 18 GG4 kinase 1 SCAMP1 5 Secretory carrier GG4 BMP2K 4 BMP2 inducible kinase GG1, GG2 membrane protein 1 LIMK2 22 LIM domain kinase 2 GG1 SEC24A 5 SEC24 family, member GG4 PXK 3 PX domain containing GG1 A(S. cerevisiae) serine/threonine kinase SEC22A 3 SEC22 vesicle trafficking GG4 PI4KA 22 Phosphatidylinositol GG1 protein homologue A 4-kinase, catalytic, (S. cerevisiae) alpha DAVID enrichment scores: GG1 = 4.11, GG2 = 3.07, GG3 = 2.72, SRPK1 6 SRSF 1 GG1 GG4 = 2.05 HIPK3 11 Homeodomain GG1 interacting protein kinase 3 categories in Table 5, which contain larger numbers of RYK 3 RYK -like GG1 genes, we undertook analysis of: tyrosine kinase (1) Genes with the strongest LD (3,000 genes, this being PRKD3 2 Protein kinase D3 GG1 the maximum number that can be analysed using Gene IPMK 10 Inositol polyphosphate GG1 Functional Classification). (2) Genes with ‘average’ LD multikinase (1,574 genes). (3) Genes with weak LD (1,259 genes). The SMG1 16 SMG1 homologue, GG1 phosphatidylinositol most enriched GO terms from the gene group with the 3-kinase-related kinase highest enrichment factor in each category are given in (C. elegans) Supplementary Table 4. Enriched GO terms for the strong MET 7 met Proto-oncogene GG1 LD category overlap broadly with the core functions (hepatocyte growth identified for the 2 % tail of strong LD genes (Table 2), factor receptor) and terms for the weak LD category are also consistent TAOK1 17 TAO kinase 1 GG1 with the 2 % tail of weak LD genes (Table 4) but include SGK3 8 Serum/glucocorticoid- GG1 more immune system related terms. The enriched GO term regulated kinase family, member 3 classifications identified for the extreme 2 % tails of the RAD54L2 3 RAD54-like 2 (S. GG2 LDU/kb distribution are also representative of enrichment cerevisiae) patterns for the broader categories based on the LDU/kb KIF14 1 Kinesin family member GG2 distribution listed in Table 5. Interestingly, the ‘average’ 14 LD category (Supplementary Table 4) shows a degree of KIF18A 11 Kinesin family member GG2 overlap with the terms identified for weak LD genes sug- 18A gesting that the greatest functional distinction is between SMC1B 22 Structural maintenance GG2 genes with complete/very strong LD and all other genes of chromosomes 1B which show average to weak LD. KIF27 9 Kinesin family member GG2 27 KIF1B 1 Kinesin family member GG2 1B Discussion SPIRE1 18 Spire homologue 1 GG3 (Drosophila) Exome sequencing enables construction of LD maps using CEP192 18 Centrosomal protein GG3 the full complement of exonic common polymorphisms in 192 kDa a DNA sample. High-resolution LD maps of individual

123 Hum Genet (2013) 132:233–243 239

Table 2 DAVID: Significantly enriched GO terms (P \ 0.01 or less) with LD patterns (Tapper et al. 2005), vary at the kilobase for genes with strong LD (LDU/kb = 0) in gene groups GG1–GG4 level with much recombination occurring in hot-spots Gene group Number of P value Fold spanning only 1–2 kb (Jeffreys et al. 2001). Previous membership: GO term genes related (EASE enrichment studies have not approached this level of resolution in (biological process) to GO term/ score) relative to individual genes but considered relationships between total genes in human gene group (%) genome recombination hot/cold regions and gene function in high background and low recombination quartiles of the genome (Smith et al. 2005; Sun et al. 2011). GG1: phosphorylation 18/18 (100) \0.000001 16 LD patterns reflect complex interplay between recom- GG1: phosphate 18/18 (100) \0.000001 13 bination, mutation, selection and population history. It is metabolic process difficult to disentangle the relative importance of these GG1: phosphorus 18/18 (100) \0.000001 13 metabolic process processes which vary for individual genes. The relationship GG1: protein amino 15/18 (83.3) \0.000001 16 between recombination and mutation has received consid- acid phosphorylation erable attention. Chuang and Li (2004) examined func- GG1: 3/18 (16.7) \0.01 100 tional relationships between genes located in mutational phosphoinositide hot and cold regions of the genome. They found that genes phosphorylation located in high mutation regions are biased towards GG1: lipid 3/18 (16.7) \0.01 94 extracellular communication (surface receptors, cell adhe- phosphorylation sion, immune response), while those in mutational cold GG1: lipid 3/18 (16.7) \0.01 22 regions were biased towards essential cellular processes modification (gene regulation, RNA processing, protein modification). GG1: 3/18 (16.7) \0.01 21 phosphoinositide These categories match closely with the findings of Smith metabolic process et al. (2005), where classifications are based on recombi- GG1: protein amino 3/18 (16.7) \0.01 18 nation and LD patterns, respectively. Smith et al. note that acid chromosome regions with weak LD contain genes for autophosphorylation which considerable allelic diversity is advantageous at the GG1: 3/18 (16.7) \0.01 9.3 species and individual level. In contrast, genes associated glycerophospholipid metabolic process with DNA repair, , DNA metabolism etc., in GG2: microtubule- 4/8 (50) \0.0001 51 regions of strong LD, are involved in conserved biological based movement processes for which recombination/mutation might be GG2: microtubule- 4/8 (50) \0.001 23 expected to introduce particularly deleterious haplotypes. based process There is evidence that mutation is at least accelerated by GG4: Golgi vesicle 6/6 (100) \0.000001 86 recombination (Lercher and Hurst 2002) and high recom- transport bination rates are associated with specific gene families GG4: vesicle-mediated 6/6 (100) \0.000001 20 occupying mutational hot regions, including immune transport response genes (Papavasiliou and Schatz 2002) and olfac- GG4: intracellular 6/6 (100) \0.000001 17 tory families (Sharon et al. 1999). transport Genes sets with particularly weak LD (Supplementary GG4: protein transport 6/6 (100) \0.000001 15 Table 3), and the highest LDU/kb category (Table 5)do GG4: establishment of 6/6 (100) \0.000001 15 protein localisation not show enrichment for genes containing disease-related GG4: protein 6/6 (100) \0.00001 13 variation despite the high allelic diversity generated localisation through high recombination and/or mutation rates. This GG4: ER to Golgi 3/6 (50) \0.0001 11 finding is consistent with high allelic diversity being tol- vesicle-mediated erated (and indeed, functionally advantageous) within a transport proportion of genes in this class. Genes in the weak LD category (LDU/kb [ 0.05), although physically smaller on genes enable analysis of relationships between LD patterns, average than other genes, have higher mean SNP densities gene function and genes containing variants related to (17.9/24.8 = 0.72 SNPs/kb) compared to all genes (17.6/ disease. We describe a sample of 30 exomes for which we 52.3 = 0.34 SNPs/kb; Table 5). Many of the genes with have extensively filtered individual SNP genotypes and high mutation rates have been highlighted as more likely to gene maps to minimise biases, such as locally poor present sequence alignment problems and less likely to sequence alignment and poor gene coverage. Recombina- contain candidate variation for Mendelian forms of disease. tion rates, which are known to correlate extremely closely Many of these genes also form part of a suggested initial 123 240 Hum Genet (2013) 132:233–243

Table 3 Genes in DAVID functional gene group GG1 (genes with Table 3 continued weak LD: LDU/kb [ 0.22) Gene Chromosome Gene name Gene Chromosome Gene name symbol symbol OR1B1 9 , family 1, FPR1 19 Formyl peptide receptor 1 subfamily B, member 1 KIAA1024 15 KIAA1024 OR51A2 11 Olfactory receptor, family 51, CD70 19 CD70 molecule subfamily A, member 2 CTAGE1 18 Cutaneous T cell lymphoma- CDCP1 3 CUB domain containing protein 1 associated antigen 1 GPR144 9 G-protein-coupled receptor 144 TMEM128 4 Transmembrane protein 128 AVPR1B 1 Arginine vasopressin receptor 1B ST8SIA2 15 ST8 alpha-N-acetyl-neuraminide C18ORF26 18 Chromosome 18 open reading frame alpha-2,8-sialyltransferase 2 26 SIGLEC14 19 Sialic acid binding Ig-like lectin 14 OR2T8 1 Olfactory receptor, family 2, PLD3 19 Phospholipase D family, member 3 subfamily T, member 8; UGT2A1 4 UDP glucuronosyltransferase 2 GALNTL5 7 UDP-N-acetyl-alpha-D-galactosamine family, polypeptide A1 SLC5A12 11 Solute carrier family 5 (sodium/ OR4D9 11 Olfactory receptor, family 4, glucose cotransporter), member 12 subfamily D, member 9 SLC2A3 12 Solute carrier family 2 (facilitated OR5AU1 14 Olfactory receptor, family 5, glucose transporter), member 3 subfamily AU, member 1 OR1I1 19 Olfactory receptor, family 1, OR6N2 1 Olfactory receptor, family 6, subfamily I, member 1 subfamily N, member 2 OR13G1 1 Olfactory receptor, family 13, TMEM176A 7 Transmembrane protein 176A subfamily G, member 1 OR6N1 1 Olfactory receptor, family 6, OR6Q1 11 Olfactory receptor, family 6, subfamily N, member 1 subfamily Q, member 1 OR13F1 9 Olfactory receptor, family 13, LRRC15 3 Leucine-rich repeat containing 15 subfamily F, member 1 FUT5 19 Fucosyltransferase 5 (alpha (1,3) TMIGD2 19 Transmembrane and immunoglobulin fucosyltransferase) domain containing 2 MMD 17 Monocyte to macrophage SYT8 11 Synaptotagmin VIII differentiation-associated TMEM171 5 Transmembrane protein 171 MS4A6E 11 Membrane-spanning 4-domains, subfamily A, member 6E OR4X1 11 Olfactory receptor, family 4, subfamily X, member 1 SDC3 1 Syndecan 3 LRRC25 19 Leucine-rich repeat containing 25 LRIG3 12 Leucine-rich repeats and immunoglobulin-like domains 3 OR4D2 17 Olfactory receptor, family 4, subfamily D, member 2 TAS2R30 12 Taste receptor, type 2, member 30 OLR1 12 Oxidised low-density lipoprotein OR2T33 1 Olfactory receptor, family 2, (lectin-like) receptor 1 subfamily T, member 33 ZNRF4 19 Zinc and ring finger 4 MFRP 11 Membrane frizzled-related protein CCKBR 11 Cholecystokinin B receptor WSCD1 17 WSC domain containing 1 C1ORF186 1 open reading frame All genes assigned to one gene group, GG1; DAVID enrichment 186 score = 3.97 YIPF3 6 Yip1 domain family, member 3 C10ORF35 10 Chromosome 10 open reading frame ‘exclusion’ list for disease studies based on exomes 35 (Fuentes Fajardo et al. 2012). TAS2R31 12 Taste receptor, type 2, member 31 A number of studies have examined the relationship CD207 2 CD207 molecule, langerin between requirement during development (essentiality) of OR1L6 9 Olfactory receptor, family 1, genes, and their roles in disease, and some have concluded subfamily L, member 6 that most disease genes are non-essential (for e.g. Goh AVPR1A 12 Arginine vasopressin receptor 1A et al. 2007), but this has been questioned by Dickerson OR9G1 11 Olfactory receptor, family 9, et al. (2011). These studies have largely considered evi- subfamily G, member 9; dence from human orthologs of mouse essential genes. One C19ORF59 19 Chromosome 19 open reading frame 59 argument is that a proportion of mutations in essential genes prevent viability and account, for example, for

123 Hum Genet (2013) 132:233–243 241

Table 4 DAVID: Significantly enriched GO terms (P \ 0.01 or less) spontaneous miscarriages. This analysis demonstrates that for genes with weak LD (LDU/kb [ 0.22) a proportion of genes with strong LD show enrichment for Gene group Number of genes P value Fold essential core biological functions and this group of genes membership: GO related to GO (EASE enrichment is also significantly depleted for genes containing variation term (biological term/total genes score) relative to related to disease. This is consistent with the arguments process) in gene group (%) human genome presented by Goh et al. (2007) that mutations in such genes background are under enhanced selective pressure leading to lethality in utero. Increased selection would act to reduce haplotype GG1: sensory 19/56 (34) \0.000001 15 diversity and hence increase measured LD for these perception of chemical essential genes. In contrast, we find significant enrichment stimulus of disease genes amongst genes which show more typical, GG1: sensory 21/56 (38) \0.000001 9.5 average LD (and, presumably, by implication, average perception recombination and mutation rates, Table 5). This category GG1: G-protein- 23/56 (41) \0.000001 7.6 of genes with average LD may be enriched for genes which coupled are less widely expressed, defined as occupying more receptor protein signalling ‘functionally peripheral’ regions of the cell (Goh et al. pathway 2007). Thus, disease mutations in this class of genes are GG1: sensory 17/56 (30) \0.000001 14 less likely to result in in utero lethality and may contribute, perception of in time, to detectable disease processes in a mutation smell carrier. GG1: cognition 21/56 (38) \0.000001 8.5 A limitation of this study arises from the relatively GG1: 21/56 (38) \0.000001 6.4 broad categorisation of genes containing disease variation neurological reported in OMIM. OMIM represents the most complete system process repository of known disease genes and disorders and GG1: cell surface 23/56 (41) \0.000001 4.6 receptor linked includes not only variants underlying Mendelian disorders, signal but also a proportion of variants involved in complex traits. transduction This heterogeneous categorisation, combined with the GG1: regulation 2/56 (4) \0.01 130 incomplete knowledge of the disease genome, may influ- of systemic ence power to detect relationships between LD and disease. arterial blood pressure by Future analyses might focus on sub-types of disease vari- vasopressin ants where possible (Mendelian, complex trait, etc.) and quantify the tendency of a particular gene to acquire

Table 5 Statistics for genes by LDU/kb category ‘‘Strong’’ LD: 0.01 \ LDU/ ‘‘Average’’ LD: ‘‘Weak’’ LD: Totals All All non- 0 B LDU/kb B 0.01 kb B 0.02 0.02 \ LDU/kb B 0.05 LDU/kb [ 0.05 disease disease genes genes

Number of disease genes 568 188 197 133 1,086 1,086 – Number of non- 5,404 1,421 1,377 1,126 9,328 – 9,328 disease genes Total genes 5,972 1,609 1,574 1,259 10,414 1,086 9,328 Proportion of disease 0.0951/0.0004 0.1168/0.0804 0.1252/0.0038 0.1056/0.9054 0.1043/– – – genes/P value Gene size kb 55.8/89.8 60.2/102.3 52.7/84.8 24.8/44.6 52.3/ 59.2/ 51.5/ (mean/SD) 87.6 94.3 86.8 Gene size LDU 0.21/0.53 0.84/1.41 1.62/2.61 3.81/19.98 0.95/ 0.97/ 0.95/ (mean/SD) 7.15 1.92 7.53 SNPs per gene 16.4/14.8 19.5/16.4 20.0/23.9 17.9/26.5 17.6/ 20.7/ 17.3/ (mean/SD) 18.5 20.3 18.2 LDU/kb (mean/SD) 0.0029/0.0030 0.0142/.0028 0.0316/0.0084 0.1815/0.3462 0.0306/ 0.0296/ 0.0307/ 0.1331 0.0888 0.1373 Bold values are statistically significant Overall P value for 2 9 4 contingency table disease genes/non-disease genes by LDU/kb category = 0.0015

123 242 Hum Genet (2013) 132:233–243 disease variation based on the number of known variants. ontology: tool for the unification of biology. The Gene Ontology Because we have conservatively excluded many variants Consortium. Nat Genet 25(1):25–29 Christodoulou K, Wiskin AE, Gibson J, Tapper W, Willis C, Afzal and genes, future studies could usefully represent more of NA, Upstill-Goddard R, Holloway JW, Simpson MA, Beattie the genome in larger samples, with potential increases in RM et al. (2012) Next generation sequencing of paediatric power, to characterise LD and gene functional relation- inflammatory bowel disease patients identifies rare and novel ships. Although, we have shown that a relatively small variants in candidate genes. Gut. doi:10.1136/gutjnl-2011- 301833 sample (30 exomes) is sufficient to characterise much of Chuang JH, Li H (2004) Functional bias and spatial organization of the LD structure and derive maps of individual genes, we genes in mutational hot and cold regions of the human genome. recognise that more genome sequences would increase PLoS Biol 2(2):253–263 coverage to provide a more complete set of gene maps. Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES (2001) High-resolution haplotype structure in the human genome. Nat Furthermore, whole-genome sequencing would be of par- Genet 29:229–232 ticular interest to integrate information on the functions of Desai A, Mitchison TJ (1997) Microtubule polymerization dynamics. gene control regions and their LD structure. We anticipate Annu Rev Cell Dev Biol 13:83–117 that the contour of LDU maps derived from whole-genome Dickerson JE, Zhu A, Robertson DL, Hentges KE (2011) Defining the role of essential genes in human disease. PLoS ONE 6(11): sequence will closely resemble Fig. 1 but will provide e273368 greater resolution across the genome than the HapMap- Fuentes Fajardo KV, Adams D, NISC Comparative Sequencing derived data. Program, Mason CE, Sincan M, Tifft C, Toro C, Boerkoel CF, In conclusion, the high-resolution linkage disequilib- Gahl W, Markello M (2012) Detecting false-positive signals in exome sequencing. Hum Mutat. doi:10.1002/humu.22033 rium maps, derived from exome sequencing data, indicate Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel that the strength of LD in individual genes is related to B, Higgins J, DeFelice M, Lochner A, Faggart M et al (2002) gene function and the tendency for a gene to contain dis- The structure of haplotype blocks in the human genome. Science ease causal variation. Genes with weak LD are particularly 296:2225–2229 Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Baraba´si AL (2007) enriched for functions linked to, for example, sensory The human disease network. Proc Natl Acad Sci USA 104(21): perception and the immune system for which high haplo- 8685–8690 type diversity is strongly advantageous. Despite the high Jeffreys AJ, Kauppi L, Neumann R (2001) Intensely punctate meiotic mutation rate, these genes do not include an excess of recombination in the class II region of the major histocompat- ibility complex. Nat Genet 29:217–222 disease-related variation. Genes containing disease variants Lau W, Kuo T-Y, Tapper W, Cox S, Collins A (2007) Exploiting are found across the whole spectrum from strong to weak large scale computing to construct high resolution linkage LD but are particularly under-represented amongst genes disequilibrium maps of the human genome. Bioinformatics with strong LD, many of which show essential biological 23(4):517–519 Lercher MJ, Hurst LD (2002) Human SNP variability and mutation functions. The reduced haplotype diversity implied by rate are higher in regions of high recombination. Trends Genet strong LD is consistent with previously published evidence 18:337–340 for selective pressure against deleterious mutations in these Li M-X, Gui H-S, Kwan JSH, Bao S-Y, Sham PC (2012) A genes. The data here identify a significant excess of disease comprehensive framework for prioritizing variants in exome sequencing studies of Mendelian diseases. Nucleic Acids Res variants within genes which show intermediate, average, 40(7):e53 strengths of LD. A proportion of these genes may have Maniatis N, Collins A, Ku X-F, McCarthy LC, Hewett DR, Tapper more ‘peripheral’ functions in the cell intermediate W, Ennis S, Ke X, Morton NE (2002) The first linkage between those with core essential functions and those disequilibrium (LD) maps: delineation of hot and cold blocks by diplotype analysis. Proc Natl Acad Sci USA 99(4):2228–2233 which require high allelic diversity and functional adapt- McVean GAT, Myers S, Hunt S, Deloukas P, Bentley DR, Donnelly P ability. Individual genes shown here are found to have (2004) The fine-scale structure of recombination rate variation in widely varying LD structure which is related to gene the human genome. Science 304:581–584 function and the propensity of genes to carry disease causal Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA et al (2010) variation. Gene LD structure can therefore contribute sig- Exome sequencing identifies the cause of a Mendelian disorder. nificant information to the identification of candidate dis- Nat Genet 42:30–35 ease genes and to the understanding of the functional roles Papavasiliou FN, Schatz DG (2002) Somatic hypermutation of of individual genes in cellular pathways. immunoglobulin genes: merging mechanisms for genetic diver- sity. Cell 109(Suppl):S35–S44 Parla JS, Iossifov I, Grabill I, Spector MS, Kramer M, McCombie WR (2011) A comparative analysis of exome capture. Genome Biol 12:R97 Service S, DeYoung J, Karayiorgou M, Louw Roos J, Pretorious H, References Bedoya G, Ospina G, Ruiz-Linares A, Macedo A, Almeida Palha J et al (2006) Magnitude and distribution of linkage disequilib- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, rium in population isolates and implications for genome-wide Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene association studies. Nat Genet 38(5):556–560

123 Hum Genet (2013) 132:233–243 243

Sharon D, Glusman G, Pilpel Y, Khen M, Gruetzner F, Haaf T, Tapper W, Collins A, Gibson J, Maniatis N, Ennis S, Morton NE Lancet D (1999) Primate evolution of an olfactory receptor (2005) A map of the human genome in linkage disequilibrium cluster: diversification by gene conversion and recent emergence units. Proc Natl Acad Sci USA 102(33):11835–11839 of pseudogenes. Genomics 61:24–36 The International HapMap Consortium (2003) The International Smith AV, Thomas DJ, Munro HM, Abecasis GR (2005) Sequence HapMap Project. Nature 426:789–796 features in regions of weak and strong linkage disequilibrium. Wei Haung D, Sherman BT, Lempicki RA (2009) Systematic and Genome Res 15(11):1519–1534 integrative analysis of large gene lists using DAVID bioinfor- Sun P, Zhang R, Jiang Y, Wang X, Li J, Lv H, Tang G, Guo X, Meng matics resources. Nat Protoc 4:44–57 X, Zhang H, Zhang R (2011) Assessing the patterns of linkage Zhang W, Collins A, Maniatis N, Tapper W, Morton NE (2002) disequilibrium in genic regions of the human genome. FEBS J Properties of linkage disequilibrium (LD) maps. Proc Natl Acad 278(19):3748–3755 Sci USA 99(26):17004–17007

123