Genetic Epidemiology
Total Page:16
File Type:pdf, Size:1020Kb
Received: 10 May 2019 | Revised: 31 July 2019 | Accepted: 28 August 2019 DOI: 10.1002/gepi.22260 RESEARCH ARTICLE Population‐wide copy number variation calling using variant call format files from 6,898 individuals Grace Png1,2,3 | Daniel Suveges1,4 | Young‐Chan Park1,2 | Klaudia Walter1 | Kousik Kundu1 | Ioanna Ntalla5 | Emmanouil Tsafantakis6 | Maria Karaleftheri7 | George Dedoussis8 | Eleftheria Zeggini1,3* | Arthur Gilly1,3,9* 1Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Abstract Kingdom Copy number variants (CNVs) play an important role in a number of human 2Department of Medical Genetics, diseases, but the accurate calling of CNVs remains challenging. Most current University of Cambridge, Cambridge, approaches to CNV detection use raw read alignments, which are computationally United Kingdom intensive to process. We use a regression tree‐based approach to call germline CNVs 3Institute of Translational Genomics, Helmholtz Zentrum München—German from whole‐genome sequencing (WGS, >18x) variant call sets in 6,898 samples Research Center for Environmental across four European cohorts, and describe a rich large variation landscape Health, Neuherberg, Germany comprising 1,320 CNVs. Eighty‐one percent of detected events have been previously 4European Bioinformatics Institute, ‐ ‐ Wellcome Genome Campus, Hinxton, reported in the Database of Genomic Variants. Twenty three percent of high quality United Kingdom deletions affect entire genes, and we recapitulate known events such as the GSTM1 5William Harvey Research Institute, Barts and RHD gene deletions. We test for association between the detected deletions and and The London School of Medicine and 275 protein levels in 1,457 individuals to assess the potential clinical impact of the Dentistry, Queen Mary University of London, London, United Kingdom detected CNVs. We describe complex CNV patterns underlying an association with − 6Anogia Medical Centre, Anogia, Greece levels of the CCL3 protein (MAF = 0.15, p =3.6x10 12)attheCCL3L3 locus, and a 7Echinos Medical Centre, Echinos, Greece novel cis‐association between a low‐frequency NOMO1 deletion and NOMO1 protein 8 − Department of Nutrition and Dietetics, levels (MAF = 0.02, p =2.2x10 7). This study demonstrates that existing population‐ School of Health Science and Education, wide WGS call sets can be mined for germline CNVs with minimal computational Harokopio University of Athens, Athens, Greece overhead, delivering insight into a less well‐studied, yet potentially impactful class of 9Department of Public Health and genetic variant. Primary Care, University of Cambridge, Cambridge, United Kingdom KEYWORDS association study, copy‐number variant, whole‐genome sequencing Correspondence Arthur Gilly, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK. Email: arthur.gilly@helmholtz- muenchen.de Funding information Wellcome Trust, Grant/Award Number: 098051; European Research Council, Grant/Award Number: ERC‐2011‐StG 280559‐SEPI *Eleftheria Zeggini and Arthur Gilly contributed equally to this study. Genet. Epidemiol. 2019;1–11. www.geneticepi.org © 2019 Wiley Periodicals, Inc. | 1 2 | PNG ET AL. 1 | INTRODUCTION (do Nascimento & Guimaraes, 2017). Such variant call sets are typically produced in the variant call format Up to 19.2% of the human genome is susceptible to copy (VCF) in most association‐focused studies, and analysis number variation, which can have a severe impact on of these comparably small files for CNV calling would be gene function (Zarrei, MacDonald, Merico, & Scherer, computationally efficient. Here, we evaluate the effect of 2015). Copy number variant (CNV) calling can be CNVs on sequencing depth measured at variant sites performed for individuals or families in a clinical using a novel tool (Unimaginatively Named CNV caller context, or for large sample sizes in population cohorts. [UN‐CNVc]), and provide a proof‐of‐concept for calling Whole‐genome sequencing (WGS) at high depth has these large variations in population‐wide WGS variant been the gold standard for detecting large polymorph- call sets. isms in population studies and is starting to replace array‐based calling in the clinic. Yet, calling structural ‐ variants genome wide has been an ongoing challenge 2 | MATERIALS AND METHODS throughout the history of computational genetics, and ‐ producing population wide CNV call sets still represents The observed read depth for a single sample in a WGS a significant investment today. The reasons for this are experiment can be modelled as a noisy piecewise ‐ two fold. First, detecting structural variants requires a constant function: different study design compared with association stu- dies: whereas for the latter, haplotype diversity and dˆ ()=xk . ()+ xϵ hence sample size are key (Alex Buerkle & Gompert, ∑ dx()= k k 2013; Le & Durbin, 2011), for the former, high depth of sequencing is paramount, leading to prohibitive costs for where d ()=0.5xnis the ideal relative depth at position ‐ population wide studies. This is in addition to other x, n is the copy number at this position, ⎧ upstream processing features, such as insert size, ⎨1()=if d x k dx()= k()=x is the indicator function for polymerase chain reaction (PCR) amplification bias, ⎩0otherwise and choice of mapping software and reference genome, copy number k genome‐wide and ϵ ~(0,)N σ is the error that also influence structural variant detection sensitivity in estimating true read counts. This error term captures (Trost et al., 2018). Second, structural variant detection all non‐CNV factors influencing read depth, such as GC poses a computational challenge, since most algorithms content or reference sequence quality. These variations use aligned reads or read pileups as a starting point for tend to act on a short‐range, and over long stretches of event detection. As these file formats describe the entire sequence, average depths vary little around the per‐ read pool, processing them genome‐wide across an entire sample mean (Figure S1). population with high‐depth WGS is demanding in terms Methods for fitting piecewise constant functions for of both running time and memory. CNV calling pipelines CNV detection have included circular binary segmenta- involving a combination of read depth and insert‐size tion (Olshen, Venkatraman, Lucito, & Wigler, 2004; based tools are increasingly included in analysis pipe- Venkatraman & Olshen, 2007), hidden Markov models lines for large human population cohorts, however, the (Seiser & Innocenti, 2014), smoothing approaches (Hsu, computing requirements and complexity of such meth- 2005; Tibshirani & Wang, 2008) as well as Bayesian ods often preclude their use in other settings. This is methods (Hutter, 2007), often in the context of array especially true when CNV calling algorithms were not comparative genomic hybridization studies. Here, due to integrated in standard WGS processing pipelines from the density of the input dataset, we use regression trees to the get‐go, in which case the entire read pool needs to be fit a piecewise constant function, although any segmen- reprocessed again to produce a CNV callset. This issue tation algorithm able to handle hundreds of thousands of can be addressed by detecting deletions and insertions points could be used instead. Regression trees have been from existing variant call sets, which demands much less applied to WGS‐based detection of CNVs before (Chen compute effort. Such methods were pioneered in the era et al., 2015), and they have been used in analysing of genotyping chips (PennCNV; Wang et al., 2007) and variant‐level data from paired cancer samples (Putnam PlatinumCNV (Kumasaka et al., 2011), are still widely et al., 2017). We wrote the UN‐CNVc, a simple and fast used (Kayser et al., 2018; Selvanayagam et al., 2018) and CNV detection tool based on regression trees. Due to its have recently been proposed to call CNVs from marker‐ sparse input format and the simplicity of the model used, level data in paired cancer samples (Putnam et al., 2017). it is able to process call sets from thousands of samples To our knowledge, no such method exists for with WGS data in a reasonable time. A summary of the variant calls produced from population‐scale WGS CNV calling pipeline is described in Figure 1a. PNG ET AL. | 3 (a) (b) (c) (e) (d) FIGURE 1 Overview of the UN‐CNVc algorithm. (a) Overview of the pipeline, with input and output files in blue, and external tools and libraries in grey. (b) Output of a piecewise constant regression (in red) on a 10Mb window on chromosome 11, for a homozygous deletion carrier. The gray signal is the raw relative depth at every sequenced marker for that sample. (c) Pooled regressed segments across the population, with colour indicating the attributed ideal depth (0: red, 0.5: blue, 1: green, and 1.5: purple). (d) Raw count (dashed line) and run‐length encoding (shaded green bars) on the number of high‐quality segments with ideal depth <1. (e) Genotyping using both weighted average segment depth (colour, scheme identical to c) and average depth across markers (plotting glyphs, squares: 0, circles: 0.5, and triangles: 1). UN‐CNVc, Unimaginatively Named CNV caller 2.1 | Identifying variant regions For each window, we fit a Gaussian mixture model, with means constrained to multiples of 0.5 within the observed Briefly, for each sample in 10Mb windows spanning the depth range at that region. For each depth segment entire genome, we apply a regression tree using the rpart R produced by the regression, we assign an ideal depth which library to the depth at marker sites normalised by is the multiple of 0.5 relative depth that is closest to the chromosome‐wide depth. We use the default values of actual value of the segment. We also assign a score s=2p, 0.01 for the complexity parameter of the regression tree where p is the one‐sided p value for the Gaussian (the overall r2 of the model must increase of at least this component centered around the ideal depth for that value at each iteration) and 6 for the minimum leaf size. segment, and consider a call high‐quality when s>0.1.