What’s in the Dutch Genome?

Cisca Wijmenga, UMC, Groningen / Dorret Boomsma, VU, Amsterdam

Exciting times for and large scale sequencing 1,000 Genome Project (Oct 2015)

Results of an early next-generation DNA sequencing run. Recent advances enable parallel sequencing reactions that can produce up to 500 giga-bases of data per instrument run. These data allow us to better understand the frequency of genetic mutations and their impact on human disease. Genome of the GoNL (Aug 2014) What’s in the Dutch genome?

GoNL: 7.6 million novel SNVs , with an excess of rare nonsense variants and frameshift indels, …

An average individual: 60 loss-of-function SNVs, 69 loss-of-function indels, 15 loss-of-function large deletions, and 20 disease- causing variants, …

SNV=Single Nucleotide Variant; Indel = insertion or deletion of bases

What makes GoNL unique? (1) Representative sample of the Dutch population

Selection criteria

• Caucasian background ((grand) parents born in same region) • Equal distribution across provinces • Unselected for disease

Boomsma et al., EJHG 2014 Made possible by ~1M Dutch inhabitants participating in biobank research

LifeLines cohort study UMCG Netherlands Register VU

Leiden Longevity study LUMC

Rotterdam Elderly >200 biobanks participate Study Erasmus MC in BBMRI-NL BBMRI (Biobanking and Biomolecular Resources Research Infrastructure) NL Important: major genetic differences between the North and the South

Abdellaoui, et al. (2013) Eur J Hum Genet Important: major genetic differences between the North and the South

Catholic Protestant Non-religious

NL in NL 1849 today

Abdellaoui, et al. (2013) Association between autozygosity and major depression: stratification due to religious assortment. Behavior Genetics Ancestry-informative PCs replicated in next-generation sequencing dataset GoNL What makes GoNL unique? (2) Trio design

Excellent haplotypes – high-quality imputation Insight into de novo events GoNL: investigating de novo SNV and SV mutations from whole genomes

Father Mother

Child

GoNL: 11,020 de novo single nucleotide variants (SNVs) 332 de novo structural variants (SVs) 11,020 high-confidence de novo mutations = 18–74 per offspring 74% of new mutations are derived from fathers and increase with age

Pearson’s correlation = 0.47, P < 2.2 × 10−16

Father Mother

A A AA

Child

A G G has never been observed before G is regarded a new mutation +1.47 new mutations per extra year (age)

The Genome of the Netherlands Consortium , Nat Genet 2014 GoNL includes monozygotic : germ line vs somatic mutation rate

~97% of de novo mutations are germ line

~3% of de novo mutations are somatic Are de novo mutation rates correlated with replication timing?

Early replicating DNA (-rich regions)

Late replicating DNA (gene-poor regions)

Do the paternal age effects have functional consequences? Most likely!

N of genic (red) and intergenic (blue) de novo mutations in offspring (log scale) as a function of paternal age. The red line = regression for genic mutations; blue = regression for intergenic mutations. The steeper slope of the red line for genic mutations indicates a faster relative increase in the rate of genic mutations with paternal age.

+0.26% de novo mutations in genic regions per extra year of paternal age .

Offspring of 40y-old vs 20y-old fathers: 19.06 vs 9.63 genic mutations

Elevated mutation rates driven by CpGs Francioli et al., Nat Genet 2015 The distribution of de novo mutations is non-random

Closely spaced mutations are enriched both across and within individuals (78 mutation clusters of up to 20 kb in length are observed). 1.5% of all de novo mutations are in such clusters

Francioli et al., Nat Genet 2015 GoNL also allowed investigation of de novo structural variations (SVs)

Reference

SNP

Deletion

Insertion

Tandem duplication

Profiling de novo structural variants requires a host of different methods

Kloosterman et al., Genome Research 2015 One in seven children bears a de novo SV mutation

Six offspring had two and one offspring had three de novo SVs

Kloosterman et al., Genome Research 2015 An excess of de novo structural changes originated on paternal haplotypes

A significantly larger fraction (66.1%) of indels and SVs are arising on paternal chromosomes than on maternal chromosomes

No significant correlation between de novo structural change occurrence and paternal age (limited number of observations?)

Kloosterman et al., Genome Research 2015 Differences in impact of de novo SNVs, indels and SVs

7 x more de novo SNVs per child (av. 45) but: de novo SVs affect ~100 times more genomic DNA (4084 bps)

18 times more are hit by de novo SNVs versus SVs

Kloosterman et al., Genome Research 2015 Patients with “neurodevelopmental” disease carry more de novo SNVs How to investigate the ‘pathogenicity’ of de novo and rare SNVs and SVs?

Take advantage of the biobanks participating in BBMRI-NL Systematic analysis potential pathogenic variants to assist clinical interpretation

Low to moderate risk Shared across populations

Higher risk Often more population specific

General population Clinical population

Biobanks allow context dependent interpretation of genetic variation

Downstream effects (RNA, metabolites, methylation, …)

Clinical information

Lifestyle

Family history Unique BBMRI-NL resources: Biobank-based Integrative Omics Study

N = ~4000

Epigenome Environmental and genetic variance change with age

• 33% of methylation sites: significant change in methylation level with age • 10% of methylation sites: significant change of the genetic or environmental variance with age • Most of these (82%): unique environmental variance changes • Prevailing pattern: The higher the age, the larger the total variance in methylation, and the larger the unique environmental variance (E)

There is an age related shift in the causes of variation in DNA methylation between people

28 Example of a rare that impacts allele-specific expression (and protein levels)

A deficiency of mannan-binding lectin is associated with susceptibility to infections and with the development of MASP2: c.359A>G immunologic disease (p.Asp120Gly) (NEJM 2003) G MAF: 0.02

Asp Asp Gly A Asp Gly Gly In particular more difficult when moving away from families A 5 year journey to obtain an ultra sharp portrait of the Dutch population

2011 Pilot N=20

De novo SVs 2010 2012 characterized N=250 2015

De novo SNVs SNP 2013 characterized calling complete Variant 2014 Finder online

Data Data Data generation processing analysis GoNLdata can be used in many different ways

Reference for Population medical genetics sequencing

Complex disease genetics (imputation)

Access to data can be requested through nlgenome.nl Imputation of Dutch biobanks with GoNL identifies novel gene variants

Reference for Population medical genetics sequencing

Complex disease genetics (imputation)

Michigan Imputation server A new SNP platform: NL-Axiom array using GONL Imputation backbone / modules UK biobank Axiom array / Psych chip / chromosome X / GWA catalogue

Virtual array GONL reference NL Axiom virtual sequence data Platform SNP Extraction + QC array: 618.889 SNPs • MAF > 0.01 N = 250 unrelated females • Missing Ss > 0.90 • Missing SNPs > 0.95 • HWE > 10-5 Phasing + Imputation Also in 1000G Imputation Quality NL-Axiom • Against 1000G phase 3 1000G SNPs, median R2 1.0 Test imputed 0.8 1.7 concordance with 0.6 8.4 original GONL SNPs 0.4

0.2 Looking at ~10 mil SNPs 0.0 Bad, 0-80% concordance 0.001-0.01 0.01-0.05 0.05-0.50 OK, 80-95% concordance 89.9 Minor Allele Frequency Good, >95% concordance Imputation of Dutch biobanks with GoNL identifies novel gene variants

Reference for Population medical genetics sequencing

Complex disease A rare missense variant genetics The frequency of the in ABCA6 (imputation) ABCA6 variant is 3.65-fold (p.Cys1359Arg), increased in GoNL predicted to be (0.030 vs 0.008 in deleterious 1000GP)

One day, we also may also understand the strange habits of the Dutch …

Using GCTA in a sample of distantly related NL individuals, the measured (and tagged) SNPs explain 25 % of the variance in initiation. Chromosomes 4 and 18, previously linked with cannabis use and other phenotypes, account for the largest amount of variance in initiation. Thanks to the GoNL team Laurent Francioli

Jenny Van Dongen Gertjan Van Ommen Abdel Abdellaoui Dorret Boomsma Eline Slagboom Cisca Paul Wijmenga de Bakker Cornelia Van Duijn Morris Swertz Kai Ye Tomorrow: much more about biobank research in the Netherlands