<<

Global patterns of copy number variation in humans from a population-based analysis.

ICHG Kyoto

Jean Monlong April 5, 2016

BOURQUE LAB MCGILL UNIVERSITY HUMAN GENETICS DEPT. Disclosure Information

I have no financial relationships to disclose

2 Copy-Number Variation

Copy-Number Variation 3 Copy Number Variation (CNV)

Imbalanced involving more than 500bp.

Copy-Number Variation 4 CNV detection from High-Throughput Sequencing

Baker 2012, Nature Methods.

Copy-Number Variation 5 More prone to CNV. Enriched in Segmental Duplications (Sharp Annual Review 2006). Short Tandem Repeats highly polymorphic (Warbuton BMC 2008). Transposons involved in CNV formation (Sen AJHG 2006).

Involved in phenotype and disease. Short Tandem Repeats and expression (Gymrek Nat. Genetics 2016). Repeats CNV involved in ∼30 genetic disorders (Mirkin Nature 2007). Retrotransposition in cancer (Lee Science 2012).

Low-mappability regions

Repeat-rich regions, , . ∼13% of the .

Copy-Number Variation 6 Involved in phenotype and disease. Short Tandem Repeats and (Gymrek Nat. Genetics 2016). Repeats CNV involved in ∼30 genetic disorders (Mirkin Nature 2007). Retrotransposition in cancer (Lee Science 2012).

Low-mappability regions

Repeat-rich regions, centromeres, telomeres. ∼13% of the human genome.

More prone to CNV. Enriched in Segmental Duplications (Sharp Annual Review 2006). Short Tandem Repeats highly polymorphic (Warbuton BMC Genomics 2008). Transposons involved in CNV formation (Sen AJHG 2006).

Copy-Number Variation 6 Low-mappability regions

Repeat-rich regions, centromeres, telomeres. ∼13% of the human genome.

More prone to CNV. Enriched in Segmental Duplications (Sharp Annual Review 2006). Short Tandem Repeats highly polymorphic (Warbuton BMC Genomics 2008). Transposons involved in CNV formation (Sen AJHG 2006).

Involved in phenotype and disease. Short Tandem Repeats and gene expression (Gymrek Nat. Genetics 2016). Repeats CNV involved in ∼30 genetic disorders (Mirkin Nature 2007). Retrotransposition in cancer (Lee Science 2012).

Copy-Number Variation 6 PopSV approach

PopSV approach 7 PopSV approach

Objective Test the entire genome, including low-mappability regions, and detect subtle abnormal coverage.

PopSV: Population-based approach Use a set of reference experiments to detect abnormal patterns.

sample reference tested number of reads mapped number genomic window

PopSV approach 8 Benchmark and validation

Existing methods FREEC LASSO-based segmentation; GC and mappability correction. cn.MOPS Multi-sample Bayesian-based segmentation.

Whole-Genome Sequencing data 45 samples, including 10 twin families (i.e 2 twins + 2 parents). 95 pairs of normal/tumor samples from Renal Cell Carcinoma (CageKid).

PopSV approach 9 Benchmark and validation

Replication in the twins. Concordance with pedigree.

Replication in the paired tumor.

Concordance of different bin sizes PCR validation.

Overall performance and in different repeat context.

PopSV approach 10 Validation conclusions

PopSV detects 3-5x more variants. Wider genomic range. Robust across challenging regions: Low-coverage. Segmental duplications. DNA satellites. Short tandem repeats GC-rich/poor. Resolution down to half the bin size.

PopSV approach 11 CNV patterns in normal genomes

CNV patterns in normal genomes 12 Where are CNVs located ? In ? ? Segmental duplication ? DNA satellites ? Short tandem repeats ? Transposable Elements ? Exons ? Promoters ?

Control regions Same size distribution. Randomly distributed.

CNV in normal genomes

640 normal genomes 45 samples from the Twin study (∼40X) 95 normal samples from Renal Cell Carcinoma (∼54X). 500 unrelated samples from GoNL (∼14X).

CNV patterns in normal genomes 13 Control regions Same size distribution. Randomly distributed.

CNV in normal genomes

640 normal genomes 45 samples from the Twin study (∼40X) 95 normal samples from Renal Cell Carcinoma (∼54X). 500 unrelated samples from GoNL (∼14X).

Where are CNVs located ? In Centromere ? Telomere ? Segmental duplication ? DNA satellites ? Short tandem repeats ? Transposable Elements ? Exons ? Promoters ?

CNV patterns in normal genomes 13 CNV in normal genomes

640 normal genomes 45 samples from the Twin study (∼40X) 95 normal samples from Renal Cell Carcinoma (∼54X). 500 unrelated samples from GoNL (∼14X).

Where are CNVs located ? In Centromere ? Telomere ? Segmental duplication ? DNA satellites ? Short tandem repeats ? Transposable Elements ? Exons ? Promoters ?

Control regions Same size distribution. Randomly distributed.

CNV patterns in normal genomes 13 Enriched close to Centromere/Telomere/Gap (CTG)

1.00

0.75

0.50 cumulative proportioncumulative

0.25

region CNV

0.00 control

0e+00 2e+07 4e+07 6e+07 distance to centromere/telomere/gap (bp)

CNV patterns in normal genomes 14 Enriched in SD and low-coverage regions

CNV patterns in normal genomes 15 Same proportion overlapping a segmental duplication. Similar distance to CTG.

Going further

1. Control for the SD and CTG patterns. 2. Look at other repeat classes.

Control regions Randomly distributed. Same size distribution.

CNV patterns in normal genomes 16 Going further

1. Control for the SD and CTG patterns. 2. Look at other repeat classes.

Control regions Randomly distributed. Same size distribution. Same proportion overlapping a segmental duplication. Similar distance to CTG.

CNV patterns in normal genomes 16 Controlling for SD and distance to CTG

CNV patterns in normal genomes 17 Controlling for SD and distance to CTG

CNV patterns in normal genomes 17 Controlling for SD and distance to CTG

Satellites enrichment driven by ALR/Alpha, (GAATG)n/(CATTC)n families.

Short Tandem Repeats Enrichment distributed across families...... but stronger for larger STR.

Transposable elements (TE): SVA class enriched. Expected: L1HS, L1PA2 to L1PA5. Surprises: HERVH, LTR38, LTR4.

CNV patterns in normal genomes 18 Repeat CNVs and protein-coding

Genes with CNVs Set CNVs Exon + Promoter + Intron All CNVs 91733 7206 11341 13259 Low coverage 26888 682 1151 1977 Extremely low coverage 10010 347 465 521 STR 4286 45 286 748 Satellite 1822 2 21 33 TE 20491 164 1747 3998 STR/Satellite/TE 22313 166 1760 4014

Repeat CNV: more than 90% of the CNV is annotated as repeat.

CNV patterns in normal genomes 19 Conclusion

Conclusion 20 In normal genomes: CNVs enriched in low coverage regions. Specific enrichment in satellites, simple repeats, TEs.

Not due to segmental duplication enrichment. Replicated across datasets but different from somatic patterns.

Some CNVs in low coverage regions or repeats hit exonic sequence.

Summary

PopSV uses reference samples. detects more CNVs. is robust across the entire genome.

Conclusion 21 Summary

PopSV uses reference samples. detects more CNVs. is robust across the entire genome.

In normal genomes: CNVs enriched in low coverage regions. Specific enrichment in satellites, simple repeats, TEs.

Not due to segmental duplication enrichment. Replicated across datasets but different from somatic patterns.

Some CNVs in low coverage regions or repeats hit exonic sequence.

Conclusion 21 Guillaume Bourque Simon Gravel Mathieu Bourgey Mathieu Blanchette Louis Letourneau Francois Lefebvre Eric Audemard Toby Hocking Simon Girard

Patrick Cossette Guy Rouleau Caroline Meloche 23 Workflow

24 Replication in twins

25 Robust across challenging regions

1.00 1.00

0.75 0.75

PopSV set PopSV set 0.50 call 0.50 call null null

0.25 0.25 proportion of regions with concordant samples proportion of regions with concordant samples

0.00 0.00

low expected high [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1] coverage class GC content

1.00 1.00

0.75 0.75

PopSV set PopSV set 0.50 call 0.50 call null null

0.25 0.25 proportion of regions with concordant samples proportion of regions with concordant samples

0.00 0.00

[0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1] [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1] segmental duplication proportion simple repeat proportion 26 Robust across challenging regions

PopSV

0.8

0.6

0.4

0.2

0.0 ● ● ● ● ● ● ● ● ● ● other5 other1 other3 other2 other4 1652−Twin1 1652−Twin2 1480−Twin2 1480−Twin1 1389−Twin1 1389−Twin2 1207−Twin1 1207−Twin2 1286−Twin1 1286−Twin2 1301−Twin1 1301−Twin2 1323−Twin1 1323−Twin2 1443−Twin2 1443−Twin1 1121−Twin1 1121−Twin2 1490−Twin1 1490−Twin2 1652−Father 1207−Father 1286−Father 1389−Father 1301−Father 1480−Father 1323−Father 1443−Father 1121−Father 1490−Father 1652−Mother 1480−Mother 1389−Mother 1207−Mother 1286−Mother 1301−Mother 1323−Mother 1443−Mother 1121−Mother 1490−Mother sample

Father ● Mother Twin

family ● 1121 ● 1207 ● 1286 ● 1301 ● 1323 ● 1389 ● 1443 ● 1480 ● 1490 ● 1652

Using only CNVs in extremely low coverage regions !

27 Resolution - 500 bp bins Vs 5 Kbp bins

1.00

0.75

0.50

0.25 proportion 5kbp−bin calls overlapping

0.00

0 2500 5000 7500 10000 12500 15000 17500 20000 size of the 500bp−bin call

28 Control regions

QC − SD, low−coverage and CTG distance control

1.00

0.75

set 0.50 CNV control

0.25 proportion the feature overlapping

0.00

lowmap segdup feature

29 Control regions

QC − SD, low−coverage and CTG distance control

1.00

0.75

0.50 cumulative proportioncumulative

0.25

region CNV

0.00 control

0e+00 2e+07 4e+07 6e+07 distance to centromere/telomere/gap (bp)

30 Control regions

S/2 S/2 S/2 S/2

S/2 S/2 Random region of size S

=

Random base in green

S

31 Controlling for SD and distance to CTG

SINE SVA TE LTR LINE DNA

SVA_F SVA_E SVA_D MER65A

LTR4 TE top families LTR38−int L1PA5 L1PA4 L1PA3 L1PA2 L1HS HERVH−int AluY Twins CK Normal GoNL CK Somatic cohort

Significance (−log10 Pvalue) 4 8 12

Depleted Enriched

32 Controlling for SD and distance to CTG

TAR1 SUBTEL_sa SST1 SATR2 SATR1 SAR REP522 MSR1

LSAU Satellite HSATII HSATI HSAT6 HSAT5 HSAT4 GSATX GSATII GSAT D20S16 CER BSR/Beta ALR/Alpha ACRO1 (GAATG)n (CATTC)n other TTTC TTCT TTCC TCTT TCTA TCCT TATC TAGA TA GGAA GCG

GATA STR GAAG GAAA CTTT CTTC CTG CGC CCTT CAG ATCT ATAG AT AGGA AGAT AGAA AAGG AAGA AAAG Twins CK Normal GoNL CK Somatic cohort

Depleted Enriched

Significance (−log10 Pvalue) 0 4 8 12

33 Distance to CTG for somatic CNVs

1.00

0.75

0.50 cumulative proportioncumulative

0.25

region CNV

0.00 control

0e+00 2e+07 4e+07 6e+07 distance to centromere/telomere/gap (bp)

34