Genome Structural Variation

Evan Eichler Howard Hughes Medical Institute University of Washington

January 18th, 2013, Comparative Genomics, Český Krumlov

Disclosure: EEE was a member of the Pacific Biosciences Advisory Board (2009-2013) Genome Structural Variation

Deletion Duplication Inversion Genetic Variation Types. Sequence • Single base-pair changes – point mutations • Small insertions/deletions– frameshift, microsatellite, minisatellite • Mobile elements—retroelement insertions (300bp -10 kb in size) • Large-scale genomic variation (>1 kb) – Large-scale Deletions, Inversion, translocations – Segmental Duplications • Chromosomal variation—translocations, inversions, fusions.

Cytogenetics Introduction • Genome structural variation includes copy- number variation (CNV) and balanced events such as inversions and translocations—originally defined as > 1 kbp but now >50 bp • Objectives 1. Genomic architecture and disease impact. 2. Detection and characterization methods 3. Primate genome evolution

Nature 455:237-41 2008

Nature 455:232-6 2008 Perspective: Segmental Duplications (SD) Definition: Continuous portion of genomic sequence represented more than once in the genome ( >90% and > 1kb in length)—a historical copy number variation

Intrachromosomal

Interchromosomal Distribution

Interspersed

Configuration Tandem Importance: Structural Variation A B C TEL

A B C TEL

Non Allelic Homologous Recombination GAMETES NAHR

A B C A B C TEL

TEL

Human Disease Triplosensitive, Haploinsufficient and Imprinted Calvin Bridges Alfred Sturtevant H.J. Mueller Importance: Evolution of New Function

Acquire New/ Modified Function

GeneA GeneA’

Maintain old Loss of Function Function Segmental Duplication Pattern chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 •~4% duplication (125 Mb) chr12 chr13 • >20 kb, >95% chr14 chr15 •59.5% pairwise (> 1 Mb) chr16 •EST rich/ “gene” rich chr17 chr18 •Associated with Alu repeats chr19 chr20 chr21 chr22 chrX chrY She, X et al., (2004) Nature 431:927-30 http://humanparalogy.gs.washington.edu

Mouse Segmental Duplication Pattern

•118 Mb or ~4% dup • >20 kb, >95%

• 89% are tandem •EST poor •Associated with LINEs

She, X et al., (2008) Nature Genetics Human Segmental Duplications Properties

• Large (>10 kb) • Recent (>95% identity) • Interspersed (60% are separated by more than 1 Mb) • Modular in organization • Difficult to resolve

Model #1: Rare Structural Variation

A B C TEL

A B C TEL

NAHR GAMETES

A B C A B C TEL

TEL

Human Disease Triplosensitive, Haploinsufficient and Imprinted Genes •Genomic Disorders: A group of diseases that results from genome rearrangement mediated mostly by non-allelic homologous recombination. (Inoue & Lupski , 2002). DiGeorge/VCFS/22q11 Syndrome

1/2000 live births 180 phenotypes 75-80% are sporadic (not inherited) Human Genome Segmental Duplication Map •130 candidate regions (298 Mb) •23 associated with genetic disease •Target patients array CGH

Bailey et al. (2002), Science 17

Chromosome 15

Chromosome 15 Genome Wide CNV Burden (15,767 cases of ID,DD,MCA vs. 8,328 controls)

Developmental Delay Cases Controls

~14.2% of genetic cause of developmental delay

Percentage of Population of Percentage explained by large CNVs (>500 kbp)

Minimum Size of CNV (kbp) Cooper et al., Nat. Genet, 2011 Model #2: Copy Number Polymorphisms and Disease

Gene Type Locus Seg. Dup Phenotype GSTT1 Decrease 22q11.2 54.3 kb halothane/epoxide sensitivity GSTM1 Decrease 1p13.3 18 kb toxin resistance, cancer susceptibility CYP2D6 Increase 22q13.1 5kb antidepressant sensitivity CYP21A2 Increase 6p21.3 35 kb Congenital adrenal hyperplasia LPA Decrease 6q27 5.5*n kb Coronary heart disease risk RHD Decrease 1p36.11 ~60 kb Rhesus blood group sensitivity C4A/B Decrease 6p21.33 32.8 kb Lupus (SLE) DEFB4 Decrease 8p23.1 ~310 kb Crohn Disease DEFB4 Increase 8p23.1 ~310 kb Psoriasis

Increase

Decrease • Disease CNPs enriched within duplicated sequences. Structural Variation and Enriched Gene Functions Unique regions Duplicated regions

Drug detoxification: glutathione-S-transferase, cytochromeP450, carboxylesterases Immune response and inflammation: Natural killer-cell receptors, defensin, complement factors Surface integrity genes: mucin, late epidermal cornified envelope genes, galectin Surface antigens: melanoma antigen gene family, rhesus antigen

•Environmental interaction and cell-cell signaling molecules enriched Cooper et al., 2007 Copy-Number Detection is not Sufficient!

Color-Blindness in Humans: The Opsin Loci

•Normal phenotypic variation •Red-green color vision defects,X-linked •8% of males and 0.5% females. NEur.

Deeb, SS, Clin. Genet, 2005 Common and Rare Structural Variation are Linked 17q21.31 Deletion Syndrome

Chromosome 17

A B C TEL

A B C TEL

TEL 17q21.31 Inversion

A B C Chromosome

C B A

Inverted • Region of recurrent deletion is a site of common inversion polymorphism in the human population • Inversion is largely restricted to Caucasian populations – 20% frequency in European and Mediterranean populations • Inversion is associated with increase in global recombination and increased fecundity

Stefansson, K et al., (2005) Nature Genetics A Common Inversion Polymorphism

Direct Orientation allele (H1) Inverted orientation allele (H2) •Tested 17 parents of children with microdeletion and found that every parent within whose germline the deletion occurred carried an inversion •Inversion polymorphism is a risk factor for the microdeletion event Duplication Architecture of 17q21.31 Inversion (H2) vs. Direct (H1) Haplotype

H1

H2

Del break Inversion break •Inversion occurred 2.3 million years ago and was mediated by the LRRC37A core duplicon •H2 haplotype acquired human-specific duplications in direct orientation that mediate rearrangement and disrupts KANSL1 gene Zody et al., Nat. Genet. 2008, Itsara et al., Am J. Human Genet 2012 Structural Variation Diversity Eight Distinct Complex Haplotypes

San Meltz-Steinberg et al., Boettger et al., Nat. Genet. 2012 Summary • Human genome is enriched for segmental duplications which predisposes to recurrent large CNVs during germ-cell production • 15% of neurocognitive disease in intellectual disabled children is “caused” by CNVs—8% of normals carry large events • Segmental Duplications enriched 10-25 fold for structural variation. • Increased complexity is beneficial and deleterious: Ancestral duplication predisposes to inversion polymorphism, inversion polymorphisms acquires duplication, haplotype becomes positively selected and now predisposes to microdeletion

Genome-wide SV Discovery Approaches

Hybridization-based Sequencing-based • Iafrate et al., 2004, Sebat et al., • Read-depth: Bailey et al, 2002 2004 • Fosmid ESP: Tuzun et al. 2005, • SNP microarrays: McCarroll et Kidd et al. 2008 al., 2008, Cooper et al., 2008, • Sanger sequencing: Mills et al., Itsara et al., 2009 2006 • Array CGH: Redon et al. 2006, • Next-gen sequencing: Korbel et Conrad et al., 2010, Park et al., al. 2007, Yoon et al., 2009, 2010, WTCCC, 2010 Alkan et al., 2009, Hormozdiari et al. 2009, Chen et al. 2009; Mills 1000 Genomes Project, Single molecule analysis Nature, 2011  Optical mapping: Teague et al., 2010 Array Comparative Genomic Hybridization

Array of DNA Molecules

Test individual Hybridization Normal reference DNA Sample DNA Sample

Cy5 Channel Cy3 Channel

12 mm Merge

One copy gain = log2(3/2) = 0.57 (3 copies vs. 2 copies in reference) One-copy loss = log2(1/2) = -1

SNP Microarray detection of Deletion (Illumina)

A- or B-

Allele Frequency Allele

- and B and AB AB

LogR ~55 kbp

Human chromosome 3 position

SNP Microarray detection of Duplication (Illumina)

Allele Frequency Allele -

ABB and B and

or AAB

AB AB LogR

Human position Clone-Based Sequence Resolution of Structural Variation

Human Genomic DNA

Genomic Library (1 million clones)

Sequence ends of genomic inserts & Map to human genome Inversions Concordant Insertion Deletion

Fosmid

> < > < > < < <

Build35 Dataset: 1,122,408 fosmid pairs preprocessed (15.5X genome coverage) 639,204 fosmid pairs BEST pairs (8.8 X genome coverage) Genome-wide Detection of Structural Variation (>8kb) by End-Sequence Pairs a) b) Insertion

< 32 kb >48 kb Deletion Putative Putative Insertion Deletion

Inversion c) discordant by orientation (yellow/gold) discordant size (red)

duplication track

Tuzun et al, Nat. Genetics, 2005; Kidd et al., Nature, 2008 Experimental Approaches Incomplete (Examined 5 identical genomes > 5kbp)

Fosmid ESP Array CGH Clone Conrad et al. sequencing N=1,128 283 634 Kidd et al. 790 N=1,206 278

128 132 130 84 5 76 5 25

McCarroll et al. N=236 Affymetrix 6.0 SNP Microarray Kidd et al., Cell 2010 Next-Generation Sequencing Methods

• Read pair analysis – Deletions, small novel insertions, inversions, transposons – Size and breakpoint resolution dependent to insert size • Read depth analysis – Deletions and duplications only – Relatively poor breakpoint resolution • Split read analysis – Small novel insertions/deletions, and mobile element insertions – 1bp breakpoint resolution • Local and de novo assembly – SV in unique segments – 1bp breakpoint resolution

Alkan et al., Nat Rev Genet, 2011 Computational Approaches are Incomplete 159 genomes (2-4X) (deletions only)

Read-Pair Read-Depth

6855 (63%) 486 3223 (80%)

4 3250

1772 (33%) Split-read

Mills et al., Nature 2011 Challenges • Size spectrum—>5 kbp discovery limit for most experimental platforms; NGS can detect much smaller but misses events mediated by repeats. • Class bias: deletions>>>duplications>>>>balanced events (inversions) • Multiallelic copy number states—incomplete references and the complexity of repetitive DNA • Exome vs. Genome • False negatives.

Using Sequence Read Depth • Map whole genome sequence toS reference genome – Variation in copy number correlates linearly with read-depth • Caveat: need to develop algorithms that can map reads to all possible locations given a preset divergence (eg. mrFAST, mrsFAST) Illumina Sequence

Reference Sequence Celera’s Sequence to Test 27.3 million reads NG Random Genome Sample

unique duplicated

Bailey et al., Science, 2002 Personalized Duplication or Copy-Number Variation Maps

Venter (Sanger) ‏ CNP#1

Watson (454) ‏

NA12878 (Solexa) ‏ CNP#2

NA12891 (Solexa) ‏

NA12892 (Solexa) ‏

•Two known ~70 kbp CNPs, CNP#1 duplication absent in Venter but predicted in Watson and NA12878, CNP#2 present mother but neither father or child Alkan, Nat. Genet, 2009 ! "#"$%&'%( ) *+ , - #. /0%1, 234%5 '%67#8+ , - . /0%6, 9"%: , + ; 4"<<. 0%= 72>%( , + ; , ?@0%A- B, %C?, <"- >3@0%D , 7>, %D , <7E. 0%F$, - 2"?2, %A- #3- , 227. 0%: , - %A<>, - . 0%. GGG%H"- 3+ "?%! $3I"2#% : 3- ?3$9) + 0%1, B%( J "- *) $". !%, - *%KL, - %K'%K72J <"$%. !'! => #0/)$. #5$!&?!@#5&. #!, 13#51#(A!B53; #)(3$8!&?!C /(%356$&5!/5' !$%#!D&< /)' !D96%#(!" #' 31/2!E5(7$9$#A!, #/F 2#A!C +!!G+632#5$!-#1%5&2&63#(A!, /5$/!H2/)/A!H+!!I - %#(#!/9$%&)(!1&5$)349$#' !#J 9/228!$&!$%3(!< &)K!

+4($)/1$% : #(92$(% STU=VVW!1&51&)' /51#!?&)!' 90231/7&5(!/5' !' #2#7&5(! •!2 ' %) ' 3 04*5&( 5' %&' ( ) 6) ' $57%#( *' ) %8' 4051$948%0:%( #*0;<5' %/0$1%4<3 #' &% D36%28!($)/7P#' !0&092/7&5!(0#13P1!(#6. #5$/2! ' #$#1$#' !48!, XM!6#5&$80356Y!!Z[ W!&?!)#63&5(!%/; #!\! ( /&0**%=>=%*' ?<' 4/' ) %7<3 ( 4%8' 403 ' *% ' 90231/7&5(!/)#!/119)/$#28!' #$#1$#' !/1)&((!4&$%!2&< ! ZVW!1&51&)' /51#!< 3$%!/))/8!H@D!4/(#' !6#5&$80356Y!

•!@( ;9) ( A04%B 957%:0<&%0&570;080<*%$;( C0&3 *%) ' 3 04*5&( 5' *%7987%/04/0&) ( 4/' % /5' !%36%!1&; #)/6#!6#5&. #(! F*%LMN%3 9/&0( &&( 1%OP5*( &( %) *(+,-(. //01% F*%( &&( 1%, - Q%%O, 04&( ) %) *(+,-(. /2/1%

•!2 ' %9) ' 4A:1%&' 8904*%0:%$0$<;( A04%*5&( AD/( A04%B 95794%*' 83 ' 45( ;% 140 =QJ G=! 120

) <$;9/( A04*%( 4) %*70B %57( 5%7987;1%*5&( AD' ) %8' 4' %:( 3 9;9' *%( &' %' 4&9/7' ) %:0&% %

5 4

( 100

*' 83 ' 45( ;%) <$;9/( A04*% )

&

%

0 5

/ 80

t

4

n

4

u

<

o

0

c 0

•!2 ' %9) ' 4A:1%57' %3 0*5%$0$<;( A04%*5&( AD' ) %8' 4' *%:( 3 9;9' *%94%57' %7<3 ( 4% / % / 60 8' 403 ' E%( ;;%0:%B 79/7%0F' &;( $%*' 83 ' 45( ;%) <$;9/( A04*% _ 40

20 " #$%&' (! 0 0.0 0.2 0.4 0.6 0.8 1.0 frac_correct *9`' % :&( /A04%/04/0&) ( 45% " )(*+, - !. /0(!#/1%!)#/' !$&!/22!0&((342#!. /00356!2&1/7&5! 35!$%#!6#5&. #! , #J 9#51356!4/(#' !1&08!59. 4#)!#(7. /7&5(!1/0$9)#!$%#! 0&092/7&5!($)91$9)#!&?!$%#!%36%28!($)/7P#' !6#5#!: : MNM. %

! "#$%&' ( ) *% qPCR based copy number estimates Sequencing based copy number estimates 25 ! "#$%&' ( 20 20

15

LO$)#. #! 15 population population

! t

t Asian Asian

n

n u 0&092/7&5! u

D 10 o

SUPPORTING FIGURES o European European c 10 c M ($)/7P1/7&5!&?! Yoruba Yoruba L $%#!L9)&0#/5!

H 5 Copy number from short read depth 5 (0#13P1!=QJ G=! +, - , - . , - , . , - , +. , - +, . ++++- , , . +, - , - , - , . +, - +, +. , +, - +- , - , - , , - , - - , , , - , - , - , . +. +. . +. ++. +. +, - +, . +- , +. , +- , +. , +- , - , - , . +, - +, - +, . % ' 90231/7&5!! 0 0 • Map reads to reference with mrsFAST 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 rel_log2 REGRESS_1000 , #6. #5$/2!>90231/7&5(! /0$1% /0$1% – Records all placements for each read X+=Z] VQ!>#0$%! H&; #)/6#! >3($/2!/' R/1#5$! - %#!. &($!0&092/7&5!($)/7P#' !6#5#!?/. 323#(!&; #)2/0!

! ' 90231/$#' !)#63&5!

– http://mrsfast.sourceforge.net 5

+;G( 4%) *(+,H%I J J K% (%&< (!351)#/(#' ! (#6. #5$/2!' 90231/7&5(% / 3 1&08!35!+(3/5(!

( Figure S1. The number of genomes from each population as a function of sequence coverage. • Per-library QC, (G+C)-bias correction + (#6. #5$/2! H&08!59. 4#) !3(!/119)/$#28!#(7. /$#' !48!)#/' !' #0$%! ' 90231/7&5! • Train estimator using depths at regions of 6#5#!?/. 328 6#5#!2#56$% &; #)2/0 ^ ($ : #/' !' #0$%!1&))#2/$#(!0&(37; #28!< 3$%!1&08!59. 4#)! RPRS T! UT ! ! >U J H"! known, invariable copy . V, =W =J K>J =J K>J J H>!

1000 . SX" ! "TI ! "TI J H>I !

• 1 kbp-windowed CN genomewide heatmap / MNYNNL KI =KK "I KK! J H>

800 4

! NWYTWPN =I T! =K ! TI U= J HTK

9

%

) $

0 600 Z - . I V ! =! T! ! =! T! J HT[

&

#

t

i

'

f

N !

' YLNM I UKTJ I UKTJ J HT"

/ 400 # ) WQ\TJ N TI K>I TI K>I J HT" 200 R] , "TT=[ I I I >! I I >! J HT> Figure S2. Multiplicative G+C correction for a high coverage genome NA18507. (a) Average read depth for specific G+C% is shown at blue points and fitted with orange line. Total average read depth (µtotal) is shown as a R- +RLK =[ "[ ! =[ "[ ! J HTT red dotted line. (b) The correction factor kgc is shown (truncated at 8) for each individual %(G+C) bin. (c) The 0 fraction of the genome at specific G+C% is shown as a histogram, in green, and overlaid in orange is the log transform of this histogram. , , R! =UUU =UUU J HTT 0 10 20 30 cp 1&08!59. 4#)! NSLL ! >U" ! "J K J HTI WQ^SR= TUK> ! T! T J HT=

CN Page 24 Individuals

Figure S3. (G+C)-corrected read depth and copy number are linearly correlated.

Figure S4. Correlation (mu_corr) of read depth to copy in control regions subsampled at different coverages (genome NA18507).

Page 25 Interphase FISH Read-Depth CNV Heat Maps vs. FISH Copy Number

9 8 7 6 5 4 3 2 1

•72/80 FISH assays correspond precisely to read-depth prediction (>20 kbp) •80/80 FISH assays correspond precisely to+/- 1 read-depth prediction 17q21 MAPT Region for 150 Genomes

CEPH 71% of Europeans carry at least European Partial duplication distal (17q21 associated)—all inversions carry the duplication

24% of Asians are hexaploid for Asian NSF gene N- ETHYLMALEIMIDE- SENSITIVE FACTOR potentially important in synapse membrane fusion; NSF (decreased expression Yoruba in schizophrenia brains (Mimics, 2000), Drosophila mutants results in aberrant synaptic transmission)

Sudmant et al., 2010, Science Read-Depth vs. Quantitative PCR

CCL3L1—chemokine ligand 3-like (1.9 kbp)

• Tested 155 genomes read-depth (1-2 X coverage) vs. QPCR • r2=0.93 between sequence and quantitative PCR estimates Unique Sequence Identifiers Distinguish Copies

copy1 ATGCTAGGCATATAATATCCGACGATATACATATAGATGTTAG… copy2 ATGCTAGGCATAGAATATCCGACGATATACATATACATGTTAG… copy3 ATGCTACGCATAGAATATCCCACGATATACATATACATGTTAG… copy4 ATGCTACGCATATAATATCCGACGATATAC--ATACATGTTAG.

• Self-comparison identifies 3.9 million singly unique nucleotide (SUN) identifiers in duplicated sequences • Select 3.4 million SUNs based on detection in 10/11 genomes=informative SUNs=paralogous sequence variants that are largely fixed • Measure read-depth for specific SUNs--genotype copy-number status of specific paralogs NBPF Gene Family Diversity CNV Detection by Exome Read-Depth A 16p13.1 Duplication

Discard first 10-12 components of variance

Krumm et al.,Genome Res., 2012 Detecting Smaller CNVs

• 5-fold increased sensitivity for CNVs <= 10 kbp than high density SNP microarray.

CoNIFERsoftware: http://conifer.sourceforge.net/index.html Going Forward

1) Focus on comprehensive assessment of genetic variation— NGS are indirect and do not resolve structure by provide specificity and excellent dynamic range response. 2) High quality sequence resolution of complex structural variation to establish alternate references/haplotypes—often show extraordinary differences in genetic diversity 3) Technology advances in whole genome sequencing “Third Generation Sequencing”: Long-read sequencing technologies with NGS throughput in order to sequence and assemble genomes de novo WG Sequencing Recent Gene Duplicates is difficult . Public Build34 25000

20000

15000 Inter Intra

10000

5000

0

91 92 95 96 97 98

90 93 94 99

100

91.5 95.5 96.5 97.5 98.5

90.5 94.5

92.5 93.5 99.5

Celera (WGSA) Percent Identity (%) Sum of Aligned Bases Bases Aligned (kb) of Sum 12000 Inter 10000 Intra 8000

6000 4000

2000

0

99

92 90 91 95 96 97 98

94 93

100

99.5

90.5 91.5 95.5 96.5 97.5 98.5

94.5

92.5 93.5 Percent Identity (%) She et al. Nature, 2005: Shorter-Read Technologies further Limit.

• SOAP-de novo Assembly YH—93% of SDs missing • Subsequent improvements in algorithms, llumina read length, reads from longer inserts, fosmid pools all improve continuity but leave 75-81% of SDs missing or mis-assembled Alkan et al. Nat. Methods, 2011, Chaisson et al unpublished Single-Molecule Real-Time Sequencing (SMRT)

Long reads no cloning or amplifciation but lower throughput and 15% error rate HGAP and QUIVER

https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP

Chin et al. Nat. Methods, 2013 A Simple Experiment

BAC Tiling Path

Seg Dup Organization

• Select tiling path of BAC previously sequenced using Sanger and corresponding to region of Complex SD • Sequence each clone (~200 fold) using on average 1 SMRT Cell and assemble using HGAP and QUIVER • Compare Sanger and Pacbio assembly using BLASR Results

• Accurate (QV>45) assembly of complex region of human genome by BAC– 125 differences—31/44 favor PacBio over Sanger • Most differences are indels but one large scale collapse of 20 kbp region to 12 kbp

Huddleston et al. Genome Res, 2014 Strategy for Resolving Complex Regions

Select Clones by BAC-end Sequence Data

Nextera Illumina Sequencing (96-well format)

PacBio Sequencing of Tiling Path Whole Genome Sequencing with PacBio • CHM1—complete hydatidiform mole (CHM1)- “Platinum Genome Assembly” • 10X Sequence coverage using RSII P5/C3 chemistry

http://datasets.pacb.com/2013/Human10x/READS/index.html Validated Breakpoint-Resolved Deletions

114.2 kb AF4.9R_CENTRAL kb 20 10 0 AFR_WEST 125 100 75 50 25 0 ASN population 300 AFR_CENTRAL 200 100 AFR_WEST

t 0 n

u EUR_NORTH ASN o

c 20 EUR_NORTH 15 10 EUR_SOUTH 5 0 SAN EUR_SOUTH 20 10 0 SAN

40 20 0 0.0 0.5 1.0 1.5 2.0 2.5 chr3.162512134.162626350 Inversions: Single Molecule Detection Transitioning into the Centromeric Satellite • Single 31.8 kb read mapping to edge of centromere on chromosome 16: • HSAT2RS anchor and extends 25 kbp into centromeric • Site of extensive copy number polymorphism and potential hotspot for rearrangements associated with cancer • Data suggest that 5 bp HSAT2 is organized into a 2.8 kbp HOR Summary • Approaches – Multiple methods need to be employed—Readpair+Read- depth+SplitRead and an experimental method – Tradeoff between sensitivity and specificity – Complexity not fully understood • Read-pair and read-depth NGS approaches – narrow the size spectrum of structural variation – lead to more accurate prediction of copy-number – unparalleled specificity in genotyping duplicated genes (reference genome quality key) • Third generation sequencing methods hold promise but require high coverage

Why? chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 •Ohno—Duplication is the chr12 chr13 primary force by which new chr14 chr15 gene functions are created chr16 chr17 •There are 990 annotated chr18 chr19 genes completely contained chr20 within segmental duplications chr21 chr22 chrX chrY Duplication Acceleration in Human Great Ape Ancestor Mbp of Overlap Chimp (16.58) Human (21.52) (17.85) (26.30)

(11.26) (14.01) (0.27) (11.54) (0.22) (1.31) (1.39)

SDs > 20Mbp kb (23.33) Orangutan Mbp/million years •A 3-4 fold excess in de novo segmental duplications in common ancestor of human, chimp and gorilla but after divergence from orangutan •Not a continuous accumulation Marques-Bonet et al., Nature, 2009; Ventura et al., Genome Res. 2011

Great Ape Genome Diversity Project

• Deep genome sequencing of 79 wild and captive born great apes (6 species and 7 subspecies) and 10 human genomes • 167 Mbp (83.6 million SNPs and 84.0 fixed SNVs) • 469 Mbp affected by copy number • 745 CNV; 1080 indels; 806 SNVs affect gene structure

Prado and Sudmant et al., Nature, 2013 chr5 a)

chr6

chr5 a)

chr5 a) chr6 Ape Segmental Duplication Patterns chr6b) b) chr5 chr5 b) chr5 chr6 Human Chimpanzee Bonobo

Gorilla chr6 0 2 4 6 8 10+ c) copy Orangutan fixed deletions

8.73 1.23 3.58 0.68(Mbp)

Gorilla-Chimpanzee-Human Chimpanzee-Human Human Human segregating 0 2 4 6 8 10+ c) copy Toscani chr6fixed deletions CEPH Eurpeans Finnish British 8.73 1.23 3.58 0.68(Mbp) Puerto Rican Columbian Mexican Gorilla-Chimpanzee-Human Chimpanzee-Human Human Human segregating Beijing Han Chinese Southern Han Chinese Japanese Yoruban Toscani Luhya CEPH Eurpeans African American Finnish Chimpanzee British Bonobo Puerto Rican Gorilla Columbian Bornean Orangutan Mexican Sumatran Orangutan Beijing Han Chinese Southern Han Chinese Japanese 0 Yoruban0.2 0.4 0.6 0.8 1 Luhya African Americanfrequency of locus Chimpanzee Bonobo Gorilla Bornean Orangutan Sumatran Orangutan

0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10+ c) frequency of locus copy fixed deletions

8.73 1.23 3.58 0.68(Mbp)

Gorilla-Chimpanzee-Human Chimpanzee-Human Human Human segregating

Toscani CEPH Eurpeans Finnish British Puerto Rican Columbian Mexican Beijing Han Chinese Southern Han Chinese Japanese Yoruban Luhya African American Chimpanzee Bonobo Gorilla Bornean Orangutan Sumatran Orangutan

0 0.2 0.4 0.6 0.8 1 frequency of locus Rate of Duplication

p=9.786 X 10-12 Sudmant PH et al. , Genome Res. 2013 Rate of Deletion

*p=4.79 X 10-9 Mosaic Architecture

11q14 11q14 10q26 11p15 7q36 22q12 4p16.1 21q21 4p16.1 12p11 4q24 Xq28 12q24 7q36 4p16.3 2p22 Duplicons

Primary Duplicative Transpositions

Duplication 2p11 Blocks 0 100 kbp Secondary Block Duplications

16p 15q

•A mosaic of recently transposed duplications •Duplications within duplications. •Potentiates “exon shuffling”, regulatory innovation Human Chromosome 16 Core Duplicon

•The burst of segmental duplications 8-12 mya corresponds to core- associated duplications which have occurred on six human (chromosomes 1,2, 7, 15, 16, 17) •Most of the recurrent genomic disorders associated with developmental delay, epilepsy, intellectual disability, etc. are mediated by duplication blocks centered on a core.

100 kbp LCR16a Jiang et al, Nat. Genet., 2007 Increasing Duplication Complexity and Recurrence

Human Baboon Orangutan

31 * * 29 1 30 * 16 13 PPY 29 * 16p13.3 13p13 3 2 13p12 16 16p13.2 4 PHA 29* 13p11.2 16p13.13 * 16 5 * PPY 5 PPY i PHA 31* 13p11.1 6 PPY h 13.3 16p13.11 PPY a3 7 PHA 5* PPY 7j 13q12.11 13.2 8 PPY a2 PHA 30* 16p12.2 * 13q12.13 PPY a1 9 10 PPY 27 13.1 PPY c 11 13q12.3 PHA 27* 16p11.2 PPY b 12 12 13q13.2 * PHA 13* PPY e 13 16p11.1 14 PPY d 15 PHA 19* 16q11.1 11.2 16 27 * 13q14.11 18 16q11.2 13q14.13 11.1 19 * 11.1 20 13q14.3 16q12.1 11.2 21 16q12.2 13q21.2 22 12.1 16q13 13q21.32 12.2 13 16q21 13q22.1 23 21 13q22.3 26 16q22.1 16q22.2 13q31.2 22 24 13q32.1 16q23.1 23 13q32.3 * 16q23.3 28 PHA 28* 13q33.2 24 31 * * PPY g PPY 28 13q34 PHA 31* 16q24.3 * Ancestral Locus •Duplication blocks have become increasingly more complex (more duplicons) and have expanded in an interspersed fashion over the last 25 million years. •Duplication blocks of different flanking content with exception of core Johnson et al., PNAS, 2006 Core Expansion Model

Time

Locus 1 Locus 2 Locus 3 Locus 4 X Y A B D AA BB E F D AA B G

Core Lineage Ape/Human Duplicon Specific Shared Human Great-ape “Core Duplicons” have led to the Emergence of New Genes RANBP2 GCC2

TRE2 TBC1D USP32

NPIP

EVI5 DUF1220 NBPF P

DND1 BPTF LRR LRRC37A P P P

RGPD

Features: No orthologs in mouse; multiple copies in chimp & human dramatic changes in expression profile; signatures of positive selection Core Duplicon Hypothesis

The selective disadvantage of interspersed duplications is offset by the benefit of evolutionary plasticity and the emergence of new genes with new functions associated with core duplicons.

Marques-Bonet and Eichler, CSHL Quant Biol, 2008

Human-specific gene family expansions

Notable human-specific expansion of brain development genes. Neuronal cell death: p=5.7e-4; Neurological disease: p=4.6e-2 . Sudmant et al., Science, 2010 SRGAP2 function

• SRGAP2 (SLIT-ROBO Rho GTPase activating 2) functions to control migration of neurons and dendritic formation in the cortex • Gene has been duplicated three times in human and no other mammalian lineage • Duplicated loci not in human genome

Guerrier et al., Cell, 2009 SRGAP2 Human Specific Duplication

SRGAP2A

240 kb Human

SRGAP2B p12.1 ~2.4 Chimp mya q21.1 ~3.4 >555 kb mya

Orang q32.1 SRGAP2C

Dennis, Nuttle et al., Cell, 2012 SRGAP2C is fixed in humans (n=661 individual genomes) SRGAP2 duplicates are expressed

RNAseq

In situ SRGAP2C duplicate antagonizes function

Charrier et al., Cell, 2012 Sahelanthropus K. platyops Homo

Orrorin A. garhi A. anamensis A. africanus

Ardipithecus A. afarensis A. boisei A. aethiopicus A. robustus

Australopithecus ~350 cc ~1000 cc Homo habilis Summary • Interspersed duplication architecture sensitized our genome to copy-number variation increasing our species predisposition to disease—children with autism and intellectual disability • Duplication architecture has evolved recently in a punctuated fashion around core duplicons which encode human great-ape specific gene innovations (eg. NPIP, NBPF, LRRC37, etc.). • Cores have propagated in a stepwise fashion “transducing” flanking sequences---human-specific acquisitions flanks are associated with brain developmental genes. • Core Duplicon Hypothesis: Selective disadvantage of these interspersed duplications offset by newly minted genes and new locations within our species. Eg. SRGAP2C

Overall Summary • I. Disease: Role of CNVs in human disease—two models common and rare—a genomic bias in location and gene type • II. Methods: Read-pair and read-depth methods to characterize SVs within genomes—need a high quality reference—not a solved problem. • III: Evolution: Rapid evolution of complex human architecture that predisposes to disease coupled to gene innovation Disease

Evolution Eichler Lab

http://eichlerlab.gs.washington.edu/ genguest