<<

Supplementary Information for Comparative Analysis of Mammalian Y Illuminates Ancestral Structure and Lineage-Specific Evolution

Gang Li, Brian W. Davis, Terje Raudsepp, Alison J. Pearks Wilkerson, Victor C. Mason, Malcolm Ferguson-Smith, Patricia C. O’Brien, Paul Waters and William J. Murphy.

Figures 1-12 Pgs. 2-16

Tables 1-3 * Pgs. 17-19

Supplemental Methods Pgs. 20-24

*Supplementary Table 4 is attached as a separate Excel file

1

Figure 1. Copy number analysis of felid Y using qPCR and FISH

A. Quantitative PCR analysis in three domestic cats

13 12 110 11 100 10 90 9 80

8 Number 70 7 60

6 Copy 50 5 40 4 30 3

Estimated Copy Number 20 2 Estimated 10 1 0 0

B. FISH mapping of domestic cat multicopy cDNA probes on snow leopard (Panthera uncia) Y chromosomes

CUL4BY TETY1 TSPY FLJ36031Y

2

Figure 2. Summary of copy number and expression information for domestic cat and dog genes. Copy Expression Expression Copy number pattern pattern number biased biased 1 MC Testis specific Testis Ubiquitous Other tissue Testis Testis specific SHROOM2 Ubiquitous SHROOM2 1 MC Other tissue TETY2 TETY2 UTY UTY DDX3Y DDX3Y USP9Y USP9Y HSFY EIF1AY AMELY KDM5D ZFY EIF1AY EIF2S3Y BCORY1 ZFY CYorf15 UBE1Y KDM5D RBMYL DYNG Cat BCORY2 UBE1Y Dog HSFY OFD1

RPS4Y

CYorf15-fusion SRY SRY family CYorf15

TSPY & TSPYL families

CUL4BY1 CUL4BY2

3

Figure 3: Comparison of sequence conservation among 5 Y mammalian chromosomes (, chimpanzee, rhesus, dog, and cat) using the sequenced cat (A) and dog (B, next page) Y chromosome sequence as a reference. Multispecies sequence conservation scores are shown by blue columns (bottom panel), and pairwise sequence identity between X and Y chromosome sequence is shown by red columns (top panel). Green circles with numbers indicate X-Y orthologous regions, the order of which is shown on a schematic of the at the bottom. and transcript information is listed below each panel. Functional genes are shown in black, while pseudogenes are shown in orange. Red bars at the bottom of each panel indicate known gene exons. A: 0 1 2 0.1 3 0.2 0.3 4 0.4 0.5mb 100% 60% 1 0.8 0.6 0.4 0.2 0

PAR CINPLP SHROOM2 TETY2 UTY DDX3Y USP9Y

0.5 5 0.6 6 0.7 7 8 0.8 9 0.9 10 11 1.0mb 100% 60% C/D/H/P 1 ZFY 5’ 0.8 0.6 0.4 0.2 0

MBTPS2P RPL12P SHROOM2P SHROOM2P AMELY EIF1AY EIF2S3Y ZFY UBE1Y enFeLV KDM5D 1.0 12 1.1 1.2 1.3 1.4 1.5mb 100% 60%

1 0.8 0.6 0.4 0.2 0

enFeLV HSFY EEF1A1P AP1S2P HSFY enFeLV HSFY RPS4YL HSFY EEF1A1P RPS4YL HSFY

1.5 1.6 14 1.7 13 1.8 1.86mb 100% 60% 20% 1 0.8 0.6 0.4 0.2 0

RPS4Y B NEK6P HSFY CYorf15P SRY CYorf15_fusion CYorf15

10 12 1 5 2 13 7 6 8 4 3 9 11 14 Cat X chromosome

4

B: 0 1 0.1 2 0.2 0.3 3 0.4 0.5mb 100% 60% 20% 1 0.8 0.6 0.4 0.2 0 PAR SHROOM2 TETY2 UTY DDX3Y CASP6P USP9Y HSFY 4 0.6 5 6 0.7 0.8 7 0.9 0.5 8 1.0mb 100% 60% 20% 1 0.8 0.6 0.4 0.2 0

RPL23LP EIF1AY KDM5D ZFY BCORY1 CYorf15 RBMYL DYNG

8 1.1 9 1.2 10 1.3 1.4 1.0 11 1.5mb 100% 60% 20% 1 0.8 0.6 0.4 0.2 0 SRY SRY TRAPPC2P BCORY2 UBE1Y OFD1 Hnrpa1P 1.5 1.6 1.7 1.8 1.9 2.0mb 100% 60% 20% 1 0.8 0.6 0.4 0.2 0 SRY SRY SRY SRY SRY 2.0 2.1 2.2 2.3 2.4 2.5mb 100% 60% 20% 1 0.8 0.6 0.4 0.2 0

TSPY & TSPYL CUL4BY1 CUL4BY2 1 10 7 4 6 8 3 2 9 5 family Dog X chromosome 11

5

Figure 4. Triangular self dot plots of self-self DNA sequence identities within the dog and cat MSY sequences. The horizontal panel below each figure displays the average % self-self identity, based on a sliding window analysis (window size=50 nucleotides, distance between each window step=5 nucleotides). Sequences with 100 percent identity present a dot in the triangular dot plot. Horizontal and vertical lines represent direct and inverted repeats. The MSY schematic with scale is shown below the triangular plot with colored background representing the different gene regions. Gene bodies are represented within the colored blocks as directional arrows, drawn to scale.

dog

100% 80% 60% 40% 20% 0%

0 0.5 1.0 1.5 2.0 2.5(mb)

cat

Pseudoautosomal region

100% Single copy region 80% 60% 40% Ampliconic region 20% X-transposed 0% 0 0.5 1.0 1.5 1.9(mb)

6

Figure 5. A) Gene structure and expression profile (RNASeq) of the novel coding gene, DYNG, found on the domestic dog Y chromosome. B) Self-self blast result of the transcript identifies the internal repeats within the open reading frame.

Expression of dog Y chromosome novel gene (DYNG) A: 0 2 4 6 8 10 12 14 16 20 22kb

100

testis 50

0

brain 50

0

liver 50

0

lung 50

0

heart 50

0

blood 50

0

muscle 50

0 ATG TAA Gene structure 1 2 3 4 5 6 7 8 9 10

7

Figure 6. Evolution of the CYorf15 gene family in select mammals. A) Structural comparison of human CYorf15 gene family, with B) dog CYorf15 gene, cat CYorf15 fusion gene and the X chromosome gene Gamma taxilin, (TXLNG). Dashed arrows and boxes indicate regions of between the different gene family members, based on pairwise sequence comparisons.

Chromosome X Human TXLNG gene

exon 1 2 3 4 5 6 7 8 9 10

A: human CYorf15A gene human CYorf15B gene

1 2 3 4 5 6 7 8 9

Dog CYorf15 gene

Chromosome X

exon 1 Ancestral carnivore TXLNG gene 2 3 4 5 6 7 8 9 10

cat CYorf15-fusion gene 1 2 3 4 5 6

de novo X degenerate gene fusion origination region region

B: 1 2 3 4 5 6 7 cat EIF1AY gene

8

Figure 7. OFD1 gene tree estimated using maximum likelihood method under a GTR+G+I model. Numbers near each node on the tree are bootstrap values (500 pseudoreplicates: only values higher than 80 have been shown). Taxon names are denoted by (species common name)/(ENSEMBL transcript name or NCBI accession number)/(available chromosome information). The chimpanzee Y chromosome OFD1 pseudogene is denoted by the start and end locations within the published Y chromosome sequence.

Chimp pseudogene Y 2158661-2174171 99 Chimp pseudogene Y 12093450-12108956 100 Chimp pseudogene Y 1875170-1888918 100 Chimp pseudogene Y 5539194-5552933 Human ENSG00000226011 pseudogene 1 Y 100 Human ENSG00000226611 pseudogene 2 Y 100 Human ENSG00000232585 pseudogene 12 Y Human ENSG00000225287 pseudogene 13 Y 100 100 Human ENSG00000235511 pseudogene 18 Y Human Human ENSG00000237302 pseudogene 11 Y human and chimpanzee 100 Human ENSG00000238088 pseudogene 7 Y OFD1 pesudogenes on Y Human ENSG00000229406 pseudogene 4 Y chromosome Human ENSG00000231988 pseudogene 3 Y Human ENSG00000231159 pseudogene 8 Y 76 85 Human ENSG00000230476 pseudogene 9 Y 100 Human ENSG00000234888 pseudogene 15 Y 100 Human ENSG00000225466 pseudogene 10 Y 100 Human ENSG00000242153 pseudogene 6 Y Human ENSG00000240438 pseudogene 5 Y 100 Chimp pseudogene Y 7184466-7201836 Human ENST00000340096 chrX 98 99 Chimp ENSPTRT00000040191 chrX 88 Gorilla ENSGGOT00000006486 chrX Old and New World 96 100 Orangutan XM 002831405 chrX monkey OFD1 genes on Gibbon XM 003261047 chrX 95 Macaque ENSMMUT00000027270 chrX X chromosome 93 Marmoset ENSCJAT00000007661 Tarsier ENSTSYT00000010650 Bushbaby ENSOGAT00000002877 99 Mouse Lemur ENSMICT00000006746 Tree Shrew ENSTBET00000004188 95 Pika ENSOPRT00000015058 Guinea Pig ENSCPOT00000015451 Mouse AJ438159 ChrX 98 Squirrel ENSSTOT00000003307 Cow NM 001192637 chrX 100 cattle OFD1 genes Cow JN193532 chrY 100 100 Dolphin ENSTTRT00000006504 100 100 Pig XM 003134936 chrX Alpaca ENSVPAT00000004619 Shrew ENSSART00000006423 Hedgehog ENSEEUT00000009611 Dog XM 537958 chrX 100 Dog JX964865 chrY dog OFD1 genes 98 Panda XM 002927445 OFD1 100 97 Cat ENSFCAT00000014879 chrX Megabat ENSPVAT00000011753 99 Microbat ENSMLUT00000012689 100 Horse XM 001917181 chrX Armadillo ENSDNOT00000002102 100 Sloth ENSCHOT00000009231 99 Hyrax ENSPCAT00000000091 85 Elephant ENSLAFT00000014407 100 100 Lesser hedgehog ENSETET00000004345 100 Opossum ENSMODT00000021978 chr7 Platypus ENSOANT00000004149 Anole Lizard XM 003218852 chr3 Chicken XM 416831 chr1 Zebra Finch ENSTGUT00000008548 chr1 0.1

9

Figure 8. Results of positive selection test for the branch leading to domestic dog OFD1Y gene. A) OFD1 gene tree, with branch lengths based on non-synonymous substitution rates. B) Bayesian posterior probabilities for sites under positive natural selection (branch-site models comparison). Colored bars underneath indicate different structural domains. C) Results of likelihood ratio tests for two pairs of model comparisons.

dog OFD1 Y A: cow OFD1 Y dog OFD1 X pig OFD1 X

cow OFD1 X

panda OFD1 X

cat OFD1 X

horse OFD1 X elephant OFD1 X

chimpanzee OFD1 X rhesus OFD1 X human OFD1 X 0.1 mouse OFD1 X

C: B: Bayesian posterior probabilities of positive selection Results of likelihood ratio tests for dog OFD1Y branch 1 Models Significance

0.8 Two ratio/One ratio P<0.001

0.6 Branch-site models P<0.001

0.4

0.2

0

LisH domain Coiled coil

10

Figure 9. Location of the dog (top) and cat (bottom) pseudoautosomal boundaries (dashed vertical line), inferred from X-Y sequence similarity plots. Gene annotations are shown at the bottom of each figure.

6,578kb dog X chromosome 6,634kb 100 Percent identity to Dog X chromosome

95

90 dog 85 Pseudoautosomal boundary

80

75

SHROOM2 TETY2 dog Y chromosome

6,642kb cat X Chromosome 6,708kb 100 Percent identity to Cat X Chromosome 95

90 cat 85 Pseudoautosomal boundary

80

75

SHROOM2 TETY2 cat Y chromosome

11

Figure 10. A) Maximum likelihood tree of TETX2 orthologues sampled from different eutherian X chromosomes, estimated under a GTR+G+I model. B) Schematic showing the origination and divergence of the Y-linked TETY2 gene on dog and cat Y chromosomes from the X chromosome progenitor gene, resulting from several mechanisms, including PAB movement and exon loss and de novo exon gain.

gibbon rhesus human 79 TETX2 gene tree A northern greater galago 100

horse 87 93 mouse Gm6927 53 cow 98 100 rat LOC680489 100

100 dog 0.1 cat rat LOC503435 mouse Gm7157

PAR X specific X chromosome exon1 exon2 WWC 5’ B SHROOM2 3’ TETX2 gene

ancestor Carnivora PAR SHROOM2 3’ exon1 exon2 WWC 5’ Ancestral gene

PAR MSY

exon1 lost SHROOM2 3’ exon2 exon3 exon4 UTY 3’ X degenerate De novo origination Y chromosome TETY2 gene

12

Figure 11. X-Y dot plots show regions of sequence conservation (>97%). The blue box highlights additional remnants of SHROOM2 X-Y similarity, well inside the MSY and distant from the PAB, reflecting ancient chromosomal inversions that likely occurred on the ancestral felid branch. High similarity region (>97%) between cat X and Y chromosomes 120M

100M

80M

60M

40M

PAR similar to 5’ end region of SHROOM2 gene 20M

0

TETY2 EIF1AY UBE1Y CYorf15 DDX3Y fusion CYorf15

UTY USP9Y AMELY ZFY KDM5D SRY SHROOM2 3’ end EIF2S3Y HSFY Family RPS4Y Family

13

Figure 12. Pairwise sequence comparisons of four Y chromosomes (dog versus cat, rhesus, and human), using a BLAST-based method on repeatmasked sequences. Each dot represents a 50 nucleotide window with at least 50 percent identity between two species. Green boxes indicate the UTY/DDX3Y/USP9Y conserved gene region. The schematic on the x and y axes of each figure indicates the size and location of each MSY gene (green bars).

dog Y chromosome vs. cat Y chromosome 1.9 1.5 1.0 cat Y chromosome 0.5 0

0 0.5 1.0 1.5 2.0 2.5 (mb) dog Y chromosome

dog Y Chromosome vs. rhesus Y chromosome dog Y chromosome vs. human Y chromosome ) mb 30.0 ( 12.0 25.0 10.0 20.0 8.0 15.0 6.0 rhesus Y chromosome 10.0 4.0 human Y chromosome 5.0 2.0 ) mb 0( 0 0.5 1.0 1.5 2.0 2.5 (mb) 0.5 1.0 1.5 2.0 2.5 (mb)

dog Y chromosome dog Y chromosome

14

Figure 13. Plots showing rates of MSY gene loss during the evolution of mammals and continuing in the carnivore lineage, assuming an exponential model of decay with baseline (Hughes et al. 2012). Plots are divided into two groups of X-degenerate gene classes: Stratum I (Murphy et al. 2006; Hughes et al. 2012), and stratum II/III (following Pearks Wilkerson et al. 2008). Estimated ancestral gene numbers (Hughes et al. 2012), and active genes inferred for different eutherian ancestors (with divergence times drawn from Meredith et al. 2011) are shown as blue diamonds. Age of the ancestral Stratum I genes (a) is based upon the inferred ~218 Mya prototherian/therian divergence time (Meredith et al. 2011), whereas the age of the origin of Stratum II/III genes (a) is derived from Pearks Wilkerson et al. 2008. Other nodes are as follows: (b) primate/rodent/carnivore ancestor, (c) primate/rodent ancestor, (d) carnivore ancestor, (e) catarrhine primate ancestor.

Stratum I Stratum II-III 500 500 a a 200 200

100 100

50 50

Gene number 10

10 Gene number b c d e b c d e 0 0 250 200 150 100 50 0 250 200 150 100 50 0 Time (million year) Time (million year)

15

Figure 14. Results of positive selection tests for several multicopy MSY gene families, performed using the branch-site model. Asterisks indicate significant (p<0.01) positive selection detected with likelihood ratio tests.

human HSFX B A rhesus HSFX cat CUL4BY **

opossum HSFY dog

opossum

cat HSFY clade cow HSFY clade elephant ** rat dog HSFY a 0.1 mouse platypus cow human HSFY clade pig b lizard bat rhesus HSFY 0.1 horse Tree chicken shrew dn tree of CUL4BX and CUL4BY gene CUL4BX dn tree of HSFX and HSFY gene families

C D E mouse Sycp3 clade

dog ZFX chimpanzee ZFX rat SSTY human ZFX rat SYCP3 gene cat ZFX

mouse Sstyl rat SSTYL cow ZFX

rat SPIN1 mouse Spin1 0.1 mouse Ssty2 clade Mouse Zfy1 mouse Ssty1 clade ** 0.1 0.02 Mouse Zfy2 mouse Sly gene clade

dn tree of SSTY gene families dn tree of SYCP3 and SLY gene families dn tree of ZFX and ZFY gene families

16

Supplementary Table 1. Structural domains of the domestic cat and dog Y chromosomes, and relative sequence coverage. Estimated sizes are based on cytogenetic and physical mapping data (Murphy et al. 2006, Pearks-Wilkerson et al. 2008). region PAR Y single copy Multicopy/ampliconic NOR estimated sequenced estimated sequenced estimated sequenced estimated sequenced size size size size cat 6.6 Mb 23 Kba 1-2 Mb 0.7 Mb ~35 Mb 1 Mb NA* NA Xp/Yp Yp Yp Yp Yp+Yq Yp dog 6.6 Mb 45 Kba 1-2 Mb 1.1 Mb 1-2 Mb 1.5 Mb 8-10 Mb 0 bp Xp/Yq Yq Yq Yq Yq Yq Yp Yp alength of corresponding Y-PAR. bNA=not applicable

Supplementary Table 2. Comparison of synonymous substitution rates between canine X-Y paralogs. Orthologous genes average sequence divergence Synonymous substitution rate X copy Y copy within intron region OFD1X OFD1Y 0.155 0.149 TRAPPC2 TRAPPC2P NA 0.146 DDX3X DDX3Y 0.527 not alignable UTX UTY 0.334 not alignable KDM5C KDM5D 0.751 not alignable UBA1 UBE1Y 0.560 not alignable USP9X USP9Y 0.561 not alignable ZFX ZFY 0.266 not alignable TXLNG CYorf15 0.498 not alignable BCOR BCORY1/2 0.774/0.687 not alignable EIF1AX EIF1AY 0.532 not alignable

17

Supplementary Table 3. Sequences involved in selection tests. Gene species name Accession number Gene species name Accession number AMELY pig AB091792 UTY chimpanzee ENSPTRT00000042001 AMELY horse AB032194 UTY mouse NM_009484 AMELY cow M63500 ZFY mouse AK076618 AMELY human ENST00000215479 ZFY mouse M24401 AMELY chimpanzee ENSPTRT00000041965 ZFY cow AF032867 DDX3Y human ENST00000360160 ZFY human ENST00000383052 DDX3Y chimpanzee ENSPTRT00000041996 ZFY chimpanzee ENSPTRT00000041960 DDX3Y mouse ENSMUST00000091190 HSFY human AF332227 EIF1AY human ENST00000361365 HSFY human AK058182 EIF1AY chimpanzee ENSPTRT00000042030 HSFY rhesus FJ527015 EIF1AY rhesus FJ527014 HSFY opossum GQ253469 EIF1AY cow FJ195366 HSFY cow NM_001077006 EIF1AY pig AY609402 HSFY cow XM_001250229 EIF2S3Y rat FJ775732 HSFY cow XM_003585244 EIF2S3Y mouse AJ006584 TSPY human ENST00000451548 KDM5D human ENST00000317961 TSPY human ENST00000457222 KDM5D chimpanzee ENSPTRT00000042025 TSPY human ENST00000428845 KDM5D mouse NM_011419 TSPY human ENST00000426950 SRY human AM884750 TSPY human ENST00000426950 SRY chimpanzee DQ977342 TSPY human ENST00000457222 SRY rhesus FJ527022 TSPY human ENST00000287721 SRY rabbit AY785433 TSPY chimpanzee ENSPTRT00000055849 SRY mouse U03645 TSPY chimpanzee ENSPTRT00000041983 SRY rat AF274872 TSPY chimpanzee ENSPTRT00000055827 SRY cow AB039748 TSPY chimpanzee ENSPTRT00000055807 SRY pig U49860 TSPY pig ENSSSCT00000022749 SRY horse AB004572 TSPY cow ENSBTAT00000049474 SRY panda AB292070 CDY human ENST00000426790 UBE1Y mouse AF150963 CDY human ENST00000306882 UBE1Y rat EF690356 CDY human ENST00000544303 UBE1Y opossum GQ253467 CDY human ENST00000306609 USP9Y human ENST00000338981 CDY chimpanzee ENSPTRT00000076216 USP9Y chimpanzee ENSPTRT00000041991 CDY chimpanzee ENSPTRT00000073911 USP9Y mouse NM_148943 CDY chimpanzee ENSPTRT00000076243 USP9Y cow NM_001145509 CDY chimpanzee ENSPTRT00000072418 UTY human ENST00000331397 CDY chimpanzee ENSPTRT00000072527

18

Gene species name Accession number Gene species name Accession number CDY rhesus FJ527011 SSTY1L rat XM_222167 DAZ chimpanzee ENSPTRT00000062513 SSTY1L rat XM_001078732 DAZ chimpanzee ENSPTRT00000062535 SstyL mouse XM_144574 DAZ chimpanzee ENSPTRT00000062545 Ssty1 mouse X05260 DAZ chimpanzee ENSPTRT00000055821 Ssty1 mouse Mm#S38822574 (EST) DAZ human ENST00000382440 Ssty1 mouse Mm#S38822973 DAZ human ENST00000405239 Ssty1 mouse Mm#S38822990 DAZ human ENST00000382365 Ssty1 mouse Mm#S8039565 DAZ human ENST00000440066 Ssty1 mouse Mm#S7116079 DAZ rhesus FJ648738 Ssty1 mouse Mm#S38821906 Cyorf15A human ENST00000407724 Ssty1 mouse Mm#S32784017 Cyorf15A chimpanzee NM_001009080 Ssty1 mouse Mm#S26680069 Cyorf15A rhesus FJ527012 Ssty1 mouse Mm#S26702624 Cyorf15A gorilla FJ532256 Ssty1 mouse Mm#S40811957 Cyorf15B human ENST00000382832 Ssty2 mouse Mm#S7227131 Cyorf15B rhesus FJ648737 Ssty2 mouse Mm#S52502287 Cyorf15B gorilla FJ532257 Ssty2 mouse Mm#S7227130 CUL4BX microbat ENSMLUT00000013243 Ssty2 mouse Mm#S38822045 CUL4BX dog ENSCAFT00000029435 Ssty2 mouse Mm#S19760504 CUL4BX platypus ENSOANT00000004088 Ssty2 mouse Mm#S14960481 CUL4BX pig ENSSSCT00000013782 Sly mouse BC049626 CUL4BX lizard ENSACAT00000010801 Sly mouse Mm#S38821277 (EST) CUL4BX chicken ENSGALT00000013942 Sly mouse Mm#S38821362 CUL4BX marmoset ENSCJAT00000005037 Sly mouse Mm#S38822154 CUL4BX human ENST00000404115 Sly mouse Mm#S38822342 CUL4BX tree shrew ENSTBET00000007462 Sly mouse Mm#S60235098 CUL4BX mouse NM_001110142 Sly mouse Mm#S38822478 CUL4BX cat ENSFCAT00000010442 Sly mouse Mm#S21508819 CUL4BX elephant ENSLAFT00000034221 Sly mouse Mm#S14960473 CUL4BX cow ENSBTAT00000024713 Sycp3 mouse ENSMUST00000033458 CUL4BX horse ENSECAT00000025736 Sycp3 mouse ENSMUST00000067763 CUL4BX rat ENSRNOT00000065023 Sycp3 rat ENSRNOT00000056700 CUL4BX rabbit ENSOCUT00000015646 FLJ36031Y cat (clone a260) DQ329513 CUL4BX opossum XM_001372861 FLJ36031Y cat (b321) DQ329514 Spin1 mouse AK168544 FLJ36031Y cat (c254) DQ329515 SPIN1 rat BC097359 FLJ36031Y cat (e268) DQ329517 SSTY rat AABR06008269 FLJ36031Y cat (d340) DQ329516

19

Supplementary Methods

Pipeline of data generation and analyses.

BAC clone library preparation and DNA isolation BAC finger print Fluorescence in situ hybridization BAC-end sequencing Paired end 454 sequencing for selected BAC clones

Initial de novo assembly

Normalize contigs Calculate short read depth of single copy genes Identification of collapsed contigs Analyze SNP frequency

mRNA library preparation Single copy contigs Collapsed contigs

Illumina sequencing Use DNATrapper like method to separate collapsed contigs

De novo assembly of Reorder the distribution map of all contigs RNA-seq data BAC-end sequences Complete cDNA sequences using copy number and paired end linkage information

Finish de novo assembly

Testis cDNA sequences RT-PCR EST data EST data Real-time RNA-seq data RNA-seq data GC content and Inter-chromosomal Self dot plot Gene expression Gene annotation repeatmaksing comparison comparison in different tissues

Phylogenetic tree Gene conversion Natural selection X-Y chromosome Multiple Y chromosome building detection tests comparison comparison

Conservation tests Rearrangement estimation

20

Domestic dog Y chromosome BAC clone map.

p

q PAR (~6.5 Mbp)

novel cds PAB multicopy region 130/135

RBMY CUL4BY ncRNA01 ncRNA28 TSPY SRY ncRNA139 OFD1 UBE1Y BCORY CYorf15 ZFY SMCY EIF1AY HSFY USP9Y DDX3Y UTY SHROOM2

GAP 390N10 28D22 59I16 71J6 258H6 PAR 269C23 108C4 190J20 264L3 18A23 298J16 29H7 128F24 119D5 237I10 278L21 103P19 107I13 4B2 218F2 337A10 102E13 426F11 126D9 163P17 77I17 384A13 27D7 53K8 411J11 1O15 127F8 281B16 62M2 166G5 146B5 301E22 86N21 28D22 285C21 183L18 422N23 239O15 161F23 275A20 230J8 334A18

pool 1 BAC pools used pool 4 pool 3 pool 2 for cDNA capture pool 5

= positive gene BLAST hit obtained from corresponding BAC pool-selected cDNA reads

De novo assembly of cat and dog Y chromosome sequences The initial sequence assembly was subjected to custom manual annotation, as follows: a). We normalized the initial de novo contigs containing significantly uneven depth of coverage by identifying regions with significant shifts in coverage, and cutting the contigs into new sub- contigs with even read depth coverage (see below).

b). Paired end sequences are favorable for sequence assembly by utilizing upstream and

21 downstream read recognition (see Fig. A below). An ideal de novo sequence assembly based on NGS data will exhibit relatively even read-depth within each contig and a contiguous distribution of matching paired-end reads (Fig. B below).

A: Paired End linkage Contigs 1 2 3

3 1 2

Consensus sequences F: forward read F R R: reverse read

B: Ideal sequences assembly (no duplication, no repetitive sequences)

1 2 3 4 5 6 7 8 9 10 11 12 read depth contig paired end sequences matching

Using this information, we identified collapsed contigs (caused by gene/sequence duplication) in the initial de novo assembly. Several examples of different assembly artifacts are shown below. For example, in a case where two regions with highly similar, paralogous sequences collapse into one contig with greater than average read coverage, this results in two distinct and abnormal patterns of paired end read distribution (Case 1 and Case 2 below).

Case 1 Nonadjacent contigs collapse Case 2 Adjacent contigs collapse

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 F R

1 2 3 6 7 8 9 10 11 12 1 2 3 5 6 9 10 11 12 4&8 7 R F F R 4&5 R F R F

Case 3 Palindrome Case 4 BAC Overlap

F R A Scaffold

1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12 R R F R Clone1 Clone2 5 4 3 2 1 A 1 2 3 4 5

Average single copy Read depth Normal paired end sequences matching Abnormal paired end sequences matching

Significant high read depth

Case 3 indicates a palindrome structure of highly similar sequences with read depth and paired end reads matching. Regions of BAC clone overlap also can cause significantly higher read

22 depth (Case 4), but paired end sequences patterns that are distinct from collapsed contigs caused by gene duplication. Using this information we reanalyzed the initial de novo sequence assembly. First we evaluated the copy number of each contig that was identified as potential collapsed contigs containing sequences from highly similar, duplicated Y chromosome regions. In this step, two criteria were used. We estimated the average expected depth of sequence coverage in contigs which contain known single copy coding genes (e.g. UTY/DDX3Y/USP9Y gene coding region). All normalized contigs were then compared to this value to provide an initial estimate of the potential collapsed copy number. Second, because the sequenced MSY region is haploid, only sequencing error or collapsed sequences could account for polymorphisms found in the initial de novo assembled contigs. We analyzed the SNP frequency of polymorphic sites found in contigs exhibiting a significant excess of coverage depth to adjust our initial copy number estimate.

c). Using paired-end sequence information and the estimated copy number of collapsed contigs, we calculated the maximum parsimony distribution map of contigs to rebuild scaffolds (below).

Collapsed contigs Unique contig

3 3 1 1 2 1 2 2 4 2 1 contig4 3 contig1 3 4 contig2 1 contig3 Upstream Downstream 3 contig ID contig ID

1 3 2 1 4

2 3 1 3

Assembly

Read depth Paired end sequences matching

d). We used a method similar to that applied in the software DNPtrapper(Arner et al. 2006) to separate collapsed contigs into unique contigs. Collapsed contigs include highly conserved regions and polymorphic region that result from collapse of nearly identical sequences distinguished by rare sequence variants. Variable region sequences are separated from the original contig and then sequence coverage is recalculated. Using paired end information the new

23 separated contigs are rebuilt.

The de novo assembly is completed by repositioning the formerly collapsed contigs into unique positions in the existing assembly. Copy number information inferred from read depth data was validated with quantitative PCR.

References

Arner E, Tammi MT, Tran AN, Kindlund E, Andersson B. 2006. DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions. BMC Bioinformatics 7:155. Hughes JF, Skaletsky H, Brown LG, Pyntikova T, Graves T, Fulton RS, Dugan S, Ding Y, Buhay CJ, Kremitzki C et al. 2012. Strict evolutionary conservation followed rapid gene loss on human and rhesus Y chromosomes. Nature 483(7387): 82-86. Meredith RW, Janecka JE, Gatesy J, Ryder OA, Fisher CA, Teeling EC, Goodbla A, Eizirik E, Simao TLL, Stadler T et al. 2011. Impacts of the Cretaceous Terrestrial Revolution and KPg Extinction on Mammal Diversification. Science 334(6055): 521-524. Murphy WJ, Wilkerson AJP, Raudsepp T, Agarwala R, Schaffer AA, Stanyon R, Chowdhary BP. 2006. Novel gene acquisition on carnivore Y chromosomes. Plos Genet 2(3): 353-363. Pearks Wilkerson AJ, Raudsepp T, Graves T, Albracht D, Warren W, Chowdhary BP, Skow LC, Murphy WJ. 2008. Gene discovery and comparative analysis of X-degenerate genes from the domestic cat Y chromosome. Genomics 92(5): 329-338.

24