The effects of natural selection on human amplicons

by

Levi S. Teitz

Bachelor of Science, Biological Sciences University of Maryland, College Park, 2013

SUBMITTED TO THE DEPARTMENT OF BIOLOGY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY IN BIOLOGY AT THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY

JUNE 2018

©2018 Massachusetts Institute of Technology. All rights reserved.

Signature of Author: ______Department of Biology May 29, 2018

Certified by: ______David C. Page Professor of Biology Thesis Supervisor

Accepted by: ______Amy E. Keating Professor of Biology Co-Chair, Biology Graduate Committee

The effects of natural selection on human Y chromosome amplicons

by

Levi S. Teitz

Submitted to the Department of Biology on May 29, 2018 in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Biology

ABSTRACT

The Y chromosome is unique among the mammalian chromosomes: it determines sex, and is therefore normally present in a single copy, unlike all other chromosomes that can recombine with an identical homolog. These two facts have had profound effects on the fate of the Y chromosome, subjecting it to unique evolutionary pressures that caused the loss of most of its genes. Because of this lack of functional genes, speculation abounded that natural selection is ineffective on a chromosome that lacks a homolog with which to recombine, and that the Y chromosome is doomed to eventually fade away.

In recent years, evidence has been building that the Y chromosome is indeed shaped by evolutionary forces acting to maintain its functional genes. However, these studies bypassed the amplicons—large, highly identical segmental duplications—which are a prominent feature of mammalian Y chromosomes and contain many genes crucial for spermatogenesis.

In this thesis, we present evidence that natural selection acts on the ampliconic regions of the human Y chromosome. We first develop computational tools to detect amplicon copy number changes from whole genome sequencing data of 1216 men, and find that many men have such changes. By projecting those changes onto a phylogenetic tree of the analyzed Y chromosomes, we find that the reference copy number of each amplicon is ancestral to all modern human Y chromosomes. We then use simulations and novel analytical methods to demonstrate that the ancestral copy number of each amplicon is maintained by selection within diverse human lineages, even in the face of extremely high rates of mutation. Finally, we find that deleted amplicons are preferentially restored to their previous copy number by subsequent duplications. These results are another step forward in the ongoing reframing the history of the mammalian Y chromosome: the Y chromosome is not the victim of random neutral processes, but is the carefully calibrated result of complex interplay between various selective forces.

Thesis Supervisor: David C. Page Title: Professor of Biology

3

4

To Grandma, Grandpa, Sabba, and Savta

5

6 בס״ד

ACKNOWLEDGEMENTS

First and foremost, thank you to my advisor, David Page, for your endless encouragement and faith in me. You are a true mentor, always looking to guide me and help me grow as a scientist. Your meticulous approach to research and the conscientious philosophy with which you run your lab are models that I can only hope to emulate in the future, wherever it takes me.

I joined the Page lab not just because of the science but also because of the people, who have created an amazingly warm, collaborative, and intellectually challenging atmosphere. Special thanks to Helen Skaletsky, for constantly sharing your incredible expertise and helping my science reach new heights; Tatyana Pyntikova, for helping me break my streak of doing no non-computational experiments, and for the tremendous amount of work you put in to assist me with my research; Laura Brown, for your kindness and care that shine through while sitting next to you every day; and Katherine Romer, for your friendship in sharing board game nights, puzzle hunts, and much more with me. And thank you to Winston Bellott. From the time I was your rotation student until today, you have happily acted as my sounding board and proofreader, taught me a tremendous amount through our countless conversations on every topic, kept me up to date on various rocket launches (from both Earth and Kerbin), and answered literally thousands of my questions. Your unending enthusiasm for and knowledge of both science and everything else is inspiring.

Thank you as well to the members of my thesis committee: Dave Bartel, Jing-Ke Weng, and Andy Clark, for your generous and valuable advice.

To my friends, in Cambridge, Somerville, and elsewhere: the meals, movies, cooking sessions, songs, games, shows, visits, phone calls, and more that we’ve shared have kept me going over the past five years. Without you, none of this work would have been possible.

And finally, to my family. I love you, and thank you for everything.

7

8 CONTENTS

CHAPTER 1: INTRODUCTION ...... 11

PART I: THE Y CHROMOSOME IN THE PRE-GENOMIC ERA ...... 13 PART II: COMPLETE SEQUENCE OF Y CHROMOSOMES ...... 22 PART III: Y CHROMOSOME VARIATION ...... 35 PART IV: CURRENT OPINIONS ON Y CHROMOSOME AMPLICONS ...... 48 ACKNOWLEDGEMENTS ...... 50 REFERENCES ...... 50

CHAPTER 2: SELECTION HAS COUNTERED HIGH MUTABILITY TO PRESERVE THE ANCESTRAL COPY NUMBER OF Y CHROMOSOME AMPLICONS IN DIVERSE HUMAN LINEAGES ...... 61

SUMMARY ...... 63 INTRODUCTION ...... 64 RESULTS ...... 70 DISCUSSION ...... 99 MATERIAL AND METHODS ...... 105 ACKNOWLEDGEMENTS ...... 116 REFERENCES ...... 116 SUPPLEMENTAL MATERIAL AND METHODS ...... 123 SUPPLEMENTAL FIGURES AND TABLES ...... 129

CHAPTER 3: CONCLUSIONS AND FUTURE WORK ...... 141

CONCLUSIONS ...... 143 FUTURE WORK ...... 145 ACKNOWLEDGEMENTS ...... 150 REFERENCES ...... 150

9

10

CHAPTER 1

INTRODUCTION

______

11

12 PART I: THE Y CHROMOSOME IN THE PRE-GENOMIC ERA

THE DISCOVERY OF THE Y CHROMOSOME

The first Y chromosome was discovered in 1905. Two studies observed via light microscopy that several insect species have a male-specific chromosome: each species has the same number of chromosomes in male and female, but in males one chromosome is significantly smaller than the rest (Stevens, 1905; Wilson, 1905). Until that point, scientists had only been aware of the , of which one copy is absent in males (Henking, 1891; Sutton, 1902). Due to its male-specific nature, the Y chromosome was hypothesized to play a role in sex determination (Stevens, 1905; Wilson, 1905). This hypothesis ended up being partially true: exceptions include fruit flies, whose sex is determined by the ratio of X chromosomes to , but which require the Y chromosome for male fertility (Bridges, 1916).

While most early sex chromosome research was done in insects, studies identifying human and other mammalian Y chromosomes followed (Painter, 1921).

However, the mammalian Y chromosome’s role in sex determination remained unclear until 1959, when studies of men with Klinefelter syndrome, who have two X chromosomes and one Y chromosome, and women with Turner syndrome, who have just a single X chromosome, demonstrated that maleness in humans depends on the presence or absence of the Y chromosome (Ford et al., 1959; Jacobs and Strong, 1959).

Y CHROMOSOME EVOLUTIONARY THEORY

In his studies of fly genetics, Hermann Muller noted that many traits exhibited the

pattern of X-chromosome linkage, but none had been linked to the Y chromosome

13 (Muller, 1914). To explain this pattern, he proposed the first theory of Y chromosome degeneration. Because male flies do not undergo recombination during meiosis, mutations on the Y chromosome are never transferred to the X chromosome. Further, since the Y chromosome is only ever present in a single copy, the corresponding copy of the gene on the X chromosome will always be present, and losing a single copy of a gene usually has no harmful effects. Therefore, the Y chromosome will accumulate mutations until it has lost all of its functional genes without ever affecting the phenotype of the fly, thereby escaping any selective pressure against such mutations.

Half a century later, Susumu Ohno’s studies of sex chromosomes led him to propose an updated model of Y chromosome evolution (Ohno, 1967). Ohno noted that the level of disparity in the cytological appearance of X and Y chromosomes varies between species of vertebrates: some have homomorphic X and Y chromosomes, which are cytologically indistinguishable; others have heteromorphic sex chromosomes, in which Y chromosomes are distinctly smaller than the X; and others still have no Y chromosome at all. Based on these observations, Ohno hypothesized that the X and Y chromosomes originated as a pair of ordinary autosomes. In each lineage, one acquired a sex-determining gene and become the Y chromosome, which then began to degenerate (Figure 1). The three observed categories represented steps in the ongoing process of Y chromosome degeneration. The homomorphic Y chromosomes had only recently acquired the sex-determining gene, and as such had diverged only slightly from the X chromosome. The heteromorphic Y chromosomes had acquired the sex- determining gene further into the past, and represented the advanced stages of Y

14 chromosome decay. Finally, the species with no Y chromosomes had reached the inevitable conclusion and lost the Y chromosome once it served no further function.

Ohno’s theory owed much to Muller’s theorizing about the Drosophila Y chromosome, but Ohno was the first to formally state that sex chromosomes originated as normal autosomes and developed independently in various lineages, and the first to bring evidence of various states of Y chromosome decay from an assortment of species.

Additionally, unlike in Muller’s flies—in which males do not undergo recombination at all—recombination occurs in both sexes in the species Ohno studied. Some other mechanism must have originally suppressed recombination between the X and Y chromosomes for Y chromosome decay to begin. Ohno suggested that large inversions on the Y chromosome had served this purpose. Because of these inversions, crossing over between the X and Y chromosomes during meiosis would result in chromosome nondisjunction or massive deletions of sequence; as a result, the inversions would effectively halt any exchange of genetic information between the sex chromosomes. Later work suggested that the fixation of such inversions is encouraged by selection (Rice,

1987a): because male-beneficial alleles gravitate to the Y chromosome (Fisher, 1931;

Charlesworth and Charlesworth, 1980; Rice, 1992), suppressing recombination between the X and Y chromosomes prevents the presence of such alleles in females, where they would be detrimental.

Another piece of evidence for the X and Y chromosomes’ shared origin emerged with the discovery that the tip of the Y chromosome is identical to the tip of the X chromosome, and therefore recombines with the X chromosome during meiosis (Cooke et al., 1985; Simmler et al., 1985). (Crossing over between the sex chromosomes had

15

FIGURE 1. EVOLUTION OF THE HUMAN Y CHROMOSOME „

The human X and Y chromosomes began as a pair of autosomes before the Y chromosome acquired the sex-determining gene SRY in the therian ancestor.

Subsequently, as series of inversions formed evolutionary strata and suppressed recombination between the X and Y chromosomes, leading to Y chromosome degradation within that stratum. Additionally, a fusion between an autosome and the X and Y chromosomes in the eutherian ancestor created the X-added region (XAR) and Y- added region (YAR). Red arrows represent hypothetical breakpoints of inversion events that formed evolutionary strata. Solid gray regions of the Y chromosome represent regions that have differentiated from the X chromosome and lost most of their genes after suppression of recombination.

16

17 been hypothesized as early as the 1930s due to visual observation of pairing between the

X and Y chromosome during meiosis [Koller and Darlington, 1934], but direct evidence of sequence homology between the two did not arise until the 1980s.) Because this sequence, like the autosomes, is present in two copies during meiosis and recombines with a homolog, it was named the , or PAR (Burgoyne, 1982).

The PAR consists of the sequence for which recombination was not suppressed by an inversion, and it has therefore retained its ancestral homology between the X and Y chromosomes. Still, the PAR is only a small fraction of the Y chromosome, and the rest of the Y chromosome—first named the non-recombining portion of the Y chromosome

(NRY) and then renamed the male-specific portion of the Y chromosome (MSY) after intrachromosomal Y chromosome recombination was discovered (Skaletsky et al.,

2003)—has been the focus of the vast majority of Y chromosome research. For the remainder of this thesis, we use the term ‘Y chromosome’ to refer specifically to the

MSY; the PAR is subject to its own distinct evolutionary pressures and is beyond the scope of this work.

As evidence built for the Y chromosome’s origin as a homolog of the X chromosome and its subsequent loss of sequence, four major evolutionary processes were proposed as drivers of Y chromosome degeneration (reviewed in depth in Charlesworth and Charlesworth, 2000). The first is Muller’s ratchet (Muller, 1964), which states that the genetic load of a nonrecombining chromosome irreversibly increases over time. Once recombination between the X and Y chromosomes is no longer possible, any mutations that the Y chromosome acquires cannot be restored to their original state by subsequent recombination. Second, strongly beneficial mutations will occasionally arise on the Y

18 chromosome and become fixed within the population. Because the Y chromosome is inherited as a single haplotype, all weakly deleterious mutations that are present on the chromosome upon which the beneficial allele arose also become fixed, a process called genetic hitchhiking (Smith and Haigh, 1974; Rice, 1987b). The third process, background selection, is a mirror image of genetic hitchhiking. Rather than weakly deleterious alleles becoming fixed along with one strongly beneficial allele, weakly beneficial alleles that arise in linkage with a strongly deleterious allele are eliminated from the population along with that deleterious allele (Charlesworth et al., 1993). Finally, there is only one copy of the Y chromosome for every three copies of the X chromosome and every four copies of each autosome in the population. As a result of this reduced effective population size, the Y chromosome is more susceptible to genetic drift, increasing the chance that deleterious mutations will become fixed (Nei, 1970). More recent research has modeled how these forces interact during the different stages of Y chromosome decay (Bachtrog, 2008).

In common to all of these theories was the assumption that the forces acting to destroy the Y chromosome would overpower any selective pressures acting to preserve it.

However, as time passed it became clear that the human Y chromosome has retained some functional elements besides the sex-determining gene.

THE DISCOVERY OF FUNCTIONAL GENES ON THE HUMAN Y CHROMOSOME

The methods available to early geneticists—namely, the study of pedigrees— made mapping traits to the sex chromosomes ostensibly simple, as X- and Y-linked traits have distinctive patterns of inheritance. In the first half of the 20th century, many studies

19 claimed to have found human traits whose genetic basis resided on the Y chromosome,

from hairy ears to webbed toes to scaly skin (Tommasi, 1907a; Tommasi, 1907b;

Schofield, 1921; Cockayne, 1933). This line of work came to a sudden halt when Curt

Stern systematically debunked every purported instance of Y chromosome linkage (Stern,

1957). For the next 20 years, the prevailing view of the mammalian Y chromosome was that of a genetic desert containing an element that determines sex and nothing else.

The first challenge to this view arrived in 1976, with the discovery of a deletion

on the long arm of the human Y chromosome that is associated with male infertility and

lack of sperm production, or azoospermia (Tiepolo and Zuffardi, 1976). This region was

therefore named the azoospermia factor, or AZF.

Then, with the development of modern methods for molecular biology in the late

1980s, a flurry of functional genes was discovered. TSPY, ZFY, and AMELY were all found on the short arm of the Y chromosome (Arnemann et al., 1987; Page et al., 1987;

Lau et al., 1989). In 1990, SRY was identified as the sex-determining gene (Gubbay et al.,

1990; Sinclair et al., 1990). RBMY and DAZ were mapped to the long arm of the Y and emerged as leading candidates for the genes whose deletion causes spermatogenic failure

(Ma et al., 1993; Reijo et al., 1995). Concurrently, the AZF region, previously thought to

be a single locus, was shown to actually be three, thereafter named the AZFa, AZFb, and

AZFc regions (Vogt et al., 1996).

These discoveries proved that the Y chromosome does indeed contain functional

elements. Still, the consensus remained that the Y chromosome is a genetic wasteland of

minor functional significance, and that these genes are the remnants of a haphazard

process of decay, retained by chance rather than because of any function they served.

20 However, from these genes, a striking pattern emerged. Unlike all other chromosomes, whose genes have a variety of functions seemingly arranged at random, the human Y chromosome’s genes are functionally coherent (Lahn and Page, 1997). All

Y chromosome genes fall into one of two categories: housekeeping genes expressed in many tissues, and genes expressed predominantly or solely in the testis. This pattern was the first hint of an overarching reason why those genes were retained on the Y chromosome when all others were lost. The genomic basis for this pattern was revealed as part of a larger project that caused a foundational shift in the entire field of biology: the sequencing of the human genome.

______

21 PART II: COMPLETE SEQUENCE OF Y CHROMOSOMES

SEQUENCING OF THE HUMAN Y CHROMOSOME

Without a doubt, the most significant event in the history of study of the Y chromosome is its sequencing in human in 2003 (Skaletsky et al., 2003). The Y chromosome’s sequence confirmed longstanding theories about its origin and evolution, while also raising unexpected new questions.

The most striking finding from the complete sequence of the human Y chromosome was that it is comprised of four distinct sequence classes, each with its own characteristics (Figure 2). The first of these classes is the X-degenerate sequence.

Spanning 9.4 Mb, it is the remains of the ancestral sequence shared by the X and Y chromosomes. The X-degenerate sequence contains 16 single-copy genes, each of which has a homolog on the human X chromosome. These genes tend to be broadly expressed regulatory genes (Bellott et al., 2014), accounting for the first gene category observed by

Lahn and Page (1997). This class also contains 13 pseudogenes, the remains of ancestral

X-Y gene pairs that acquired a mutation eliminating their function, but that have not yet decomposed to the point that they are unidentifiable as former X-Y gene pairs.

These X-Y gene pairs led to the discovery of distinct evolutionary strata on the human X chromosome: contiguous blocks of sequence that correspond to ancient inversions on the Y chromosome that suppressed recombination with the X chromosome

(Lahn and Page, 1999a), which vindicated Ohno’s theory about the mechanism of recombinational suppression. The strata were identified because all X-Y gene pairs in a single stratum have similar levels of nucleotide divergence to each other, indicating that a single event caused all gene pairs in a stratum to begin to diverge at the same time.

22 The second discovered class is 3.4 Mb of sequence with 99% nucleotide identity

to a similar region on the X chromosome. The fact that this sequence has no counterpart

on other Y chromosomes, along with its level of nucleotide divergence from the

corresponding X chromosome region, indicates that it was transposed from the X

chromosome to the Y chromosome after humans diverged from other primates (Page et

al., 1984). This class is therefore named the X-transposed sequence. Its presence demonstrates that the Y chromosome is more than just the husk of the ancestral X-Y pair,

but can also acquire new sequence and even new functional genes. However, in a

demonstration of the power of the evolutionary forces acting to degrade the Y

chromosome, the X-transposed region on the Y chromosome has lost all but two of those

functional genes in the relatively short amount of time since it was acquired.

The third sequence class is heterochromatin, which comprises more than half of

the reference Y chromosome. The vast majority of this heterochromatin is present as a

single block that is approximately 40 Mb long in the reference Y chromosome, but its

length varies between men (Repping et al., 2006). Beyond this major heterochromatic

block, prominent heterochromatic regions on the human Y chromosome include the

centromere, telomeres, and a block of around 400 kb within the proximal long arm of the

chromosome.

The fourth sequence class is something that, at the time the human Y chromosome

was first sequenced, was entirely unexpected: the amplicons.

23

FIGURE 2. GENOMIC STRUCTURE OF MAMMALIAN Y CHROMOSOMES „

Mammalian Y chromosomes contains four major sequence classes. The four mammalian

Y chromosomes with fully assembled ampliconic sequence are shown. Detailed ampliconic structures are shown for the human AZFc region, the major block of chimpanzee ampliconic sequence, and the long arm of the mouse Y chromosome.

Colored arrows (human and chimpanzee detail) and blocks (mouse detail) represent individual amplicon copies. Within each species, units of the same amplicon family are shown in the same color. Arrow direction indicates amplicon orientation.

24

25 AMPLICONS

The Y chromosome amplicons are large blocks of sequence—from tens of kilobases to over a megabase—that are present on the Y chromosome in two or more nearly identical copies, comprising 10.2 Mb in humans. Unlike most repetitive sequence in the genome, which is considered junk DNA and has little functional significance, the Y chromosome amplicons contain functional genes. In fact, ampliconic sequence is quite similar to regular, single-copy euchromatin, except that it is present in more than one copy. Beyond this fact, several features of the Y chromosome amplicons are remarkable: their arrangement, their level of identity, and their functional coherence.

Many of the amplicons on the human Y chromosome are arranged in palindromes: two copies of an ampliconic unit arranged in opposite orientations, and separated by only a small “spacer” sequence. Many others are arranged similarly, but with a much greater distance between copies; these amplicons are called inverted repeats.

Most spectacularly, the amplicons of the AZFc region of the human Y chromosome are arranged in a complex, interwoven pattern over a span of 4.5 Mb (Figure 2).

The level of identity between amplicons is also exceptional. Copies of inverted repeat sequences differ by up to 34 and as little as 5 bases out of every 10,000 (99.66-

99.95% nucleotide identity). The palindromes have an even more extreme level of identity: their copies differ by at most 6/10,000 bases (99.94% nucleotide identity), with some differing by as little as 3/100,000 bases (99.997% nucleotide identity). This level of identity is maintained by gene conversion—the unidirectional transfer of sequence, as opposed to reciprocal transfer caused by crossing over—between amplicon copies (Rozen et al., 2003). Because of the high level of identity between amplicon copies, typical

26 methods of genome sequencing that use short or inaccurate reads, or that use sequence

from multiple haplotypes, fail at determining the Y chromosome’s genomic structure, as

multiple amplicon copies collapse into a single copy or form overly complex and

incorrect structures upon assembly. Therefore, a method based on iteratively sequencing

and tiling bacterial artificial chromosomes (BACs) was developed and used for the

human Y chromosome (Kuroda-Kawaguchi et al., 2001).

Finally, the ampliconic genes show remarkable functional coherence. Every

ampliconic gene is expressed primarily or exclusively in the testes, meaning that the

amplicons are responsible for the second category of genes on the human Y chromosome

that were first noted by Lahn and Page (1997). Unlike the single-copy X-degenerate

genes that comprise the first category, which were present on the ancestral autosomal X-

Y chromosome pair, many ampliconic genes were acquired from other chromosomes

(Saxena et al., 1996; Lahn and Page, 1999b; Kuroda-Kawaguchi et al., 2001).

Ampliconic sequence—defined as large, highly identical segmental

duplications—was subsequently discovered on other chromosomes, such as the X

chromosome (Mueller et al., 2013). Still, the Y chromosome amplicons remain by far the

most extreme examples of amplicons in terms of sequence identity, size, and architectural

complexity.

SEQUENCING OF OTHER MAMMALIAN Y CHROMOSOMES

Modern therian (placental and marsupial mammal) sex chromosomes are descended from the sex chromosomes of the therian common ancestor, as evidenced by shared genes on placental and marsupial X chromosomes (Watson et al., 1990). The

27 placental sex chromosomes also have an added region caused by a chromosomal fusion of an autosome to the sex chromosomes that is absent in marsupials (Watson et al., 1991;

Nanda et al., 2002). In the years following the sequencing of the human Y chromosome, several other mammalian Y chromosomes were sequenced to varying levels of completion, revealing the divergent evolutionary paths taken by the chromosome since its inception (Figure 2).

The first such Y chromosome to be sequenced belonged to the chimpanzee

(Hughes et al., 2010). Although the autosomal euchromatic sequence of humans and chimpanzees is 99% similar (Chimpanzee Sequencing Analysis Consortium, 2005), their respective Y chromosomes are much more divergent. As mentioned above, the X- transposed sequence in humans is absent on the chimpanzee Y chromosome. The X- degenerate sequences are the most similar Y chromosome regions between human and chimpanzee, with all 12 of the chimpanzee X-degenerate genes present on the human Y chromosome. However, several major inversions and transpositions of X-degenerate sequence have occurred since the human-chimpanzee split. Even more divergent are the ampliconic regions. The chimpanzee Y chromosome contains 14.7 Mb of ampliconic sequence (compared to 10.2 Mb in humans), much of it arranged in palindromes and more complex architectures. Only around half of this sequence is common to both species, but it is drastically rearranged between the two; the remaining ampliconic sequence is unique to either chimpanzee or human.

The sequencing of the rhesus macaque and mouse Y chromosomes yielded similar results (Hughes et al., 2012; Soh et al., 2014). In both, many X-degenerate genes are shared with other mammalian Y chromosomes, and amplicons are present but

28 extremely divergent from any other species. The rhesus macaque Y chromosome contains only 500 kb of ampliconic sequence, although it does share some genes with human and chimpanzee amplicons. The mouse Y chromosome contains 87.7 Mb of ampliconic sequence comprised primarily of around 200 copies of a 0.5 Mb unit, and its ampliconic gene content is almost completely different than that of the primate Y chromosome. The massive amplification on the mouse Y chromosome is hypothesized to be the result of a genetic arms race with the mouse X chromosome: homologous genes are also amplified on the mouse X chromosome, and deletion of those genes on either chromosome results in a skewed sex ratio in offspring (Cocquet et al., 2009; Cocquet et al., 2012). This sex ratio distortion could have driven amplification of these genes on both the X and Y chromosomes, as any increase in copy number on one chromosome would favor a similar increase on the other chromosome in order to return the sex ratio to even.

The sequencing of one non-mammalian sex chromosome also supported the existing theories of sex chromosome evolution. The chicken Z and W chromosomes (in birds, males are ZZ and females are ZW) arose from a different ancestral autosome than the X and Y chromosomes in mammals, evidenced by the fact that chromosomes 5, 9, and 18 in human correspond to the chicken Z chromosome, as determined by gene content and synteny (Nanda et al., 1999; Nanda et al., 2002). Similarly, chicken chromosomes 1 and 4 correspond to the human X chromosome (International Chicken

Genome Sequencing, 2004; Ross et al., 2005). Both chicken and mammalian sex chromosomes correspond to autosomes in teleost fishes, further reinforcing the fact that sex chromosomes arise from ordinary autosomes (Bellott et al., 2010).

29 Partial assemblies of the marmoset, rat, bull, and opossum Y chromosomes followed (Bellott et al., 2014). Other Y chromosomes, including those of dog, cat, pig, and gorilla (Li et al., 2013; Skinner et al., 2016; Tomaszkiewicz et al., 2016), were sequenced without using the BAC-based methods used for the above chromosomes, leaving the architecture, copy number, and complete sequence of the amplicons a mystery.

From these varied Y chromosomes, several patterns emerge. First, the single- copy, X-degenerate genes have been conserved on mammalian Y chromosomes for hundreds of millions of years, implying that the mammalian Y chromosome is not continuing to decay but instead quickly decayed before reaching a steady state that it remains in today (Figure 3; Bellott et al., 2014). Additionally, these genes tend to be widely expressed, dosage sensitive, and regulators of basic biological processes (Bellott et al., 2014). These facts imply that the single-copy genes that remain on the Y chromosome are not the random remnants of degeneration but carefully cultivated survivors that were preserved because of their crucial functions. Because these genes are dosage sensitive, deletions behave as dominant loss of function mutations: losing the Y chromosome copy would be deleterious, counter Muller’s original theory that Y chromosome mutations would have no phenotypic effect (Muller, 1914). This stands in contrast to most genes, for which loss of function is recessive—there is no harmful phenotype upon deletion of a single copy (or mild overexpression). This fact—combined with the role of these genes as regulators expressed in many tissues across development, which would cause massive deleterious effects if not operating properly—necessitates the retention of these genes on the Y chromosome.

30 Second, ampliconic sequence is present on every sequenced Y chromosome, and the ampliconic genes are testis-specific and often acquired from other chromosomes rather than ancestral to the X-Y chromosome pair. The testis specificity of ampliconic genes implies that they play some role in spermatogenesis and male fertility; the ubiquity of ampliconic sequence suggests that the formation and maintenance of amplicons confers some functional benefit beyond simply housing genes that are necessary for spermatogenesis. Several theories arose as to what this benefit is. Multiple copies of ampliconic genes can provide the proper gene dosage (Soh et al., 2014). Having multiple copies allows a gene that acquired a deleterious mutation to be restored to its previous state through gene conversion with an intact copy (Rozen et al., 2003; Connallon and

Clark, 2010; Hallast et al., 2013). A modification of this theory posits that gene conversion also speeds up adaptation through gene conversion of beneficial mutations

(Betran et al., 2012). Finally, the sex chromosomes are largely inactivated during the pachytene stage of spermatogenesis (Monesi, 1965), as genes can only be expressed at that stage if they are paired with homologous sequence (Turner et al., 2005). However, amplicons may be able to escape this inactivation and express their genes by pairing with themselves during meiosis (Warburton et al., 2004). These theories are not mutually exclusive, and they may have played different roles throughout the lifespan of amplicons—some driving their initial formation and others powering their maintenance.

Third, in contrast to the single-copy genes, multi-copy ampliconic genes vary between species. Even in closely related species that do share many ampliconic genes, such as human and chimpanzee, the ampliconic architecture and the copy number of the ampliconic genes are wildly different. This rapid evolution points to either a lack of

31

FIGURE 3. Y CHROMOSOMES UNDERGO QUICK DECAY AND SUBSEQUENT STABILIZATION

FOLLOWING STRATUM FORMATION „

(Adapted from Bellott et al., 2014)

Number of ancestral Y chromosome genes on the MSY in each stratum (on a log scale) plotted against time. Circles represent the number of genes present in the ancestral X-Y chromosome pair (325 mya) and the most recent common ancestor of human and opossum (176 mya), human and bull (97 mya), human, mouse, and rat (91 mya), human and marmoset (44 mya), human and chimpanzee (6 mya), and modern humans (0 mya).

Gene numbers are inferred from the set of ancestral Y chromosome genes retained in modern-day species descended from each common ancestor, and therefore represent the lower bound of gene number at each historical point. Lines represent a model of exponential decay with a constant baseline fitted to the circles. This model produces a better fit than other models that have been proposed to describe Y chromosome degeneration, such as simple exponential decay or linear decay.

32

33 selective constraint or extreme positive selection driving diversification.

The era of Y chromosome sequencing confirmed previous theories about the Y chromosome’s autosomal origins while debunking the idea that the Y chromosome was on the road to extinction. The Y chromosome sequences also demonstrated functional themes among single-copy genes and revealed the existence of the amplicons. The next frontier of Y chromosome research was not sequencing a single reference chromosome per species, but using the existing reference sequences to study tens, hundreds, and even thousands of Y chromosomes at a time.

______

34 PART III: Y CHROMOSOME VARIATION

SINGLE NUCLEOTIDE Y CHROMOSOME VARIATION

The reference sequence of the human Y chromosome opened the door to new areas of Y chromosome research. One particularly abundant field of study was Y chromosome demography. Because the Y chromosome does not recombine with a homologous chromosome, it is inherited as a single haplotype. Therefore, a phylogenetic tree of every human Y chromosome can be built, making the Y chromosome a powerful tool for reconstructing the history of human populations. A universal standard for this grand Y chromosome tree was adopted in 2002, in which major branches of the tree— called haplogroups—are labeled by letters of the alphabet, with further subdivisions given more detailed labels (Figure 4; Y Chromosome Consortium, 2002). Since then, the tree has grown ever more detailed by nonstop research, including the recent discovery of haplogroup A00, which diverged from all other haplogroups over 200,000 years ago

(Mendez et al., 2013). Beyond classifying Y chromosomes, the study of Y chromosome demography allows for the reconstruction of ancient patterns of human populations. In some cases, these patterns can be linked to known historical events such as the colonization of continents, the expansion of cultures, and the development of agriculture

(Batini et al., 2015; Karmin et al., 2015; Poznik et al., 2016).

An underlying assumption of these studies is that the Y chromosome’s SNPs reflect neutral evolutionary forces such as drift and migration. However, patterns of single-nucleotide variation on the Y chromosome point towards the presence of non- neutral evolutionary forces as well. Single-copy genes on the human Y chromosome show decreased nonsynonymous mutation compared to synonomous mutation, a classic

35

FIGURE 4. PHYLOGENETIC TREE OF Y CHROMOSOME HAPLOGROUPS „

Major branches of the haplogroup tree that are present in the 1000 Genomes Project and haplogroup A00 are shown. Divergence times of haplogroups A-R are estimated from

Poznik et al. (2016). Haplogroup A divergence time is a composite of several haplogroup

A branches that each diverged independently. Haplogroup A00 divergence time estimated from Elhaik et al. (2014), which is the lowest of several estimates of divergence time for the haplogroup.

36

37 indicator of purifying selection (Rozen et al., 2009). In addition, the low level of genetic diversity present within the single-copy sequence of the human Y chromosome as a whole is a signature of selection (Wilson Sayres et al., 2014). These studies remain the minority; most Y chromosome demographic research still largely assumes a neutral Y chromosome.

LARGE-SCALE Y CHROMOSOME VARIATION

In addition to single-nucleotide variation, the Y chromosome is susceptible to structural variation: much larger deletions, duplications, inversions, and complex rearrangements of sequence (Figure 5). While the first foray into this topic was the discovery of deletions at the putative AZF locus in 1976 (Tiepolo and Zuffardi, 1976), the study of Y chromosome structural variation truly began in the 1990s. During those years, the AZF locus was revealed to actually be three loci: AZFa, AZFb, and AZFc (Vogt et al.,

1996). Further studies discovered that the mechanism of deletion for all three regions is non-allelic homologous recombination (NAHR): crossing over between nearly identical sequences on either sister chromatids or a single chromatid looping to pair with itself. For

AZFa deletions, this identical sequence is two copies of an endogenous retrovirus inserted within the X-degenerate sequence on the long arm of the Y chromosome, spaced

800 kb apart (Sun et al., 2000). The AZFb and AZFc deletions actually overlap, and both are caused by crossing over between amplicon copies (Sun et al., 2000; Repping et al.,

2002). (Because of this overlap, the modern nomenclature of the human Y chromosome recognizes the region of complex interleaved amplicons on the long arm as the AZFc region, while AZFb only refers to the previously described deletion, not to a region of the

38 reference chromosome.) These findings explained the recurrence of the AZF deletions, as

the large size and high identity of these regions, particularly the amplicons, make them

prime targets for NAHR.

These studies pinpointed the deletion coordinates using sequence-tagged sites, or

STSs: small stretches of sequence that are unique, even within the highly identical

amplicons, and whose locations are mapped across the Y chromosome (Vollrath et al.,

1992). Amplifying these STSs by PCR can detect their presence or absence on a Y

chromosome and enable rough triangulation of the span of a deletion. Further studies

using similar methods identified other recurrent variants such as the gr/gr and b2/b3

deletions, which are nearly fixed in branches of haplogroups D and N, respectively

(Repping et al., 2003; Fernandes et al., 2004; Repping et al., 2004), and the AMELY

deletion (Lattanzi et al., 2005). These variants also arise due to NAHR between amplicon

copies. In fact, the amplicons are especially vulnerable to large deletions and

duplications, collectively termed copy number variants (CNVs), because the amplicons,

with their large size and high identity, are prime targets for NAHR (Kuroda-Kawaguchi et al., 2001).

However, using STSs to detect amplicon structural variation has two inherent weaknesses. First, the method can only detect deletions, since PCR only detects the presence or absence of an STS, not how many copies of it are present. Duplications, inversions, and other rearrangements of sequence go undetected, and complex variants that result from consecutive mutation events can result in confusing patterns of scattered deleted STSs. Second, because STSs have long gaps between them, particularly within the amplicons, this method cannot identify the exact breakpoints of a deletion. For

39

FIGURE 5. MAJOR Y CHROMOSOME STRUCTURAL VARIANTS „

Each variant shown is caused by non-allelic homologous recombination (NAHR) between nearly identical sequence at the breakpoints. Span of deletions is shown in red.

Detailed mechanism of deletion is shown for AZFc and gr/gr deletions. In each, 1. the colored arc shows the targets of NAHR on a single copy of the entire AZFc region, followed by 2. crossing over between two sister chromatids of the Y chromosome, causing a deletion and 3. the resulting architecture after NAHR. An alternative mechanism, in which a single chromatid forms a loop and undergoes NAHR with itself, is not shown. Locations of the two centromeres on the isodicentric chromosome are indicated with arrows.

40

41 example, the AZFa breakpoints were determined by using STSs to find the approximate span of the deletion and then independently scanning that region for appropriate NAHR breakpoints (Sun et al., 2000).

One method that can detect variants other than deletions is FISH (fluorescence in situ hybridization): the use of fluorescent DNA probes that hybridize to known sequence within a cell, which is then observed with microscopy. This method allows direct visual observation of the copy number of amplicons. The arrangement of multiple amplicon units can also be determined with 2-color FISH, using two separate DNA probes that hybridize to different amplicon units and that each have a different colored fluorophore.

FISH was used to determine amplicon copy number on the reference Y chromosome

(Saxena et al., 2000) and in many subsequent studies of Y chromosome structural variation. One such study used FISH (among other methods) to observe ampliconic

CNVs and rearrangements in 47 men from various haplogroups (Repping et al., 2006).

One striking finding was the variability of the AZFc region, which underwent changes in amplicon copy number and/or architecture in 20 out of the 47 men. This study also reinforced the idea that most ampliconic structural variation is caused by NAHR, as many of the observed variants, including the AZFc structures, correspond to structures that can be formed by one or more NAHR events. Further research also identified isodicentric Y chromosomes and massive inversions spanning the Y chromosome centromere, both caused by crossing over between amplicon copies (Lange et al., 2009; Lange et al.,

2013).

The downside of using FISH to detect Y chromosome structural variation is that the method is distinctly low-throughput and time-consuming. In addition, FISH shares

42 the weakness of STSs in that it can provide large-scale information about variants but

cannot precisely determine variant boundaries on its own. The next wave of studies

addressed these issues by taking advantage of two emerging high-throughput

technologies: array comparative genome hybridization (aCGH) and high-throughput

sequencing.

HIGH-THROUGHPUT STUDIES

The invention of microarrays and, subsequently, high-throughput sequencing heralded an era of studies throughout biology that analyzed hundreds and even thousands of samples at once. Both technologies can be used to detect Y chromosome structural variation. Microarrays are used in aCGH, in which DNA is hybridized to tens of thousands of probes, and copy number is determined based on the relative levels of hybridization between test and control samples (reviewed in Carter, 2007). High- throughput sequencing generates hundreds of millions of short DNA reads, which can then be analyzed using a broad variety of computational methods to detect structural variants (Abyzov et al., 2011; Miller et al., 2011; Rausch et al., 2012; Fan et al., 2014).

However, major studies of human structural variation tend to ignore the Y chromosome

(Handsaker et al., 2015; Sudmant et al., 2015), as the repetitive nature of the amplicons complicates the analysis—mainstream computational tools to detect copy number variation often assume that each sequence is only present in a single location on the reference chromosome, and those tools fail when analyzing regions, such as the amplicons, that do not fulfill this assumption.

43 In recent years, four notable studies used one of these two methods to detect structural variants, and particularly CNVs, on the Y chromosome with both greater precision and in a greater number of men than before. Wei et al. (2015) and Johansson et al. (2015) used aCGH to study 411 men and 1718 men, respectively; Espinosa et al.

(2015) analyzed high-throughput sequencing data from a subset of 70 men from the 1000

Genomes Project; and Poznik et al. (2016), although focusing on the demographic analyses based on SNPs described above, also studied CNVs from the full set of 1244 men from the 1000 Genomes Project. The methodology of each group differs, with each describing variants that others would have omitted, limiting the power of any comparative analysis of the studies. Still, several overarching themes emerge from these reports. Y chromosome structural variants are extremely prevalent and diverse, with the studies detecting 22, 25, 19, and 110 distinct CNVs, respectively. The amplicons are particularly susceptible to CNVs: 59% of CNVs in Wei et al., 80% in Johansson et al., and 42% in Espinosa et al. were found within ampliconic regions. (Poznik et al. excluded several ampliconic loci because their variant detection pipeline had difficulties in calling

CNVs in such repetitive regions, so the percentage of ampliconic CNVs in their study does not accurately reflect the prevalence of such variants.)

Another common theme is the recurrence of specific variants: many large CNVs are present in multiple men and even in multiple haplogroups, indicating that such variants arose independently many times throughout history. These recurrent CNVs are especially likely to be ampliconic, further reinforcing the extreme mutability of the amplicons.

44 These results also provide insight into the mechanistic origins of Y chromosome

CNVs and supply an explanation for the recurrent nature of many of the variants. Many

CNVs in all four studies—ampliconic or otherwise—are bordered by repetitive sequence, implicating NAHR as the cause of mutation and confirming that it is a prevalent mechanism for CNV formation on the Y chromosome. However, many non-NAHR- mediated variants were observed as well, implying that the mutational landscape of Y chromosome structural variation is more complex than previously shown.

All four studies also share common limitations. Little effort was made to reconstruct the mechanism of origin of variants more complex than a simple NAHR event. In addition, no evolutionary analysis was performed beyond the rudimentary assignment of CNVs to haplogroups, leaving the evolutionary pressures that shaped the formation and inheritance of these CNVs a mystery.

PHENOTYPES OF AMPLICONIC STRUCTURAL VARIATION

Y chromosome structural variants can have profound functional and phenotypic consequences, and their study extends beyond just cataloging variants. In fact, the first Y chromosome structural variants to be described were the AZF deletions that cause spermatogenic failure (Tiepolo and Zuffardi, 1976; Vogt et al., 1996). Since then, other phenotypic consequences of Y chromosome variants have been discovered. Transposition of the sex-determining gene SRY to the X chromosome causes sex reversal (Andersson et al., 1986). Isodicentric Y chromosomes caused by NAHR between amplicon copies can lead to loss of the Y chromosome and Turner syndrome (Lange et al., 2009). In addition to the canonical AZFa, AZFb, and AZFc deletions (Vogt et al., 1996), certain partial

45 deletions of the AZFc region such as the gr/gr and b1/b3 deletions are associated with increased risk of spermatogenic failure (Lynch et al., 2005; Rozen et al., 2012). TSPY copy number variation can also affect spermatogenesis (Giachini et al., 2009). Other phenotypes can result from such variants as well: the gr/gr deletion is associated with testis cancer (Nathanson et al., 2005). However, the gr/gr deletion is almost fixed within a branch of haplogroup D, demonstrating that chromosomes with the deletion are fit enough to spread throughout a population (Repping et al., 2003). Other AZFc deletions, such as the b2/b3 deletion, appear to have no phenotypic effect, although this fact is debated (Lu et al., 2009; Rozen et al., 2012). A complicating factor is that the genetic background of the men with such variants—both on the remainder of the Y chromosome as well as on other chromosomes—could very well influence the presence and severity of these phenotypes. Meanwhile, the recent broad surveys of Y chromosome structural variation described above (Wei et al., 2015; Johansson et al., 2015; Espinosa et al., 2015;

Poznik et al., 2016) made no attempt to examine the phenotypic consequences of the variants they detected, but the large majority of those variants have no known effect.

NONHUMAN Y CHROMOSOME VARIATION

The study of Y chromosome structural variation has recently expanded to include nonhuman Y chromosomes. Studies in chimpanzee, bonobo, macaque, and mouse found that, as in humans, large CNVs are common and recurrent, particularly within the amplicons (Ghenu et al., 2016; Oetjens et al., 2016; Morgan and de Villena, 2017).

Although only a total of nine chimpanzees were studied, different subspecies display distinct patterns of amplicon copy number, suggesting that amplicon content is diverging

46 between populations of chimpanzees (Oetjens et al., 2016). In mice, the ampliconic variation is consistent with the hypothesized genetic arms race between the X and Y chromosome ampliconic genes, further demonstrating the volatility of the amplicons and the myriad functional roles of the ampliconic genes (Morgan and de Villena, 2017).

______

47 PART IV: CURRENT OPINIONS ON Y CHROMOSOME AMPLICONS

From the time of their initial, unexpected discovery 15 years ago, it was clear that the Y chromosome amplicons play a crucial functional role. However, research into the amplicons has been largely limited to counting a few well-defined variants that have known effects on spermatogenesis. Recent broader studies of Y chromosome structural variation make it clear that the amplicons are uniquely susceptible to major and recurrent

CNVs, but little has been done beyond cataloging these variants.

Instead, most deep functional and evolutionary research on the mammalian Y chromosome has focused on single-copy regions. Evolutionary comparisons are relatively easy to make for such regions because they are better conserved between species than the amplicons. In contrast, the rapidly evolving nature of the amplicons makes it almost impossible to reconstruct their history even by comparing closely related species.

Because of this rapid evolution, it is easy to revert to past theories of Y chromosome evolution and assume that the amplicons—and especially ampliconic copy number and architecture—are unconstrained by selection and victim to the random fluctuations of genetic drift and mutation. Recent work has begun to challenge this assumption, but to date there is much speculation and little concrete evidence or systematic investigation into the effect of selection on the amplicons.

In this thesis, we demonstrate that amplicon copy number in humans has been subject the natural selection over the past 200,000 years, and that such selection has maintained the ancestral copy number of each amplicon, even in the face of extremely high levels of mutation. We do so by analyzing high-throughput sequencing data from

1216 human males, using novel computational tools to detect amplicon copy number. We

48 use that copy number information to predict amplicon architecture and the mechanism of mutation. Then, using the complete phylogenetic tree of the Y chromosomes analyzed, we compare the distribution of CNVs within the tree to the expected distribution if such

CNVs were unconstrained by selection, and show that the observed pattern of CNVs is incompatible with neutrality and instead displays the signature of purifying selection.

This work is presented in Chapter 2. Chapter 3 presents conclusions and future directions to further investigate the mammalian Y chromosome amplicons.

______

49 ACKNOWLEDGEMENTS

I thank Winston Bellott, Mina Kojima, and David Page for their comments on this

chapter.

______

REFERENCES

Abyzov, A., Urban, A. E., Snyder, M., & Gerstein, M. (2011). CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 21, 974-84.

Andersson, M., Page, D. C., & de la Chapelle, A. (1986). Chromosome Y-specific DNA is transferred to the short arm of X chromosome in human XX males. Science 233, 786-8.

Arnemann, J., Epplen, J. T., Cooke, H. J., Sauermann, U., Engel, W., & Schmidtke, J. (1987). A human Y-chromosomal DNA sequence expressed in testicular tissue. Nucleic Acids Res 15, 8713-24.

Bachtrog, D. (2008). The temporal dynamics of processes underlying Y chromosome degeneration. Genetics 179, 1513-25.

Batini, C., Hallast, P., Zadik, D., Delser, P. M., Benazzo, A., Ghirotto, S., Arroyo-Pardo, E., Cavalleri, G. L., de Knijff, P., et al. (2015). Large-scale recent expansion of European patrilineages shown by population resequencing. Nat Commun 6, 7152.

Bellott, D. W., Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Cho, T. J., Koutseva, N., Zaghlul, S., Graves, T., et al. (2014). Mammalian Y chromosomes retain widely expressed dosage-sensitive regulators. Nature 508, 494-9.

Bellott, D. W., Skaletsky, H., Pyntikova, T., Mardis, E. R., Graves, T., Kremitzki, C., Brown, L. G., Rozen, S., Warren, W. C., et al. (2010). Convergent evolution of chicken Z and human X chromosomes by expansion and gene acquisition. Nature 466, 612-6.

Betran, E., Demuth, J. P., & Williford, A. (2012). Why chromosome palindromes? Int J Evol Biol 2012, 207958.

Bridges, C. B. (1916). Non-Disjunction as Proof of the Chromosome Theory of Heredity. Genetics 1, 1-52.

50 Burgoyne, P. S. (1982). Genetic homology and crossing over in the X and Y chromosomes of Mammals. Hum Genet 61, 85-90.

Carter, N. P. (2007). Methods and strategies for analyzing copy number variation using DNA microarrays. Nat Genet 39, S16-21.

Charlesworth, B., & Charlesworth, D. (2000). The degeneration of Y chromosomes. Philos Trans R Soc Lond B Biol Sci 355, 1563-72.

Charlesworth, B., Morgan, M. T., & Charlesworth, D. (1993). The effect of deleterious mutations on neutral molecular variation. Genetics 134, 1289-303.

Charlesworth, D., & Charlesworth, B. (1980). Sex differences in fitness and selection for centric fusions between sex-chromosomes and autosomes. Genet Res 35, 205-14.

Chimpanzee Sequencing Analysis Consortium. (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69-87.

Cockayne, E. A. (1933). Inherited abnormalities of the skin and its appendages. London: Oxford University Press.

Cocquet, J., Ellis, P. J., Mahadevaiah, S. K., Affara, N. A., Vaiman, D., & Burgoyne, P. S. (2012). A genetic basis for a postmeiotic X versus Y chromosome intragenomic conflict in the mouse. PLoS Genet 8, e1002900.

Cocquet, J., Ellis, P. J., Yamauchi, Y., Mahadevaiah, S. K., Affara, N. A., Ward, M. A., & Burgoyne, P. S. (2009). The multicopy gene Sly represses the sex chromosomes in the male mouse germline after meiosis. PLoS Biol 7, e1000244.

Connallon, T., & Clark, A. G. (2010). Gene duplication, gene conversion and the evolution of the Y chromosome. Genetics 186, 277-86.

Cooke, H. J., Brown, W. R., & Rappold, G. A. (1985). Hypervariable telomeric sequences from the human sex chromosomes are pseudoautosomal. Nature 317, 687-92.

Elhaik, E., Tatarinova, T. V., Klyosov, A. A., & Graur, D. (2014). The 'extremely ancient' chromosome that isn't: a forensic bioinformatic investigation of Albert Perry's X-degenerate portion of the Y chromosome. Eur J Hum Genet 22, 1111-6.

Espinosa, J. R., Ayub, Q., Chen, Y., Xue, Y., & Tyler-Smith, C. (2015). Structural variation on the human Y chromosome from population-scale resequencing. Croat Med J 56, 194-207.

51 Fan, X., Abbott, T. E., Larson, D., & Chen, K. (2014). BreakDancer: Identification of Genomic Structural Variation from Paired-End Read Mapping. Curr Protoc Bioinformatics 45, 15.6.1-11.

Fernandes, S., Paracchini, S., Meyer, L. H., Floridia, G., Tyler-Smith, C., & Vogt, P. H. (2004). A large AZFc deletion removes DAZ3/DAZ4 and nearby genes from men in Y haplogroup N. Am J Hum Genet 74, 180-7.

Fisher, R. A. (1931). The evolution of . Biological Reviews 6, 345-68.

Ford, C. E., Jones, K. W., Polani, P. E., De Almeida, J. C., & Briggs, J. H. (1959). A sex- chromosome anomaly in a case of gonadal dysgenesis (Turner's syndrome). Lancet 1, 711-3.

Ghenu, A. H., Bolker, B. M., Melnick, D. J., & Evans, B. J. (2016). Multicopy gene family evolution on primate Y chromosomes. BMC Genomics 17, 157.

Giachini, C., Nuti, F., Turner, D. J., Laface, I., Xue, Y., Daguin, F., Forti, G., Tyler- Smith, C., & Krausz, C. (2009). TSPY1 copy number variation influences spermatogenesis and shows differences among Y lineages. J Clin Endocrinol Metab 94, 4016-22.

Gubbay, J., Collignon, J., Koopman, P., Capel, B., Economou, A., Munsterberg, A., Vivian, N., Goodfellow, P., & Lovell-Badge, R. (1990). A gene mapping to the sex-determining region of the mouse Y chromosome is a member of a novel family of embryonically expressed genes. Nature 346, 245-50.

Hallast, P., Balaresque, P., Bowden, G. R., Ballereau, S., & Jobling, M. A. (2013). Recombination dynamics of a human Y-chromosomal palindrome: rapid GC- biased gene conversion, multi-kilobase conversion tracts, and rare inversions. PLoS Genet 9, e1003666.

Handsaker, R. E., Van Doren, V., Berman, J. R., Genovese, G., Kashin, S., Boettger, L. M., & McCarroll, S. A. (2015). Large multiallelic copy number variations in humans. Nat Genet 47, 296-303.

Henking, H. (1891). Über Spermatogenese und deren Beziehung zur Eientwicklung bei Pyrrhocoris apterus L. Zeitschrift für wissenschaftliche Zoologie 51, 685-736.

Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S., Dugan, S., Ding, Y., Buhay, C. J., et al. (2012). Strict evolutionary conservation followed rapid gene loss on human and rhesus Y chromosomes. Nature 483, 82-6.

52 Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K., Minx, P. J., Fulton, R. S., McGrath, S. D., Locke, D. P., et al. (2010). Chimpanzee and human Y chromosomes are remarkably divergent in structure and gene content. Nature 463, 536-9.

International Chicken Genome Sequencing, C. (2004). Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695-716.

Jacobs, P. A., & Strong, J. A. (1959). A case of human intersexuality having a possible XXY sex-determining mechanism. Nature 183, 302-3.

Johansson, M. M., Van Geystelen, A., Larmuseau, M. H., Djurovic, S., Andreassen, O. A., Agartz, I., & Jazin, E. (2015). Microarray Analysis of Copy Number Variants on the Human Y Chromosome Reveals Novel and Frequent Duplications Overrepresented in Specific Haplogroups. PLoS One 10, e0137223.

Karmin, M., Saag, L., Vicente, M., Wilson Sayres, M. A., Jarve, M., Talas, U. G., Rootsi, S., Ilumae, A. M., Magi, R., et al. (2015). A recent bottleneck of Y chromosome diversity coincides with a global change in culture. Genome Res 25, 459-66.

Koller, P. C., & Darlington, C. D. (1934). The genetical and mechanical properties of the sex chromosomes. Journal of Genetics 29, 159-73.

Kuroda-Kawaguchi, T., Skaletsky, H., Brown, L. G., Minx, P. J., Cordum, H. S., Waterston, R. H., Wilson, R. K., Silber, S., Oates, R., et al. (2001). The AZFc region of the Y chromosome features massive palindromes and uniform recurrent deletions in infertile men. Nat Genet 29, 279-86.

Lahn, B. T., & Page, D. C. (1997). Functional coherence of the human Y chromosome. Science 278, 675-80.

Lahn, B. T., & Page, D. C. (1999a). Four evolutionary strata on the human X chromosome. Science 286, 964-7.

Lahn, B. T., & Page, D. C. (1999b). Retroposition of autosomal mRNA yielded testis- specific gene family on human Y chromosome. Nat Genet 21, 429-33.

Lange, J., Noordam, M. J., van Daalen, S. K., Skaletsky, H., Clark, B. A., Macville, M. V., Page, D. C., & Repping, S. (2013). Intrachromosomal homologous recombination between inverted amplicons on opposing Y-chromosome arms. Genomics 102, 257-64.

53 Lange, J., Skaletsky, H., van Daalen, S. K., Embry, S. L., Korver, C. M., Brown, L. G., Oates, R. D., Silber, S., Repping, S., et al. (2009). Isodicentric Y chromosomes and sex disorders as byproducts of homologous recombination that maintains palindromes. Cell 138, 855-69.

Lattanzi, W., Di Giacomo, M. C., Lenato, G. M., Chimienti, G., Voglino, G., Resta, N., Pepe, G., & Guanti, G. (2005). A large interstitial deletion encompassing the gene on the short arm of the Y chromosome. Hum Genet 116, 395- 401.

Lau, E. C., Mohandas, T. K., Shapiro, L. J., Slavkin, H. C., & Snead, M. L. (1989). Human and mouse amelogenin gene loci are on the sex chromosomes. Genomics 4, 162-8.

Li, G., Davis, B. W., Raudsepp, T., Pearks Wilkerson, A. J., Mason, V. C., Ferguson- Smith, M., O'Brien, P. C., Waters, P. D., & Murphy, W. J. (2013). Comparative analysis of mammalian Y chromosomes illuminates ancestral structure and lineage-specific evolution. Genome Res 23, 1486-95.

Lu, C., Zhang, J., Li, Y., Xia, Y., Zhang, F., Wu, B., Wu, W., Ji, G., Gu, A., et al. (2009). The b2/b3 subdeletion shows higher risk of spermatogenic failure and higher frequency of complete AZFc deletion than the gr/gr subdeletion in a Chinese population. Hum Mol Genet 18, 1122-30.

Lynch, M., Cram, D. S., Reilly, A., O'Bryan, M. K., Baker, H. W., de Kretser, D. M., & McLachlan, R. I. (2005). The Y chromosome gr/gr subdeletion is associated with male infertility. Mol Hum Reprod 11, 507-12.

Ma, K., Inglis, J. D., Sharkey, A., Bickmore, W. A., Hill, R. E., Prosser, E. J., Speed, R. M., Thomson, E. J., Jobling, M., et al. (1993). A Y chromosome gene family with RNA-binding protein homology: candidates for the azoospermia factor AZF controlling human spermatogenesis. Cell 75, 1287-95.

Mendez, F. L., Krahn, T., Schrack, B., Krahn, A. M., Veeramah, K. R., Woerner, A. E., Fomine, F. L., Bradman, N., Thomas, M. G., et al. (2013). An African American paternal lineage adds an extremely ancient root to the human Y chromosome phylogenetic tree. Am J Hum Genet 92, 454-9.

Miller, C. A., Hampton, O., Coarfa, C., & Milosavljevic, A. (2011). ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads. PLoS One 6, e16327.

Monesi, V. (1965). Differential rate of ribonucleic acid synthesis in the autosomes and sex chromosomes during male meiosis in the mouse. Chromosoma 17, 11-21.

54 Morgan, A. P., & de Villena, F. P. V. (2017). Sequence and structural diversity of mouse Y chromosomes. Mol Biol Evol 34, 3186-204.

Mueller, J. L., Skaletsky, H., Brown, L. G., Zaghlul, S., Rock, S., Graves, T., Auger, K., Warren, W. C., Wilson, R. K., et al. (2013). Independent specialization of the human and mouse X chromosomes for the male germ line. Nat Genet 45, 1083-7.

Muller, H. J. (1914). A gene for the fourth chromosome of Drosophila. Journal of Experimental Zoology 17, 325-36.

Muller, H. J. (1964). The Relation of Recombination to Mutational Advance. Mutat Res 106, 2-9.

Nanda, I., Haaf, T., Schartl, M., Schmid, M., & Burt, D. W. (2002). Comparative mapping of Z-orthologous genes in vertebrates: implications for the evolution of avian sex chromosomes. Cytogenet Genome Res 99, 178-84.

Nanda, I., Shan, Z., Schartl, M., Burt, D. W., Koehler, M., Nothwang, H., Grutzner, F., Paton, I. R., Windsor, D., et al. (1999). 300 million years of conserved synteny between chicken Z and human chromosome 9. Nat Genet 21, 258-9.

Nathanson, K. L., Kanetsky, P. A., Hawes, R., Vaughn, D. J., Letrero, R., Tucker, K., Friedlander, M., Phillips, K. A., Hogg, D., et al. (2005). The Y deletion gr/gr and susceptibility to testicular germ cell tumor. Am J Hum Genet 77, 1034-43.

Nei, M. (1970). Accumulation of nonfunctional genes on sheltered chromosomes. American Naturalist 104, 311-22.

Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biol Evol 8, 2231-40.

Ohno, S. (1967). Sex Chromosomes and Sex-linked Genes. Berlin: Springer.

Page, D. C., Harper, M. E., Love, J., & Botstein, D. (1984). Occurrence of a transposition from the X-chromosome long arm to the Y-chromosome short arm during human evolution. Nature 311, 119-23.

Page, D. C., Mosher, R., Simpson, E. M., Fisher, E. M., Mardon, G., Pollack, J., McGillivray, B., de la Chapelle, A., & Brown, L. G. (1987). The sex-determining region of the human Y chromosome encodes a finger protein. Cell 51, 1091-104.

Painter, T. S. (1921). The Y-Chromosome in Mammals. Science 53, 503-4.

55 Poznik, G. D., Xue, Y., Mendez, F. L., Willems, T. F., Massaia, A., Wilson Sayres, M. A., Ayub, Q., McCarthy, S. A., Narechania, A., et al. (2016). Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat Genet 48, 593-9.

Rausch, T., Zichner, T., Schlattl, A., Stutz, A. M., Benes, V., & Korbel, J. O. (2012). DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333-i39.

Reijo, R., Lee, T. Y., Salo, P., Alagappan, R., Brown, L. G., Rosenberg, M., Rozen, S., Jaffe, T., Straus, D., et al. (1995). Diverse spermatogenic defects in humans caused by Y chromosome deletions encompassing a novel RNA-binding protein gene. Nat Genet 10, 383-93.

Repping, S., Skaletsky, H., Brown, L., van Daalen, S. K., Korver, C. M., Pyntikova, T., Kuroda-Kawaguchi, T., de Vries, J. W., Oates, R. D., et al. (2003). Polymorphism for a 1.6-Mb deletion of the human Y chromosome persists through balance between recurrent mutation and haploid selection. Nat Genet 35, 247-51.

Repping, S., Skaletsky, H., Lange, J., Silber, S., Van Der Veen, F., Oates, R. D., Page, D. C., & Rozen, S. (2002). Recombination between palindromes P5 and P1 on the human Y chromosome causes massive deletions and spermatogenic failure. Am J Hum Genet 71, 906-22.

Repping, S., van Daalen, S. K., Brown, L. G., Korver, C. M., Lange, J., Marszalek, J. D., Pyntikova, T., van der Veen, F., Skaletsky, H., et al. (2006). High mutation rates have driven extensive structural polymorphism among human Y chromosomes. Nat Genet 38, 463-7.

Repping, S., van Daalen, S. K., Korver, C. M., Brown, L. G., Marszalek, J. D., Gianotten, J., Oates, R. D., Silber, S., van der Veen, F., et al. (2004). A family of human Y chromosomes has dispersed throughout northern Eurasia despite a 1.8-Mb deletion in the azoospermia factor c region. Genomics 83, 1046-52.

Rice, W. R. (1987a). The Accumulation of Sexually Antagonistic Genes as a Selective Agent Promoting the Evolution of Reduced Recombination between Primitive Sex Chromosomes. Evolution 41, 911-14.

Rice, W. R. (1987b). Genetic hitchhiking and the evolution of reduced genetic activity of the Y sex chromosome. Genetics 116, 161-7.

Rice, W. R. (1992). Sexually antagonistic genes: experimental evidence. Science 256, 1436-9.

56 Ross, M. T., Grafham, D. V., Coffey, A. J., Scherer, S., McLay, K., Muzny, D., Platzer, M., Howell, G. R., Burrows, C., et al. (2005). The DNA sequence of the human X chromosome. Nature 434, 325-37.

Rozen, S., Marszalek, J. D., Alagappan, R. K., Skaletsky, H., & Page, D. C. (2009). Remarkably little variation in proteins encoded by the Y chromosome's single- copy genes, implying effective purifying selection. Am J Hum Genet 85, 923-8.

Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J., Cordum, H. S., Waterston, R. H., Wilson, R. K., & Page, D. C. (2003). Abundant gene conversion between arms of palindromes in human and ape Y chromosomes. Nature 423, 873-6.

Rozen, S. G., Marszalek, J. D., Irenze, K., Skaletsky, H., Brown, L. G., Oates, R. D., Silber, S. J., Ardlie, K., & Page, D. C. (2012). AZFc deletions and spermatogenic failure: a population-based survey of 20,000 Y chromosomes. Am J Hum Genet 91, 890-6.

Saxena, R., Brown, L. G., Hawkins, T., Alagappan, R. K., Skaletsky, H., Reeve, M. P., Reijo, R., Rozen, S., Dinulos, M. B., et al. (1996). The DAZ gene cluster on the human Y chromosome arose from an autosomal gene that was transposed, repeatedly amplified and pruned. Nat Genet 14, 292-9.

Saxena, R., de Vries, J. W., Repping, S., Alagappan, R. K., Skaletsky, H., Brown, L. G., Ma, P., Chen, E., Hoovers, J. M., et al. (2000). Four DAZ genes in two clusters found in the AZFc region of the human Y chromosome. Genomics 67, 256-67.

Schofield, R. (1921). Inheritance of webbed toes. J Hered 12, 400-01.

Simmler, M. C., Rouyer, F., Vergnaud, G., Nystrom-Lahti, M., Ngo, K. Y., de la Chapelle, A., & Weissenbach, J. (1985). Pseudoautosomal DNA sequences in the pairing region of the human sex chromosomes. Nature 317, 692-7.

Sinclair, A. H., Berta, P., Palmer, M. S., Hawkins, J. R., Griffiths, B. L., Smith, M. J., Foster, J. W., Frischauf, A. M., Lovell-Badge, R., et al. (1990). A gene from the human sex-determining region encodes a protein with homology to a conserved DNA-binding motif. Nature 346, 240-4.

Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G., Repping, S., Pyntikova, T., Ali, J., et al. (2003). The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423, 825-37.

Skinner, B. M., Sargent, C. A., Churcher, C., Hunt, T., Herrero, J., Loveland, J. E., Dunn, M., Louzada, S., Fu, B., et al. (2016). The pig X and Y Chromosomes: structure, sequence, and evolution. Genome Res 26, 130-9.

57 Smith, J. M., & Haigh, J. (1974). The hitch-hiking effect of a favourable gene. Genet Res 23, 23-35.

Soh, Y. Q., Alfoldi, J., Pyntikova, T., Brown, L. G., Graves, T., Minx, P. J., Fulton, R. S., Kremitzki, C., Koutseva, N., et al. (2014). Sequencing the mouse Y chromosome reveals convergent gene acquisition and amplification on both sex chromosomes. Cell 159, 800-13.

Stern, C. (1957). The problem of complete Y-linkage in man. Am J Hum Genet 9, 147- 66.

Stevens, N. M. (1905). Studies in spermatogenesis. Washington, D.C.: Carnegie Institution of Washington.

Sudmant, P. H., Rausch, T., Gardner, E. J., Handsaker, R. E., Abyzov, A., Huddleston, J., Zhang, Y., Ye, K., Jun, G., et al. (2015). An integrated map of structural variation in 2,504 human genomes. Nature 526, 75-81.

Sun, C., Skaletsky, H., Rozen, S., Gromoll, J., Nieschlag, E., Oates, R., & Page, D. C. (2000). Deletion of azoospermia factor a (AZFa) region of human Y chromosome caused by recombination between HERV15 proviruses. Hum Mol Genet 9, 2291- 6.

Sutton, W. S. (1902). On the morphology of the chromosome group in Brachystola magna. Biological Bulletin 4, 24-39.

Tiepolo, L., & Zuffardi, O. (1976). Localization of factors controlling spermatogenesis in the nonfluorescent portion of the human Y chromosome long arm. Hum Genet 34, 119-24.

Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H. W., Harris, R., Ye, D., O'Brien, P. C., Chikhi, R., et al. (2016). A time- and cost- effective strategy to sequence mammalian Y Chromosomes: an application to the de novo assembly of gorilla Y. Genome Res 26, 530-40.

Tommasi, C. (1907a). Ipertricosi auricolare famigliare. Arch-Psichiatr Neuropat Antropol Crim Med Legale 28, 60-67.

Tommasi, C. (1907b). Ipertricosi auricolare famigliare. Giorn Psych Clin Tech Manic 35, 1-21.

Turner, J. M., Mahadevaiah, S. K., Fernandez-Capetillo, O., Nussenzweig, A., Xu, X., Deng, C. X., & Burgoyne, P. S. (2005). Silencing of unsynapsed meiotic chromosomes in the mouse. Nat Genet 37, 41-7.

58 Vogt, P. H., Edelmann, A., Kirsch, S., Henegariu, O., Hirschmann, P., Kiesewetter, F., Kohn, F. M., Schill, W. B., Farah, S., et al. (1996). Human Y chromosome azoospermia factors (AZF) mapped to different subregions in Yq11. Hum Mol Genet 5, 933-43.

Vollrath, D., Foote, S., Hilton, A., Brown, L. G., Beer-Romero, P., Bogan, J. S., & Page, D. C. (1992). The human Y chromosome: a 43-interval map based on naturally occurring deletions. Science 258, 52-9.

Warburton, P. E., Giordano, J., Cheung, F., Gelfand, Y., & Benson, G. (2004). Inverted repeat structure of the human genome: the X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes. Genome Res 14, 1861-9.

Watson, J. M., Spencer, J. A., Riggs, A. D., & Graves, J. A. (1990). The X chromosome of monotremes shares a highly conserved region with the eutherian and marsupial X chromosomes despite the absence of X chromosome inactivation. Proc Natl Acad Sci U S A 87, 7125-9.

Watson, J. M., Spencer, J. A., Riggs, A. D., & Graves, J. A. (1991). Sex chromosome evolution: platypus gene mapping suggests that part of the human X chromosome was originally autosomal. Proc Natl Acad Sci U S A 88, 11256-60.

Wei, W., Fitzgerald, T. W., Ayub, Q., Massaia, A., Smith, B. H., Dominiczak, A. F., Morris, A. D., Porteous, D. J., Hurles, M. E., et al. (2015). Copy number variation in the human Y chromosome in the UK population. Hum Genet 134, 789-800.

Wilson, E. B. (1905). The Chromosomes in Relation to the Determination of Sex in Insects. Science 22, 500-2.

Wilson Sayres, M. A., Lohmueller, K. E., & Nielsen, R. (2014). Natural selection reduced diversity on human Y chromosomes. PLoS Genet 10, e1004064.

Y Chromosome Consortium. (2002). A nomenclature system for the tree of human Y- chromosomal binary haplogroups. Genome Res 12, 339-48.

59

60

CHAPTER 2

SELECTION HAS COUNTERED HIGH MUTABILITY TO PRESERVE THE ANCESTRAL COPY NUMBER OF Y CHROMOSOME AMPLICONS IN DIVERSE HUMAN LINEAGES

______

Levi S. Teitz, Tatyana Pyntikova, Helen Skaletsky, and David C. Page

Author Contributions: L.S.T. and D.C.P. planned the project. L.S.T. developed computational methods with assistance from H.S. and performed data analysis. T.P. performed FISH experiments. L.S.T., H.S., and T.P. analyzed FISH data. L.S.T. and D.C.P. wrote the manuscript.

Adapted from a manuscript submitted to The American Journal of Human Genetics.

61

62 SUMMARY

Amplicons—large, highly identical segmental duplications—are a prominent feature of mammalian Y chromosomes. Although they encode genes essential for fertility, these amplicons differ vastly between species, and little is known about the selective constraints acting on them. Here, we develop computational tools to detect amplicon copy number with unprecedented accuracy from high-throughput sequencing data. We find that one-sixth (16.9%) of 1216 males from the 1000 Genomes Project have at least one deleted or duplicated amplicon. However, each amplicon’s reference copy number is scrupulously maintained among divergent branches of the Y chromosome phylogeny, including the ancient branch A00, indicating that the reference copy number is ancestral to all modern human Y chromosomes. Using simulations and novel analytical methods, we demonstrate that this pattern of variation is incompatible with neutral evolution and instead displays hallmarks of mutation-selection balance. We also observe cases of amplicon rescue, in which deleted amplicons are restored through subsequent duplications. These results indicate that, contrary to the lack of constraint suggested by the differences between species, natural selection has suppressed amplicon copy number variation in diverse human lineages.

______

63 INTRODUCTION

The human Y chromosome does not recombine with a homologous chromosome along the vast majority of its length (Burgoyne, 1982). As a result, it developed a unique and complex genomic structure compared to the other chromosomes (Skaletsky et al.,

2003). In particular, the Y chromosome contains distinct classes of DNA sequence, most strikingly the ampliconic sequence: large segments, ranging from tens of kilobases to megabases, present in two or more copies on the Y chromosome (Figure 1A). The amplicons are typically arranged in palindromes and have extremely high sequence identity—amplicon copies differ by as little as 3 in 100,000 base pairs—that is maintained by gene conversion (Rozen et al., 2003). Remarkably, the genes within the amplicons are functionally coherent: they are expressed predominantly or exclusively in the testis (Lahn and Page, 1997). The azoospermia factor c (AZFc) region, composed of ampliconic units interleaved in an intricate pattern (Figure 1B), is of particular functional importance, as large deletions within the region cause spermatogenic failure (Vogt et al.,

1996; Kuroda-Kawaguchi et al., 2001). Other amplicon variants cause or increase the risk of spermatogenic failure, sex reversal, Turner syndrome, and testis cancer (Andersson et al., 1986; Lynch et al., 2005; Nathanson et al., 2005; Giachini et al., 2009; Lange et al.,

2009; Rozen et al., 2012).

Little is known about the evolutionary forces that govern the formation, maintenance, and diversification of Y chromosome amplicons. Although amplicons are present on other mammalian Y chromosomes, the genetic content and genomic structure of those amplicons differ wildly between species (Hughes et al., 2010; Hughes et al.,

2012; Soh et al., 2014). Even humans and chimpanzees, which diverged only 8 million

64

 FIGURE 1. GENOMIC STRUCTURE OF THE HUMAN Y CHROMOSOME

A. The human Y chromosome contains four major sequence classes (see Introduction

for details). Ampliconic sequence includes palindromes with a small spacer sequence

between copies and widely spaced inverted repeats. Arrows: palindrome or inverted

repeat copies; arrow direction indicates copy orientation. Locations of ampliconic

protein-coding genes are also shown.

B. AZFc region encompassing palindromes 1-3, containing multiple copies of six

ampliconic subunits with ≥99.94% nucleotide identity. Arrows: blue, teal, green, red,

gray, and yellow amplicon units; arrow direction indicates copy orientation. Small

white rectangle: single copy of IR1. Locations of protein-coding genes found in the

AZFc region are shown below AZFc architecture.

65 years ago and whose autosomal euchromatic sequences are 99% identical, share some amplicons but drastically differ in overall ampliconic content and structure (Chimpanzee

Sequencing Analysis Consortium, 2005). Because of this extreme divergence, the evolutionary history of the amplicons cannot be reconstructed by comparing Y chromosomes of different species.

Amplicon variation within species, on the other hand, occurs over a timescale that is conducive to the study of amplicon evolution. Copy number variants (CNVs)— deletions and duplications of ampliconic sequence—have been studied on the human Y chromosome for decades. These CNVs are often caused by non-allelic homologous recombination (NAHR), as the large, nearly identical amplicon copies represent prime targets for this mechanism (Lange et al., 2009; Lange et al., 2013). Early studies of amplicon CNVs each focused on the detection of a single type of CNV (Vogt et al., 1996;

Repping et al., 2002; Repping et al., 2004). Later studies, bolstered by developing technology, described many types of CNVs in larger numbers of men (Repping et al.,

2006; Rozen et al., 2012; Espinosa et al., 2015; Johansson et al., 2015; Wei et al., 2015;

Massaia and Xue, 2017). Amplicon CNVs have recently been discovered on the Y chromosomes of chimpanzees, macaques, gorillas, and mice (Ghenu et al., 2016; Oetjens et al., 2016; Tomaszkiewicz et al., 2016; Morgan and de Villena, 2017). Some amplicon

CNVs have been implicated in spermatogenic failure, sex reversal, Turner syndrome, and testis cancer (Andersson et al., 1986; Vogt et al., 1996; Kuroda-Kawaguchi et al., 2001;

Lynch et al., 2005; Nathanson et al., 2005; Giachini et al., 2009; Lange et al., 2009;

Rozen et al., 2012). However, the amplicon CNVs with well-described phenotypes represent only a small part of the spectrum of amplicon variation; the vast majority of

66 amplicon CNVs that have been discovered have no known effect on spermatogenesis or

any other trait (Repping et al., 2006; Espinosa et al., 2015; Johansson et al., 2015; Wei et

al., 2015).

Even though amplicon CNVs have been the subject of intense investigation,

previous studies made only nominal attempts to reconstruct the evolution of the

amplicons, instead focusing on documenting amplicon variation. Here, we present the

most detailed reconstruction of Y chromosome amplicon evolution in a single species to

date. Our study is made possible by the meticulous sequencing of the reference human Y

chromosome, advances in technology that made large sequencing datasets available, and phylogenetic studies of the Y chromosome that enabled the construction of a detailed tree of the men in our dataset (Figure 2; Y Chromosome Consortium, 2002; 1000 Genomes

Project Consortium et al., 2015; Poznik et al., 2016). With these three tools, we obtain accurate amplicon CNV calls in 1216 men and map those CNV calls onto the phylogenetic tree. We then use that tree to test evolutionary models in a more precise and powerful manner than previous studies, allowing us to better describe the evolutionary pressures acting on the amplicons.

We find that 16.9% of men in our dataset have an amplicon CNV, and that such

CNVs are caused by both NAHR between amplicon copies and other mechanisms.

Although these CNVs are present in almost all major haplogroups (branches of the Y chromosome phylogeny), the reference copy number of each amplicon is maintained among divergent branches, including in haplogroup A00, which diverged from all other human Y chromosomes over 200 kya (Mendez et al., 2013). Further, the distribution of men with CNVs within the phylogenetic tree of these Y chromosomes is incompatible

67 with a model of neutral evolution and instead is indicative of mutation-selection balance.

Finally, we observe cases of amplicon rescue, in which deleted amplicons are restored through subsequent duplication of other nearly identical amplicon copies. These results indicate that, although amplicons are susceptible to large-scale rearrangements, selection acts to maintain amplicon copy number among diverse human lineages.

______

68

 FIGURE 2. DETAILED PHYLOGENETIC TREE OF 1000 GENOMES PROJECT

Y CHROMOSOMES

Haplogroup names are shown around the tree. Branch lengths are measured in SNPs.

69 RESULTS

SEQUENCING DEPTH CORRESPONDS TO AMPLICON COPY NUMBER

To detect amplicon CNVs, we analyzed whole genome sequencing data from males from the 1000 Genomes Project (1000 Genomes Project Consortium et al., 2015).

Detecting copy number variation from sequencing data is a well-studied problem, but widely used tools struggle to accurately call CNVs in the complex ampliconic regions.

Therefore, we developed a pipeline to search for amplicon copy number changes (Figure

3). We aligned the data to the entire reference genome, masked genome typical interspersed repeats on the Y chromosome, and computed depth of 15 amplicons and 4 single-copy control regions on the Y chromosome (Table S1). We then adjusted depth to correct for GC content bias and to normalize for overall coverage of the Y chromosome so that, in the absence of a CNV, the expected depth of each control region and amplicon is 1.

After these steps, the depth of each amplicon provided an estimate of its copy number. When the depth of an amplicon for each individual was plotted as a histogram, we observed extraordinarily clear peaks corresponding to whole-amplicon deletions and duplications (Figure 4A). (The exception is the P7 amplicon, which did not give consistent results because of its very small size and was excluded from subsequent analyses.) In contrast, the control regions showed only single peaks centered around a normalized depth of 1 (Figure 4B). The sharp ampliconic peaks imply that most amplicon copy number variation affects whole amplicons at a time, consistent with the idea that

NAHR is the dominant mechanism by which Y chromosome amplicon CNVs arise.

Because of these peaks, we called amplicon copy number from depth alone and

70

 FIGURE 3. PIPELINE FOR DETECTING AMPLICON CNVS FROM SEQUENCING DATA

71

FIGURE 4. AMPLICON CNVS CAN BE DETECTED FROM SEQUENCING DATA „

A. Normalized depth of blue, green, red, and yellow amplicons in 1216 men. Vertical lines: cutoff at which copy number call changes.

B. Normalized depth of single-copy control regions in 1216 men.

C. Normalized depth and copy number of each amplicon in a single man with the reference number of each amplicon. Bars: normalized depth of amplicons. Hash marks: copy number call cutoffs. Dotted gray line: normalized depth of 1.

D. Example of a partial amplicon CNV. Depth in each copy of the amplicon is shown.

Blue dots: depth of 100-bp windows. Red lines: predicted change points.

72

73 determined the complete amplicon copy number state of each individual (Figure 4C).

The AZFc region is particularly susceptible to CNVs due to NAHR because, as

many of its amplicon copies are in the same orientation (Figure 1B), crossing-over

between those copies results in deletion or duplication (Kuroda-Kawaguchi et al., 2001;

Repping et al., 2006). (In contrast, most other amplicons on the human Y chromosome

are in opposite orientations and are only susceptible to inversion through this

mechanism.) The AZFc architectures caused by NAHR events can be predicted by

enumerating all possible series of one, two, and three NAHR events, and we matched

observed copy number states to these predicted architectures (Figure S1). In this way, we

used our copy number data to draw conclusions about the arrangement of CNVs and the

mechanism by which they arose.

CNVs that cause the deletion or duplication of only part of an amplicon copy can

also occur. To detect these, we used a modified version of the binary segmentation

algorithm (Sen and Srivastava, 1975) to detect abrupt changes in depth in the middle of

amplicons (Figure 4D). These change points represent the breakpoints of partial (rather

than whole-amplicon) CNV events, which are not caused by NAHR between whole

amplicon copies. We detected 52 men with partial CNVs, six of which had previously

been called by our whole-amplicon analysis as men with the reference amplicon state.

FILTERING AND CONFIRMATION OF COPY NUMBER CALLS

To confirm that no reads from elsewhere in the genome align to the amplicons or control regions, we measured amplicon and control region depth of 15 women and five

74 men from the 1000 Genomes Project. Normalized ampliconic and control region depth in

all women was near 0 (Figure 5A).

To ensure the accuracy of copy number calls, we removed men with either

abnormal depth in their control regions or discordance between the mean and median depth of their amplicons. To ensure that noise due to low depth was not introducing artifactual calls to our dataset, we determined that the rate of CNVs did not significantly differ between men with lower depth and men with higher depth (Figure 6). We also compared amplicon copy number in two father-son pairs found in the 1000 Genomes

Project. In each pair, the same copy number of each amplicon was present in father and son (Figure 5B). Additionally, 92 men have sequencing data from two independent sequencing libraries that pass the above filtering steps. When amplicon copy number calls were generated independently from both libraries for each man, 89 of the 92 men

(96.7%) had concordant copy number calls between libraries, and the three men with discordant calls each differed in only a single amplicon (Figure 5C). From these results, we expect that 98.4% of the men in our dataset have accurate copy number calls for every amplicon.

To validate the copy number calls using an orthogonal and non-computational method, we performed multi-color interphase FISH on 12 lymphoblastoid cell lines of men sequenced in the 1000 Genomes Project. These samples were chosen to represent a range of amplicon structures. We counted the copy number of green, red, and yellow amplicons using probes that hybridize to those amplicons. The FISH analysis confirmed our computational CNV calls (Figure 5D-F, Figure S2), although we found that FISH tended to underestimate amplicon copy number in men with large duplications.

75

FIGURE 5. VALIDATION OF AMPLICON CNV CALLS „

A. Normalized amplicon and control region depth in 15 women and five men. Depth in men is ~0.5 because depth was normalized with an autosomal region.

B. Amplicon copy number calls in two father-son pairs.

C. Example of a man with two sequencing libraries with concordant amplicon copy number calls. In all copy number call plots of men with a deletion or duplication, the bar of the affected amplicon is shown in red or green, respectively.

D-F. Two-color interphase FISH using probes that hybridize to the green and red amplicons (left) and green and yellow amplicons (middle). Right: copy number calls.

Bottom: predicted AZFc architecture.

76

77

 FIGURE 6. SEQUENCING DEPTH IS NOT CORRELATED WITH CNV CALLS

A. Fraction of men with CNV calls in different ranges of sequencing depth. Error

bars: binomial 95% confidence intervals.

B-D. Fraction of men with CNV calls in different ranges of sequencing depth in well-

represented sub-haplogroups E1b (n=294), R1a (n=81), and R1b (n=206). This

controls for the possibility of the whole-dataset results being affected by, for example,

a haplogroup with a high fraction of men with CNVs that was sequenced more deeply

than other haplogroups. Error bars: binomial 95% confidence intervals.

78 CLASSIFICATION AND PHENOTYPIC IMPACT OF OBSERVED CNVS

Of the 1216 men analyzed, 206 (16.9%) had one of 56 distinct CNVs affecting at least one amplicon. Of these men, 88 (43%) had deletions, 103 (50%) had duplications, and 15 (7%) had complex CNVs—deletions of some amplicons and duplications of others (Figure 7A). These results are in rough concordance with a previous report that found CNVs in 14.7% of Y chromosomes (Johansson et al., 2015). The majority

(171/206) of these men had amplicon CNVs found solely within the AZFc region (Figure

7B). However, we also detected 31 men with non-AZFc CNVs (Figure 7C), as well as four men with CNVs both within and outside of the AZFc region.

The majority (133/175, 76%) of men with AZFc CNVs have CNVs that corresponded to the predicted one-, two-, or three-step NAHR events (Figure 7D, Figure

S1). Many of the observed CNVs are well-studied and recurrent, such as gr/gr deletions and b2/b3 deletions, while others are novel (Figure 7E). One-step events (n=84) were more common than two-step events (n=40), which were in turn more common than three- step events (n=9), consistent with our hypothesis that such CNVs are caused by independent and successive NAHR events. However, 34 (19%) of the AZFc CNVs did not correspond to predicted NAHR events, and instead corresponded to the deletion or duplication of a single block of sequence within the AZFc region (Figure 7F). Such

CNVs could be caused by non-NAHR mechanisms such as non-homologous end joining, or by NAHR between small targets within amplicon copies, such as genome typical interspersed repeats. An additional eight men had AZFc CNVs that could not be explained by either of the above two mechanisms. Such men likely represent a combination of NAHR and non- NAHR events. Several of these men had evidence of

79

FIGURE 7. AMPLICON CNVS „

A. Proportion of men with the reference amplicon state (no CNVs), duplications, deletions, and complex CNVs.

B. Locations of observed CNVs.

C. Copy number calls in a man with a non-AZFc CNV.

D. Predicted mechanism of AZFc CNV formation.

E. Examples of men with AZFc CNVs predicted to be caused by NAHR. Predicted architectures are shown below. Left: a man with the previously described gr/gr deletion, found in 49 men in our dataset. Middle: a man with the previously described b2/b3 deletion, found in 26 men in our dataset. Right: a man with a novel duplication, found in one man in our dataset.

F. Example of a man with an AZFc CNV not caused by NAHR between amplicon copies.

Bottom: reference AZFc architecture with gray arc showing the predicted breakpoints of the non-NAHR event, and predicted CNV architecture.

80

81

FIGURE 8. MECHANISM OF FORMATION OF A COMPLEX AZFC CNV „

A. Copy number calls of affected man. The copy number calls do not match any predicted AZFc architecture.

B. Evidence of a partial amplicon mutation event in this man. Blue dots: depth of 100-bp windows. Red lines: predicted change points.

C. Predicted multi-step mechanism of formation for this CNV. 1. Reference AZFc architecture. The green arc shows the targets of NAHR on a single copy of the AZFc region. 2. Crossing over occurs between two sister chromatids of the Y chromosome, causing a deletion. An alternative mechanism, in which a single chromatid forms a loop and undergoes NAHR with itself, is not shown. 3. Intermediate deletion stage after

NAHR. The gray arc shows breakpoints of the subsequent non-NAHR duplication event.

This duplication event occurred twice. 4. Final AZFc architecture, which matches the copy numbers calls in A. The copy number call for the green amplicon results from part of that amplicon being present in two copies and part of it being present in six copies.

82

83 partial amplicon CNVs that, combined with the predicted AZFc architectures, suggested plausible mechanisms for the formation of such CNVs (Figure 8).

The only CNVs in our dataset that have definitive phenotypic effects are the gr/gr and b1/b3 deletions, which we found in 49 men and one man, respectively, and are risk factors for spermatogenic failure and/or testis cancer (Lynch et al., 2005; Nathanson et al., 2005; Rozen et al., 2012). We also observed 26 b2/b3 deletions, whose phenotype is less clear, with conflicting reports about whether or not it contributes to spermatogenic failure (Lu et al., 2009; Rozen et al., 2012). We observed no men with either of the canonical complete AZFb or AZFc deletions, which both cause spermatogenic failure with nearly complete penetrance (Kuroda-Kawaguchi et al., 2001; Repping et al., 2002).

(The incidences of these complete AZFb and AZFc deletions in the general population are approximately 1/8188 and 1/2320, respectively, so it is unsurprising that we see no such variants in a study of our size [Repping et al., 2002; Rozen et al., 2012].) The remaining

53 distinct CNVs, which are present in the majority of men with CNVs in our dataset

(130/206, 63.1%), have no known strong associations with a phenotype. This reinforces the fact that ampliconic copy number variation is broader than the few well-studied variants that have been brought to the forefront by studies in azoospermic men.

THE REFERENCE AMPLICON COPY NUMBER STATE PERVADES THE Y CHROMOSOME

PHYLOGENETIC TREE

We next asked what this variation tells us about the evolution of the amplicons.

Because the Y chromosome is inherited from father to son as a single haplotype, an accurate phylogenetic tree of all Y chromosomes can be constructed; major branches of

84 this phylogeny are called haplogroups (Y Chromosome Consortium, 2002). The 1000

Genomes Project contains samples collected from populations around the globe, so most

major haplogroups are represented (Figure 9A). We calculated the proportion of Y

chromosomes of each haplogroup that have amplicon CNVs (Figure 9B). Two

haplogroups, D and N, have no individuals with the reference copy number, the result of

ancestral deletions that are fixed within those haplogroups (Repping et al., 2003;

Fernandes et al., 2004; Repping et al., 2004). All other haplogroups (with the exception

of B, which has only seven men in our data) contain men with CNVs, but the reference

state is present in the majority (62%-93%) of men in each haplogroup. We then mapped

the detected CNVs onto a modified version of the detailed phylogenetic tree of the 1000

Genomes Project Y chromosomes (Figure 2) built by Poznik et al. (2016). With this

detailed tree, we calculated a lower bound for the amplicon CNV mutation rate of 3.83 ×

10−4 mutations per father-to-son Y transmission. Given this high mutation rate, it is

surprising that the reference amplicon state is so pervasive. If the amplicon CNVs were

selectively neutral, we would expect to see a larger number of ancient mutations, which

would cause most or all of a haplogroup to have a CNV. This led us to hypothesize that

selection was acting to remove amplicon CNVs from the population.

HAPLOGROUP A00 HAS MAINTAINED THE REFERENCE COPY NUMBER STATE

The most ancient haplogroup known is A00, which diverged from all other Y chromosomes between 200 and 300 kya (Mendez et al., 2013). We determined amplicon copy number of two A00 men who were sequenced as part of a different study (Karmin et al., 2015). These two men represent an independent experiment of evolution over almost

85

FIGURE 9. AMPLICON CNVS ARE DISTRIBUTED THROUGHOUT THE Y CHROMOSOME

PHYLOGENETIC TREE „

A. Distribution of Y chromosome haplogroups in 1000 Genomes Project populations. Sri

Lankan and Indian Telugu samples were collected from a population living in the United

Kingdom; Gujarati Indian samples were collected from a population living in Houston,

Texas.

B. Phylogenetic tree of major Y chromosome haplogroups represented in our dataset. Pie charts: proportions of men with different CNV classes in each haplogroup. Pie chart area is proportional to the number of men from that haplogroup in our dataset.

C. Copy number calls of two men from haplogroup A00.

86

87 twice as much time as the other men we analyzed. We found that both A00 Y

chromosomes have the reference copy number of each amplicon (Figure 9C), implying

that the reference amplicon state is the ancestral state. Further, given the amplicon CNV

mutation rate calculated above and the time since the divergence of A00 from the

reference, it is extremely unlikely that the reference copy number would be maintained in

the absence of selection (Figure 10).

THE DISTRIBUTION OF CNVS IS INCOMPATIBLE WITH A MODEL OF NEUTRAL EVOLUTION

We tested the empirical distribution of men with CNVs in the detailed phylogenetic tree against a model of neutral evolution. We did so by simulating neutral evolution of amplicon CNVs over the tree structure, using a range of mutation rates. For each simulation, we counted the number of mutation events that occurred during simulation and the total number of men with an amplicon CNV after the simulation was complete.

At high mutation rates, there were many men with CNVs but few observed mutation events, as most mutation events occurred near the root of the tree, and mutated branches could not re-mutate in our model (Figure 11A). At intermediate mutation rates, we observed more mutation events, but fewer men with CNVs. Finally, at low mutation rates, there were few mutation events and few men with CNVs. However, in our real data, we observed many mutation events and a middling number of men with CNVs, a combination that is incompatible with the neutral model (Figure 11A). A more complex model that allowed for reversion to the ancestral state and subsequent re-mutation only matched our real data when the reversion rate was five to ten times higher than the mutation rate (Figure 12). Such extreme discrepancy between mutation and reversion

88

 FIGURE 10. SIMULATION OF A00 MUTATION AND DRIFT

Each cell is the average of 1,000 simulations. The lower bound of the true mutation

rate, as calculated from the 1000 Genomes Project data, is 3.83 × 10−4 mutations per

father-to-son Y transmission. CNVs are present in the large majority of the

population in all simulations at or above the predicted mutation rate.

89

FIGURE 11. THE REFERENCE AMPLICON STATE HAS BEEN SELECTED FOR THROUGHOUT

HUMAN EVOLUTION „

A. Mutation events vs. number of men with CNVs. Each point represents one simulation

over the phylogenetic tree of men in our dataset.

B. Distribution of mutation events over the phylogenetic tree. Blue curve: branches of the

phylogenetic tree of men in our dataset, sorted by branch age. Red diagonal line:

expected distribution if CNVs were selectively neutral. Gray lines: branches of the

phylogenetic tree shuffled at random. 1,000 shuffles were performed. See Figure 13 for

an in-depth description of this method.

CD. Amplicon rescue in C. gr/gr and D. b2/b3 deletions. Top: a chromosome with the

deletion undergoes a blue-to-blue duplication that restores most of the amplicons to the

reference copy number. The blue arc shows the targets of NAHR on a single copy of the

AZFc region. Middle: men with the pre-rescue and post-rescue AZFc structures. Bottom:

phylogenetic trees containing the rescued men. Men in red have the gr/gr or b2/b3

deletion; men in green have the respective rescue amplicon copy number states.

90

91

 FIGURE 12. SIMULATIONS OF NEUTRAL EVOLUTION WITH REVERSION

Mutation events vs. number of men with mutants. Each point represents one

simulation over the phylogenetic tree of men in our dataset.

92 likely represents selection acting to remove CNVs from the population, rather than true chromosomal reversion.

MUTATIONS TIMES ARE SKEWED TOWARD THE RECENT PAST

A second phylogenetic analysis reinforced the idea that amplicon CNVs are not neutral. If amplicon variation were neutral, we would expect mutation events to be distributed evenly between ancient and recent branches of the tree. Instead, we found that mutation events are significantly skewed toward the recent branches of the tree (p = 1.01

× 10−7, KS test, Figure 11B, Figure 13)—72% of mutation events took place in the latter

50% of the tree. This skew fits with a history of mutation-selection balance, in which mutations occur at a high rate, but are constantly removed from the population due to selection. Recent mutations, which have not yet had enough time to be removed from the population, are therefore more common than ancient mutations.

An alternative interpretation of this pattern of amplicon variation is a recent change in the amplicon mutation rate, possibly due to environmental factors. However, such a change would have had to occur independently in each haplogroup, many of which are geographically isolated. Our observations are also not explained by bursts of Y chromosome population expansion described in Poznik et al. (2016). The detailed phylogenetic tree of Y chromosomes used in our analyses is built from SNPs, and its branch lengths are dependent solely on SNPs; therefore, any historical dynamics that affect the tree's structure and branch length are accounted for when we compare the amplicon CNVs to the tree. For these reasons, selection against amplicon CNVs is the most plausible interpretation of our findings.

93 The above analyses are particularly sensitive to false positive CNV calls, as false

calls would appear as recent, isolated mutations. Men that differ from the reference state

by a single copy of a single amplicon are the most likely to be false positives. Therefore,

we generated a high-confidence CNV callset by excluding 22 such men that we could not

validate with partial-amplicon CNV detection or FISH. We then performed the above two

analyses using this high-confidence callset; in both cases, the results change only

minimally (Figure S3).

AMPLICON DELETIONS ARE RESCUED BY SUBSEQUENT DUPLICATION

In addition to selection acting to eliminate CNVs, variant Y chromosomes can undergo subsequent mutations that restore most or all amplicons to the reference copy number.

We used our phylogenetic tree to directly observe this process, which we call amplicon rescue. While we observed no cases of rescued duplications, which—in the absence of phylogenetic evidence—would be indistinguishable from other men with the reference copy number, we did observe several cases of rescued deletions (Figure 11CD, Figure

14). Amplicon rescue occurred at a significantly higher rate than analogous duplications on the reference background: we observed 9 blue-to-blue duplications on a deletion background (75 men with gr/gr or b2/b3 deletions) vs. 6 blue-to-blue duplications on a reference background (1010 reference state men; p = 2.00 × 10-7, Fisher exact test). Since both events are mechanistically analogous, this difference likely represents selection favoring rescued Y chromosomes over chromosomes with deletions, rather than a difference in the rate of incidence of such mutation events.

______

94

FIGURE 13. METHODOLOGY FOR CALCULATING CNV DISTRIBUTION OVER THE

PHYLOGENETIC TREE

A. Sample phylogenetic tree. Red leaves: individuals with a CNV.

B. Step 1: Find the edges of the tree in which mutation events occurred by maximum

parsimony, shown in red.

C. Step 2: Annotate edges by age. Edge 1 is the oldest branch. Edge 8 is the youngest

branch.

D. Step 3: Arrange the edges in a single line and sort edges by age. After sorting, edges

closer to the root of the tree will be further to the left, and edges closer to the leaves of

the tree will be further to the right. The length of this line is the sum of the edge lengths,

which is equal to the total evolutionary time traversed by the tree. Evolutionary time is

measured in SNPs, as the phylogenetic tree used here is built using single-nucleotide

changes as a molecular clock.

E. Step 4: Plot the cumulative fraction of CNVs observed from the beginning of the line

to the end of the line. In this case, there are two edges with mutation events, so 50% of

events are observed at branch 2, and 100% of events are observed at branch 7.

F. Step 5: Using the Kolmogorov-Smirnov test, compare the distribution of CNVs to the

null distribution (dotted gray line), which represents a constant rate of mutation over

time.

95

FIGURE 13 (CONTINUED) „

G. Distribution of real mutation events over phylogenetic tree. Blue curve: branches of

the phylogenetic tree sorted by branch age. Red diagonal line: expected distribution if

CNVs were selectively neutral. p = 1.01 × 10−7, KS test.

H. Distribution of shuffled real mutation events over phylogenetic tree. Gray lines: branches of the phylogenetic tree shuffled at random. 1,000 shuffles were performed. Red diagonal line: expected distribution. Minimum p-value of shuffles = 1.46 × 10−3, KS test.

I. Distribution of simulated mutation events over phylogenetic tree. Gray lines: branches

of the phylogenetic tree with simulated mutations sorted by branch age. 1,000 simulations

were performed. Red diagonal line: expected distribution. Minimum p-value of

simulations = 4.99 × 10−4, KS test.

96

97

 FIGURE 14. MECHANISM OF AMPLICON RESCUE

AB. 1. Architecture of the AZFc region with the A. gr/gr deletion and B. b2/b3

deletion. The blue arc shows the targets of NAHR on a single copy of the AZFc

region. 2. Crossing over occurs between two sister chromatids of the Y chromosome,

causing a duplication. 3. The resulting architecture after NAHR.

98 DISCUSSION

Natural selection has played a foundational role in shaping autosomal and X- linked copy number variation (Sudmant et al., 2015). However, it has been stated that selection is ineffective on the mammalian Y chromosome, because the Y chromosome does not undergo recombination with a homologous chromosome (Charlesworth and

Charlesworth, 2000). (This lack of recombination actually makes our study possible, enabling a deep reconstruction of the Y chromosome’s history that is impossible for any autosome or X chromosome.) Demographic studies emphasize the role of neutral processes rather than selection to explain the history of the Y chromosome (Poznik et al.,

2016). Some have even speculated that, in the absence of effective selection, the mammalian Y chromosome will eventually decay and be lost (Graves, 2006). On the contrary, we have shown that Y chromosome amplicons have been subject to purifying selection to maintain the ancestral copy number for more than 200,000 years of human history. Previous studies had speculated about the effects of selection on amplicons, without providing direct evidence of ampliconic selection (Repping et al., 2006; Wilson

Sayres et al., 2014). Selection also acts to maintain the non-ampliconic regions of the Y chromosome (Rozen et al., 2009; Bellott et al., 2014; Wilson Sayres et al., 2014). In conjunction, these results demonstrate that the human Y chromosome’s current form is the result of the complex interplay between selective forces acting on its varied sequence classes.

Most of the early research into Y chromosome structural variation arose from the study of azoospermic men, leading to the discovery of ampliconic deletions that cause spermatogenic failure (Kuroda-Kawaguchi et al., 2001; Repping et al., 2002; Lynch et al.,

99 2005; Nathanson et al., 2005; Rozen et al., 2012). This initial focus on variants that affect spermatogenesis was compounded by the fact that infertility clinics were a major resource for such research, resulting in significant ascertainment bias in the set of variants that became well-studied and well-known. Additionally, almost all of these well-known amplicon CNVs are deletions, the legacy of the early years of Y chromosome research in which the primary method of detecting CNVs was the use of sequence-tagged sites, which can detect deletions but not duplications or inversions (Vollrath et al., 1992).

The true breadth of amplicon copy number variation has been revealed by recent surveys (Repping et al., 2006; Espinosa et al., 2015; Johansson et al., 2015; Wei et al.,

2015). In accordance with these studies, we found that most amplicon CNVs in the general population do not fall within the small set of CNVs with confirmed phenotypes, and that duplications are more common than deletions. Our results suggest that most or all amplicon CNVs have phenotypic effects that cause selection to remove them from the population. The obvious candidate for this phenotype is a negative effect on spermatogenesis, as ampliconic genes are expressed exclusively in the testis (Lahn and

Page, 1997), and recent evidence suggests that both deletions and duplications in the

AZFc region can increase the risk of spermatogenic failure in certain populations (Yang et al., 2015). However, the mechanism through which any of these mutations affect spermatogenesis is still a mystery. Because the large CNVs within the AZFc region that cause spermatogenic failure delete multiple genes at once, the gene or genes responsible for the resulting phenotype cannot be determined. Further, studies in model systems are thwarted because most human ampliconic genes are not present on, for example, the mouse Y chromosome (Soh et al., 2014). For these reasons, the functions of individual

100 human ampliconic genes are unknown, with the only information being that they are

expressed in the testis and at least one is crucial for spermatogenesis. It is even possible

that noncoding elements within the amplicons play a major phenotypic role. Further

complicating this question is the fact that different CNVs change the copy number of

different genes, which may cause different molecular phenotypes altogether, but our

dataset is not large enough to characterize even the magnitude of the phenotypic effects

of different CNVs.

Due to the tremendous differences in amplicon structure and content between

species, we might expect amplicon structure within species to also be highly variable.

Such diversity would not be entirely unexpected—male reproductive genes evolve rapidly (Wyckoff et al., 2000). Instead, although the amplicons are quite mutable, the ancestral amplicon copy number state has nevertheless been maintained for 200,000 years

of human history. We propose two explanations for this observation, which cannot be

distinguished using our results.

First, the differences observed between species could be driven by changes in

environmental and reproductive pressures that alter the optimal copy number state of the

amplicons. For example, differences in primate mating systems affect the role of sperm

competition in reproduction; different amplicon states may be optimal for environments

with differing levels of sperm competition (Dixson, 1998; Hughes et al., 2005).

Alternatively, genetic changes, such as the Y chromosome’s acquisition of the DAZ gene

in primates, could change the optimal amplicon state (Saxena et al., 1996). In either case,

the new optimal state would quickly spread throughout the population; because many

variant amplicon states are always present within the population, it is likely that some Y

101 chromosomes are already close to the new optimum. This paradigm is analogous to standing variation within bacterial populations. Further, because the amplicons are so mutable, this race to a new optimum would shuffle the amplicon architecture, leaving the new structure unrecognizable compared to the previous one. This model predicts that the amplicons are in a state of punctuated equilibrium, in which long periods of stability are interrupted by sudden and drastic rearrangements.

Second, the differences observed between species could be driven by rare amplicon CNVs with large positive effects. The mutation-selection balance we have observed would represent change in the short term, but long-term amplicon change would instead be driven by these rare mutations. In contrast to our first model, this model predicts steady change in the dominant amplicon copy number state: when observing amplicon evolution over a timescale between the 200,000 years of human history studied here and the 8 million years since the human-chimpanzee split, we would see a progression of intermediate amplicon structures.

Our results also provide insight into the biological role of ampliconic sequence.

The ubiquity of ampliconic sequence on mammalian Y chromosomes suggests that amplification itself confers a functional benefit. Theories about this benefit include

1. gene conversion between amplicon copies allows for the rescue of deleterious mutations, 2. multiple copies provide the proper dosage of ampliconic genes, and 3. palindromes allow ampliconic genes to escape sex chromosome inactivation during meiosis by pairing with themselves (Rozen et al., 2003; Warburton et al., 2004;

Connallon and Clark, 2010; Hallast et al., 2013; Soh et al., 2014). Our finding that fitness is negatively affected by both duplications and deletions of amplicons supports the gene

102 dosage theory, as extra amplicon copies should have no deleterious effect on either gene conversion or escape from inactivation. However, these theories are not mutually exclusive; it is even possible that the initial driver of amplicon formation was gene conversion or escape from inactivation, and only afterwards was gene expression tuned to the number of amplicon copies.

This study provides a foundation for deeper investigation of the evolutionary questions presented here, as sequencing technologies grow increasingly powerful and large datasets—some orders of magnitude larger than the one we analyzed—become available (Gudbjartsson et al., 2015; Nagasaki et al., 2015; Telenti et al., 2016). For example, these datasets might contain the rare beneficial amplicon CNVs predicted by our second model, which, in a study of our size, would either be absent or occur at such a low frequency as to be indistinguishable from the other, deleterious CNVs. These datasets can also be used to tease out the magnitude of each amplicon’s contribution to reproductive fitness.

Additionally, data from other species can determine if the maintenance of an ancestral amplicon copy number state is common or restricted to humans. Of course, amplicons are subject to a variety of evolutionary forces that differ between species. For example, the mouse Y chromosome underwent runaway expansion as part of a genetic arms race with the X chromosome (Soh et al., 2014). Even if some selective pressures favored the ancestral amplicon state in mouse, the opposing pressure to amplify could have overridden them. Studying other species, particularly those with population divergence times greater than the 200,000 years of humans, can also help choose between our two proposed models of amplicon evolution. A promising possibility is the

103 chimpanzee: its Y chromosome is highly ampliconic and has high-quality reference sequence, the most recent common ancestor of the chimpanzee Y chromosomes is over one million years old, and chimpanzee genomes are beginning to be sequenced in high numbers (Prado-Martinez et al., 2013; Hallast et al., 2016). Future studies, using data from both human and non-human species, will continue to shed light on the evolutionary history of the Y chromosome amplicons and their roles in fitness and reproduction.

______

104 MATERIAL AND METHODS

ANNOTATION OF Y CHROMOSOME AMPLICONS

When the human Y chromosome was sequenced, the methods used to annotate the amplicons were sufficient to describe their overall structure (Kuroda-Kawaguchi et al., 2001; Skaletsky et al., 2003). However, we wanted a more precise description of amplicon coordinates for this study. Therefore, we re-annotated the Y chromosome amplicons as follows. We divided the reference Y chromosome sequence into overlapping 100-base-pair windows and aligned these windows to the entire reference genome (hg38) using bowtie2 (Langmead and Salzberg, 2012), with settings to return up to 10 alignments with alignment score -11 or greater. (These alignments were also used for masking repetitive sequence; see below.) We then created a bedgraph file of the Y chromosome, in which the value for each base equals the number of times the window beginning at that base aligned to the genome, and visualized the file using the UCSC

Genome Browser (Kent et al., 2002; Kent et al., 2010). Then, using the previous amplicon annotations as a guide, we inspected the regions of the Y chromosome where windows aligned to more than one location on the Y chromosome. We determined the precise start and end coordinates of previously described amplicons, gaps in amplicon copies that have low identity to corresponding copies, and several short and previously undescribed amplicons. Table S1 is a list of coordinates for the amplicons studied in this paper. Table S2 is a complete list of amplicons that we annotated with this method.

105 1000 GENOMES PROJECT DATA

We analyzed whole-genome sequencing data of 1225 men from the 1000

Genomes Project (1000 Genomes Project Consortium et al., 2015). We downloaded

FASTQ files from the Project’s FTP site (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/).

To ensure the proper alignment of reads, we used only files with paired-end reads and

read length ≥ 50. We then aligned the FASTQ files to the latest build of the human

reference genome (hg38) using bowtie2 (Langmead and Salzberg, 2012).

GC BIAS CORRECTION AND REPEAT MASKING

The GC content of DNA affects read depth in high-throughput sequencing (Dohm

et al., 2008). To correct for this effect, we used custom python scripts to build a GC bias

curve for each sequencing library and correct sequencing depth based on those curves.

See Supplemental Material and Methods for a full description of our method.

We also masked repetitive sequence from our subsequent analyses. From our re-

annotation of the amplicons, we had data on the number of times each 100-bp window of

the Y chromosome aligned elsewhere in the genome. We masked all windows that

aligned more than the expected number of times. For example, in the green amplicon,

which has three copies, we masked all windows that aligned four or more times. Because

many genomic repetitive elements are divergent enough that sequencing reads align to

them uniquely, this method is significantly less stringent than masking by RepeatMasker

(Smit et al., 2013-2015), which was used as part of the GC correction pipeline. As a

result, we masked a smaller percentage of the amplicon sequence (18% vs. 50%), which

increases the accuracy of our subsequent analyses.

106 WHOLE AMPLICON CNV DETECTION

To detect CNVs that delete or duplicate whole copies of amplicons, we calculated the mean depth of 15 amplicons by dividing the total depth in all copies of that amplicon by the combined size of the copies. We also calculated the mean depth of four single- copy regions on the Y chromosome to act as negative controls in the filtering steps below. We then normalized by the mean depth of a 1-Mb single-copy region of the Y chromosome. After normalization, the expected depth of an amplicon with the reference copy number is 1. We called copy number of amplicons based on the total amplicon depth. The thresholds for calling an amplicon CNV were the midpoints between the expected depths for each copy number of an amplicon. For example, the blue amplicon has four copies in the reference genome. In a man with 4 copies, the expected depth is 1; in a man with three copies, the expected depth is 0.75; and in a man with two copies, the expected depth is 0.5. Therefore, a man will be called as having three copies if depth is between 0.625 and 0.875.

PARTIAL AMPLICON CNV DETECTION

To detect CNVs that delete or duplicate only part of an amplicon, we used a modified version of the binary segmentation algorithm. Our input was non-overlapping

100-bp windows of depth for each amplicon copy. We removed windows with ≥ 25% masked bases. We calculated change points in the mean as described in Sen and

Srivastava (1975) except we used the Mann-Whitney U statistic rather than the t-statistic.

We do so because the t-test assumes normality, and sequencing depth is not distributed normally (Sampson et al., 2011). Because of this modification, we cannot use previous

107 methods to determine statistical significance. Instead, for each amplicon, we manually

assessed the men with the 64 highest U-statistic values. (Each copy of an amplicon has its

own change point; we defined the U-statistic value of an amplicon as the lowest of its

copies’ values.) A partial CNV was called if the following three criteria were met: 1. the

change points in each amplicon copy were close to each other, 2. the direction of the shift

in mean across the change point was the same in each amplicon copy, and 3. in each

amplicon copy, the predicted copy number based on depth was different on each side of

the change point. Several exceptions were made to criterion 1 in cases where an entire

duplication or deletion was contained in a single amplicon. In such cases, two change

points are present in the single amplicon, and the maximally significant change point

differed in different amplicon copies. When such cases were obvious upon visual

inspection, we called them as partial CNVs. We also removed four men from the dataset

that had extremely noisy depth and an abnormally high number of change points.

It should be noted that this method is less powerful than the more recent circular

binary segmentation algorithms that are commonly used to detect genomic CNVs

(Olshen et al., 2004). Our choice of a less powerful algorithm was intentional, as we only

wish to detect the largest and most obvious partial amplicon CNVs. Smaller CNVs are

probably under different evolutionary pressures than whole-amplicon and large partial

CNVs, and therefore are less relevant to our analysis here.

Y CHROMOSOME SPECIFICITY OF AMPLICONIC SEQUENCE

We confirmed that no reads from elsewhere in the genome aligned to the amplicons or control regions. We measured amplicon depth as described above from

108 whole genome sequencing data of 15 women and five men from the 1000 Genomes

Project, using a 5-Mb region of chromosome 2 (chr2: 80,000,000-85,000,000) as the normalization region. Single-copy sequence on the Y is unfit for this purpose, since its expected depth in women is 0.

FILTERING OF COPY NUMBER CALLS

We performed two filtering steps. First, for each amplicon, we calculated the median depth of the 100-bp windows used in partial amplicon CNV detection. In 40 men, the predicted copy number (using thresholds as described above) was different using this median value and the mean depth. In 38 of these men, mean and median copy number calls differed in a single amplicon, and the copy number state using the mean values was either the reference state or a common and predicted CNV state. Two men that did not fit these criteria were removed from the dataset. Second, we calculated the mean and standard deviation of the depth of each control region, excluding four men with large short-arm deletions that remove control regions 2 and 3. There were 28 men with two or more control regions more than two standard deviations away from their means. 27 of these men had either the reference copy number or a common and predicted CNV state.

The one man that had neither was removed from the dataset.

We estimated call accuracy rate from men with two libraries as follows. Assume a library will contain erroneous CNV calls with a fixed probability x. The probability of both libraries having correct CNV calls is (1 − x)2. We found that 89/92 (96.6%) of men with two libraries had concordant amplicon calls. Further assuming that the chance of both libraries having erroneous calls that are concordant with each other is negligible, we

109 can solve for x = 1.6%. We removed two of the men with discordant copy number calls

from the dataset. The third man had concordant CNV calls of other amplicons that

matched a common predicted CNV, and we adjusted the discordant call to conform to

that state.

MULTI-COLOR FISH

Cell lines of 12 men from the 1000 Genomes Project (HG00142, HG00271,

HG01187, HG01890, HG02394, HG02982, HG03445, NA12812, NA18504, NA18960,

NA18983, and NA20520) were obtained from the NHGRI Sample Repository for Human

Genetic Research at the Coriell Institute for Medical Research (https://www.coriell.org/1/

NHGRI). Two-color interphase FISH was performed as described (Saxena et al., 2000).

We scored at least 200 cells for each set of probes in each cell line. Images were

recolored to match the color of amplicon names.

AMPLICON ARCHITECTURE PREDICTION

We simulated all AZFc architectures formed by one, two, or three NAHR events

between amplicon copies, as previously described (Figure S1; Repping et al., 2006). Men

with amplicon CNVs were matched to architectures with the same copy number of each

amplicon. When multiple architectures matched, we chose the architecture(s) formed by

the fewest NAHR events.

110 HAPLOGROUPS

The haplogroups of 1210 men in the 1000 Genomes Project are already annotated.

For the remaining 15, we determined haplogroups using Ytree (https://bitbucket.org/

reneeg36/ytree/overview). The phylogenetic tree in Figure 9B was built using estimates

of divergence time from Poznik et al. (2016).

MODIFICATION OF DETAILED PHYLOGENETIC TREE

In many of our subsequent analyses, we used the detailed phylogenetic tree of the

1000 Genomes Project Y chromosomes (Figure 2) built by Poznik et al. (2016). We modified the original tree from that study in two ways. First, we manually identified instances where the tree architecture was inconclusive because no SNPs differed between three or more branches, but two or more of those branches contained the same CNV. We corrected the tree in such instances where its original, arbitrarily determined architecture contradicted the architecture implied by the CNVs.

Second, due to low sequencing coverage in individual men, SNPs may be missing from branches near the tips of the tree, leading to those branches being depicted as shorter than they actually are. This would, in turn, cause CNVs to appear to cluster in the more recent past, even if they were in fact distributed evenly over time. Because such clustering is a key result of this study, it is essential to correct for this effect. First, any branch with length 0 was changed to have length 0.5. Then, we adjusted each branch as follows. As described in depth in Poznik et al. (2016), assuming that 1. a SNP is detected if two reads covering the site of the SNP are observed, and 2. the number of reads at a given site can be described with a Poisson distribution with mean equal to the overall

111 sequencing coverage of the Y chromosome, we expect the length of a branch to be

reduced by x(p0 + p1), where x = the observed branch length as measured in SNPs and pi

= the probability of observing exactly i sequencing reads at a given site. Therefore, we

divide each branch length by 1 − (p0 + p1) to correct for this reduction, using the combined sequencing depth of each individual descended from that branch to calculate p0 and p1. This method is imperfect: as discussed in Poznik et al. (2016), it is intractable to

completely model and correct for the effect of missing SNPs. However, our method of

correction extends the lengths of the terminal branches of the tree so that each is at least

as long as its expected true length. Therefore, our correction is, at worst, incorrectly

extending the terminal branches of the tree at the expense of the more ancient branches,

so we can be confident that the clustering of CNVs in the more recent branches of the

tree is not an artifact caused by this effect.

CALCULATION OF AMPLICON MUTATION RATE

To calculate a lower bound of the amplicon CNV mutation rate, we divided the number of mutation events in the detailed phylogenetic tree by the total evolutionary time traversed by the tree. The number of mutation events, as determined above by Fitch’s algorithm, was 139. The total branch length of the tree after correction for missing SNPs as described above was 69,029 SNPs. We converted SNPs to generations as described below to obtain a total branch length of 363,369 generations. These values yielded a rate of 3.83 × 10−4 mutations per father-to-son Y transmission. We expect that the true mutation rate is higher, as selection is depressing the number of mutations observed by removing Y chromosomes with mutations from the population.

112 SIMULATION OF NEUTRAL EVOLUTION

We simulated neutral evolution using the detailed phylogenetic tree. In our

simulations, nodes of the tree can be in one of two states: reference or mutated. Each

simulation began at the root of the tree and traveled along each branch to the leaves.

Every generation, there was a fixed probability of mutation from the reference state to a

mutated state. We also used a model in which, every generation, there was a fixed

probability of reversion from the mutated state to the reference state. Generations are

measured by converting branch lengths of the tree, measured in number of SNPs, to

years, as described in Poznik et al. (2016). Briefly, the Y chromosome mutation rate is

estimated as 0.76 × 10−9 SNP mutations per bp per year as calculated by Fu et al. (2014).

The total number of bases analyzed to build the tree is approximately 10,000,000.

Therefore, (0.76 × 10−9 SNP mutations per bp per year × 107 bp) −1 = 131.6 years per SNP mutation. We then converted years to generations, assuming a generation time of 25 years. Because a branch length of 1 SNP corresponds to 5.26 generations, we simulated fractional generations at the end of each branch.

We performed 10,000 simulations for each of 24 different mutation rates and 24 reversion rates, ranging from 0.5 to 1 × 10−6 mutations per generation. We implemented

Fitch’s algorithm to count the number of mutation events that occurred (Fitch, 1971).

When calculating mutation events in the real tree, we did not distinguish between different types of CNV events, as our model has only two states: reference and mutated.

Therefore, the number of real mutation events counted here is lower than the number used above when calculating the amplicon mutation rate. Simulations were run using custom python scripts and the ete3 python module.

113

CALCULATING CNV DISTRIBUTION OVER THE PHYLOGENETIC TREE

See Figure 13. We defined branch age as the mean distance from the child node of the branch to leaves that descend from that node, plus half the length of the branch. See

Supplemental Material and Methods for a further discussion of this test.

HAPLOGROUP A00 MEN

We analyzed two haplogroup A00 men who were sequenced by Karmin et al.

(2015). We downloaded Y chromosome BAM files of these men from the Estonian

Biocentre data repository (www.ebc.ee/free_data/chrY) and converted the BAM files to

FASTQ files using bedtools (Quinlan and Hall, 2010). We then processed these files

using the same pipeline as the 1000 Genomes Project samples. The exception is that we

did not perform GC bias correction in these samples. Autosomal data is necessary for GC

correction, and only Y chromosome data was available for these men.

We simulated A00 evolution using a model of haploid genetic drift. Men can be

in one of two states, reference or mutated. We began with N men, all in the reference

state. Each generation, we drew a number x from a binomial distribution B(N, p + m),

where p = the fraction of men in the previous generation with the mutated state and m =

the mutation rate per generation per individual. We then set the number of men with the

mutated state in the next generation to x. We simulated 10,000 generations,

corresponding to 250,000 years of history given a generation time of 25 years.

114 FIGURE GENERATION

Plots were generated using Adobe Illustrator and custom python scripts with python modules matplotlib, seaborn, and ete3 (Hunter, 2007; Huerta-Cepas et al., 2016; https://doi.org/10.5281/zenodo.54844).

CUSTOM CODE

All custom code used is available at https://github.com/lsteitz/y-amplicon-

evolution.

______

115 ACKNOWLEDGEMENTS

We thank Winston Bellott and Alex Godfrey for advice on analyses and figures, Renee

George for assistance with Ytree, David Poznik for providing a formatted version of the

Y chromosome phylogenetic tree, and Winston Bellott, Jenn Hughes, Alex Godfrey, and

Emily Jackson for critical reading of the manuscript. This work was supported by the

National Institutes of Health and the Howard Hughes Medical Institute.

______

REFERENCES

1000 Genomes Project Consortium, Auton, A., Brooks, L. D., Durbin, R. M., Garrison, E. P., Kang, H. M., Korbel, J. O., Marchini, J. L., McCarthy, S., et al. (2015). A global reference for human genetic variation. Nature 526, 68-74.

Andersson, M., Page, D. C., & de la Chapelle, A. (1986). Chromosome Y-specific DNA is transferred to the short arm of X chromosome in human XX males. Science 233, 786-8.

Bellott, D. W., Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Cho, T. J., Koutseva, N., Zaghlul, S., Graves, T., et al. (2014). Mammalian Y chromosomes retain widely expressed dosage-sensitive regulators. Nature 508, 494-9.

Burgoyne, P. S. (1982). Genetic homology and crossing over in the X and Y chromosomes of Mammals. Hum Genet 61, 85-90.

Charlesworth, B., & Charlesworth, D. (2000). The degeneration of Y chromosomes. Philos Trans R Soc Lond B Biol Sci 355, 1563-72.

Chimpanzee Sequencing Analysis Consortium. (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69-87.

Connallon, T., & Clark, A. G. (2010). Gene duplication, gene conversion and the evolution of the Y chromosome. Genetics 186, 277-86.

Dixson, A. F. (1998). Primate Sexuality: Comparative Studies of the Prosimians, Monkeys, Apes and Human Beings. Chicago: Univ Chicago Press.

116 Dohm, J. C., Lottaz, C., Borodina, T., & Himmelbauer, H. (2008). Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36, e105.

Espinosa, J. R., Ayub, Q., Chen, Y., Xue, Y., & Tyler-Smith, C. (2015). Structural variation on the human Y chromosome from population-scale resequencing. Croat Med J 56, 194-207.

Fernandes, S., Paracchini, S., Meyer, L. H., Floridia, G., Tyler-Smith, C., & Vogt, P. H. (2004). A large AZFc deletion removes DAZ3/DAZ4 and nearby genes from men in Y haplogroup N. Am J Hum Genet 74, 180-7.

Fitch, W. M. (1971). Toward defining the course of evolution: minimum change for a specified tree topology. Systematic Zoology 20, 406-16.

Fu, Q., Li, H., Moorjani, P., Jay, F., Slepchenko, S. M., Bondarev, A. A., Johnson, P. L., Aximu-Petri, A., Prufer, K., et al. (2014). Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514, 445-9.

Ghenu, A. H., Bolker, B. M., Melnick, D. J., & Evans, B. J. (2016). Multicopy gene family evolution on primate Y chromosomes. BMC Genomics 17, 157.

Giachini, C., Nuti, F., Turner, D. J., Laface, I., Xue, Y., Daguin, F., Forti, G., Tyler- Smith, C., & Krausz, C. (2009). TSPY1 copy number variation influences spermatogenesis and shows differences among Y lineages. J Clin Endocrinol Metab 94, 4016-22.

Graves, J. A. (2006). Sex chromosome specialization and degeneration in mammals. Cell 124, 901-14.

Gudbjartsson, D. F., Helgason, H., Gudjonsson, S. A., Zink, F., Oddson, A., Gylfason, A., Besenbacher, S., Magnusson, G., Halldorsson, B. V., et al. (2015). Large-scale whole-genome sequencing of the Icelandic population. Nat Genet 47, 435-44.

Hallast, P., Balaresque, P., Bowden, G. R., Ballereau, S., & Jobling, M. A. (2013). Recombination dynamics of a human Y-chromosomal palindrome: rapid GC- biased gene conversion, multi-kilobase conversion tracts, and rare inversions. PLoS Genet 9, e1003666.

Hallast, P., Maisano Delser, P., Batini, C., Zadik, D., Rocchi, M., Schempp, W., Tyler- Smith, C., & Jobling, M. A. (2016). Great ape Y Chromosome and mitochondrial DNA phylogenies reflect subspecies structure and patterns of mating and dispersal. Genome Res 26, 427-39.

Huerta-Cepas, J., Serra, F., & Bork, P. (2016). ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Mol Biol Evol 33, 1635-8.

117 Hughes, J. F., Skaletsky, H., & Page, D. C. (2012). Sequencing of rhesus macaque Y chromosome clarifies origins and evolution of the DAZ (Deleted in AZoospermia) genes. Bioessays 34, 1035-44.

Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K., Minx, P. J., Fulton, R. S., McGrath, S. D., Locke, D. P., et al. (2010). Chimpanzee and human Y chromosomes are remarkably divergent in structure and gene content. Nature 463, 536-9.

Hughes, J. F., Skaletsky, H., Pyntikova, T., Minx, P. J., Graves, T., Rozen, S., Wilson, R. K., & Page, D. C. (2005). Conservation of Y-linked genes during human evolution revealed by comparative sequencing in chimpanzee. Nature 437, 100-3.

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing In Science & Engineering 9, 90-95.

Johansson, M. M., Van Geystelen, A., Larmuseau, M. H., Djurovic, S., Andreassen, O. A., Agartz, I., & Jazin, E. (2015). Microarray Analysis of Copy Number Variants on the Human Y Chromosome Reveals Novel and Frequent Duplications Overrepresented in Specific Haplogroups. PLoS One 10, e0137223.

Karmin, M., Saag, L., Vicente, M., Wilson Sayres, M. A., Jarve, M., Talas, U. G., Rootsi, S., Ilumae, A. M., Magi, R., et al. (2015). A recent bottleneck of Y chromosome diversity coincides with a global change in culture. Genome Res 25, 459-66.

Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., & Haussler, D. (2002). The human genome browser at UCSC. Genome Res 12, 996- 1006.

Kent, W. J., Zweig, A. S., Barber, G., Hinrichs, A. S., & Karolchik, D. (2010). BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26, 2204-7.

Kuroda-Kawaguchi, T., Skaletsky, H., Brown, L. G., Minx, P. J., Cordum, H. S., Waterston, R. H., Wilson, R. K., Silber, S., Oates, R., et al. (2001). The AZFc region of the Y chromosome features massive palindromes and uniform recurrent deletions in infertile men. Nat Genet 29, 279-86.

Lahn, B. T., & Page, D. C. (1997). Functional coherence of the human Y chromosome. Science 278, 675-80.

Lange, J., Noordam, M. J., van Daalen, S. K., Skaletsky, H., Clark, B. A., Macville, M. V., Page, D. C., & Repping, S. (2013). Intrachromosomal homologous recombination between inverted amplicons on opposing Y-chromosome arms. Genomics 102, 257-64.

118 Lange, J., Skaletsky, H., van Daalen, S. K., Embry, S. L., Korver, C. M., Brown, L. G., Oates, R. D., Silber, S., Repping, S., et al. (2009). Isodicentric Y chromosomes and sex disorders as byproducts of homologous recombination that maintains palindromes. Cell 138, 855-69.

Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357-9.

Lu, C., Zhang, J., Li, Y., Xia, Y., Zhang, F., Wu, B., Wu, W., Ji, G., Gu, A., et al. (2009). The b2/b3 subdeletion shows higher risk of spermatogenic failure and higher frequency of complete AZFc deletion than the gr/gr subdeletion in a Chinese population. Hum Mol Genet 18, 1122-30.

Lynch, M., Cram, D. S., Reilly, A., O'Bryan, M. K., Baker, H. W., de Kretser, D. M., & McLachlan, R. I. (2005). The Y chromosome gr/gr subdeletion is associated with male infertility. Mol Hum Reprod 11, 507-12.

Massaia, A., & Xue, Y. (2017). Human Y chromosome copy number variation in the next generation sequencing era and beyond. Hum Genet 136, 591-603.

Mendez, F. L., Krahn, T., Schrack, B., Krahn, A. M., Veeramah, K. R., Woerner, A. E., Fomine, F. L., Bradman, N., Thomas, M. G., et al. (2013). An African American paternal lineage adds an extremely ancient root to the human Y chromosome phylogenetic tree. Am J Hum Genet 92, 454-9.

Morgan, A. P., & de Villena, F. P. V. (2017). Sequence and structural diversity of mouse Y chromosomes. Mol Biol Evol 34, 3186-204.

Nagasaki, M., Yasuda, J., Katsuoka, F., Nariai, N., Kojima, K., Kawai, Y., Yamaguchi- Kabata, Y., Yokozawa, J., Danjoh, I., et al. (2015). Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nat Commun 6, 8018.

Nathanson, K. L., Kanetsky, P. A., Hawes, R., Vaughn, D. J., Letrero, R., Tucker, K., Friedlander, M., Phillips, K. A., Hogg, D., et al. (2005). The Y deletion gr/gr and susceptibility to testicular germ cell tumor. Am J Hum Genet 77, 1034-43.

Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biol Evol 8, 2231-40.

Olshen, A. B., Venkatraman, E. S., Lucito, R., & Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557-72.

119 Poznik, G. D., Xue, Y., Mendez, F. L., Willems, T. F., Massaia, A., Wilson Sayres, M. A., Ayub, Q., McCarthy, S. A., Narechania, A., et al. (2016). Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat Genet 48, 593-9.

Prado-Martinez, J., Sudmant, P. H., Kidd, J. M., Li, H., Kelley, J. L., Lorente-Galdos, B., Veeramah, K. R., Woerner, A. E., O'Connor, T. D., et al. (2013). Great ape genetic diversity and population history. Nature 499, 471-5.

Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-2.

Repping, S., Skaletsky, H., Brown, L., van Daalen, S. K., Korver, C. M., Pyntikova, T., Kuroda-Kawaguchi, T., de Vries, J. W., Oates, R. D., et al. (2003). Polymorphism for a 1.6-Mb deletion of the human Y chromosome persists through balance between recurrent mutation and haploid selection. Nat Genet 35, 247-51.

Repping, S., Skaletsky, H., Lange, J., Silber, S., Van Der Veen, F., Oates, R. D., Page, D. C., & Rozen, S. (2002). Recombination between palindromes P5 and P1 on the human Y chromosome causes massive deletions and spermatogenic failure. Am J Hum Genet 71, 906-22.

Repping, S., van Daalen, S. K., Brown, L. G., Korver, C. M., Lange, J., Marszalek, J. D., Pyntikova, T., van der Veen, F., Skaletsky, H., et al. (2006). High mutation rates have driven extensive structural polymorphism among human Y chromosomes. Nat Genet 38, 463-7.

Repping, S., van Daalen, S. K., Korver, C. M., Brown, L. G., Marszalek, J. D., Gianotten, J., Oates, R. D., Silber, S., van der Veen, F., et al. (2004). A family of human Y chromosomes has dispersed throughout northern Eurasia despite a 1.8-Mb deletion in the azoospermia factor c region. Genomics 83, 1046-52.

Rozen, S., Marszalek, J. D., Alagappan, R. K., Skaletsky, H., & Page, D. C. (2009). Remarkably little variation in proteins encoded by the Y chromosome's single- copy genes, implying effective purifying selection. Am J Hum Genet 85, 923-8.

Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J., Cordum, H. S., Waterston, R. H., Wilson, R. K., & Page, D. C. (2003). Abundant gene conversion between arms of palindromes in human and ape Y chromosomes. Nature 423, 873-6.

Rozen, S. G., Marszalek, J. D., Irenze, K., Skaletsky, H., Brown, L. G., Oates, R. D., Silber, S. J., Ardlie, K., & Page, D. C. (2012). AZFc deletions and spermatogenic failure: a population-based survey of 20,000 Y chromosomes. Am J Hum Genet 91, 890-6.

120 Sampson, J., Jacobs, K., Yeager, M., Chanock, S., & Chatterjee, N. (2011). Efficient study design for next generation sequencing. Genet Epidemiol 35, 269-77.

Saxena, R., Brown, L. G., Hawkins, T., Alagappan, R. K., Skaletsky, H., Reeve, M. P., Reijo, R., Rozen, S., Dinulos, M. B., et al. (1996). The DAZ gene cluster on the human Y chromosome arose from an autosomal gene that was transposed, repeatedly amplified and pruned. Nat Genet 14, 292-9.

Saxena, R., de Vries, J. W., Repping, S., Alagappan, R. K., Skaletsky, H., Brown, L. G., Ma, P., Chen, E., Hoovers, J. M., et al. (2000). Four DAZ genes in two clusters found in the AZFc region of the human Y chromosome. Genomics 67, 256-67.

Sen, A., & Srivastava, M. S. (1975). On tests for detecting a change in mean. Annals of Statistics 3, 98-108.

Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G., Repping, S., Pyntikova, T., Ali, J., et al. (2003). The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423, 825-37.

Smit, A. F. A., Hubley, R., & Green, P. (2013-2015). RepeatMasker Open-4.0. Retrieved from http://www.repeatmasker.org

Soh, Y. Q., Alfoldi, J., Pyntikova, T., Brown, L. G., Graves, T., Minx, P. J., Fulton, R. S., Kremitzki, C., Koutseva, N., et al. (2014). Sequencing the mouse Y chromosome reveals convergent gene acquisition and amplification on both sex chromosomes. Cell 159, 800-13.

Sudmant, P. H., Mallick, S., Nelson, B. J., Hormozdiari, F., Krumm, N., Huddleston, J., Coe, B. P., Baker, C., Nordenfelt, S., et al. (2015). Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761.

Telenti, A., Pierce, L. C., Biggs, W. H., di Iulio, J., Wong, E. H., Fabani, M. M., Kirkness, E. F., Moustafa, A., Shah, N., et al. (2016). Deep sequencing of 10,000 human genomes. Proc Natl Acad Sci U S A 113, 11901-06.

Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H. W., Harris, R., Ye, D., O'Brien, P. C., Chikhi, R., et al. (2016). A time- and cost- effective strategy to sequence mammalian Y Chromosomes: an application to the de novo assembly of gorilla Y. Genome Res 26, 530-40.

Vogt, P. H., Edelmann, A., Kirsch, S., Henegariu, O., Hirschmann, P., Kiesewetter, F., Kohn, F. M., Schill, W. B., Farah, S., et al. (1996). Human Y chromosome azoospermia factors (AZF) mapped to different subregions in Yq11. Hum Mol Genet 5, 933-43.

121 Vollrath, D., Foote, S., Hilton, A., Brown, L. G., Beer-Romero, P., Bogan, J. S., & Page, D. C. (1992). The human Y chromosome: a 43-interval map based on naturally occurring deletions. Science 258, 52-9.

Warburton, P. E., Giordano, J., Cheung, F., Gelfand, Y., & Benson, G. (2004). Inverted repeat structure of the human genome: the X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes. Genome Res 14, 1861-9.

Wei, W., Fitzgerald, T. W., Ayub, Q., Massaia, A., Smith, B. H., Dominiczak, A. F., Morris, A. D., Porteous, D. J., Hurles, M. E., et al. (2015). Copy number variation in the human Y chromosome in the UK population. Hum Genet 134, 789-800.

Wilson Sayres, M. A., Lohmueller, K. E., & Nielsen, R. (2014). Natural selection reduced diversity on human Y chromosomes. PLoS Genet 10, e1004064.

Wyckoff, G. J., Wang, W., & Wu, C. I. (2000). Rapid evolution of male reproductive genes in the descent of man. Nature 403, 304-9.

Y Chromosome Consortium. (2002). A nomenclature system for the tree of human Y- chromosomal binary haplogroups. Genome Res 12, 339-48.

Yang, B., Ma, Y. Y., Liu, Y. Q., Li, L., Yang, D., Tu, W. L., Shen, Y., Dong, Q., & Yang, Y. (2015). Common AZFc structure may possess the optimal spermatogenesis efficiency relative to the rearranged structures mediated by non- allele homologous recombination. Sci Rep 5, 10551.

122 SUPPLEMENTAL MATERIAL AND METHODS

GC BIAS CORRECTION

The GC content of DNA affects read depth in high-throughput sequencing (Dohm et al.,

2008). This bias can drastically differ between sequencing libraries and is primarily driven by the GC content of the entire DNA fragment, rather than just the sequenced read

(Benjamini and Speed, 2012). To correct for this effect, we built a GC bias curve for each sequencing library and corrected sequencing depth based on those curves. To build a GC bias curve, we began by selecting 10,000,000 positions on the autosomes, excluding repetitive regions as annotated by RepeatMasker (Smit et al., 2013-2015). In order to reduce the possibility of any systematic bias due to unanticipated factors in specific regions of the genome, these locations were different for each curve we built, but were always chosen so that regions with very high and very low GC content—which are relatively rare—were well-represented. Then, using the mapping locations of paired reads in the library, we built an empirical distribution of DNA fragment sizes present in the library. For each of the 10,000,000 chosen locations in the genomes, we randomly selected from the empirical fragment size distribution and calculated the GC content of a window of the selected size starting at that location. We sorted each calculated GC percentage into bins of 0.5%. Then, we calculated the GC content of each fragment from the library that began at one of the chosen locations, and again sorted each calculated percentage in bins of 0.5%. For each bin, we divided the number of real fragments by the number of locations and normalized by total sequencing depth of the library. Finally, we smoothed the resulting GC curve with the LOWESS method, using the Statistics module of Biopython (Cock et al., 2009). The value of each bin in the final curve equals the over-

123 or underrepresentation of observed fragments (fragments with that bin’s GC content in the sequencing library) relative to expected fragments (the prevalence of regions with that

GC content in the genome).

After calculating GC curves, we calculated corrected sequencing depth.

Sequencing depth for a location in the genome is normally calculated by adding 1 for each read that overlaps that location. For corrected depth, instead of adding 1 for each read, we add 1 divided by the value of the GC bias curve for the fragment’s GC content.

If this value is > 3, we add 3 instead. This occurs most often for fragments with extremely high or low GC content, which tend to have very low GC curve values.

Capping the depth value of a read at 3 prevents rare instances in which, by chance, a region of such fragments has a high number of reads, leading to its depth being exaggerated to extremely high levels in the absence of such a cap.

BRANCH-SORTING ANALYSIS

The branch-sorting test generates an analytical, rather than simulation-based, p- value of observing a distribution of amplicon CNVs over the detailed phylogenetic tree under selectively neutral conditions. (See Figure 13 and Material and Methods for a description of this test.) We make the assumption that, under selectively neutral conditions, mutation events will be distributed uniformly over the total evolutionary time covered by the tree. This assumption holds true even if Y chromosomes underwent bursts of population expansion throughout history. The phylogenetic tree of Y chromosomes contains within itself the information about such population dynamics; because this analysis calculates the distribution of CNV events over the total evolutionary time

124 traversed by the tree, the greater number of men in which mutation can occur after a

population expansion is reflected in the greater evolutionary time covered by such men

within the tree.

For this analysis, we annotate each mutation event, including events that happen

on a Y chromosome that has already undergone a previous amplicon CNV mutation

event. In contrast, our other, simulation-based analysis did not count such events. We

made this distinction to make the simulation-based method more tractable, at some cost

of verisimilitude. Most branches in which a mutation event occurred can be annotated by

Fitch’s algorithm (Fitch, 1971). For 25 branches in which Fitch’s algorithm gave an

inconclusive result, we manually annotated mutation events based on the most likely

mechanism of mutation. For example, when two different variants are child nodes of the

same parent node, Fitch’s algorithm is inconclusive. If one of the variants could result

from a mutation event occurring on a chromosome with the other variant, we annotated

the branch of the parent node and the branch of the former variant as having mutation events. If those two variants could not occur from an event occurring to the other variant, we annotated both child node branches as having mutation events.

For a p-value to be valid, its values when testing data that conforms to the null hypothesis must be uniformly distributed between 0 and 1. Therefore, to test the validity of this analysis, we shuffled the order of the branches within the tree, maintaining the presence or absence of a mutation event in each branch, and calculated a p-value in the same way we calculated the p-value of the sorted branches. We performed this process

1,000 times and calculated the distribution of resulting p-values. We found that the p-

125 values generated by shuffling the branches were indeed uniformly distributed,

demonstrating that the test does perform well in this case.

However, two further tests demonstrate the limitations of our method. First, we

took the empirical tree structure of the 1000 Genomes Project men and randomly

assigned mutation events to branches with various mutation frequencies, ranging from 5

× 10−1 to 5 × 10−7 mutations per father-to-son Y transmission. This generated a number of

trees, one per mutation rate, and each with a different total number of mutation events.

We then performed 1,000 shuffles of each of these trees. In the trees with a high number

of mutation events, the p-value distribution of the resulting shuffled trees was skewed

towards low p-values. Second, we simulated amplicon mutation over the tree structure

1,000 times using a mutation rate of 3.83 × 10−4 mutations per father-to-son Y transmission, which is the lower bound calculated from the real data. Unlike our other simulations, branches in which a mutation had occurred in an ancestral branch were allowed to mutate a second time; this allowance of re-mutation is necessary to match our assumption that mutation events should be uniformly distributed over the evolutionary time within the tree. For each simulation, we calculated p-values as described. In this case too, the distribution was skewed towards low p-values (Figure 13I). Additionally, the simulated trees tended to curve below the line representing neutral evolution (i.e. they had more mutation evens in the recent past and fewer in the ancient past).

These results occur for two reasons: first, the KS test is designed to test continuous distributions. Here, the distribution of mutation events is discrete, as we place the mutation event at then center of the branch in which it occurred. Second, our model only allows a single mutation event per branch.

126 When mutations are rare (as is the case with the real data), these factors make

little difference. However, when mutations are more common, the fact that all mutation

events are in the center of each branch, combined with the fact that branches not all the

same length, creates enough deviation from the continuous null uniform distribution to

skew the p-values toward lower values. Further, the fact that longer branches tend to be

in the more ancient parts of the tree means that it is more likely that two mutation events

(either true or simulated) would occur in a single branch in the more ancient parts of the

tree. Those events are only counted as a single event by our method, reducing the number

of events counted in the ancient branches of the tree.

Allowing multiple mutations to occur in each branch and distributing them

randomly within the branch, rather than in the center, ameliorated these issues. For

analysis of our real data, we chose not to do this, to keep the method as simple as

possible. We note that the true data had a more extreme KS statistic than all 1,000

simulations; further, the minimum p-value of the 1,000 simulations was 4.99 × 10−4, compared to p = 1.01 × 10−7 for the real data. Therefore, although our test exaggerates the

significance of the p-value, the deviation of the real data from neutral expectation is

nevertheless extremely significant. However, our method must be modified for trees that

are more densely populated with mutation events and for trees in which the signature of

selection is less extreme.

127 REFERENCES FOR SUPPLEMENTAL MATERIAL AND METHODS

Benjamini, Y., & Speed, T.P. (2012). Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72.

Cock, P.J., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., et al. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422-3.

Dohm, J.C., Lottaz, C., Borodina, T., & Himmelbauer, H. (2008). Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36, e105.

Fitch, W.M. (1971). Toward defining the course of evolution: minimum change for a specified tree topology. Systematic Zoology 20, 406-16.

Smit, A. F. A., Hubley, R., & Green, P. (2013-2015). RepeatMasker Open-4.0. Retrieved from http://www.repeatmasker.org

128 SUPPLEMENTAL FIGURES AND TABLES

FIGURE S1. PREDICTED AZFC CNV STATES ARISING THROUGH NAHR

FIGURE S2. RESULTS OF TWO-COLOR FISH ANALYSIS

FIGURE S3. EVOLUTIONARY ANALYSIS OF HIGH-CONFIDENCE CNV CALLS

TABLE S1. AMPLICON AND CONTROL REGION BOUNDARIES

TABLE S2. COMPLETE AMPLICON ANNOTATION OF THE HUMAN Y CHROMOSOME

129

 FIGURE S1 (PAGE 1)

130

 FIGURE S1 (PAGE 2). PREDICTED AZFC CNV STATES ARISING THROUGH NAHR

A. AZFc reference architecture.

B. AZFc architectures formed by one NAHR event between amplicon copies.

C. AZFc architectures formed by two NAHR events between amplicon copies. The

799 AZFc architectures formed by three NAHR events, not shown here, are available

in Repping et al. (2006).

D. AZFc architectures corresponding to copy number states found in 1000 Genomes

Project men. Some copy number states are concordant with multiple amplicon

architectures.

131

FIGURE S2. RESULTS OF TWO-COLOR FISH ANALYSIS „

A. Hybridization locations of FISH probes used. (B-J) Selected FISH images and copy number call plots of the remaining 9 men on whom FISH was performed. AZFc architectures are shown for men whose computational CNV calls matched a predicted architecture.

B-D. Men with deletions. FISH of the man in C. detected an error in the computational

CNV call.

EF. Men with the reference copy number.

G-J. Men with duplications. At high copy numbers, FISH underestimates the number of amplicon copies.

132

133

FIGURE S3. EVOLUTIONARY ANALYSIS OF HIGH-CONFIDENCE CNV CALLS „

A. Mutation events vs. number of men with CNVs. Each point represents one simulation

over the phylogenetic tree of men in our dataset.

B. Distribution of CNV mutation events over the evolutionary tree. Blue curve: branches

of the phylogenetic tree of men in our dataset sorted by branch age. Red diagonal line:

expected distribution if CNVs were selectively neutral. Gray lines: branches of the

phylogenetic tree shuffled at random. 1,000 shuffles were performed. p = 6.57 × 10−7, KS test.

134

135 TABLE S1. AMPLICON AND CONTROL REGION BOUNDARIES

Amplicon name Start location End location Length1 Orientation Notes P8 Proximal P8 13984446 14019663 35217 F Distal P8 14023068 14058287 35219 R P7 Proximal P7 15874850 15883583 8733 F Distal P7 15896217 15904951 8734 R P6 Proximal P6 16159548 16269551 110003 F Distal P6 16315772 16425803 110031 R P5 Proximal P5 17455803 17823070 367267 F Excludes IR5 region Distal P5 17823070 17951262 128192 R Excludes IR5 region IR1 Yp IR1 7586095 7608949 21620 F Excludes gap from 7592160-7608949 Yq IR1 22722170 22746680 21526 R Excludes gaps from 22726696-22729621 and 22740559- 22740618 IR2 Proximal IR2 21551835 21567301 15466 F Excludes RBMY genes Distal IR2 21862115 21877568 15453 R Excludes RBMY genes IR3 Distal IR3 6234810 6532527 276314 F Excludes TSPY copy from 6245524-6266828 and gap from 6361948-6362047 Proximal IR3 9628646 9919596 279977 R Excludes gaps from 9802679-9812672 and 9907906-9908886 IR5 IR5-1 17823070 17951262 128192 F IR5-2 17954717 18082941 128224 R IR5-3 23992775 24114882 122107 F IR5-4 25555259 25677365 122106 R Blue b1 21925349 22093051 167702 F b2 22493356 22661025 167669 R b3 23535162 23702730 167568 F b4 25967344 26135454 168110 R Teal t1 22093051 22208739 94273 F Excludes RBMY gene from 22165030-22186445 t2 22377680 22493356 94259 R Excludes RBMY gene from 22399976-22421393 Green g1 22754280 23061370 307090 F g2 24388041 24695116 307075 F g3 24974970 25282100 307130 R Red r1 23061754 23208205 76698 F Excludes DAZ gene from 23129355-23199108 r2 23210394 23359020 76703 R Excludes DAZ gene from 23219434-23291357 r3 24695499 24822589 76667 F Excludes DAZ gene from 24763070-24813493 r4 24824725 24974626 76680 R Excludes DAZ gene from 24833820-24907041 Gray gray1 23359751 23474022 114271 F gray2 26196591 26311317 114726 R Yellow y1 23702730 24276001 451164 F Excludes IR5 region from 23992775-24114882; excludes distal region that aligns to multiple locations in the genome y2 25394048 25967344 451190 R Excludes IR5 region from 25555259-25677365; excludes proximal region that aligns to multiple locations in the genome Control regions Control region 1 4000000 5000000 1000000 N/A Control region 2 7000000 7500000 500000 N/A Control region 3 8000000 8500000 500000 N/A Control region 4 20500000 20550000 50000 N/A Normalization region 14500000 15500000 1000000 N/A

1Total length of sequence used to calculate region depth. Some regions excluded internal sequence for depth calculation, indicated in italics. In these cases, length is less than the distance from start location to end location. Detailed information for each region given in Comments (Column F).

136 TABLE S2. COMPLETE AMPLICON ANNOTATION OF THE HUMAN Y CHROMOSOME

Start End Orientation Repeat Subrepeat Repeat Copy Length Copy Number 6234810 6245524 F IR3 IR3-1 Distal IR3 10714 2 6245524 6266828 R TSPY TSPY Small TSPY array 21304 2 6266828 6361948 F IR3 IR3-2 Distal IR3 95120 2 6362047 6532527 F IR3 IR3-3 Distal IR3 170480 2 7578486 7586095 R IR1 Green/IR1 Yp IR1 7609 4 7586095 7592160 F IR1 IR1-1 Yp IR1 6065 2 7593394 7604421 F IR1 IR1-2 Yp IR1 11027 2 7604421 7608949 F IR1 IR1-3 Yp IR1 4528 2 7608949 7638905 F IR1 Blue/IR1 Yp IR1 29956 4 9333511 9537359 R TSPY TSPY Large TSPY array 203848 2 9628646 9802679 R IR3 IR3-3 Proximal IR3 174033 2 9812672 9907906 R IR3 IR3-2 Proximal IR3 95234 2 9908886 9919596 R IR3 IR3-1 Proximal IR3 10710 2 13984446 14019663 F P8 P8 Proximal P8 35217 2 14023068 14058287 R P8 P8 Distal P8 35219 2 15874850 15883583 F P7 P7 Proximal P7 8733 2 15896217 15904951 R P7 P7 Distal P7 8734 2 16159548 16269551 F P6 P6 Proximal P6 110003 2 16315772 16425803 R P6 P6 Distal P6 110031 2 17455803 17823070 F P5 P5 Proximal P5 367267 2 17823070 17951262 F P5 IR5 IR5-1 128192 4 17954717 18082941 R P5 IR5 IR5-2 128224 4 18082941 18450203 R P5 P5 Distal P5 367262 2 18450212 18640367 F P4 P4 Proximal P4 190155 2 18680022 18870186 R P4 P4 Distal P4 190164 2 20055177 20350712 R DYZ19 DYZ19 DYZ19 295535 1 21010200 21021696 R Rep1 Rep1 Rep1-1 11496 2 21051500 21063551 R Rep1 Rep1 Rep1-2 12051 2 21493770 21504762 F RBMY1 RBMY1-1 RBMY1-1.0 10992 7 21504762 21508333 F IR2 IR2-RBMY1-spacer Proximal IR2 3571 4 21508333 21517311 F IR2 RBMY1-2 Proximal IR2 8978 6 21517311 21528302 F IR2 RBMY1-1 Proximal IR2 10991 7 21528302 21531874 F IR2 IR2-RBMY1-spacer Proximal IR2 3572 4 21531874 21540851 F IR2 RBMY1-2 Proximal IR2 8977 6 21540851 21551835 F IR2 RBMY1-1 Proximal IR2 10984 7 21551835 21567301 F IR2 IR2 Proximal IR2 15466 2 21862115 21877568 R IR2 IR2 Distal IR2 15453 2 21877568 21888451 R IR2 RBMY1-1 Distal IR2 10883 7 21888451 21897531 R IR2 RBMY1-2 Distal IR2 9080 6 21897531 21901098 R IR2 IR2-RBMY1-spacer Distal IR2 3567 4 21901098 21912093 R IR2 RBMY1-1 Distal IR2 10995 7 21912093 21921073 R IR2 RBMY1-2 Distal IR2 8980 6 21921073 21924668 R IR2 IR2-RBMY1-spacer Distal IR2 3595 4 21925349 22093051 F P3 Blue b1 167702 4 22093051 22165030 F P3 Teal-1 t1 71979 2 22166027 22177018 R P3 RBMY1-1 t1 10991 7 22177028 22185987 R P3 RBMY1-2 t1 8959 6 22186445 22208739 F P3 Teal-2 t1 22294 2 22377680 22399976 R P3 Teal-2 t2 22296 2 22400430 22409408 F P3 RBMY1-2 t2 8978 6 22409408 22420396 F P3 RBMY1-1 t2 10988 7 22421393 22493356 R P3 Teal-1 t2 71963 2 22493356 22661025 R P3 Blue b2 167669 4 22661506 22692207 R P3 Blue-plus b2 30701 3 22692207 22722170 R IR1 Blue/IR1 b2/IR1 29963 4 22722170 22726696 R IR1 IR1-3 Yq IR1 4526 2 22729621 22740559 R IR1 IR1-2 Yq IR1 10938 2 22740618 22746680 R IR1 IR1-1 Yq IR1 6062 2 22746680 22754280 F IR1 Green/IR1 g1 7600 4 22754280 23061370 F g1 Green g1 307090 3 23061754 23129355 F P2 Red-1 r1 67601 4 23129355 23199108 F P2 DAZ r1 69753 4 23199108 23208205 F P2 Red-2 r1 9097 4 23208205 23210394 F P2 Red-spacer r1-r2 2189 2 23210394 23219434 R P2 Red-2 r2 9040 4 23219434 23291357 R P2 DAZ r2 71923 4 23291357 23359020 R P2 Red-1 r2 67663 4 23359751 23474022 F P1 Gray gray1 114271 2 23474022 23503983 F P1 Blue/IR1 b3 29961 4 23503983 23534676 F P1 Blue-plus b3 30693 3 23535162 23702730 F P1 Blue b3 167568 4 23702730 23992775 F P1 Yellow-1 y1 290045 2

137 Start End Orientation Repeat Subrepeat Repeat Copy Length Copy Number 23992775 24114882 F P1 IR5 y1 122107 4 24114882 24124587 F P1.2 P1.1/2 y1 9705 4 24124587 24128529 F P1.2 P1.1/2-spacer y1 3942 2 24128529 24138240 R P1.2 P1.1/2 y1 9711 4 24138240 24180078 F P1 P1-chr15-1 y1 41838 2 24180078 24181684 F P1.3 P1.3/4 y1 1606 4 24181684 24207522 F P1.3 P1.3/4-spacer y1 25838 2 24207522 24209128 R P1.3 P1.3/4 y1 1606 4 24209128 24276001 F P1 P1-ch15-2 y1 66873 2 24276001 24380446 F P1 P1-mm y1 104445 2 24380446 24388041 F P1 Green/IR1 g2 7595 4 24388041 24695116 F P1 Green g2 307075 3 24695499 24763070 F P1 Red-1 r3 67571 4 24763070 24813493 F P1 DAZ r3 50423 4 24813493 24822589 F P1 Red-2 r3 9096 4 24822589 24824725 F P1 Red-spacer r3-r4 2136 2 24824725 24833820 R P1 Red-2 r4 9095 4 24833820 24907041 R P1 DAZ r4 73221 4 24907041 24974626 R P1 Red-1 r4 67585 4 24974970 25282100 R P1 Green g3 307130 3 25282100 25289694 R P1 Green/IR1 g3 7594 4 25289694 25394048 R P1 P1-mm y2 104354 2 25394048 25461016 R P1 P1-ch15-2 y2 66968 2 25461016 25462622 F P1.4 P1.3/4 y2 1606 4 25462622 25488457 R P1.4 P1.3/4-spacer y2 25835 2 25488457 25490063 R P1.4 P1.3/4 y2 1606 4 25490063 25531901 R P1 P1-chr15-1 y2 41838 2 25531901 25541604 F P1.1 P1.1/2 y2 9703 4 25541604 25545550 R P1.1 P1.1/2-spacer y2 3946 2 25545550 25555259 R P1.1 P1.1/2 y2 9709 4 25555259 25677365 R P1 IR5 y2 122106 4 25677365 25967344 R P1 Yellow-1 y2 289979 2 25967344 26135454 R P1 Blue b4 168110 4 26135935 26166629 R P1 Blue-plus b4 30694 3 26166629 26196591 R P1 Blue/IR1 b4 29962 4 26196591 26311317 R P1 Gray gray2 114726 2

138

139

140

CHAPTER 3

CONCLUSIONS AND FUTURE WORK

______

141

142 CONCLUSIONS

Research into the Y chromosome over the past century has been dominated by the idea that preservative forces of evolution were, at best, valiantly fighting a losing battle, and at worst completely ineffectual (Muller, 1914; Ohno, 1967; Charlesworth and

Charlesworth, 2000). However, in recent years this view has begun to change as evidence of Y chromosome maintenance and selection—primarily within single-copy sequence— emerged (Rozen et al., 2009; Hughes et al., 2010; Hughes et al., 2012; Bellott et al.,

2014; Wilson Sayres et al., 2014). In this thesis, we have demonstrated that such forces act on the amplicons as well. In particular, amplicon copy number on the human Y chromosome has been conserved for over 200,000 years of history. This preservation occurred in the face of the extremely high rates of mutation faced by the amplicons in the form of CNVs. We found that the majority of amplicon CNVs—especially those within the AZFc region—are caused by NAHR, as expected based on previous research. We also demonstrated that the mutational mechanism can be reconstructed even for complex

CNVs, which are often the result of multiple steps of NAHR, sometimes in conjunction with other mechanisms.

The conservation of the amplicons points to the action of strong selective pressures working to maintain amplicon copy number, and suggests a stability of amplicon structure within humans that is unexpected in the face of the massive differences in amplicon structure and content seen between species (Hughes et al., 2010;

Hughes et al., 2012; Soh et al., 2014). This work is a step forward in the ongoing reframing of the mammalian Y chromosome’s history: the Y chromosome is not just the

143 victim of random neutral processes, but is also the carefully calibrated result of the action of various selective forces.

______

144 FUTURE WORK

Recent years have produced answers to many longstanding questions about the Y

chromosome, but many questions—especially about the amplicons—remain. Future

research will continue to reveal the function of the amplicons and more details about the

selective forces that act upon them. In the Discussion of Chapter 2, we raised the

potential for future work to measure the magnitude of the phenotypic effects of different

ampliconic CNVs, distinguish between competing models of amplicon evolution, and

expand our analyses by using emerging high-throughput sequencing datasets in humans

and other species. Other potentially fruitful lines of inquiry are studying the function of

individual ampliconic genes, using high-throughput genomic approaches to understand

the impact of genetic background on the phenotypes of amplicon structural variation,

leveraging cutting-edge sequencing technologies to efficiently assemble the ampliconic structure of many species, sequencing targeted groups of closely related Y chromosomes, and studying other mechanisms of ampliconic change.

Although many ampliconic CNVs—such as the AZFc and gr/gr deletions—have known phenotypes (Kuroda-Kawaguchi et al., 2001; Lynch et al., 2005; Nathanson et al.,

2005), the functions of the individual ampliconic genes are still unclear. As a result of the complex architecture of the amplicons, these deletions remove copies of multiple genes, making it impossible to determine which gene is responsible for the resulting phenotype.

The challenge of determining ampliconic gene function is compounded because the Y chromosomes of classical model systems such as mice can not be used to perform genetic studies of human ampliconic genes, due to the drastic differences in ampliconic gene content between species. Biochemical and expression studies of ampliconic genes,

145 neither of which is affected by the genetic complexity of the amplicons, can clarify the

nature of the proteins coded by the ampliconic genes and those genes’ roles within

biological pathways, respectively. Another approach to this problem is the use of induced

pluripotent stem cells (iPSCs): cell lines can be generated with an existing deletion and

complemented with ampliconic genes one at a time, or modern gene editing techniques

can generate cell lines with every copy of a single gene knocked out but with all other

genes intact. These methods have already shown promise in studies that implanted human

cell lines with Y chromosome deletions into mice to study the effects of germ cell

formation in vivo (Ramathal et al., 2014; Ramathal et al., 2015). Finally, as discussed in

Chapter 2, sequencing datasets that are an order of magnitude larger than the one studied

here are becoming available (Gudbjartsson et al., 2015; Nagasaki et al., 2015; Telenti et

al., 2016). Large sequencing studies can identify men with rare mutations that eliminate

the function of a single ampliconic gene, and the phenotypes of such men can provide

insight into the deleted gene’s function.

Beyond their use in the study of individual ampliconic genes, these larger datasets can also clarify the effects of various ampliconic CNVs on different genetic backgrounds.

We found that the gr/gr and b2/b3 deletions are both preferentially rescued by duplications in the 1000 Genomes Project men. The gr/gr deletion is a known risk factor for spermatogenic failure and testis cancer (Lynch et al., 2005; Nathanson et al., 2005;

Rozen et al., 2012), but is nearly fixed within a branch of haplogroup D (Repping et al.,

2003). Studies that directly looked for phenotypic effect of the b2/b3 deletion are less clear. The deletion is nearly fixed in a branch of haplogroup N (Fernandes et al., 2004;

Repping et al., 2004), and a massive survey of 20,000 Y chromosomes found no

146 association between the b2/b3 deletion and spermatogenic failure (Rozen et al., 2012).

However, other work has claimed that the b2/b3 deletion is a risk factor for spermatogenic failure (Lu et al., 2009; Yang et al., 2015). These seemingly contradictory results could result from the gr/gr and b2/b3 deletions having different phenotypic effects depending on the genetic background of the Y chromosome upon which the deletions occur. Haplogroups D and N could have protective mutations that mitigate the effects of such deletions. The studies that found deleterious effects of the b2/b3 deletion were performed on Han Chinese men, while those that found no effect studied men from other populations. While our results indicate that these and similar CNVs all have deleterious effects on fitness across all Y chromosomes, the magnitude of those effects may differ depending on genetic background. As the size of association studies grows to include tens or even hundreds of thousands of men, enough data can be generated to differentiate between those magnitudes.

Beyond simply sequencing more men with current methods, cutting-edge sequencing technology will also assist in the study of the amplicons. Nanopore sequencing can generate reads of up to hundreds of kb long (Jain et al., 2018). Although the base calling accuracy of these reads is still poor compared to short-read technologies, such long reads are a powerful tool for resolving complex ampliconic structures. (In a testament to the impressiveness of the Y chromosome amplicons, the largest ampliconic regions, such as palindrome P1 on the human Y chromosome or the long arm of the mouse Y chromosome, are too long be resolved even by these reads of this size without significant additional effort. Still, these long reads are a massive leap forward towards assembling ampliconic structures relatively quickly and easily.) This technology could

147 usher in an era in which the Y chromosome ampliconic structure is known for tens or even hundreds of species, instead of the handful available today. A repository of ampliconic structures and genes would enable comparative evolutionary studies of the amplicons that could answer longstanding questions about the amplicons’ origins and functions, such as the mechanisms by which amplicons arise, the stability of ampliconic structures over different timescales, and the structural and sequence requirements for maintenance of amplicons.

In this vein, sequencing specifically selected groups of closely related Y chromosomes can advance our understanding of ampliconic origins and functions. To date, the most closely related Y chromosomes with complete ampliconic sequence are the human and chimpanzee, which diverged 8 million years ago (Chimpanzee Sequencing

Analysis Consortium, 2005). The differences between the ampliconic sequence in those two species make it difficult to reconstruct an ancestral amplicon structure and the series of mutational events that took place to transform the ancestral human/chimpanzee Y chromosome into the modern Y chromosomes in each species. Sequencing two or more species that diverged more recently could reveal more about the rate of amplicon divergence between species and the types of mutations that become fixed within species.

A further avenue of study is investigating other methods of ampliconic change.

The work described here focused on large structural variants within amplicons, and particularly on deletions and duplications of whole amplicon copies. While these mutation events change the copy number of an amplicon, most of them do not cause an ampliconic unit to grow or shrink. It is known that amplicons can either gain or lose sequence: humans, chimpanzees, and gorillas share some ampliconic sequence, but for

148 many of these shared amplicons, the inner palindrome boundary sequence is common to multiple species, while the outer palindrome boundary differs (Rozen et al., 2003).

Studying how amplicon boundaries differ within a species, as well as sequencing more closely related species that may share some ampliconic sequence, can shed light on how amplicons expand and contract over time.

All in all, the Y chromosome amplicons are the result of tens of millions of years of evolution unconstrained by many typical limitations faced by the X chromosome and the autosomes, but subject to their own unique restrictions and pressures. In many ways, our view of the structures and functions that DNA can take is limited by what is common and easy to study, and the amplicons reveal that drastically different modes are possible as well. The future will bring even greater insights into these possibilities through continued study of the history, dynamics, and functions of this fascinating but often overlooked part of the genome.

149 ACKNOWLEDGEMENTS

I thank Winston Bellott, Mina Kojima, and David Page for their comments on this

chapter.

______

REFERENCES

Bellott, D. W., Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Cho, T. J., Koutseva, N., Zaghlul, S., Graves, T., et al. (2014). Mammalian Y chromosomes retain widely expressed dosage-sensitive regulators. Nature 508, 494-9.

Charlesworth, B., & Charlesworth, D. (2000). The degeneration of Y chromosomes. Philos Trans R Soc Lond B Biol Sci 355, 1563-72.

Chimpanzee Sequencing Analysis Consortium. (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69-87.

Fernandes, S., Paracchini, S., Meyer, L. H., Floridia, G., Tyler-Smith, C., & Vogt, P. H. (2004). A large AZFc deletion removes DAZ3/DAZ4 and nearby genes from men in Y haplogroup N. Am J Hum Genet 74, 180-7.

Gudbjartsson, D. F., Helgason, H., Gudjonsson, S. A., Zink, F., Oddson, A., Gylfason, A., Besenbacher, S., Magnusson, G., Halldorsson, B. V., et al. (2015). Large-scale whole-genome sequencing of the Icelandic population. Nat Genet 47, 435-44.

Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S., Dugan, S., Ding, Y., Buhay, C. J., et al. (2012). Strict evolutionary conservation followed rapid gene loss on human and rhesus Y chromosomes. Nature 483, 82-6.

Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K., Minx, P. J., Fulton, R. S., McGrath, S. D., Locke, D. P., et al. (2010). Chimpanzee and human Y chromosomes are remarkably divergent in structure and gene content. Nature 463, 536-9.

Jain, M., Koren, S., Miga, K. H., Quick, J., Rand, A. C., Sasani, T. A., Tyson, J. R., Beggs, A. D., Dilthey, A. T., et al. (2018). Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36, 338-45.

Kuroda-Kawaguchi, T., Skaletsky, H., Brown, L. G., Minx, P. J., Cordum, H. S., Waterston, R. H., Wilson, R. K., Silber, S., Oates, R., et al. (2001). The AZFc region of the Y chromosome features massive palindromes and uniform recurrent deletions in infertile men. Nat Genet 29, 279-86.

150 Lu, C., Zhang, J., Li, Y., Xia, Y., Zhang, F., Wu, B., Wu, W., Ji, G., Gu, A., et al. (2009). The b2/b3 subdeletion shows higher risk of spermatogenic failure and higher frequency of complete AZFc deletion than the gr/gr subdeletion in a Chinese population. Hum Mol Genet 18, 1122-30.

Lynch, M., Cram, D. S., Reilly, A., O'Bryan, M. K., Baker, H. W., de Kretser, D. M., & McLachlan, R. I. (2005). The Y chromosome gr/gr subdeletion is associated with male infertility. Mol Hum Reprod 11, 507-12.

Muller, H. J. (1914). A gene for the fourth chromosome of Drosophila. Journal of Experimental Zoology 17, 325-36.

Nagasaki, M., Yasuda, J., Katsuoka, F., Nariai, N., Kojima, K., Kawai, Y., Yamaguchi- Kabata, Y., Yokozawa, J., Danjoh, I., et al. (2015). Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nat Commun 6, 8018.

Nathanson, K. L., Kanetsky, P. A., Hawes, R., Vaughn, D. J., Letrero, R., Tucker, K., Friedlander, M., Phillips, K. A., Hogg, D., et al. (2005). The Y deletion gr/gr and susceptibility to testicular germ cell tumor. Am J Hum Genet 77, 1034-43.

Ohno, S. (1967). Sex Chromosomes and Sex-linked Genes. Berlin: Springer.

Ramathal, C., Angulo, B., Sukhwani, M., Cui, J., Durruthy-Durruthy, J., Fang, F., Schanes, P., Turek, P. J., Orwig, K. E., et al. (2015). DDX3Y gene rescue of a Y chromosome AZFa deletion restores germ cell formation and transcriptional programs. Sci Rep 5, 15041.

Ramathal, C., Durruthy-Durruthy, J., Sukhwani, M., Arakaki, J. E., Turek, P. J., Orwig, K. E., & Reijo Pera, R. A. (2014). Fate of iPSCs derived from azoospermic and fertile men following xenotransplantation to murine seminiferous tubules. Cell Rep 7, 1284-97.

Repping, S., Skaletsky, H., Brown, L., van Daalen, S. K., Korver, C. M., Pyntikova, T., Kuroda-Kawaguchi, T., de Vries, J. W., Oates, R. D., et al. (2003). Polymorphism for a 1.6-Mb deletion of the human Y chromosome persists through balance between recurrent mutation and haploid selection. Nat Genet 35, 247-51.

Repping, S., van Daalen, S. K., Korver, C. M., Brown, L. G., Marszalek, J. D., Gianotten, J., Oates, R. D., Silber, S., van der Veen, F., et al. (2004). A family of human Y chromosomes has dispersed throughout northern Eurasia despite a 1.8-Mb deletion in the azoospermia factor c region. Genomics 83, 1046-52.

Rozen, S., Marszalek, J. D., Alagappan, R. K., Skaletsky, H., & Page, D. C. (2009). Remarkably little variation in proteins encoded by the Y chromosome's single- copy genes, implying effective purifying selection. Am J Hum Genet 85, 923-8.

151 Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J., Cordum, H. S., Waterston, R. H., Wilson, R. K., & Page, D. C. (2003). Abundant gene conversion between arms of palindromes in human and ape Y chromosomes. Nature 423, 873-6.

Rozen, S. G., Marszalek, J. D., Irenze, K., Skaletsky, H., Brown, L. G., Oates, R. D., Silber, S. J., Ardlie, K., & Page, D. C. (2012). AZFc deletions and spermatogenic failure: a population-based survey of 20,000 Y chromosomes. Am J Hum Genet 91, 890-6.

Soh, Y. Q., Alfoldi, J., Pyntikova, T., Brown, L. G., Graves, T., Minx, P. J., Fulton, R. S., Kremitzki, C., Koutseva, N., et al. (2014). Sequencing the mouse Y chromosome reveals convergent gene acquisition and amplification on both sex chromosomes. Cell 159, 800-13.

Telenti, A., Pierce, L. C., Biggs, W. H., di Iulio, J., Wong, E. H., Fabani, M. M., Kirkness, E. F., Moustafa, A., Shah, N., et al. (2016). Deep sequencing of 10,000 human genomes. Proc Natl Acad Sci U S A 113, 11901-06.

Wilson Sayres, M. A., Lohmueller, K. E., & Nielsen, R. (2014). Natural selection reduced diversity on human Y chromosomes. PLoS Genet 10, e1004064.

Yang, B., Ma, Y. Y., Liu, Y. Q., Li, L., Yang, D., Tu, W. L., Shen, Y., Dong, Q., & Yang, Y. (2015). Common AZFc structure may possess the optimal spermatogenesis efficiency relative to the rearranged structures mediated by non- allele homologous recombination. Sci Rep 5, 10551.

152