The Pennsylvania State University

The Graduate School

EVOLUTION OF Y AMPLICONIC IN GREAT APES

A Dissertation in

Bioinformatics and Genomics

by

Rahulsimham Vegesna

© 2020 Rahulsimham Vegesna

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

May 2020

The dissertation of Rahulsimham Vegesna was reviewed and approved by the following:

Paul Medvedev Associate Professor of Computer Science & Engineering Associate Professor of Biochemistry & Molecular Biology Dissertation Co-Adviser Co-Chair of Committee

Kateryna D. Makova Pentz Professor of Biology Dissertation Co-Adviser Co-Chair of Committee

Michael DeGiorgio Associate Professor of Biology and Statistics

Wansheng Liu Professor of Animal Genomics

George H. Perry Chair, Intercollege Graduate Degree Program in Bioinformatics and Genomics Associate Professor of Anthropology and Biology

ii

ABSTRACT

In addition to the sex-determining SRY and several other single-copy genes, the human harbors nine multi-copy gene families which are expressed exclusively in testis. In humans, these gene families are important for spermatogenesis and their loss is observed in patients suffering from infertility. However, only five of the nine ampliconic gene families are found across great apes, while others are missing or pseudogenized in some species. My research goal is to understand the evolution of the Y ampliconic gene families in humans and in non-human great ape species. The specific objectives I addressed in this dissertation are

1. To test whether Y ampliconic gene expression levels depend on their copy number and whether there is a gene dosage compensation to counteract the ampliconic gene copy number variation observed in humans. For the nine ampliconic gene families found in humans, the copy number and expression levels were estimated in 149 men. Among the Y ampliconic gene families, higher copy number leads to higher expression. Within the Y ampliconic gene families, copy number does not influence gene expression, rather a high tolerance for variation in gene expression was observed in testis of presumably healthy men. We also found that expression of five Y ampliconic gene families is coordinated with that of their non-Y (i.e. X or autosomal) homologs. Indeed, five ampliconic gene families had consistently lower expression levels when compared to their non-Y homologs suggesting dosage regulation, while the HSFY family had higher expression levels than its X homolog and thus lacked dosage regulation.

2. To test whether the Y ampliconic gene copy number and gene expression levels are conserved across great apes. For the ampliconic gene families found in great apes, the copy number and expression levels were estimated in independent datasets ranging from two to 14 samples per species. Our results indicate high variability in gene family size but conservation in gene expression levels in Y ampliconic gene families. This relationship was similar to what was observed in humans. However, for three gene families, size was positively correlated with gene expression levels across species, suggesting that, given sufficient evolutionary time, copy number influences gene expression on the Y chromosome.

3. To study the dynamics of gene (and gene family) loss and gain in great ape Y . Given the assemblies and alignments of great ape Y chromosomes, we determined the gene content on the Y chromosome of bonobo and orangutan. We then reconstructed the evolutionary history of gene content across great apes to observe that there was an increased rate of loss of genes in Pan genus (bonobo and chimpanzee) when compared to other great apes. The human palindromes P6 and P7 which are void of known ampliconic genes are conserved across great apes. The potential reason for their conservation is presence of possible gene expression regulators and not genes on these palindromes.

The results of this dissertation significantly advance our understanding of Y chromosome evolution in great apes. They provide an overview of variation in gene copy number and expression levels of these highly similar gene families which have been a challenge to study previously.

Table of Contents LIST OF TABLES ...... viii LIST OF FIGURES ...... ix ACKNOWLEDGMENTS ...... xiii Chapter 1 ...... 1 Introduction ...... 1 References ...... 3 Chapter 2 ...... 7 Dosage regulation, and variation in gene expression and copy number at human Y chromosome ampliconic genes ...... 7 Abstract ...... 7 Introduction ...... 8 Results ...... 11 AmpliCoNE: Ampliconic Copy Number Estimator ...... 11 Y ampliconic gene copy number estimates ...... 13 Y ampliconic gene families with low copy number in humans are frequently deleted in non-human great apes ...... 14 Y ampliconic gene expression ...... 15 More copious gene families have higher gene expression levels ...... 15 Within a family, copy number and gene expression are not correlated ...... 16 Y haplogroups and ampliconic gene families ...... 17 The role of age in ampliconic gene expression ...... 20 Ampliconic gene dosage regulation ...... 20 Discussion ...... 26 Variability in Y ampliconic gene copy number ...... 27 Variability in Y ampliconic gene expression ...... 28 Dosage regulation of Y ampliconic genes ...... 30 Materials and Methods ...... 34 AmpliCoNE: Ampliconic Copy Number Estimator ...... 34 Simulation-based validation of AmpliCoNE ...... 36 Datasets ...... 36 Pipeline for human WGS analysis ...... 37 Experimental validation with droplet digital PCR (ddPCR) ...... 37

iv

Estimating gene expression levels ...... 38 Human Y haplogroup determination ...... 38 Code availability ...... 39 References ...... 39 Chapter 3 ...... 47 Ampliconic genes on the great ape Y chromosomes: Rapid evolution of copy number but conservation of expression levels ...... 47 Abstract ...... 47 Introduction ...... 48 Results ...... 52 Dynamic evolution of Y ampliconic gene copy number ...... 52 Conservation of Y ampliconic gene expression in great apes ...... 60 The relationship between copy number and gene expression levels ...... 62 Y ampliconic gene copy number variation and phenotypes related to sperm competition ...... 64 Discussion ...... 65 Materials and Methods ...... 73 DNA samples ...... 73 ddPCR assays for ampliconic gene copy number estimation in bonobo and orangutan ...... 73 Construction of Y-specific great ape phylogenetic trees ...... 74 Analysis of conservation in copy number across great ape species ...... 75 RNA-Seq datasets ...... 76 Transcriptome assembly of Y ampliconic genes in great apes ...... 77 Estimating gene expression levels from RNA-Seq datasets ...... 78 Testing for conservation in gene expression levels ...... 79 Availability of data and materials ...... 80 References ...... 80 Chapter 4 ...... 93 Dynamic evolution of great ape Y chromosomes ...... 93 Introduction ...... 93 Results ...... 95 Conservation of human and chimpanzee palindrome sequences...... 95 What drives conservation of gene-free palindromes P6 and P7? ...... 97

v

Evolution of gene content in great apes ...... 98 Gene conversion between X and Y chromosomes of great apes ...... 101 Discussion ...... 102 Palindromes ...... 102 Genes ...... 103 Gene conversion ...... 104 Materials and Methods ...... 105 Human and chimpanzee palindrome arm sequence conservation ...... 105 Palindrome sequence read depth in bonobo, gorilla, and orangutan ...... 106 Search for regulatory factors in human palindromes P6 and P7 ...... 107 Analysis of genes homologous to human Y genes ...... 107 Novel genes in bonobo and orangutan assemblies ...... 108 Reconstruction of gene content of great apes...... 108 Gene conversion events between the X and Y chromosomes ...... 109 References ...... 109 Chapter 5 ...... 118 Summary ...... 118 Significance and future work ...... 119 Major contributions ...... 121 References ...... 122 Appendix A ...... 125 Supplemental Tables ...... 125 Supplemental Figures ...... 145 References ...... 157 Appendix B ...... 158 Supplemental Note 1. CAFE Simulations ...... 158 Supplemental Note 2. EVE Simulations ...... 163 Supplemental Tables ...... 168 Supplemental Figures ...... 182 References ...... 191 Appendix C ...... 193 Supplemental Note S1. Evolutionary scenarios for palindromes ...... 193 Supplemental Note S2. Augustus predictions ...... 197

vi

Supplemental Tables ...... 201 Supplemental Figures ...... 210 References ...... 211

vii

LIST OF TABLES

Table 3- 1. Differences in Y ampliconic gene copy numbers across species as evaluated with ANOVA and CAFE. To determine which ampliconic gene families vary in their copy number across great ape species, we performed conventional one-way ANOVA (F-statistic and p-value are shown). To identify significant expansions or contractions of gene family size across great apes, we performed CAFE analysis. Significant p-values (Bonferroni-corrected p-value cutoff of 0.05/9=0.006; nine gene families) are in bold. ...57

Table 4- 1. Gene conversion between X and Y chromosomes of great apes using GENECONV. The first column has information about the gene symbols with the stratum they belong to in brackets. The values in the following column represent the number of gene conversion events with significant p-values (<0.05) which are corrected for multiple comparisons across all sequence pairs. The values in brackets represent the number of gene conversion events with significant p-values (<0.05) for pairwise comparisons corrected for the length of the alignment...... 101

viii

LIST OF FIGURES Figure 2- 1. Correlation in copy number and expression levels across Y ampliconic gene families. The gene families are clustered based on correlation coefficients. (A) Correlation in copy numbers among 167 individuals. (B) Correlation in gene expression levels among 149 individuals...... 14 Figure 2- 2. Relationship between copy number and expression levels for nine Y ampliconic gene families. The copy number (X-axis) and gene expression values (Y- axis) values for 149 individuals are presented on a natural log scale. The dots are values for different men, and boxplots are the distribution of values for individual gene families. Both the dots and boxplots are color-coded by their respective gene families. The black line represents the linear function (copy number ~ expression) fitted to the points on the plot. The coefficient of determination (R2) for the linear model is 0.25...... 15 Figure 2- 3. The distribution of ampliconic gene copy numbers and expression levels across Y haplogroups. For each plot the x-axis shows Y haplogroups: E - African (N=22, yellow), I - European (N=24, green), J - Western Asian (N=11, blue), and R - European (N=85,red), and the y-axis shows copy number estimates or gene expression levels. The black dashed line represents the overall mean copy number or expression values for all the samples on each plot. The permutation-based significance of pairwise haplogroup comparisons is shown with stars ( *** < 0.001, ** < 0.01, * < 0.05 P-value). The one-way ANOVA test p-values are printed at the top of each plot. Bonferroni-corrected cutoff for nine tests (0.05/9 ≈ 0.006) is used to identify significance of ANOVA...... 18 Figure 2- 4. Possible differences in expression between Y ampliconic genes and their non-Y homologs. (A) Neofunctionalization: Ampliconic gene family, after moving to Y or divergence from the X, obtained a new beneficial function. The gene family might be under positive selection and its expression might be independent from its non-Y homolog. (B) Subfunctionalization: Ampliconic gene family, after moving to the Y or divergence from the X, acquired a (testis-specific) sub-function. Because subfunctionalization is division of labour, the sum of ampliconic gene expression and non-Y homolog expression should be equal to the ancestral expression. The average expression of ampliconic gene family could be lower or higher than that of their homologs and depends on the subfunction. (C) Relaxed selection: Ampliconic gene family, due to its multi-copy nature and presence of gene conversion, evolves at a faster rate than its non-Y homolog and is not under selection...... 21 Figure 2- 5. Expression differences between Y ampliconic genes and their non-Y homologs. Each plot compares expression levels of Y ampliconic gene family (the sum of expression of all copies of a gene family, blue) to their homologs on the (green) or (yellow). The gene names are shown on the x-axis and normalized expression levels—on the y-axis...... 25 Figure 2- 6. Individual-level relationship between Y ampliconic genes and their non- Y homologs. Each plot compares expression levels of Y ampliconic gene family (the sum of expression of all copies of a gene family) to their non-Y homologs in each individual (N=149). Each dot represents an individual. The black line represents the linear regression fit to the data and the respective equation is at the top of each plot. The R-squared value is at the right-hand side bottom of each plot...... 26

ix

Figure 3- 1. Y ampliconic genes in great apes. A. Venn diagram showing gene content comparison among great ape species. B. Plot of the first two principal components (PCs) of Y ampliconic gene copy numbers across great ape species (the first and second PCs explained 68.7% and 22.8% of the variation, respectively; Fig. S2 shows variation explained by the other components)...... 52 Figure 3- 2. Variation in copy number of Y ampliconic gene families in great apes. Box plots summarizing the distribution of copy numbers of the six great ape species across nine Y ampliconic gene families. The gene families are separated into individual plots with the gene family name at the top. Within each plot, the X-axis represents six species (bonobo, chimpanzee, human, gorilla, Bornean orangutan, and Sumatran orangutan), and the Y-axis represents copy number. The black dot within each boxplot is the median value per species...... 54 Figure 3- 3. Larger Y ampliconic gene families are more variable across great apes. The six scatter plots represent the relationship between median copy number and variance for each of the species, and the species name is present at the top of each plot. The X-axis represents natural logarithm of median copy number and the Y-axis is a natural logarithm of variance in copy number. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line represents the linear function fitted to the given data points. The dots are color-coded to represent the nine gene families, with missing dots indicating gene family absence in that species...... 55 Figure 3- 4. Results of CAFE analysis identifying Y ampliconic gene families with significant shifts in gene copy number when compared to their ancestors. For each gene family with a significant difference in copy number, the phylogenetic tree representing the estimated copy number at internal nodes is shown. Significant shifts are highlighted in blue (contraction) and red (expansion). The copy numbers at the internal nodes were predicted by CAFE...... 58 Figure 3- 5. Summary of gene expression levels across great apes. In the dot plot below, the X-axis represents nine ampliconic gene families and the Y-axis represents their expression levels. The plot represents testis-specific expression of 12 great ape samples. Each dot within a gene family represents expression levels of an individual and the color of the dot denotes the species it belongs to. Missing dots represent gene families that are considered missing or pseudogenized, and their expression levels are excluded from the gene expression analysis (Table S5)...... 60 Figure 3- 6. Relationship between copy number and gene expression of Y ampliconic gene families in great ape species. The five scatter plots represent the relationship between expression and copy number for each of the five species, the name of the species is present at the top of each plot. In each of the scatter plots the X-axis represents natural logarithm of median copy number and the Y-axis represents natural logarithm of median gene expression. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line is the linear function fitted to the given data points. The dots are color-coded to represent the nine

x gene families, with missing dots corresponding to the gene families that are pseudogenized, deleted, or not expressed, in that species...... 63 Figure 3- 7. Relationship between copy number and gene expression across species. In each of the scatter plots the X-axis represents natural logarithm of median copy number and the Y-axis represents natural logarithm of median gene expression. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line represents the linear function fitted to the given data points. The dots are color-coded to represent the five species. The five scatter plots represent the relationship between expression and copy number for each of the five gene families, with the name of the gene family present at the top of each plot. Only the gene families that are present in all species are shown here...... 63

Figure 4- 1. Evolution of sequences homologous to human and chimpanzee palindromes. (A) Heatmaps showing coverage for each palindrome in each species in the multi-species alignment, and box plots representing copy number (natural log) of 1-kb windows which have homology with human or chimpanzee palindromes. (B) The great ape phylogenetic tree with evolution of human (shown in blue) and chimpanzee (purple) palindromic sequences overlayed on it. Palindrome names in bold indicate that their sequences were present in ≥2 copies. Negative (-) and positive (+) signs indicate gain and loss of palindrome sequence (possibly only partial), respectively. Arrows represent gain (↑) or loss (↓) of palindrome copy number. If several equally parsimonious scenarios were possible, we assumed a later date of acquisition of the multi-copy state for a palindrome (Note S1)...... 97 Figure 4- 2. IGV screen shots of peaks in DNase-seq, H3K4me1 and H3K27ac on Palindrome P6. A. Peaks on both the arms of the palindromes are shown within the blue and grey boxes. B. Zoom in view of peaks on the left arm of palindrome P6. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2. C. Zoom in view of peaks on the right arm of palindrome P6. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2...... 99 Figure 4- 3. IGV screenshot of CREB1 peaks on Palindrome P7. Peaks on both the arms of the palindromes are shown. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2...... 99 Figure 4- 4. Evolution of Y chromosome gene content in great apes. The reconstructed history of gene birth and death for X-degenerate (blue) and ampliconic (red) genes was overlaid on the great ape phylogenetic tree (not drawn to scale), using macaque as an outgroup. The rates of gene birth and death (in events per million years) are shown in parentheses (for complete data see Fig. S3). The list at the root includes the genes that were present in the common ancestor of great apes and macaque. In addition to most of the genes on the human Y, the macaque Y harbors the X-degenerate MXRA5Y gene, which we found to be deleted in orangutan and pseudogenized in bonobo, chimpanzee, gorilla, and human. We currently cannot find a full-length copy of the VCY

xi gene in bonobo. TXLNGY and DDX3Y are also known as CYorf15B and DBY, respectively...... 100

xii

ACKNOWLEDGMENTS

I would like to thank Paul Medvedev, Kateryna Makova, Shashikant Cooduvalli, Michael DeGiorgio, Steve Schaeffer, Marry Poss, Wansheng Liu, and Francesca Chiaromonte for their time and effort during the last six and half years. I would like to thank past and present members of Medvedev lab and Makova lab for their support through the highs and lows of my PhD journey and for being there to celebrate even the small victories. I would also like to thank the Huck administrative staff and bx administrators for their help throughout my stay. I would also like to thank members of GenoMIX and IGSA, who helped me grow as a person. I learned a lot from every one of you and thanks for making my PhD journey memorable.

I would also like to thank different funding resources for their financial support over the years, including PSU-NIH funded CBIOS Predoctoral Training Program, grants from the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM130691 (to K.D.M.), the National Science Foundation(NSF) awards DBI-1356529 (to P.M.), IIS-1453527, IIS-1421908, and CCF-1439057 (to P.M.). In addition, funds were made available through the Clinical and Translational Sciences Institute, Institute for Cyber Science, and Eberly College of Sciences—at Penn State, Pennsylvania Department of Health using Tobacco Settlement and CURE Funds. The findings and conclusions do not necessarily reflect the view of the above funding agencies.

Finally, I would like to thank my family and friends for their continuous support and for believing in me at every point of my life.

xiii

Chapter 1

Introduction

Hominidae (great apes) is a family of primates which shared a common ancestor 13 million years ago (Glazko & Nei, 2003). The members of this family include four extant genera: chimpanzees (Pan), gorillas (Gorilla), humans (Homo), and orangutans (Pongo). These primates are classified as great apes because of their intelligence, strength, large size and absence of a tail (“Fowler’s Zoo and Wild Animal Medicine, Volume 8 | ScienceDirect,” n.d.). Geographically, non-human great apes are localized to equatorial Africa (chimpanzees and gorillas) and Southeast Asia (orangutans), and they are under significant threat due to severe hunting, habitat loss, and spread of infectious diseases (“Fowler’s Zoo and Wild Animal Medicine, Volume 8 | ScienceDirect,” n.d.). The International Union for Conservation of Nature has classified these non-human great apes as endangered (Fruth et al., 2016; Iucn & IUCN, 2016a, 2016b, 2016c, 2017). There are multiple efforts to conserve these great ape species either in captivity or in the wild (“Fowler’s Zoo and Wild Animal Medicine, Volume 8 | ScienceDirect,” n.d.). One of the most important goals of conservation of endangered species is to preserve genetic variation and reproductive fitness to ensure their recovery as self-sustaining, healthy populations (Comizzoli, Songsasen, & Wildt, 2010). The Y chromosome of great apes harbors gene families linked to spermatogenesis (Skaletsky et al., 2003), which could influence the reproductive success of great apes. There is a substantial gap in our understanding of basic biology of these multi-copy gene families on the Y chromosome, which this dissertation helps to address.

Great ape sex chromosomes constitute an -like X and a hemizygous Y, which have originated from a pair of homologous chromosomes approximately 160-190 million year ago (MYA) (Luo, Yuan, Meng, & Ji, 2011; Veyrunes et al., 2008). One of the proto- sex chromosomes acquired the testis-determining factor (SRY) turning it into a proto-Y, followed by a series of inversions which prevented the Y chromosome to recombine with the X (Lahn & Page, 1999; Ross et al., 2005). Lack of recombination resulted in massive gene loss and degradation of the Y (Bellott et al., 2014; Skaletsky et al., 2003), however the Y chromosome overcame this process by obtaining large repeat structures (amplicons) which enable gene conversion and non-allelic homologous recombination within the Y (Betrán, Demuth, & Williford, 2012). The ampliconic regions consist of intrachromosomal duplications in the form of inverted and tandem repeats (Skaletsky et al., 2003). The Y chromosome assemblies are available for human, chimpanzee, and gorilla and are missing for bonobo and orangutans (Hughes et al., 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). Cytogenetic studies showed differences in gene content, gene order, and size of the Y chromosome among great apes (Gläser et al., 1998; Weber, Schempp, & Wiesner, 1986; Yunis & Prakash, 1982).

The repeats in the ampliconic region include tandem repeats and palindromic structures composed of highly similar inverted repeats (arms) around a relatively short unique sequence (spacer). The close proximity of repeats to each other enables gene conversion and non-allelic homologous recombination, which removes deleterious mutations and lowers the diversity of the repeats and the genes within them (Betrán et al., 2012). As a result, the arms within a palindrome are 99.9% identical to each other, which makes the paralogous genes on the arms very similar to each other in sequence (Skaletsky et al., 2003). The genes in the ampliconic region are present as multi-copy gene families (Skaletsky et al., 2003). There are nine -coding ampliconic gene families on the human Y: BPY2 (basic protein Y2 ), CDY (chromodomain Y), DAZ (deleted in azoospermia), HSFY (heat-shock transcription factor Y), PRY (PTP-BL related Y), RBMY (RNA-binding motif Y), TSPY (testis-specific Y), VCY (variable charge), and XKRY (X Kell blood-related Y). In non-human great apes, few of these gene families are missing: HSFY, PRY, and XKRY are missing in chimpanzee and possibly also in bonobo, and the VCY gene family is missing in bonobo, gorilla, and orangutans (Hallast & Jobling, 2017). The byproduct of intraspecific non-allelic homologous recombination within ampliconic region is variation of copy number of ampliconic genes within a species (Hughes et al., 2010; Lucotte et al., 2018; Oetjens, Shen, Emery, Zou, & Kidd, 2016; Repping et al., 2006; Schaller et al., 2010; Skov & Schierup, 2017; Tomaszkiewicz et al., 2016; Ye et al., 2018). We do not have the information about the evolutionary conservation of copy number of ampliconic gene families across great apes as the ampliconic copy number of orangutan species is still unknown. For those species we have data for, the TSPY and RBMY gene families have higher variation in copy number when compared to other genes families among humans, chimpanzees, and gorillas (Hughes et al., 2010; Lucotte et al., 2018; Oetjens et al., 2016; Repping et al., 2006; Schaller et al., 2010; Skov & Schierup, 2017; Tomaszkiewicz et al., 2016; Ye et al., 2018).

Ampliconic genes are expressed predominantly or exclusively in testis and play an important role in spermatogenesis (Skaletsky et al., 2003). Loss or partial deletion of ampliconic gene copies is linked to infertility in humans. Screening for deletions in the Y chromosome of individuals with infertility resulted in three azoospermia factor regions (AZFa, AZFb, and AZFc), which are active during different phases of spermatogenesis (P. H. Vogt et al., 1996), and complete or partial deletion of these regions is linked to azoospermia and arrest of spermatogenesis (Carvalho, Zhang, & Lupski, 2011; Krausz et al., 2011; Navarro-Costa, Plancha, & Gonaçlves, 2010; Repping et al., 2002; Peter H. Vogt, 1996). The exact function of individual gene families is understudied (Navarro- Costa, 2012). With respect to ampliconic gene expression, we have limited information: studies showed that testis-specific expression of Y ampliconic genes was acquired prior to their amplification on the Y (Bellott et al., 2014). The expression levels of Y genes decreased when compared to ancestral expression levels of single proto-sex chromosome alleles in primates (Cortez et al., 2014). A recent study investigating the expression of Y ampliconic genes during male meiosis found that gene families with high variation in copy number also have high expression levels at different stages of sperm development

2

(Lucotte et al., 2018). Apart from this, very little is known about the expression of ampliconic genes and how copy number variation impacts gene expression.

Despite the importance of Y chromosome ampliconic gene families for infertility and reproductive fitness, there is still a substantial gap in our understanding of basic biology of ampliconic gene families on the Y chromosome. For example, some questions of interest are: what is the effect of variation in ampliconic copy number on their gene expression levels, what mechanisms did testis adapt to handle the shift in gene copy number between individuals, are the ampliconic gene copy number and expression levels conserved across great apes, and do the dynamics of gene loss or gain in great apes Y explain the differences in matting pattern across great apes. My research goal is to understand the evolution of the Y ampliconic gene families in humans and in non-human great ape species. In this dissertation my specific objectives are to explore the variability of ampliconic genes at DNA and RNA levels using high-throughput next-generation sequencing technologies. In Chapter 2, I will test whether Y ampliconic gene expression levels depend on their copy number, and whether there is a gene dosage regulation to counteract the ampliconic gene copy number variation observed in humans using the Genotype-Tissue Expression (GTEx) dataset (Ardlie et al., 2015). In Chapter 3, as a follow-up to Chapter 2, we test whether the Y ampliconic gene copy number and gene expression levels are conserved across great apes. The goal is to test for significant gain or loss of copies and of gene expression levels across great ape species and to test whether the variation observed can explain phenotypes related to sperm competition in great apes. In Chapter 4, we study the dynamics of gene (and gene family) loss and gain in great ape Y chromosomes using Y chromosome assemblies of great apes. Chapter 5 presents conclusions from the Dissertation.

References

Ardlie, K. G., DeLuca, D. S., Segrè, A. V., Sullivan, T. J., Young, T. R., Gelfand, E. T., … Lockhart. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235), 648–660. Bellott, D. W., Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Cho, T. J., … Page, D. C. (2014). Mammalian y chromosomes retain widely expressed dosage- sensitive regulators. Nature, 508(7497), 494–499. Betrán, E., Demuth, J. P., & Williford, A. (2012). Why Chromosome Palindromes? International Journal of Evolutionary Biology, 2012(Figure 2), 1–14. Carvalho, C. M. B., Zhang, F., & Lupski, J. R. (2011). Structural variation of the human

3

genome: mechanisms, assays, and role in male infertility. Systems Biology in Reproductive Medicine, 57(1-2), 3–16. Comizzoli, P., Songsasen, N., & Wildt, D. E. (2010). Protecting and extending fertility for females of wild and endangered mammals. Cancer Treatment and Research, 156, 87–100. Cortez, D., Marin, R., Toledo-Flores, D., Froidevaux, L., Liechti, A., Waters, P. D., … Kaessmann, H. (2014). Origins and functional evolution of y chromosomes across mammals. Nature, 508(7497), 488–493. Fowler’s Zoo and Wild Animal Medicine, Volume 8 | ScienceDirect. (n.d.). Retrieved September 4, 2019, from https://www.sciencedirect.com/book/9781455773978/fowlers-zoo-and-wild-animal- medicine-volume-8 Fruth, B., Hickey, J. R., André, C., Furuichi, T., Hart, J., Hart, T., … Others. (2016). Pan paniscus. The IUCN Red List of Threatened Species 2016: e. T15932A102331567. Gläser, B., Grützner, F., Willmann, U., Stanyon, R., Arnold, N., Taylor, K., … Schempp, W. (1998). Simian Y chromosomes: species-specific rearrangements of DAZ, RBM, and TSPY versus contiguity of PAR and SRY. Mammalian Genome: Official Journal of the International Mammalian Genome Society, 9(3), 226–231. Glazko, G. V., & Nei, M. (2003). Estimation of divergence times for major lineages of primate species. Molecular Biology and Evolution, 20(3), 424–434. Hallast, P., & Jobling, M. A. (2017). The Y chromosomes of the great apes. Human Genetics, 136(5), 511–528. Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P. J., … Page, D. C. (2010). Chimpanzee and human Y chromosomes are remarkably divergent in structure and gene content. Nature, 463(7280), 536–539. Iucn, & IUCN. (2016a). Gorilla gorilla: Maisels, F., Bergl, R.A. & Williamson, E.A. IUCN Red List of Threatened Species. https://doi.org/10.2305/iucn.uk.2018- 2.rlts.t9404a136250858.en Iucn, & IUCN. (2016b). Pan troglodytes: Humle, T., Maisels, F., Oates, J.F., Plumptre, A. & Williamson, E.A. IUCN Red List of Threatened Species. https://doi.org/10.2305/iucn.uk.2016-2.rlts.t15933a17964454.en Iucn, & IUCN. (2016c). Pongo pygmaeus: Ancrenaz, M., Gumal, M., Marshall, A.J., Meijaard, E., Wich , S.A. & Husson, S. IUCN Red List of Threatened Species. https://doi.org/10.2305/iucn.uk.2016-1.rlts.t17975a17966347.en

4

Iucn, & IUCN. (2017). Pongo abelii: Singleton, I., Wich , S.A., Nowak, M., Usher, G. & Utami-Atmoko, S.S. IUCN Red List of Threatened Species. https://doi.org/10.2305/iucn.uk.2017-3.rlts.t121097935a115575085.en Krausz, C., Chianese, C., Giachini, C., Guarducci, E., Laface, I., & Forti, G. (2011). The Y chromosome-linked copy number variations and male fertility. Journal of Endocrinological Investigation, 34(5), 376–382. Lahn, B. T., & Page, D. C. (1999). Four evolutionary strata on the human X chromosome. Science, 286(5441), 964–967. Lucotte, E. A., Skov, L., Jensen, J. M., Macià, M. C., Munch, K., & Schierup, M. H. (2018). Dynamic Copy Number Evolution of X- and Y-Linked Ampliconic Genes in Human Populations. Genetics, 209(3), 907–920. Luo, Z.-X., Yuan, C.-X., Meng, Q.-J., & Ji, Q. (2011). A Jurassic eutherian mammal and divergence of marsupials and placentals. Nature, 476(7361), 442–445. Navarro-Costa, P. (2012). Sex, rebellion and decadence: the scandalous evolutionary history of the human Y chromosome. Biochimica et Biophysica Acta, 1822(12), 1851– 1863. Navarro-Costa, P., Plancha, C. E., & Gonaçlves, J. (2010). Genetic dissection of the AZF regions of the human Y chromosome: Thriller or filler for male (In)fertility? Journal of Biomedicine & Biotechnology, 2010. https://doi.org/10.1155/2010/936569 Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biology and Evolution, 8(7), 2231–2240. Repping, S., Skaletsky, H., Lange, J., Silber, S., van der Veen, F., Oates, R. D., … Rozen, S. (2002). Recombination between Palindromes P5 and P1 on the Human Y Chromosome Causes Massive Deletions and Spermatogenic Failure. American Journal of Human Genetics, 71(4), 906–922. Repping, S., van Daalen, S. K. M., Brown, L. G., Korver, C. M., Lange, J., Marszalek, J. D., … Rozen, S. (2006). High mutation rates have driven extensive structural polymorphism among human Y chromosomes. Nature Genetics, 38(4), 463–467. Ross, M. T., Grafham, D. V., Coffey, A. J., Scherer, S., McLay, K., Muzny, D., … Bentley, D. R. (2005). The DNA sequence of the human X chromosome. Nature, 434(7031), 325–337. Schaller, F., Fernandes, A. M., Hodler, C., Münch, C., Pasantes, J. J., Rietschel, W., & Schempp, W. (2010). Y Chromosomal Variation Tracks the Evolution of Mating

5

Systems in Chimpanzee and Bonobo. PloS One, 5(9), e12482. Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G., … Page, D. C. (2003). The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature, 423(6942), 825–837. Skov, L., & Schierup, M. H. (2017). Analysis of 62 hybrid assembled human Y chromosomes exposes rapid structural changes and high rates of gene conversion. PLoS Genetics, 13(8), 1–20. Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H. W., Harris, R., … Makova, K. D. (2016). A time- and cost-effective strategy to sequence mammalian Y Chromosomes: an application to the de novo assembly of gorilla Y. Genome Research, 26(4), 530–540. Veyrunes, F., Waters, P. D., Miethke, P., Rens, W., McMillan, D., Alsop, A. E., … Marshall Graves, J. A. (2008). Bird-like sex chromosomes of platypus imply recent origin of mammal sex chromosomes. Genome Research, 18(6), 965–973. Vogt, P. H. (1996). Human Y Chromosome Function in Male Germ Cell Development. In Advances in Developmental Biology (1992) (pp. 191–257). Vogt, P. H., Edelmann, A., Kirsch, S., Henegariu, O., Hirschmann, P., Kiesewetter, F., … Haidl, G. (1996). Human Y chromosome azoospermia factors (AZF) mapped to different subregions in Yq11. Human Molecular Genetics, 5(7), 933–943. Weber, B., Schempp, W., & Wiesner, H. (1986). An evolutionarily conserved early replicating segment on the sex chromosomes of man and the great apes. Cytogenetic and Genome Research, 43(1-2), 72–78. Ye, D., Zaidi, A., Tomaszkiewicz, M., Anthony, K., Liebowitz, C., DeGiorgio, M., … Makova, K. D. (2018). High levels of copy number variation of ampliconic genes across major human Y haplogroups. Genome Biology and Evolution, (May). https://doi.org/10.1093/gbe/evy086 Yunis, J., & Prakash, O. (1982). The origin of man: a chromosomal pictorial legacy. Science, 215(4539), 1525–1530.

6

Chapter 2 Dosage regulation, and variation in gene expression and copy number at human Y chromosome ampliconic genes

The material in this chapter was previously published in 2019 as a research article by R. Vegesna, M. Tomaszkiewicz, P. Medvedev, and K.D. Makova, appearing in PLoS Genetics 15(9): e1008369. Supporting information that accompanied this article is provided in Appendix A.

Abstract The Y chromosome harbors nine multi-copy ampliconic gene families expressed exclusively in testis. The gene copies within each family are >99% identical to each other, which poses a major challenge in evaluating their copy number. Recent studies demonstrated high variation in Y ampliconic gene copy number among humans. However, how this variation affects expression levels in human testis remains understudied. Here, we developed a novel computational tool Ampliconic Copy Number Estimator (AmpliCoNE) that utilizes read sequencing depth information to estimate Y ampliconic gene copy number per family. We applied this tool to whole-genome sequencing data of 149 men with matched testis expression data whose samples are part of the Genotype- Tissue Expression (GTEx) project. We found that the Y ampliconic gene families with low copy number in humans were deleted or pseudogenized in non-human great apes, suggesting relaxation of functional constraints. Among the Y ampliconic gene families, higher copy number leads to higher expression. Within the Y ampliconic gene families, copy number does not influence gene expression, rather a high tolerance for variation in gene expression was observed in testis of presumably healthy men. No differences in gene expression levels were found among major Y haplogroups. Age positively correlated with expression levels of the HSFY and PRY gene families in the African subhaplogroup E1b, but not in the European subhaplogroups R1b and I1. We also found that expression of five out of six Y ampliconic gene families is coordinated with that of their non-Y (i.e. X or autosomal) homologs. Indeed, five ampliconic gene families had consistently lower expression levels when compared to their non-Y homologs suggesting dosage regulation,

7 while the HSFY family had higher expression levels than its X homolog and thus lacked dosage regulation.

Introduction

The human Y chromosome harbors 10.2 million bases (Mb) of ampliconic regions containing nine protein-coding multi-copy gene families (Skaletsky et al., 2003). These genes are important not only because of their association with male infertility (Repping et al., 2002; Skaletsky et al., 2003) but also because they might hold the key to understanding the evolutionary forces that have shaped the Y chromosome. Ampliconic gene families show a high level of copy number variability (Lucotte et al., 2018; Skov, Danish Pan Genome Consortium, & Schierup, 2017; Ye et al., 2018) and, possibly, a similar variability in gene expression levels. Understanding the relationship between these two variabilities is an important step in the study of these genes. Yet, there has been no comprehensive investigation to-date that explores expression of these gene families and its connection to copy number at a large, population-level scale.

Studying ampliconic gene families has been a considerable challenge because they exhibit a much higher intra-familial sequence similarity than other gene families. The majority (eight out of nine) of Y ampliconic gene families are located in palindromes— structures composed of highly similar inverted repeats (arms) around a relatively short unique sequence (spacer). The arms within a palindrome are 99.9% identical to each other, which results in a high sequence identity among paralogous genes located on the arms (Skaletsky et al., 2003). The ninth family, TSPY, is present as an array of tandem repeats outside of palindromes (Skaletsky et al., 2003), however its genes still share sequence identity of >99%. It has been hypothesized that the Y chromosome has acquired its ampliconic structure as a way of facilitating gene conversion (Rozen et al., 2003), which can overcome the decay due to a lack of interchromosomal recombination (Betrán, Demuth, & Williford, 2012; B. Charlesworth & Charlesworth, 2000).

Why these ampliconic gene families are preserved on the Y chromosome remains an open question. It has been suggested that this is due to sexual antagonism eventually leading to increased male reproductive fitness (Bellott et al., 2014; Betrán et al., 2012; Rozen et al., 2003). Sexual antagonism is expected to lead to the accumulation of genes and

8 mutations benefiting males on the Y chromosome (D. Charlesworth & Charlesworth, 1980). Consistent with the sexual antagonism hypothesis, all ampliconic genes on the Y are expressed exclusively or predominantly in testis. However, it is also possible that these genes have recently evolved under relaxed function constraints. The ability to analyze the expression levels of Y ampliconic genes at a large scale can help in exploring their potential functional constraints via comparing their testis expression level to that of their non-Y homologs (when available). For instance, if a Y ampliconic gene family undergoes neofunctionalization, then its resulting expression level is expected to be independent of and potentially higher than that for its non-Y homologs (which we assume retained the ancestral function).

In support of some functional constraints is the observation that the loss or partial deletion of Y ampliconic gene copies is linked to infertility in humans. For example, TSPY copy number was linked to both infertility (Giachini et al., 2009) and sperm count (Giachini et al., 2009; C. Krausz et al., 2011; Csilla Krausz, Giachini, & Forti, 2010). The long arm of the human Y chromosome includes three azoospermia factor regions (AZFa, AZFb, and AZFc) which cover most of the ampliconic genes families and are active during different phases of spermatogenesis (Vogt et al., 1996). Complete or partial deletion of these regions is linked to azoospermia and arrest of spermatogenesis (Carvalho, Zhang, & Lupski, 2011; C. Krausz et al., 2011; Navarro-Costa, Plancha, & Gonaçlves, 2010; Repping et al., 2002; Vogt et al., 1996). Presumably, copy number decrease linked with infertility is accompanied by a reduction in gene expression of the affected Y ampliconic gene families, however this is yet to be demonstrated.

Recent studies indicated high variation in Y ampliconic gene copy number in healthy men (Lucotte et al., 2018; Skov et al., 2017; Ye et al., 2018). Skov and colleagues (Skov et al., 2017) studied Y ampliconic gene copy number variation in 62 men of Danish descent and identified multiple copy number changes across all nine gene families among unrelated individuals, as well as de novo copy number differences for the TSPY and VCY gene families between a father and a son. Ye and colleagues (Ye et al., 2018) assessed Y ampliconic gene copy number variation in 100 individuals from around the world. They observed that the size of gene family is correlated with its variation in copy number: larger families, such as TSPY and RBMY, have higher levels of variation, however the variation appears to be independent of the Y haplogroup. Two men rarely had the same Y

9 ampliconic gene copy number profile and, when they did, this was likely a result of homoplasy. Lucotte and colleagues (Lucotte et al., 2018) used the data from the Simons Genome Diversity Project (Mallick et al., 2016) and observed substantial variation in copy number in six out of nine human Y ampliconic gene families (Lucotte et al., 2018). Teitz and colleagues (Teitz, Pyntikova, Skaletsky, & Page, 2018) assessed copy number of full- length Y chromosome amplicons located in the AZFc region in men sequenced by the 1000 Genomes Project (Teitz et al., 2018). Their results suggest that selection has preserved the ancestral ampliconic copy number on the Y chromosome in diverse human lineages (Teitz et al., 2018).

These multiple studies of copy number notwithstanding, there has been little investigation of gene expression in Y ampliconic genes. A recent study investigating the expression of Y ampliconic genes during male meiosis found that gene families with high variation in copy number also have high expression levels at different stages of sperm development (Lucotte et al., 2018). Other than the results of this single study, there is a big gap in our understanding of variation in expression of Y ampliconic genes among humans, even though gene expression could be a better predictor of genes’ functions than copy number. Additionally, previous studies have reported that aging affects gene expression (Vinuela et al., 2016; Yang et al., 2015).

Even less is known about how variation in copy number of Y ampliconic genes affects their gene expression. Most parsimoniously, a gain of a complete gene copy should lead to an increase in gene expression levels, unless the extra copy obtains a new function through neofunctionalization, has decreased functional demands due to subfunctionalization or is lost due to pseudogenization. Indeed, this parsimonious hypothesis was supported by the data from the 1000 Genomes Project, where most genes overlapping multiallelic copy number variations (CNVs) display a positive correlation between copy number and gene expression (Handsaker et al., 2015). However, studies across different model organisms have reported that differences in copy number result in either increased, decreased or unchanged expression levels among individuals in a population (Henrichsen, Chaignat, & Reymond, 2009). This more complex relationship can be caused by several scenarios during duplication. For instance, a tandem duplication event may not include regulatory elements, may physically disrupt topologically associated domains (TADs), which prevents the interaction of the gene with its enhancer in 3D space

10

(Lupiáñez et al., 2015; Spielmann, Lupiáñez, & Mundlos, 2018), or may result in a new copy acting as a negative feedback loop to reduce transcription (Henrichsen et al., 2009). Moreover, a non-tandem duplication may occur to a site that is not transcriptionally active (Henrichsen et al., 2009). Which of these parsimonious or more complex scenarios occurs on the human Y chromosome ampliconic genes has not been explored.

In this study, we explored the above questions by analyzing the largest data set available to-date consisting of expression data from testis, along with matched whole-genome sequencing data, from 170 men, as generated by the Genotype Tissue Expression (GTEx) consortium (Carithers et al., 2015). Simultaneously, we developed a novel computational tool AmpliCoNE to estimate the copy number of an ampliconic gene family from sequencing data. Such estimation is complicated by the presence of multiple highly-similar gene copies in the reference, which makes conventional tools inapplicable (Medvedev, Stanciu, & Brudno, 2009). Custom strategies have been developed and shown to be effective(Cortez et al., 2014; Handsaker et al., 2015; Lucotte et al., 2018; Oetjens, Shen, Emery, Zou, & Kidd, 2016; Skov et al., 2017; Sudmant et al., 2010), but we did not identify any existing software that could be run directly on ampliconic Y chromosome gene families.

Using AmpliCoNE, we explored whether variation in Y ampliconic gene expression levels could be explained by variation in gene copy number, Y haplogroup, and individual’s age. We correlated the estimated with AmpliCoNE copy numbers of Y ampliconic gene families to their expression levels in testis and studied how this correlation is affected by Y haplogroups. Additionally, we investigated how testis-specific expression of Y ampliconic genes diverged from their non-Y homologs during evolution.

Results

AmpliCoNE: Ampliconic Copy Number Estimator AmpliCoNE is composed of two programs. The first (AmpliCoNE-build) is executed only once to process the reference genome. It takes the location of all the gene copies in the reference genome, grouped by family, determines which positions in the genes are informative (i.e. where read depth is an effective predictor of copy number) and which positions in the reference can be used as a control (where copy number variation is

11 infrequent and the read depth has limited noise). The second step (AmpliCoNE-count) is then executed separately for every sample. It parses read alignments and measures the GC-corrected read depth at the informative positions. It then accumulates this information at a family-level and reports the copy number for each gene family, using the read depth at control positions as a baseline. We provide further details in the Methods.

To evaluate AmpliCoNE’s accuracy, we ran it on simulated data and whole-genome short- read data from the Genome in a Bottle (GIAB) consortium (Zook et al., 2014). Using the hg38 reference, we simulated three datasets with varying copy numbers of RBMY, TSPY, and VCY gene families and kept the copy numbers for the remaining six gene families constant (i.e. with the copy number found in the reference). AmpliCoNE estimated ampliconic copy numbers correctly 100% of the time in the simulated datasets (Table S1). We then compared gene family copy numbers between different GIAB experimental runs (technical replicates) for the same human sample (Table S2), as well as between a father and a son (which can be treated as biological replicates because copy number differences between generations are expected to be rare (Skov et al., 2017)). AmpliCoNE consistently predicted copy numbers with a difference of less than 0.5 copies per family. We tested AmpliCoNE at different depths of coverage and showed that it can predict similar copy numbers (estimates with difference of less than 0.5) even for datasets with the Y chromosome sequencing depth as low as 6x (Table S3). AmpliCoNE’s runtime is dependent on the number of reads it needs to process. For instance, it took AmpliCoNE 11 minutes to process the GTEx Y-chromosome-specific BAM file (~500 MB in size).

To measure the concordance between AmpliCoNE’s copy number estimates and complementary non-sequencing assays, we used droplet digital PCR (ddPCR). Both AmpliCoNE and ddPCR were applied to estimate Y ampliconic gene copy numbers for four males sequenced by the GIAB consortium (Tables 1 and S4) (Zook et al., 2014). The ddPCR estimates were identical to AmpliCoNE estimates for five out of nine gene families (BPY2, DAZ, HSFY, PRY, and XKRY) in all four samples. The CDY and RBMY family copy numbers differed between the two methods in only one and two individuals, respectively. The VCY and TSPY family copy number estimates differed in three and four individuals, respectively. Compared with ddPCR, AmpliCoNE consistently underestimated the copy number for the VCY gene family. Previous studies have indicated presence of X- to-Y gene conversion between VCX and VCY (Iwase, Satta, Hirai, Hirai, & Takahata,

12

2010; Trombetta, Cruciani, Underhill, Sellitto, & Scozzari, 2010). We investigated this case in more detail and discovered that genes from the VCY family harbor only a very short (220-bp) sequence distinguishing them from their VCX paralogs. This sequence has a low sequencing depth even after GC correction, which results in the underestimation of the VCY copy number by AmpliCoNE. In the case of TSPY, it is known to have many highly similar copies which may themselves vary in copy number, which can potentially confound both AmpliCoNE and ddPCR estimates. These caveats notwithstanding, AmpliCoNE’s biases in estimating copy numbers for TSPY and VCY are consistent across samples and thus should not affect our results in a systematic way.

Y ampliconic gene copy number estimates Using AmpliCoNE, we estimated copy numbers of Y chromosome ampliconic genes in 170 presumably healthy men whose genomes were sequenced in their entirety as part of the GTEx project (Carithers et al., 2015). These individuals (Table S5) were selected because they had matched testis expression data. The individuals belonged to nine major haplogroups: B, E, G, I, J, L, O, Q, R, and T (Table 2). The majority of the samples in the dataset had European or African Y haplogroups, with a few Asian haplogroups present. We also used AmpliCoNE to estimate the copy numbers of X-degenerate genes, which are expected to be 1 in healthy samples. Three samples had copy number estimates close to zero for two or more ampliconic gene families, or had less than one copy for several X- degenerate genes , which could suggest an individual with a disease or could result from a technical artifact, and thus were removed from the downstream analysis. As a result, we retained 167 samples.

Gene families with higher median copy number had higher variation when compared to gene families with lower median copy number (R2=0.91; Fig. S1). RBMY and TSPY were the largest gene families and displayed the highest variation in copy number (5-14 and 20-64 copies for RBMY and TSPY, respectively). HSFY, PRY, VCY, and XKRY were the smallest gene families, which on average had two copies per individual, and displayed low variation in copy number. We observed a positive correlation in copy number among BPY2, CDY, and DAZ gene families which could be explained by their co-localization on palindrome P1; duplication or deletion involving P1 can affect the copy numbers of all three gene families (Fig. 1A).

13

Figure 2- 1. Correlation in copy number and expression levels across Y ampliconic gene families. The gene families are clustered based on correlation coefficients. (A) Correlation in copy numbers among 167 individuals. (B) Correlation in gene expression levels among 149 individuals.

Y ampliconic gene families with low copy number in humans are frequently deleted in non-human great apes We expected to observe a higher probability for gene families with lower median copy number to be completely deleted due to random rearrangements. Therefore we aimed to test whether the gene families with lower copy number in human had a higher chance of being deleted in non-human great ape species. It is known from previous studies that the VCY gene family is missing in bonobo, gorilla, and orangutan, whereas the HSFY, PRY, and XKRY families are missing in bonobo and chimpanzee (Hallast & Jobling, 2017). Consistent with our hypothesis, the HSFY, PRY, VCY, and XKRY gene families had low copy numbers in humans (Fig. S1; Table S6).

14

Y ampliconic gene expression To explore the relationship between ampliconic gene copy number and their expression levels, we analyzed testis expression data from the same 167 humans whose Y ampliconic gene copy number was estimated with AmpliCoNE. After removing outliers (see Materials and Methods), we retained 149 samples and obtained expression levels for each gene family—the sum of expression of all the gene copies within each family—in each of them. We found that, similar to our observation for copy numbers (Fig. S1), families with higher gene expression levels had higher variation in gene expression (R2 =0.99; Fig. S2). The TSPY family had the highest gene expression level and the highest variation in expression across individuals, and XKRY—the lowest (Table S6; Fig. S2). The XKRY gene family could be considered to be not expressed (as its expression levels are zero) in 58 individuals or expressed at very low levels (with DESeq2 normalized read count < 10) in the remaining 91 individuals. DAZ, HSFY, and RBMY gene families had similar median expression levels and variance among themselves (Table S6; Fig. S2). Within our dataset, we found two sets of ampliconic gene families whose expression levels were positively correlated with each other (Fig. 1B). The first set included BPY2, CDY, HSFY, and PRY, and the second set—DAZ, TSPY, RBMY, and VCY (Fig. 1B). The expression of these sets of gene families could be co-regulated or might have cell-type specificity.

More copious gene families have higher gene expression levels When we investigated the relationship between expression levels and copy number among all 149 individuals across nine ampliconic gene families, we found that more copious gene families tended to have higher expression levels in comparison to the less copious gene families (Fig. 2). Indeed, the expression levels were positively correlated with estimates of copy numbers (Spearman's rank correlation rho = 0.43; P-value < 2.2x10-16). The DAZ, HSFY, and VCY gene families appeared to be outliers in this analysis, as they had gene expression levels similar to the RBMY gene family even though their median copy number estimates were approximately half or less than half of RBMY gene family. The DAZ gene family had similar gene copy number yet higher expression levels when compared to the CDY gene family. The XKRY family consistently had very low expression levels, even though its median copy number per individual was two.

Figure 2- 2. Relationship between copy number and expression levels for nine Y ampliconic gene families. The copy number (X-axis) and gene expression values (Y-axis) values for 149 individuals are presented on a natural log scale. The dots are values for different men, and boxplots

15 are the distribution of values for individual gene families. Both the dots and boxplots are color- coded by their respective gene families. The black line represents the linear function (copy number ~ expression) fitted to the points on the plot. The coefficient of determination (R2) for the linear model is 0.25.

Within a family, copy number and gene expression are not correlated Next, we tested whether copy number, as measured for each individual, is positively correlated with gene expression levels, again measured for each individual, within the same gene family. There was no significant correlation in any of the nine families studied (all P-values were above the Bonferroni-corrected P-value cutoff of 0.05/9=0.006; Fig. S3; Table S7). To control for genetic variation on the Y, we next compared copy number estimates to gene expression levels for individuals with the same Y subhaplogroup. We focused on the European R1b and I1a, and the African E1b subhaplogroups because they had more than 10 individuals in our dataset (77, 15 and 22, respectively; Table 2). We still found no significant correlations between copy number and expression levels in any of the

16 nine gene families for individuals from either of these three subhaplogroups (all P-values were above the Bonferroni-corrected P-value cutoff of 0.05/9=0.006; Figs. S4-S6; Table S7).

Y haplogroups and ampliconic gene families We further asked whether the major Y haplogroup can at least in part explain the variation we observed in copy number and in gene expression levels of Y chromosome ampliconic genes. We focused our analysis on major haplogroups R (European), I (European), E (African), and J (Western Asian) because they were represented by at least 10 samples in our dataset (Table 2). Using one-way ANOVA, we found that the copy numbers of BPY2 (P= 2.34x10-3), RBMY (P=2.97x10-8), and TSPY (P =1.07x10-22) gene families had significant differences among the four major Y haplogroups analyzed (Bonferroni- corrected P-value cutoff of 0.05/9=0.006; Table 3). The remaining six gene families did not display significant differences among Y haplogroups (Table 3). When we compared the mean copy number differences between haplogroups in a pairwise fashion using a permutation test (1 million permutations; 9 gene families are tested for 6 cases—R vs E; R vs I; R vs J; I vs E, I vs J, E vs J—thus we performed 9 x 6 = 54 tests; Bonferroni- corrected P-value cutoff of 0.05/54 = 0.00093), TSPY differed significantly in copy numbers (Fig. 3) between major European (R and I) vs. African (E) or vs. Western Asian (J) haplogroups (P=0 for R vs. E; P=0 for I vs. E; P=0 for R vs. J; P=0.3x10-5 for I vs. J; Table S8). RBMY copy numbers differed significantly between a European (R) vs. African (E) or Western Asian (J) haplogroups (P=6.94x10-4 for R vs. E; P=0 for R vs J; Table S8). No significant differences between the two major European haplogroups (R and I) were observed (Table S8). In contrast, we found that gene expression levels of all nine Y ampliconic gene families were not significantly different among major Y haplogroups (all P-values were above the P-value cutoff of 0.05/9 ≈ 0.006; one-way ANOVA; Table 3). We observed a trend suggesting differences in expression values among haplogroups for the BPY2 and DAZ gene families, but these differences were small in scale. Nevertheless, out of the nine gene families, BPY2 (P=0.056) and DAZ (P=0.01) had low P-values for the ANOVA analysis (Table 3, Fig. 3) and for the permutation test comparing mean expression levels between haplogroups (P=7.09x10-3 for E vs. R for BPY2; P=1.36x10-2 for E vs. R for DAZ; P cutoff of 0.05/54 = 0.00093; Table S9). When we compared the trend in copy number and gene expression differentiation among haplogroups, we observed that in the TSPY

17 gene family both copy number and gene expression levels were lower for the European haplogroups (I, R) than for the African (E) or Western Asian (J) haplogroups (Fig. 3). This trend was statistically significant for copy number, but not significant for gene expression. Analyzing a larger sample size might lead to finding this trend to be significant also for gene expression.

Figure 2- 3. The distribution of ampliconic gene copy numbers and expression levels across Y haplogroups. For each plot the x-axis shows Y haplogroups: E - African (N=22, yellow), I - European (N=24, green), J - Western Asian (N=11, blue), and R - European (N=85,red), and the y-axis shows copy number estimates or gene expression levels. The black dashed line represents the overall mean copy number or expression values for all the samples on each plot. The permutation-based significance of pairwise haplogroup comparisons is shown with stars ( *** < 0.001, ** < 0.01, * < 0.05 P-value). The one-way ANOVA test p-values are printed at the top of each plot. Bonferroni-corrected cutoff for nine tests (0.05/9 ≈ 0.006) is used to identify significance of ANOVA.

18

19

The role of age in ampliconic gene expression To examine the potential role of aging in determining Y ampliconic gene expression, we compared the ages of individuals at the time of sample collection to the ampliconic gene expression levels and found no statistically significant relationship (nine gene families were tested for correlation which results in Bonferroni correction P-value cutoff of 0.05/9=0.006; Fig. S7; Table S10). Next, to perform a similar analysis for individuals with the same subhaplogroup, we limited our analysis to individuals with the European R1b and I1a, and African E1b subhaplogroups (77, 15, and 22 individuals, respectively). For the R1b and I1a subhaplogroups we found no significant relationship between age and expression levels for any of the nine Y ampliconic gene families studied (Fig. S8-9; Table S10). However, for the African E1b subhaplogroup, gene families HSFY (Spearman correlation= 0.57; P=0.0061) and PRY (Spearman correlation= 0.61; P=0.0028) had a positive correlation between expression levels and age, which was significant after Bonferroni correction (Fig. S10; Table S10). A larger dataset of African haplogroups should be studied to validate this relationship.

Ampliconic gene dosage regulation The presence of homologs outside of the Y for two groups of Y ampliconic gene families allows us to study evolution of their gene expression levels (Bhowmick, Satta, & Takahata, 2007). In particular, the CDY and DAZ genes were copied to the Y chromosome from autosomes (Bhowmick et al., 2007); the HSFY, RBMY, TSPY, VCY, and XKRY gene families have homologs on the X and were likely present on the ancestral autosomes giving rise to the two sex chromosomes (Bhowmick et al., 2007). In the analyses below, we assume that the testis-specific expression of Y ampliconic genes was acquired prior to their amplification on the Y (Bellott et al., 2014)and that their autosomal or X- chromosomal homologs have maintained ancestral expression levels, i.e. they possess expression levels of Y ampliconic genes prior to their Y linkage (L. Gu & Walters, 2017). The latter assumption is based on the overall slower rates of evolution of X-chromosomal and autosomal genes as compared to their Y-chromosomal homologs.

We envision three possible scenarios for gene expression evolution of Y ampliconic gene families that have non-Y homologs (Fig. 4). First, because of sexual antagonism, a gene on the Y could obtain beneficial mutations and diverge in function from its non-Y homolog to acquire new functions in testis (i.e. neo-functionalization). The expression of such a

20 gene family would be independent from, and potentially higher than that for, its non-Y homologs (scenario A). Second, a gene family on the Y could retain function of the non-Y homolog but acquire testis-specific expression (i.e. sub-functionalization). In this case, either the non-Y copy represents the ancestral expression levels and the Y copies are expected to maintain low expression levels, or the sum of expression from the Y and non- Y copies is regulated to be at levels similar to those of the non-Y copy in the ancestor (scenario B). In this case, the expression of both Y and non-Y homologs might be downregulated. Third, genes on the Y might be under relaxed selective constraints and thus have low expression levels (scenario C) (Lynch & Conery, 2000). Below we test these three scenarios by comparing expression levels of both Y and non-Y ampliconic gene homologs in testis tissue.

In addition to the analysis of such overall differences in the expression level (Fig. 4), we can also examine the relationship between the Y ampliconic genes’ and their non-Y homologs’ gene expression across individuals, which should further assist in determining a particular evolutionary scenario (Fig. S11). If the expression levels of Y ampliconic genes are higher than those of their non-Y homologs, and across individuals the expression levels of these two groups of genes are positively correlated, then this pattern is consistent with neo-functionalization of the Y ampliconic genes.

Figure 2- 4. Possible differences in expression between Y ampliconic genes and their non- Y homologs. (A) Neofunctionalization: Ampliconic gene family, after moving to Y or divergence from the X, obtained a new beneficial function. The gene family might be under positive selection and its expression might be independent from its non-Y homolog. (B) Subfunctionalization: Ampliconic gene family, after moving to the Y or divergence from the X, acquired a (testis-specific) sub-function. Because subfunctionalization is division of labour, the sum of ampliconic gene expression and non-Y homolog expression should be equal to the ancestral expression. The average expression of ampliconic gene family could be lower or higher than that of their homologs and depends on the subfunction. (C) Relaxed selection: Ampliconic gene family, due to its multi-

21 copy nature and presence of gene conversion, evolves at a faster rate than its non-Y homolog and is not under selection.

This is because higher expression levels of ampliconic genes than those at the ancestral state suggest independent expression of Y ampliconic genes from their non-Y homologs, and a positive correlation between Y ampliconic genes and their non-Y homologs suggests coregulation, e.g. they might share similar transcription factors (Yu, Luscombe, Qian, & Gerstein, 2003). A combination of these two patterns suggests an acquisition of a new function (scenario A) (Fig. S11A). If the expression levels of Y ampliconic genes are higher than those of their non-Y homologs, and across individuals the expression levels of these two groups of genes are negatively correlated, then the data are compatible with neo- or sub-functionalization (scenario A or B). Indeed, the observed negative correlation could be explained by neo-functionalization, where ampliconic genes acquired a new function and inhibit the expression of the non-Y homologs. Alternatively, the negative correlation could be explained by sub-functionalization, where ampliconic genes acquired new transcription factors which limit their expression to a few cell types, and the negative correlation is due to the differences in the abundance of cell types in which ampliconic genes are expressed (Fig. S11B). If the expression levels of Y ampliconic genes are lower than those of their non-Y homologs, and across individuals the expression

22 levels of these two groups of genes are positively correlated, then this pattern is consistent with any of the three scenarios A-C. This is because the lower expression levels of Y ampliconic genes could be due to downregulation of gene expression by the Y chromosome to accommodate the multi-copy state of ampliconic genes (Lan & Pritchard, 2016), evolution of which could still be compatible with any of three three scenarios A-C (Fig. S11C). If the expression levels of the Y ampliconic genes are lower than those of their non-Y homologs, and across individuals the expression levels of these two groups of genes are negatively correlated, then the data are compatible with scenario A or B. This is because negative correlation eliminates the scenario of relaxed selection, i.e. scenario C (Fig. S11D). Finally, if we observe no correlation in expression levels between Y ampliconic genes and their non-Y homologs, then we can conclude that their expression is independent from each other, which could be a result of neo-functionalization, sub- functionalization or random drift in expression levels under relaxed selection.

To test these scenarios, we first compared testis expression levels between Y ampliconic gene families CDY and DAZ, which were copied to the Y from autosomes, and their autosomal homologs (Fig. 5). The CDY autosomal homologs CDYL and CDYL2 are ubiquitously expressed; and the DAZ autosomal homolog DAZL has testis-specific expression (Ardlie et al., 2015; Bhowmick et al., 2007; Dorus, Gilbert, Forster, Barndt, & Lahn, 2003; Vangompel & Xu, 2011). The expression levels of CDY (the sum of expression levels for the whole gene family) were 89% lower than those for their autosomal homologs (the sum of expression of CDYL and CDYL2), and for DAZ they were 63% lower than those for their autosomal homolog DAZL (Fig. 5). Next, we tested whether the expression levels for Y ampliconic genes and their autosomal homologs are regulated at the level of each individual. For each gene family, we examined a potential correlation in gene expression levels between the Y ampliconic genes and their non-Y homologs. We observed a significant negative correlation between CDY and CDYL+CDYL2 expression levels (Spearman correlation=-0.31; P=2x10-4), which indicates that, across individuals, whenever the CDY expression level increases, the CDYL+CDYL2 expression levels decrease (Fig. 6). In case of DAZ, a positive correlation in expression levels (Spearman correlation=0.57; P=0) was observed between DAZ and its autosomal homolog DAZL (Fig. 6). Lower expression of CDY and DAZ than their non-Y homologs could be a result of downregulation of gene expression by Y chromosome to maintain the multi-copy state, however the negative correlation in CDY vs. CDYL+CDYL2 expression levels indicates

23 the presence of either neo- or subfunctionalization. DAZ could have undergone any of the three scenarios which are difficult to differentiate based on the available data.

We next examined how testis-specific gene expression of the HSFY, RBMY, TSPY, VCY, and XKRY gene families diverged from that of their X homologs. Most of the X homologs of ampliconic genes (except for VCY and XKRY) are expressed in multiple tissues along with testis. The XKRX gene, an X homolog of the XKRY gene family, is not expressed in testis and we omitted this gene family from our analysis (Table S11). Three Y gene families studied (RBMY, TSPY, and VCY) on average had lower expression levels in comparison to their X homologs (66%, 75%, and 71% lower, respectively; Fig. 5). HSFY was the only gene family that on average had higher expression in comparison to their homologs on the X (35% higher than X-homologs). This could imply that HSFY might have acquired a new function which is selected for in testis (scenario A). At the level of the studied individuals, all four studied gene families exhibited positive correlation in gene expression levels between their Y ampliconic and X homolog genes, suggesting a potential co- regulation (Fig. 6). This correlation was particularly strong for the HSFY and VCY gene families (Spearman correlation of 0.69 and 0.84, respectively). The observed higher expression of HSFY than of its X homologs, as well as positive correlation in gene expression levels between these two groups of genes, is a strong indicator of neofunctionalization. In the case of RBMY, TSPY, and VCY, it is challenging to differentiate among the three scenarios we propose based on the available data.

24

Figure 2- 5. Expression differences between Y ampliconic genes and their non-Y homologs. Each plot compares expression levels of Y ampliconic gene family (the sum of expression of all copies of a gene family, blue) to their homologs on the X chromosome (green) or autosomes (yellow). The gene names are shown on the x-axis and normalized expression levels—on the y- axis.

25

Figure 2- 6. Individual-level relationship between Y ampliconic genes and their non-Y homologs. Each plot compares expression levels of Y ampliconic gene family (the sum of expression of all copies of a gene family) to their non-Y homologs in each individual (N=149). Each dot represents an individual. The black line represents the linear regression fit to the data and the respective equation is at the top of each plot. The R-squared value is at the right-hand side bottom of each plot.

Discussion

Ampliconic genes constitute the majority (80%) of protein-coding genes present on the human Y chromosome and play an important role in spermatogenesis (Skaletsky et al., 2003). Yet, very little is known about the significance of Y ampliconic gene copy number

26 variation in determining their expression levels in humans. Here we analyzed both copy number and testis-specific expression of ampliconic gene families in 149 presumably healthy men. Our goal was to understand the relationship between copy number variation and expression levels while accounting for Y chromosome haplogroups. Variability in Y ampliconic gene copy number Our results indicate that smaller Y ampliconic gene families maintain lower variation in copy number and, as the size of gene families increases, variation in copy number also increases, in agreement with previous studies (Lucotte et al., 2018; Skov et al., 2017; Ye et al., 2018) . The parsimonious explanation for this observation is that a greater number of gene copies leads to loss or gain of gene copies because of a higher probability of rearrangements via replication slippage and/or non-allelic homologous recombination (NAHR) (Jobling, 2008; Lambert et al., 1999; PJ Hastings James R Lupski, 2010). On the human Y, the larger gene families are either spread across multiple palindromes (e.g., RBMY) or are arranged as a tandem array (TSPY), and such arrangements can result in multiple scenarios of NAHR which will lead to gain or loss of gene copies. BPY2 has two functional copies on palindrome P1 and one copy outside of palindromes, and such an arrangement can also result in NAHR.

We found that the large TSPY and RBMY gene families have not only a high level of variation in copy number, but also a significantly different number of gene copies among the major Y haplogroups analyzed. An earlier study also found significant differences in copy number for these two gene families among human Y haplogroups across the world and suggested that this observation cannot be explained by selection (Ye et al., 2018). However, selection explanation might warrant a further investigation. Indeed, a recent molecular analysis of infertile men indicated a positive correlation between the number of RBMY copies and sperm count and motility (Yan et al., 2017). Moreover, RBMY is a male- specific oncogene (Tsuei et al., 2011). Therefore it will be of interest to investigate whether variation in RBMY copy number across Y haplogroups influences these two disease- related phenotypes and might be subject to natural selection. Similarly, TSPY is a candidate proto-oncogene which can regulate its own expression via a positive feedback loop in gonadoblastoma and a variety of somatic cancers (Kido & Lau, 2014). Thus, additional studies should be performed to test whether variation in TSPY copy number across haplogroups is associated with differential predisposition to gonadoblastoma.

27

The smaller Y ampliconic gene families (HSFY, PRY, VCY, and XKRY) have lower variation in copy number compared with larger families. These gene families, for which the average family size is only two copies, are each present on an individual palindrome (the two copies are present as inverted repeats on opposite palindrome arms). Recombination between inverted repeats is expected to result in an inversion keeping copy number constant (W. Gu, Zhang, & Lupski, 2008). In addition, the presence of only two copies increases the chances of a complete gene family elimination due to Muller’s ratchet or of rearrangements which involve the whole palindrome. Consistent with this prediction, we find these gene families to be deleted or pseudogenized in several great ape species (Hallast & Jobling, 2017).

Thus, the copy number of ampliconic genes is an important factor in determining the survival of a gene family on the Y chromosome. Too few copies can lead to a complete loss of a gene family (see the preceding paragraph), whereas too many copies can lead to frequent NAHR which can rapidly increase or decrease copy number (Connallon & Clark, 2010). Consistent with this expectation, it was suggested that the human Y chromosome evolves under selection to maintain an optimal copy number for its amplicons in diverse human lineages (Teitz et al., 2018). Most likely both random genetic drift and natural selection contribute to determining the Y chromosome ampliconic gene copy number. Drift leads to smaller-scale changes in copy numbers, whereas selection might act at removing extreme copy numbers because too few copies might lead to infertility and too many copies might lead to genetic instability and thus both are selected against. Variation in Y ampliconic gene copy number in subfertile and infertile males should be investigated in future studies and should shed additional light on the balance between these two evolutionary processes.

Note that in the present study we only examined complete gene copy gains or losses, but insertions and deletions inside a gene can also affect gene expression and functionality, and might be linked to infertility (Poli, Iriarte, Iudica, Zanier, & Coco, 2015). The effects of such smaller CNVs are more robustly evaluated from long-read data and we leave this exploration to future work. Variability in Y ampliconic gene expression Here we studied the expression levels of the Y ampliconic gene families in testis tissue of presumably healthy individuals. The vast majority of cells in testis are germline cells in the

28 seminiferous ducts, where spermatogenesis takes place. We primarily captured Y chromosome gene expression in spermatogonia prior to meiosis and throughout different spermatogenesis stages after meiosis (Larson, Kopania, & Good, 2018; Sin, Ichijima, Koh, Namiki, & Namekawa, 2012); this is because Y transcription is silenced at other stages of spermatogenesis due to meiotic sex chromosome inactivation (Handel, 2004; Larson et al., 2018) and postmeiotic sex chromosome repression (Larson et al., 2018; Sin et al., 2012). As a tissue, testis is a mixture of germline cells at different stages of development, Sertoli cells, myoid peritubular cells, and interstitial Leydig cells. Thus, the expression values generated using testis tissue as a source represent cumulative gene expression of germline cells at different stages of spermatogenesis with a mixture of somatic cells. This potential limitation notwithstanding, our results indicate substantial variation in expression levels for Y ampliconic genes in testis among men and suggest that different levels of Y ampliconic genes’ expression are tolerated by presumably healthy individuals.

When we compared copy number of ampliconic genes to their gene expression values, we found that across gene families the gene families with higher median copy number had higher expression levels. This is consistent with an observation made by Lucotte and colleagues (Lucotte et al., 2018) who reported on the expression of Y ampliconic genes at different stages of spermatogenesis with respect to variation in their copy number. Overall, the Y chromosome has higher copy number of genes for those gene families whose median expression levels are higher in testis, however it is important to note that this relationship might be different at individual cell types in testis and should be studied further.

When we examined the relationship between copy number and expression within a gene family, our analysis revealed that expression of Y ampliconic gene families is independent of their copy number. Moreover, no significant differences in Y ampliconic gene expression levels were observed among Y haplogroups, even though we found significant differences among Y haplogroups in copy number for some gene families (BPY2, TSPY, and RBMY). This suggests that testis tissue might have evolved the ability to tolerate different Y ampliconic gene copy numbers, and also variable Y ampliconic gene expression levels.

Approximately 77% of all protein-coding genes in the human genome are expressed in testis (Djureinovic et al., 2014), and some of these genes could regulate expression of the

29

Y ampliconic genes. Understanding the 3D organization and chromatin structure on the Y is expected to aid in identifying the genomic regions and genes that ampliconic genes interact with and are regulated by in the genome. Future studies analyzing expression data at different stages of spermatogenesis in individuals with different Y ampliconic gene copy numbers will assist in deciphering the role of copy number variation in determining gene expression in more detail. Additionally, our findings should be confirmed by studies of gene expression at the protein level.

A man’s advanced age has significant negative impact on reproduction (Harris, Fronczak, Roth, & Meacham, 2011). Semen parameters such as daily sperm production, total sperm count, and sperm viability are negatively correlated with age (Gunes, Hekim, Arslan, & Asci, 2016). However, within our dataset, we observed mixed results regarding age effects on Y ampliconic gene expression: age did not influence variation in gene expression of these genes in individuals with European Y subhaplogroups I1a and R1b, however HSFY and PRY expression had a positive correlation with age in individuals with an African subhaplogroup E1b. These findings should be validated with a larger data set to examine the role of Y ampliconic genes in changes in spermatogenesis with age.

Dosage regulation of Y ampliconic genes The Y chromosome degradation, which is common across eutherian mammals, has resulted in the loss of the majority of genes originally present on the proto-sex autosomal pair (Vicoso & Bachtrog, 2009). To balance the loss of genes on the Y in males, the mammalian X chromosome adapted its expression levels by inactivating one of its copies and increasing the expression of the other copy in females (Nguyen & Disteche, 2005; Straub & Becker, 2007; Vicoso & Bachtrog, 2009). We wondered whether a similar process evolved at Y ampliconic genes that have non-Y homologs, namely whether the expression of Y ampliconic genes and their non-Y homologs has been co-regulated. Alternatively, Y ampliconic genes might have evolved new functions, and thus potentially high expression levels, independent of their non-Y homologs. Yet another alternative would be the overall low expression levels because of the relaxation of functional constraints on the Y ampliconic genes. The precise functions of Y ampliconic genes have been under-characterized (Table S12) due to the repeated nature of the Y chromosome and scarcity of testable orthologs in model organisms. While Y ampliconic genes have

30 testis-specific expression likely as a result of sexual antagonism, the majority of non-Y homologs of Y ampliconic genes have ubiquitous expression.

Recently, a multi-step model for preservation of tandem duplicate genes was presented. According to this model, the expression of gene duplicates is downregulated immediately after the duplication event, followed by dosage sharing which could lead to functional adaptations such as sub- or neofunctionalization (Lan & Pritchard, 2016). Knowing that non-Y homologs of Y ampliconic genes are expressed in testis (except for XKRX), we compared the expression levels of closely related homologs of ampliconic genes on both autosomes and X chromosome to the sum of expression levels for all the copies of a Y gene family. We demonstrated that, with the exception of the HSFY family, Y ampliconic gene families have consistently lower expression levels when compared to their non-Y homologs, thus not elevating the overall expression level of the family. We term this phenomenon dosage regulation of Y ampliconic genes. Lower expression of Y ampliconic gene families could be an adaptation of the Y to maintain the multi-copy state of ampliconic gene families. By lowering the expression of the whole gene family, the Y can buffer sudden loss or gain of gene copies. In addition to dosage regulation, the gene family should be expressed at optimal levels to maintain their functionality during spermatogenesis. Lower optimal expression of Y ampliconic gene families compared to their non-Y homologs could be a result of subfunctionalization (e.g., testis specificity in expression), which benefits germline cell development. Alternatively, such low expression could be a result of relaxed selection, and, in agreement with this possibility, Y ampliconic genes show a higher rate of nonsynonymous to synonymous substitution rates compared to single-copy X degenerate genes on the Y (Betrán et al., 2012). Alternatively, a gene family could be under positive selection or undergoing neofunctionalization even in their low-level expression state. The expression of ampliconic gene families is important for spermatogenesis because of an association between gene deletions and infertility, but relaxed selection can facilitate rapid differentiation of ampliconic gene function.

We found that expression levels of the CDY ampliconic genes and those of their autosomal homologs are negatively correlated among individual men. This suggests that the CDY gene family might not be expressed at the same time during spermatogenesis as its autosomal homologs or that there is a coordinated downregulation of CDY expression with a rise in CDYL and CDYL2 expression (and vice versa). In humans, the CDYL and CDYL2

31 autosomal genes produce the ubiquitously expressed long transcripts, but lost the testis- specific short transcript which is now produced by CDY (Dorus et al., 2003). The combined tissue expression patterns of CDY, CDYL, and CDYL2 in human recapitulate the expression patterns of CDYL and CDYL2 in mouse or rabbit, which do not have CDY on their Y chromosome (Dorus et al., 2003).

In contrast with CDY, we found that expression levels of DAZ, HSFY, and VCY gene families are strongly positively correlated with their non-Y homolog expression across individuals, which suggests a co-regulation in gene expression levels of these ampliconic gene families and their homologs (the RBMY and TSPY families also show positive correlation, however it is not strong). When we examine the linear relationship between ampliconic gene families and their homologs among individual men, the Y ampliconic gene expression increases at a slower pace when compared to the expression of their non-Y homologs, except for HSFY where the expression increases at a similar rate for both Y and non-Y homologs (Fig. 6).

The VCY gene family is the most commonly lost gene family among great apes, however in our dataset the expression of this gene family is higher than for most other gene families on the Y and is higher than is predicted from its copy number (Fig. 2). The homologs of VCY on the X chromosome (VCX, VCX2, VCX3A, and VCX3B) are expressed in testis (Lahn & Page, 2000; Uhlén et al., 2015)—and we show that at higher levels than the VCY family itself. In addition, there is high sequence identity (>95%) between the VCX and VCY gene families, which could imply that both VCX and VCY could have been under selection to maintain function of the gene family, however, to balance the expression of the multi- copy VCX family, VCY might have lowered its expression. The role of both VCX and VCY in ribosome assembly in spermatogenesis has been suggested (Zou et al., 2003). The loss of VCY in great ape species might have been compensated by functionally similar VCX family expression in testis. The expression levels of the VCX family across great apes must be studied to understand its role in the loss of VCY.

A recent study found multiple distinct clusters of full-length Y ampliconic gene transcripts, likely originating from different copies of the same family (Sahlin, Tomaszkiewicz, Makova, & Medvedev, 2018). Therefore, the presence of multiple full-length transcripts (Sahlin et al., 2018) and low expression levels for Y ampliconic gene families (the present study)

32 suggest that individual gene copies within a family are downregulated to accommodate the expression of the whole gene family on the Y chromosome and outside of it (on autosomes and on the X). This hypothesis needs to be examined in future studies in which expression levels of individual gene copies will be evaluated with long-read sequencing technology. It will also be important to decipher the isoforms and their expression levels for Y ampliconic genes and their non-Y homologs to understand whether Y ampliconic genes and their homologs express the same isoforms, or whether Y ampliconic genes express their own, unique, testis-specific isoforms.

It is essential to note that, in addition to evolution of expression levels of the whole gene family including its non-Y homologs, the Y ampliconic genes can diverge to acquire additional male-specific functionality because they are present on the Y, which is susceptible to accumulating genetic differences dictated by sexual antagonism. In other words, Y ampliconic genes could have gained secondary functions independent of their functions on the proto-sex chromosomes. This scenario might be exemplified by the case of the HSFY family, whose expression levels have increased in comparison to its X- chromosome homologs. This pattern suggests that this gene family underwent neofunctionalization. The exact function of HSFY is unknown, but its role in transcription regulation has been suggested because it harbors a DNA-binding domain (Shinka et al., 2004). In fact, it was shown that HSFY and HSFX share only this DNA-binding domain but not the rest of their sequences and thus indeed might have diverged in their functions (Shinka et al., 2004). Moreover, HSFY has stage-specific expression during spermatogenesis, suggesting that it acquired a function different than that of heat shock it is homologous to (Shinka et al., 2004). The loss of HSFY was linked to infertility (Kichine et al., 2012; Kinoshita et al., 2006; Shinka et al., 2004). In another study, underexpression of HSFY was linked to arrest of maturation of nascent germ cells to motile sperm (Stahl, Mielnik, Barbieri, Schlegel, & Paduch, 2012). According to our study, the expression of HSFY gene family was positively correlated with age in the African E1b Y haplogroup, however such a relationship was not found in the R1b haplogroup. Further studies addressing transcription regulation by the HSFY family in individuals of varying age across different Y haplogroups are required to understand the HSFY functionality in more detail.

33

We assume that non-Y homologs have retained the ancestral expression state because of the overall fast evolutionary rate on the Y chromosome (Makova & Li, 2002). However, the X chromosome and autosomes have also been evolving, albeit slower than the Y. Evolutionary changes acquired by non-Y homologs since they diverged from the Y homologs have not been addressed in this study due to the lack of ancestral expression data. To address this, future studies should identify species which have orthologs of human ampliconic genes in a single-copy state on their Y chromosome. In the case of CDY and DAZ gene families, future studies should identify species in which these genes’ orthologs are present in a single-copy state on their autosomes and absent from the sex chromosomes. Once such species are identified, their testis-specific expression data for these genes could be used as the ancestral expression state.

Materials and Methods

AmpliCoNE: Ampliconic Copy Number Estimator To estimate copy number in highly-similar multi-copy gene families, several strategies have been proposed. One can align each read to all possible locations in the reference genome (Alkan et al., 2009), identify sites in the reference genome that uniquely distinguish and tag paralogs of interest (Handsaker et al., 2015; Oetjens et al., 2016; Sudmant et al., 2010; Teitz et al., 2018), use simulated reads for mock genomes with human gene cDNAs at different gene copy counts to obtain a theoretical function of the coverage distribution with respect to copy number (Cortez et al., 2014), or customize the reference to keep a single copy of each gene family (Lucotte et al., 2018; Skov et al., 2017). While these strategies were effective in their respective papers, we could not find software that could work on human Y ampliconic genes. We therefore combine the ideas from these strategies into AmpliCoNE, a tool for estimating copy number in highly-similar multi-copy gene families. The Results section contains an overview of AmpliCoNE, but we provide more details here. Determining control and informative positions To calibrate a baseline for read depth at copy number of one, AmpliCoNE uses single copy regions unique to the Y (AmpliCoNE also provides an option to use X-degenerate regions as a control, but we do not describe it in the manuscript). These regions are identified by AmpliCoNE-build as positions in the Y chromosome such that the k-mer starting at the position does not match any other location in the reference. We used a k-

34 mer of size 101, since this is the length of the shortest reads that we use from GTEx, and we allowed up to two edits in matching to other locations. AmpliCoNE-build computes these positions by using the GEM-mappability tool (v1.315) (Derrien et al., 2012) and extracting those locations with a mappability of 1.

AmpliCoNE-build pre-determines which positions of the genome will be used for later measuring read depth. A position within a gene family is said to be informative if (1) the location is not within an annotated high-copy repeat region (e.g. a transposable element) or a short tandem repeat, (2) the k-mer starting at that location is specific to its gene family of origin, and (3) the k-mer is non-repetitive within its gene. For (1), AmpliCoNE-build takes repeat annotations as input, such as ones generated by RepeatMasker (“RepeatMasker Home Page,” n.d.) and Tandem Repeat Finder (Benson, 1999). This step is necessary since these regions are notoriously hard to align to. For (2) and (3), AmpliCoNE-build uses a strategy similar to the one used by Tietz and colleagues to annotate Y chromosome amplicons (Teitz et al., 2018). It extracts all the 101-mers from the Y chromosome and maps them back to the reference using Bowtie2 (Langmead & Salzberg, 2012). It allows Bowtie2 to generate up to 15 alignments per k-mer (-k 15) but discards alignments that have more than two mismatches. A k-mer is then determined to be specific to the gene family if all its alignments fall within the gene regions in the family. It is determined to be non-repetitive within its gene if the number of alignments equals to the number of genes in the family. Using only gene-specific locations is crucial for AmpliCoNE's accuracy, since non-specific locations would add biased noise to later read depth estimates. Computing read depth and calling copy number AmpliCoNE-count takes an alignment file of male reads to the male reference as input. It only retains alignments that are part of a properly mapping read pair and have at least 88 perfect matches in the first 90 bp of the read. This threshold is designed to retain only reliable alignments and is intended to match the criteria used for determining control positions. For each non-repeat-masked position 푖 on the Y chromosome, AmpliCoNE- count then computes the number of alignments starting at 푖, which we refer to as the read depth 퐷𝑖.

It is known that the GC bias affects the depth of reads generated using Illumina technology (Bentley et al., 2008). Therefore, AmpliCoNE-count adjusts the read depth by using a sliding-window-based GC correction method (Yoon, Xuan, Makarov, Ye, & Sebat, 2009).

35

Concretely, AmpliCoNE-count first collects the read depths for the control positions and notes the GC percentage of the 501 bp window centered at those positions. The read depths are then binned according to their GC percentage, using 100 bins: for a given bin

푏, we calculate μb , the mean read depth of the control locations with a GC percentage belonging to 푏. We also let μ be the mean read depth over all control positions. The GC- corrected read depth for position 푖 is then calculated as 휇퐷𝑖/휇푏.

For each gene, AmpliCoNE-count computes the gene copy count as the mean GC- corrected read depth for all informative locations in the gene, divided by the mean control read depth μ. To obtain the final copy count for each family, AmpliCoNE-count reports the total sum of the copy counts of all the genes in the family.

Simulation-based validation of AmpliCoNE To evaluate the accuracy of AmpliCoNE, we ran simulations. There are nine TSPY genes (six functional + three pseudogene), six RBMY genes and two VCY genes in the hg38 reference. We added different copies of these three ampliconic gene families to the Y chromosome (Table S1) to simulate reads. The total number of gene copies in the three custom references used to generate the simulated reads were 22/7/4 copies (for TSPY/RBMY/VCY, respectively) in set 1; 29/12/2 copies in set 2; 23/9/3 copies in set 3. Using wgsim [0.3.2] (lh, n.d.) we simulated 666 million paired-end reads of length 101 bp and insert size of 260 bp (the exact parameters were "-d 260 -N 666873346 -1 101 -2 101 -S 9 -e 0 -r 0 -R 0"). The reads from the three simulated datasets were aligned to the hg38 reference genome using BWA MEM[v0.7.15](Li, 2013). The SAM files were sorted and PCR duplicates were removed using the PICARD toolkit[v1.128](“Website,” n.d.). Finally, samtools[v1.3.1](Li et al., 2009) were used to index the alignments. The sorted indexed BAM files were presented as input to AmpliCoNE-count.

Datasets We used mRNA sequencing data for 170 testis samples with matched whole-genome sequencing (WGS) data from the GTEx project (Carithers et al., 2015). The GTEx RNA- seq libraries were generated with the Illumina TruSeq protocol and whole-genome sequencing was performed with paired-end reads ranging from 100 bp to 150 bp in length

36 with target insert size of 350-370 bp (Carithers et al., 2015). As the validation dataset for AmpliCoNE, we used WGS data from four males (depth of coverage ranging from ~45- 50x in HG002 and HG003, ~300x in HG005 and ~100x in HG006) sequenced by the GIAB consortium (Zook et al., 2014).

Pipeline for human WGS analysis The Y-chromosome-specific alignments of the GTEx dataset were extracted from dbGAP using the SRA toolkit (Leinonen, Sugawara, Shumway, & on behalf of the International Nucleotide Sequence Database Collaboration, 2010). From the alignments, we extracted the reads and aligned them to the hg38 reference genome using bwa-mem [v0.7.15](Li, 2013). The SAM files were sorted and PCR duplicates were removed using PICARD toolkit [v1.128](“Website,” n.d.). Finally, samtools[v1.3.1](Li et al., 2009) were used to index the read alignment files. The generated BAM files were presented as input to AmpliCoNE-count to estimate ampliconic copy number.

AMpliCoNE-build requires the locations of all the gene copies, in the reference genome, for each ampliconic gene family. While the location of functional copies are already annotated in hg38, these do not include highly similar pseudogenized copies. These are necessary to include since they will affect the read mappings. For each family, we therefore took an arbitrary annotated copy of a gene, and used BLAT (W. James Kent, 2002) to find all sites aligning with >99% identity (Table S13). These locations were given as input to AmpliCoNE-build.

Experimental validation with droplet digital PCR (ddPCR) In order to validate the in silico ampliconic gene copy number count in four individuals sequenced by the GIAB consortium (Zook et al., 2014), we acquired their DNA (NA24385, NA24149, NA24631, and NA24694) from Coriell and performed ddPCR for all nine Y ampliconic gene families. In order to infer the copy number of these gene families we used SRY, a single-copy gene on the Y chromosome, and RPP30, a two-copy autosomal gene, as references. We ran ddPCR for each sample in triplicates using EvaGreen dsDNA dye (Bio-Rad) on the Biorad QX200 digital droplet platform with the protocol and primers from our previous publication (Tomaszkiewicz et al., 2016). The results were analyzed using QuantaSoft software. Subsequently, the concentration (copies/uL) of each ampliconic

37 gene family of interest was divided by the concentration of the references, SRY and RPP30 (Table S4).

Estimating gene expression levels Gene expression estimates were obtained using the kallisto-DESeq2 pipeline described below. The standard human (hg38) RefSeq transcripts obtained from the UCSC Genome Browser (W. J. Kent et al., 2002) were used as reference. We generated an index for the reference using the kallisto [v0.43.0] index function with default parameters (Bray, Pimentel, Melsted, & Pachter, 2016). For each sample we obtained read counts per transcript using the kallisto quant (--bias, --seed=9, --bootstrap-samples=100) function. The hg38 refFlat file containing the transcript-to-gene mapping information was obtained from the UCSC Genome Browser (W. J. Kent et al., 2002) annotation database, which was used to convert the transcript-level read counts to gene-level expression levels using tximport package [v1.2.0] (Soneson, Love, & Robinson, 2015). Since there were no replicates for the samples, we set the 170 sample ids as different conditions in the design, and the gene-level read counts for 170 RNASeq samples were normalized using DESeq2 [v1.14.1] (Love, Huber, & Anders, 2014). Additionally, read counts based on the vst (Variance Stabilizing Transformation) function in DESeq2 were used to check for outliers. To identify outliers in the dataset we performed Principal Component Analysis using the prcomp() function on the vst-based normalized read counts. When we plotted the first and second principal components, we found 21 samples outside the main cluster of the remaining 149 samples (Fig. S12). We followed steps described in DEseq2 vignettes and plotted the heatmap of sample-to-sample distance for the top 1,000 highly expressed genes to identify outliers visually and we found the same 21 samples as outliers. Thus, we filtered out these 21 samples and utilized the expression values for the nine ampliconic families in the remaining 149 samples in the downstream analysis. We summed the expression values for all the gene copies within a gene family to obtain family-level expression values. Human Y haplogroup determination Yhaplo[v1.0.11] (Poznik & David Poznik, 2016) was used to predict Y haplogroup of the samples. The version of Yhaplo[1.0.11] we used expects the SNP coordinates consistent with the hg19 (Church et al., 2011) version of the human reference. The Y-chromosome- specific BAM files downloaded from dbGAP were aligned to the hg19 version of the human reference using BWA MEM. We directly converted the downloaded BAM files into pileup

38 format using samtools mpileup function. A custom script was used to convert the pileup file into Yhaplo-compatible input format. We annotated the Y haplotype for all the samples in the dataset using Yhaplo default parameters. Code availability Code used in the manuscript is available at github link: https://github.com/makovalab- psu/GTEx_Testis_Analysis. Steps to install and use AmpliCoNE are available at github: https://github.com/makovalab-psu/AmpliCoNE-tool

References

Alkan, C., Kidd, J. M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., … Eichler, E. E. (2009). Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genetics, 41(10), 1061–1067. Ardlie, K. G., DeLuca, D. S., Segrè, A. V., Sullivan, T. J., Young, T. R., Gelfand, E. T., … Lockhart. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235), 648–660. Bellott, D. W., Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Cho, T.-J., … Page, D. C. (2014). Mammalian Y chromosomes retain widely expressed dosage- sensitive regulators. Nature, 508(7497), 494–499. Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27(2), 573–580. Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J., Brown, C. G., … Smith, A. J. (2008). Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218), 53. Betrán, E., Demuth, J. P., & Williford, A. (2012). Why Chromosome Palindromes? International Journal of Evolutionary Biology, 2012(Figure 2), 1–14. Bhowmick, B. K., Satta, Y., & Takahata, N. (2007). The origin and evolution of human ampliconic gene families and ampliconic structure. Genome Research, 17(4), 441– 450. Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology, 34(5), 525–527. Carithers, L. J., Ardlie, K., Barcus, M., Branton, P. A., Britton, A., Buia, S. A., … GTEx

39

Consortium. (2015). A Novel Approach to High-Quality Postmortem Tissue Procurement: The GTEx Project. Biopreservation and Biobanking, 13(5), 311–319. Carvalho, C. M. B., Zhang, F., & Lupski, J. R. (2011). Structural variation of the human genome: mechanisms, assays, and role in male infertility. Systems Biology in Reproductive Medicine, 57(1-2), 3–16. Charlesworth, B., & Charlesworth, D. (2000). The degeneration of Y chromosomes. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 355(1403), 1563–1572. Charlesworth, D., & Charlesworth, B. (1980). Sex differences in fitness and selection for centric fusions between sex-chromosomes and autosomes. Genetical Research, 35(02), 205. Church, D. M., Schneider, V. A., Graves, T., Auger, K., Cunningham, F., Bouk, N., … Hubbard, T. (2011). Modernizing reference genome assemblies. PLoS Biology, 9(7), e1001091. Connallon, T., & Clark, A. G. (2010). Gene duplication, gene conversion and the evolution of the Y chromosome. Genetics, 186(1), 277–286. Cortez, D., Marin, R., Toledo-Flores, D., Froidevaux, L., Liechti, A., Waters, P. D., … Kaessmann, H. (2014). Origins and functional evolution of Y chromosomes across mammals. Nature, 508(7497), 488–493. Derrien, T., Estellé, J., Marco Sola, S., Knowles, D. G., Raineri, E., Guigó, R., & Ribeca, P. (2012). Fast computation and applications of genome mappability. PloS One, 7(1), e30377. Djureinovic, D., Fagerberg, L., Hallström, B., Danielsson, A., Lindskog, C., Uhlén, M., & Pontén, F. (2014). The human testis-specific proteome defined by transcriptomics and antibody-based profiling. Molecular Human Reproduction, 20(6), 476–488. Dorus, S., Gilbert, S. L., Forster, M. L., Barndt, R. J., & Lahn, B. T. (2003). The CDY- related gene family: coordinated evolution in copy number, expression profile and protein sequence. Human Molecular Genetics, 12(14), 1643–1650. Giachini, C., Nuti, F., Turner, D. J., Laface, I., Xue, Y., Daguin, F., … Krausz, C. (2009). TSPY1Copy Number Variation Influences Spermatogenesis and Shows Differences among Y Lineages. The Journal of Clinical Endocrinology and Metabolism, 94(10), 4016–4022. Gu, L., & Walters, J. R. (2017). Evolution of Sex Chromosome Dosage Compensation in Animals: A Beautiful Theory, Undermined by Facts and Bedeviled by Details.

40

Genome Biology and Evolution, 9(9), 2461–2476. Gunes, S., Hekim, G. N. T., Arslan, M. A., & Asci, R. (2016). Effects of aging on the male reproductive system. Journal of Assisted Reproduction and Genetics, 33(4), 441– 454. Gu, W., Zhang, F., & Lupski, J. R. (2008). Mechanisms for human genomic rearrangements. PathoGenetics, 1(1), 4. Hallast, P., & Jobling, M. A. (2017). The Y chromosomes of the great apes. Human Genetics, 136(5), 511–528. Handel, M. A. (2004). The XY body: a specialized meiotic chromatin domain. Experimental Cell Research, 296(1), 57–63. Handsaker, R. E., Van Doren, V., Berman, J. R., Genovese, G., Kashin, S., Boettger, L. M., & McCarroll, S. A. (2015). Large multiallelic copy number variations in humans. Nature Genetics, 47(3), 296–303. Harris, I. D., Fronczak, C., Roth, L., & Meacham, R. B. (2011). Fertility and the aging male. Reviews in Urology, 13(4), e184–e190. Henrichsen, C. N., Chaignat, E., & Reymond, A. (2009). Copy number variants, diseases and gene expression. Human Molecular Genetics, 18(R1), R1–R8. Iwase, M., Satta, Y., Hirai, H., Hirai, Y., & Takahata, N. (2010). Frequent gene conversion events between the X and Y homologous chromosomal regions in primates. BMC Evolutionary Biology, 10, 225. Jobling, M. A. (2008). Copy number variation on the human Y chromosome. Cytogenetic and Genome Research, 123(1-4), 253–262. Kent, W. J. (2002). BLAT--the BLAST-like alignment tool. Genome Research, 12(4), 656– 664. Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., & Haussler, a. D. (2002). The Human Genome Browser at UCSC. Genome Research, 12(6), 996–1006. Kichine, E., Rozé, V., Di Cristofaro, J., Taulier, D., Navarro, A., Streichemberger, E., … Mitchell, M. J. (2012). HSFY genes and the P4 palindrome in the AZFb interval of the human Y chromosome are not required for spermatocyte maturation. Human Reproduction , 27(2), 615–624. Kido, T., & Lau, Y.-F. C. (2014). The Y-located gonadoblastoma gene TSPY amplifies its own expression through a positive feedback loop in prostate cancer cells. Biochemical and Biophysical Research Communications, 446(1), 206–211.

41

Kinoshita, K., Shinka, T., Sato, Y., Kurahashi, H., Kowa, H., Chen, G., … Nakahori, Y. (2006). Expression analysis of a mouse orthologue of HSFY, a candidate for the azoospermic factor on the human Y chromosome. The Journal of Medical Investigation: JMI, 53(1-2), 117–122. Krausz, C., Chianese, C., Giachini, C., Guarducci, E., Laface, I., & Forti, G. (2011). The Y chromosome-linked copy number variations and male fertility. Journal of Endocrinological Investigation, 34(5), 376–382. Krausz, C., Giachini, C., & Forti, G. (2010). TSPY and Male Fertility. Genes, 1(2), 308– 316. Lahn, B. T., & Page, D. C. (2000). A human sex-chromosomal gene family expressed in male germ cells and encoding variably charged proteins. Human Molecular Genetics, 9(2), 311–319. Lambert, S., Saintigny, Y., Delacote, F., Amiot, F., Chaput, B., Lecomte, M., … Lopez, B. S. (1999). Analysis of intrachromosomal homologous recombination in mammalian cell, using tandem repeat sequences. Mutation Research, 433(3), 159–168. Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357–359. Lan, X., & Pritchard, J. K. (2016). Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals. Science, 352(6288), 1009–1013. Larson, E. L., Kopania, E. E. K., & Good, J. M. (2018). Spermatogenesis and the Evolution of Mammalian Sex Chromosomes. Trends in Genetics: TIG, 34(9), 722–732. Leinonen, R., Sugawara, H., Shumway, M., & on behalf of the International Nucleotide Sequence Database Collaboration. (2010). The Sequence Read Archive. Nucleic Acids Research, 39(Database), D19–D21. Li H. (2011). wgsim-Read simulator for next generation sequencing. Github Repository. [online] http://github.com/lh3/wgsim.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM. Retrieved from http://arxiv.org/abs/1303.3997 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25(16), 2078–2079. Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.

42

Lucotte, E. A., Skov, L., Jensen, J. M., Coll Macià, M., Munch, K., & Schierup, M. H. (2018). Dynamic Copy Number Evolution of X- and Y-Linked Ampliconic Genes in Human Populations. Genetics. https://doi.org/10.1534/genetics.118.300826 Lupiáñez, D. G., Kraft, K., Heinrich, V., Krawitz, P., Brancati, F., Klopocki, E., … Mundlos, S. (2015). Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell, 161(5), 1012–1025. Lynch, M., & Conery, J. S. (2000). [Review of The evolutionary fate and consequences of duplicate genes]. Science, 290(5494), 1151–1155. Makova, K. D., & Li, W.-H. (2002). Strong male-driven evolution of DNA sequences in humans and apes. Nature, 416(6881), 624–626. Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., … Reich, D. (2016). The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature, 538(7624), 201–206. Medvedev, P., Stanciu, M., & Brudno, M. (2009). Computational methods for discovering structural variation with next-generation sequencing. Nature Methods, 6(11s), S13. Navarro-Costa, P. (2012). Sex, rebellion and decadence: the scandalous evolutionary history of the human Y chromosome. Biochimica et Biophysica Acta, 1822(12), 1851– 1863. Navarro-Costa, P., Plancha, C. E., & Gonaçlves, J. (2010). Genetic dissection of the AZF regions of the human Y chromosome: Thriller or filler for male (In)fertility? Journal of Biomedicine & Biotechnology, 2010. https://doi.org/10.1155/2010/936569 Nguyen, D. K., & Disteche, C. M. (2005). Dosage compensation of the active X chromosome in mammals. Nature Genetics, 38(1), 47–53. Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biology and Evolution, 8(7), 2231–2240. PJ Hastings James R Lupski, S. M. R. A. G. I. (2010). Mechanisms of change in gene copy number. Nature Reviews. Genetics, 10(8), 551–564. Poli, M. N., Iriarte, P. F., Iudica, C., Zanier, J. H. M., & Coco, R. (2015). New Sequence Variations in Spermatogenesis Candidates Genes. JBRA Assisted Reproduction, 19(4), 216–222. Poznik, G. D., & David Poznik, G. (2016). Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men. https://doi.org/10.1101/088716

43

RepeatMasker Home Page. (n.d.). Retrieved June 20, 2018, from http://www.repeatmasker.org Repping, S., Skaletsky, H., Lange, J., Silber, S., van der Veen, F., Oates, R. D., … Rozen, S. (2002). Recombination between Palindromes P5 and P1 on the Human Y Chromosome Causes Massive Deletions and Spermatogenic Failure. American Journal of Human Genetics, 71(4), 906–922. Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J., Cordum, H. S., Waterston, R. H., … Page, D. C. (2003). Abundant gene conversion between arms of palindromes in human and ape Y chromosomes. Nature, 423(6942), 873–876. Sahlin, K., Tomaszkiewicz, M., Makova, K. D., & Medvedev, P. (2018). Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon. Nature Communications, 9(1), 4601. Shinka, T., Sato, Y., Chen, G., Naroda, T., Kinoshita, K., Unemi, Y., … Nakahori, Y. (2004). Molecular characterization of heat shock-like factor encoded on the human Y chromosome, and implications for male infertility. Biology of Reproduction, 71(1), 297–306. Sin, H.-S., Ichijima, Y., Koh, E., Namiki, M., & Namekawa, S. H. (2012). Human postmeiotic sex chromatin and its impact on sex chromosome evolution. Genome Research, 22(5), 827–836. Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G., … Page, D. C. (2003). The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature, 423(6942), 825–837. Skov, L., Danish Pan Genome Consortium, & Schierup, M. H. (2017). Analysis of 62 hybrid assembled human Y chromosomes exposes rapid structural changes and high rates of gene conversion. PLoS Genetics, 13(8), e1006834. Soneson, C., Love, M. I., & Robinson, M. D. (2015). Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4, 1521. Spielmann, M., Lupiáñez, D. G., & Mundlos, S. (2018). Structural variation in the 3D genome. Nature Reviews. Genetics, 19(7), 453–467. Stahl, P. J., Mielnik, A. N., Barbieri, C. E., Schlegel, P. N., & Paduch, D. A. (2012). Deletion or underexpression of the Y-chromosome genes CDY2 and HSFY is associated with maturation arrest in American men with nonobstructive azoospermia. Asian Journal of Andrology, 14(5), 676–682. Straub, T., & Becker, P. B. (2007). Dosage compensation: the beginning and end of

44

generalization. Nature Reviews. Genetics, 8(1), 47–57. Sudmant, P. H., Kitzman, J. O., Antonacci, F., Alkan, C., Malig, M., Tsalenko, A., … Eichler, E. E. (2010). Diversity of human copy number variation and multicopy genes. Science, 330(6004), 641–646. Teitz, L. S., Pyntikova, T., Skaletsky, H., & Page, D. C. (2018). Selection Has Countered High Mutability to Preserve the Ancestral Copy Number of Y Chromosome Amplicons in Diverse Human Lineages. American Journal of Human Genetics, 103(2), 261–275. Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H. W., Harris, R., … Makova, K. D. (2016). A time- and cost-effective strategy to sequence mammalian Y Chromosomes: an application to the de novo assembly of gorilla Y. Genome Research, 26(4), 530–540. Trombetta, B., Cruciani, F., Underhill, P. A., Sellitto, D., & Scozzari, R. (2010). Footprints of X-to-Y gene conversion in recent human evolution. Molecular Biology and Evolution, 27(3), 714–725. Tsuei, D.-J., Lee, P.-H., Peng, H.-Y., Lu, H.-L., Su, D.-S., Jeng, Y.-M., … Chang, M.-H. (2011). Male germ cell-specific RNA binding protein RBMY: a new oncogene explaining male predominance in liver cancer. PloS One, 6(11), e26948. Uhlén, M., Fagerberg, L., Hallström, B. M., Lindskog, C., Oksvold, P., Mardinoglu, A., … Pontén, F. (2015). Proteomics. Tissue-based map of the human proteome. Science, 347(6220), 1260419. Vangompel, M. J. W., & Xu, E. Y. (2011). The roles of the DAZ family in spermatogenesis: More than just translation? Spermatogenesis, 1(1), 36–46. Vicoso, B., & Bachtrog, D. (2009). Progress and prospects toward our understanding of the evolution of dosage compensation. Chromosome Research: An International Journal on the Molecular, Supramolecular and Evolutionary Aspects of Chromosome Biology, 17(5), 585–602. Vinuela, A., Brown, A. A., Buil, A., Tsai, P.-C., Davies, M. N., Bell, J. T., … Small, K. (2016). Age-dependent changes in mean and variance of gene expression across tissues in a twin cohort. https://doi.org/10.1101/063883 Vogt, P. H., Edelmann, A., Kirsch, S., Henegariu, O., Hirschmann, P., Kiesewetter, F., … Haidl, G. (1996). Human Y chromosome azoospermia factors (AZF) mapped to different subregions in Yq11. Human Molecular Genetics, 5(7), 933–943. Website. (n.d.). Retrieved June 13, 2018, from http://broadinstitute.github.io/picard/ Yang, J., The GTEx Consortium, Huang, T., Petralia, F., Long, Q., Zhang, B., … Tu, Z.

45

(2015). Synchronized age-related gene expression changes across multiple tissues in human and the link to complex diseases. Scientific Reports, 5(1). https://doi.org/10.1038/srep15145 Yan, Y., Yang, X., Liu, Y., Shen, Y., Tu, W., Dong, Q., … Yang, Y. (2017). Copy number variation of functional RBMY1 is associated with sperm motility: an azoospermia factor-linked candidate for asthenozoospermia. Human Reproduction , 32(7), 1521– 1531. Ye, D., Zaidi, A. A., Tomaszkiewicz, M., Anthony, K., Liebowitz, C., DeGiorgio, M., … Betran, E. (2018). High Levels of Copy Number Variation of Ampliconic Genes across Major Human Y Haplogroups. Genome Biology and Evolution, 10(5), 1333–1350. Yoon, S., Xuan, Z., Makarov, V., Ye, K., & Sebat, J. (2009). Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research, 19(9), 1586–1592. Yu, H., Luscombe, N. M., Qian, J., & Gerstein, M. (2003). Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends in Genetics: TIG, 19(8), 422–427. Zook, J. M., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., & Salit, M. (2014). Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nature Biotechnology, 32(3), 246. Zou, S. W., Zhang, J. C., Zhang, X. D., Miao, S. Y., Zong, S. D., Sheng, Q., & Wang, L. F. (2003). Expression and localization of VCX/Y proteins and their possible involvement in regulation of ribosome assembly during spermatogenesis. Cell Research, 13(3), 171–177.

46

Chapter 3

Ampliconic genes on the great ape Y chromosomes: Rapid evolution of copy number but conservation of expression levels

This chapter will be submitted as a research article by R. Vegesna, M. Tomaszkiewicz, O.A. Ryder, R. Campos-Sánchez, P. Medvedev, M. DeGiorgio and K.D. Makova. In this chapter, M. Tomaszkiewicz performed all the wet-lab experimental work. Supporting information is provided in Appendix B.

Abstract

Multi-copy ampliconic gene families on the Y chromosome play an important role in spermatogenesis. Thus, studying their genetic variation in endangered great ape species is critical. We estimated the sizes (copy number) of nine Y ampliconic gene families in population samples of chimpanzee, bonobo, and orangutan with droplet digital PCR, combined these estimates with published data for human and gorilla, and produced genome-wide testis gene expression data for great apes. Analyzing this comprehensive dataset within an evolutionary framework, we, first, found high inter- and intraspecific variation in gene family size, with larger families exhibiting higher variation as compared to smaller families, a pattern consistent with random genetic drift. Second, for four gene families, we observed significant interspecific size differences, sometimes even between sister species—chimpanzee and bonobo. Sperm competition and mating structure, which differ drastically among great apes, might have affected these patterns. Third, despite substantial variation in copy number, Y ampliconic gene families’ expression levels did not differ significantly among species, suggesting dosage regulation. Fourth, for three gene families, size was positively correlated with gene expression levels across species, suggesting that, given sufficient evolutionary time, copy number influences gene expression. Our results indicate high variability in size but conservation in gene expression levels in Y ampliconic gene families, significantly advancing our understanding

47 of Y chromosome evolution in great apes. The Y gene copy number estimation protocols developed here can be used to trace male-biased dispersal and thus aid in conservation efforts of endangered great apes.

Introduction

Great apes (family Hominidae) include four genera—Pongo (Bornean, Sumatran and Tapanuli orangutans), Gorilla (eastern and western gorillas), Pan (common chimpanzee and bonobo), and Homo (humans)—who shared a common ancestor approximately 13 million years ago (MYA) (Glazko & Nei, 2003). All great apes but humans are endangered species (Fruth et al., 2016; Iucn & IUCN, 2016a, 2016b, 2016c, 2017). Therefore, understanding genetic variation within and among species, and preserving reproductive fitness of these animals is of utmost importance. Some of the clues to addressing these pressing questions lie in the analysis of sex chromosomes of great apes. Sex chromosomes harbor genes linked to spermatogenesis and influencing fertility and reproduction (Ross et al., 2005; Skaletsky et al., 2003), however we lack a key understanding of the diversity of these genes across great apes.

The great ape sex chromosomes, X and Y, originated from a pair of homologous chromosomes in the common ancestor of therian mammals 160-190 MYA (Luo, Yuan, Meng, & Ji, 2011; Veyrunes et al., 2008). Over time, the X chromosome has retained most of its ancestral gene content facilitated by continued recombination in females (Mueller et al., 2013), whereas the Y chromosome underwent a series of inversions and lost most of its genes due to the lack of recombination with the X (Charlesworth & Charlesworth, 1980; Skaletsky et al., 2003). Additionally, sexual antagonism led to accumulation of genes and mutations beneficial to males on the Y (Fisher, 1931). Since the split of great apes from their common ancestor, the Y chromosome continued to diverge. Cytogenetic studies have demonstrated that the Y chromosome differs in size, gene content, and gene order among great ape species (Gläser et al., 1998). To date, among great apes, only the human, chimpanzee and gorilla Y chromosomes have been sequenced and assembled, with the gorilla assembly being in draft state (Hughes et al., 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). The Y chromosomes of other great ape species are yet to be deciphered.

48

The same sequence regions are present on the Y chromosomes of great apes studied to date. These include the (PAR) and the male-specific region (MSY, or the male-specific region on the Y) (Hughes et al., 2012; Skaletsky et al., 2003). The PAR can recombine with the X and thus is identical to the homologous region on it (Graves & Marshall Graves, 1995; Lahn & Page, 1999). The MSY region in great apes is interspersed with heterochromatic (Cechova et al., 2019) and euchromatic sequences of different sizes. The euchromatic MSY portion consists of single-copy X-degenerate and X-transposed regions, and of highly repetitive ampliconic regions (Hughes et al., 2012, 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). The X-degenerate regions constitute the remnants of the ancient proto-sex chromosomes, while the X-transposed region (so far found only in human) represents a recent transposistion from the X to the Y. The ampliconic regions harbor protein-coding multi-copy gene families that are expressed in testis and are associated with spermatogenesis and male fertility (Hughes et al., 2012, 2010; Rozen et al., 2003; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). In humans, these are BPY2 (basic protein Y2), CDY (chromodomain Y), DAZ (deleted in azoospermia), HSFY (heat-shock transcription factor Y), PRY (PTP-BL related Y), RBMY (RNA-binding motif Y), TSPY (testis-specific Y), VCY (variable charge), and XKRY (X Kell blood-related Y) (Skaletsky et al., 2003). Five of these nine gene families—BPY2, CDY, DAZ, RBMY, and TSPY—are shared among great apes studied so far, however information about presence/absence of the other four gene families in different great ape species remains incomplete (reviewed in (Hallast & Jobling, 2017)). The majority of ampliconic gene families in human and chimpanzee are located in palindromes—large inverted repeats common on the Y chromosome (Hughes et al., 2010; Rozen et al., 2003; Skaletsky et al., 2003). (The exception to this pattern is the TSPY gene family, which in humans is present as a tandem array outside of palindromes (Skaletsky et al., 2003).) The presence of palindromes facilitates gene conversion between ampliconic genes, which removes deleterious mutations and lowers sequence diversity within Y ampliconic gene families (Hallast, Balaresque, Bowden, Ballereau, & Jobling, 2013; Rozen et al., 2003). Understanding the evolutionary dynamics of these gene families across great apes is essential for preserving genetic diversity and survival of these endangered species. However, a comprehensive investigation of the evolution of copy number and expression of Y ampliconic gene families has been lacking to date.

49

Several studies indicated intraspecific variation in Y chromosome ampliconic gene copy number in great apes (Hughes et al., 2010; Lucotte et al., 2018; Oetjens, Shen, Emery, Zou, & Kidd, 2016; Repping et al., 2006; Schaller et al., 2010; Skov & Schierup, 2017; Tomaszkiewicz et al., 2016; Ye et al., 2018). High variation in gene copy number for the RBMY and TSPY gene families was observed in humans, chimpanzees, and gorillas (Lucotte et al., 2018; Oetjens et al., 2016; Skov & Schierup, 2017; Tomaszkiewicz et al., 2016; Vegesna, Tomaszkiewicz, Medvedev, & Makova, 2019; Ye et al., 2018). Chimpanzees also exhibit high copy number variation in the DAZ gene family (Oetjens et al., 2016; Schaller et al., 2010), and gorillas—in the CDY and HSFY gene families (Tomaszkiewicz et al., 2016). Targeted fluorescence in situ hybridization (FISH) intraspecific analysis of the DAZ and CDY gene families identified no variation in Bornean orangutan, but two variants in Sumatran orangutan (Greve et al., 2011). Thus, the precise range of copy number variation for Y ampliconic gene families remains unknown in either of these two orangutan species. The available information on copy number variation for Y ampliconic gene families in bonobos is currently limited to two individuals (Oetjens et al., 2016). Thus, we are critically missing data on ampliconic gene copy number variation in orangutans and bonobos. Moreover, variation in Y ampliconic gene copy number in great apes has never been analyzed in an evolutionary framework, which could contribute to conservation genetics evaluations.

The evolution of gene expression of Y ampliconic gene families in great apes has remained even more understudied. Recently, we demonstrated dosage regulation of human Y ampliconic gene expression in testis when compared to their homologs on the X or autosomes (Vegesna et al., 2019). Additionally, expression levels and Y haplogroup or gene copy number of an individual were not significantly associated with each other (Vegesna et al., 2019). However, across gene families, we observed a positive correlation between the copy number and expression levels (Vegesna et al., 2019), which was also shown in another study examining expression of Y ampliconic gene families at different stages of spermatogenesis (Lucotte et al., 2018). Apart from these few studies, little is known about variation and evolution in expression of Y ampliconic gene families in great apes. Moreover, the relationship between copy number and expression levels for the Y ampliconic genes in non-human great ape species remains unexplored.

50

Previous studies suggested that evolution of the Y chromosome could reflect different mating patterns and social structure in great apes (Hallast et al., 2016; Hughes et al., 2010; Schaller et al., 2010). Great apes exhibit substantial variation in mating systems, which can result in different levels of sperm competition. Bonobos and chimpanzees have a multimale-multifemale, i.e. polygynandrous, mating system, where female’s promiscuity results in high levels of sperm competition. In contrast, gorillas have a unimale- multifemale, i.e. polygynous, mating system, which results in low levels of sperm competition. Orangutans and humans fall in between (Harrison & Chivers, 2007; Wistuba et al., 2003). The roving male polygynous mating system in orangutans, and mating systems defined from monogamous to polygynous in humans, result in levels of sperm competition that are lower than those in chimpanzees/bonobos and higher than those in gorillas (Harrison & Chivers, 2007; Wistuba et al., 2003). Using testis size and several sperm phenotypes as a proxy of sperm competition (Harcourt, Harvey, Larson, & Short, 1981), one can examine an association between genetic variation on the Y chromosome with varying levels of sperm competition across great ape species. Based on the importance of Y ampliconic gene families in spermatogenesis, it is reasonable to hypothesize that their copy number and/or expression levels can be different among great apes with various mating patterns and different levels of sperm competition, and can be associated with sperm phenotypes, however this has not been explored previously.

In this study, we present the first comprehensive analysis of the evolution of Y ampliconic gene copy number and expression levels across great apes. Using droplet digital PCR (ddPCR), we estimated copy number of ampliconic gene families in nonhuman great apes. We tested whether the copy number of Y ampliconic gene families is conserved across great apes and identified species that have experienced a significant gain or loss in copy number. Additionally, we generated testis expression data for bonobo and Bornean orangutan, thus augmenting such data we and others previously generated for gorilla, orangutan, chimpanzee, and human (Ardlie et al., 2015; Brawand et al., 2011; Carelli et al., 2016; Fungtammasan et al., 2016; Ruiz-Orera et al., 2015; Tomaszkiewicz et al., 2016). We assembled the transcripts of Y ampliconic gene families and tested whether their expression is conserved across great apes. Given the Y ampliconic gene families’ copy number and expression data, we investigated the evolutionary relationship between them. Finally, we examined whether the variation in copy number or gene expression can explain phenotypes related to sperm competition. Our results highlight the important role

51 of ampliconic genes in shaping Y chromosome evolution and evolution of great apes in general.

Results

Dynamic evolution of Y ampliconic gene copy number Overall copy number and variance. To evaluate copy number of Y ampliconic genes, we used a ddPCR protocol similar to the one utilized in previous studies from our group (Tomaszkiewicz et al., 2016; Ye et al., 2018) (see Materials and Methods). With ddPCR, template DNA is fractionated into multiple droplets within which PCR takes place, and each droplet is analyzed to determine copy number in a sample (B. J. Hindson et al., 2011). ddPCR differs from quantitative real-time PCR in that it estimates the absolute quantity of DNA without generating a standard curve (B. J. Hindson et al., 2011). It serves as a more economic alternative to whole-genome sequencing for copy number evaluation for targeted genomic regions or gene families, while providing similar copy number estimates (Vegesna et al., 2019), and is particularly attractive in the absence of the Y chromosome reference (which is the case for bonobo and Sumatran and Bornean orangutans).

Our samples included seven bonobos, nine chimpanzees, seven Bornean orangutans, and five Sumatran orangutans. To the best of our knowledge, all samples came from wild- born, unrelated individuals. Additionally, we used Y ampliconic gene copy number estimates generated by our group previously for 10 humans with African ancestry (Ye et al., 2018) and 14 wild-born gorillas (Tomaszkiewicz et al., 2016). Summarizing the data generated in our study and previous findings (reviewed in (Hallast & Jobling, 2017)), we show that eight ampliconic gene families—BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, and XKRY—are shared exclusively among human, gorilla, and both species of orangutan (Fig. 1A). We demonstrate that the XKRY gene family is present in both Bornean and Sumatran orangutans (Fig. 1A). HSFY, PRY, and XKRY are pseudogenized in chimpanzee and bonobo and VCY is absent in bonobo, gorilla, and both species of orangutan examined (Fig. 1A). For these gene families, we assigned gene count of zero in our analysis.

Figure 3- 1. Y ampliconic genes in great apes. A. Venn diagram showing gene content comparison among great ape species. B. Plot of the first two principal components (PCs) of Y

52 ampliconic gene copy numbers across great ape species (the first and second PCs explained 68.7% and 22.8% of the variation, respectively; Fig. S2 shows variation explained by the other components).

When we compared the overall number of Y ampliconic gene copies (i.e. the sum of copies of all gene families) among species (Table S1), we found that the two orangutan species had the highest numbers (median≈130 genes), and chimpanzee (median=40.9) and gorilla (median=44.1) had the lowest numbers, whereas the numbers for human (median=64.3) and bonobo (median=82.8) were in-between. When we computed the inter-individual variance in the number of Y ampliconic gene copies within each species (Table S1), we observed that Bornean orangutan (variance=455) and Sumatran orangutan (variance=136) had the highest variances, whereas gorilla (variance=13.0) had the lowest, with human’s (variance=33.7), chimpanzee’s (variance=42.5) and bonobo’s (variance=87.5) variances being in-between. In general, the higher the total number of Y ampliconic gene copies a species had, the higher was its variance in copy number (Fig. S1). After principal component analysis (PCA) of copy number estimates in individual great ape samples, species formed well-separated clusters (Figs. 1B and S2).

Copy number and its variance in individual gene families. Separating the data by Y ampliconic gene family and species (Fig. 2), we observed a positive correlation between gene family copy number and its variance in each species (Fig. 3) and in each gene family (Fig. S3). The TSPY gene family had consistently higher copy number and variance than other Y chromosome ampliconic gene families in bonobo, chimpanzee, and Sumatran

53 orangutan, and had the second highest (after CDY) copy number and variance in Bornean orangutan (Fig. 3). The RBMY gene family also had high copy number and variance in all great ape species except for the two orangutan species. In contrast, the VCY family, present only in human and chimpanzee, had consistently low copy numbers (Figs. 2-3).

Figure 3- 2. Variation in copy number of Y ampliconic gene families in great apes. Box plots summarizing the distribution of copy numbers of the six great ape species across nine Y ampliconic gene families. The gene families are separated into individual plots with the gene family name at the top. Within each plot, the X-axis represents six species (bonobo, chimpanzee, human, gorilla, Bornean orangutan, and Sumatran orangutan), and the Y-axis represents copy number. The black dot within each boxplot is the median value per species.

Most ampliconic gene families were more copious in orangutans than in other species (Figs. 2-3). For example, the PRY and XKRY gene families each had at least eight copies in orangutans (Bornean orangutan: eight copies for PRY and 15 copies for XKRY; Sumatran orangutan: 10 copies for PRY and 22 copies for XKRY)—in contrast, in Homininae these gene families were either lost (in bonobo and chimpanzee) or had a median size of only two copies (in human and gorilla). Also, each of the BPY2, CDY, and DAZ gene families had more than twice the number of gene copies in orangutans than that found in other great ape species.

Gene families lost in some species (HSFY, PRY, VCY, and XKRY) usually had few copies and low variation in the closely related species. For instance, the HSFY, PRY, and XKRY

54 gene families were pseudogenized in bonobo and chimpanzee, and human had a low copy number (on average two copies) for these gene families (Figs. 2-3). Similarly, the VCY gene family was lost in the majority of great ape species, except for chimpanzee and human, in which it had a low copy number (on average two copies; Figs. 2-3).

Copy number differences between recently diverged species. As an initial investigation into how quickly Y ampliconic gene families evolve, we tested whether copy numbers for individual gene families differed significantly between recently diverged, sister species. Two pairs of closely related species were included in this comparison— chimpanzee and bonobo, separated ~0.77-1.8 MYA (Hey, 2010; Yu et al., 2003), and Sumatran and Bornean orangutans, separated ~0.4 MYA (Locke et al., 2011). Five gene families were tested in bonobo vs. chimpanzee, and eight—in Sumatran vs. Bornean orangutans (Fig. 1A), with a permutation test in which we compared the mean copy number difference between the two species (permuting species labels with one million permutations; bonobo vs. chimpanzee Bonferroni-corrected p-value cutoff of 0.05/5 = 0.01; Sumatran vs. Bornean orangutan Bonferroni-corrected p-value cutoff of 0.05/8 = 0.00625). Between bonobo and chimpanzee, each of the five gene families tested exhibited a significant difference in its copy number (p-values of 1.67×10-3, <10-6, <10-6, <10-6, and <10-6, for BPY2, CDY, DAZ, RBMY and TSPY gene families, respectively). On the contrary, between the two orangutan species, we only found a significant difference in copy number for XKRY (p-value=1.32×10-3; see Table S2 for p-values for the other gene families). Thus, significant differences in Y ampliconic gene copy number do exist, even between closely related species.

Figure 3- 3. Larger Y ampliconic gene families are more variable across great apes. The six scatter plots represent the relationship between median copy number and variance for each of the species, and the species name is present at the top of each plot. The X-axis represents natural logarithm of median copy number and the Y-axis is a natural logarithm of variance in copy number. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line represents the linear function fitted to the given data points. The

55 dots are color-coded to represent the nine gene families, with missing dots indicating gene family absence in that species.

Evolution of copy number across species. Building upon this observation, we tested whether copy number in Y ampliconic gene families is conserved across great ape species and identified species with significant gain or loss of gene copies. Simple ANOVA performed on all the copy number values showed a significant difference in gene copy number for each gene family analyzed across great apes except for VCY (Table 1). We next tested whether these differences were still observed after taking the phylogenetic relationship among great apes into consideration. For this purpose, we used CAFE (Han, Thomas, Lugo-Martinez, & Hahn, 2013), a tool that implements a stochastic birth-and- death process to model the expansion and contraction of gene family sizes over a phylogeny, and ran it with the Y-chromosome-specific phylogenetic tree (see Materials

56 and Methods) and the median copy number per species for each of the families as input. We performed simulations to validate the use of CAFE for the given dataset (see Supplementary Note 1). CAFE estimated the rate of birth/death (훌) of ampliconic genes as 5.03×10-5 events per thousand years.

CAFE predicted that two of the nine Y ampliconic gene families tested had a significant expansion or contraction in their size (RBMY, p=0.001; and XKRY, p=0.001; Table 1 and Fig. 4; Bonferroni-corrected p-value cutoff of 0.05/9=0.006; 9 gene families) and two additional gene families had low p-values even if non-significant after correcting for multiple testing (CDY, p=0.018; and TSPY, p=0.009; Table 1 and Fig. 4). For these four gene families, CAFE also provided p-values for each branch of the phylogenetic tree with a significant gain or loss of gene copies when compared to its immediate ancestral node (Bonferroni-corrected p-value cutoff of 0.05/10=0.005; 10 nodes in great ape phylogenetic tree). Three interesting patterns emerged from this analysis (Fig. 4 and Table S3). First, the TSPY gene family, which had consistently high variation in copy number across great apes (Figs. 2-3), had the largest number of branches with significant differences in copy number across the phylogeny. We observed significant lineage-specific reductions in its family size in chimpanzee (from 30 copies inferred in the immediate ancestral node to 18 copies in chimpanzee, p=1.39×10-9), gorilla (from 21 to six copies, p=5.07×10-5), and Sumatran orangutan (from 27 to 23 copies, p=1.01×10-3), and significant expansions in Bornean orangutan (from 27 to 32 copies; p=2.68×10-4) and bonobo (from 30 to 48 copies, p=3.89×10-7).

Table 3- 1. Differences in Y ampliconic gene copy numbers across species as evaluated with ANOVA and CAFE. To determine which ampliconic gene families vary in their copy number across great ape species, we performed conventional one-way ANOVA (F-statistic and p-value are shown). To identify significant expansions or contractions of gene family size across great apes, we performed CAFE analysis. Significant p-values (Bonferroni-corrected p-value cutoff of 0.05/9=0.006; nine gene families) are in bold.

Gene family ANOVA F-value ANOVA p-value CAFE p-value BPY 80.84 1.26×10-21 0.345 CDY 178.23 6.52×10-29 0.014 DAZ 366.80 7.51×10-36 0.331 HSFY 26.74 7.55×10-9 0.488 PRY 357.70 1.11×10-24 0.209 RBMY 162.17 5.08×10-28 0.001

57

TSPY 82.98 7.38×10-22 0.008 VCY* 1.19 0.290 0.187 XKRY 481.47 1.08×10-26 0.001 *For VCY, power is limited because we only used the data from two species (chimpanzee and human). This gene family is absent in the other great ape species analyzed (see text for details).

Second, two gene families (CDY and XKRY) showed significant expansions in the branch leading to the two orangutan species, and one of them (XKRY) also exhibited significant differences between the two orangutan species. In the case of CDY, the node connecting the two orangutan species gained significant number of copies when compared to the common ancestor of great apes (from 15 to 35 copies, p=1.86×10-3). In the XKRY gene family, there was also a significant gain in gene copies in the common ancestor of Bornean and Sumatran orangutans when compared to the common ancestor of great apes (from six to 18 copies, p= 4.31×10-3). Additionally, Bornean orangutan lost gene copies (from 18 to 15 copies, p=2.86×10-3) and Sumatran orangutan gained gene copies (from 18 to 22 copies; p=6.46×10-4) when compared to their common ancestor. And third, intriguingly, in the RBMY gene family, bonobo gained gene copies (from 17 to 29 copies, p=2.02×10-7), while chimpanzee lost gene copies (from 17 to 11 copies, p=9.61×10-4), when compared to their common ancestor.

Figure 3- 4. Results of CAFE analysis identifying Y ampliconic gene families with significant shifts in gene copy number when compared to their ancestors. For each gene family with a significant difference in copy number, the phylogenetic tree representing the estimated copy number at internal nodes is shown. Significant shifts are highlighted in blue (contraction) and red (expansion). The copy numbers at the internal nodes were predicted by CAFE.

58

To evaluate the influence of intraspecific variation on our analysis of gene copy number evolution, we ran CAFE while using the data for five randomly subsampled individuals (instead of a single summary value, i.e. median) per species (an approximately star phylogeny among individuals of the same species was assumed, see Materials and Methods). The sample size of five here corresponds to the smallest sample size we have per species (for Sumatran orangutan). This procedure was performed 100 times. We observed that differences in chimpanzee, bonobo and gorilla lineages were still significant most of the time (in 83-100 times out of 100, Table S4). In the case of orangutans, the shift in CDY copy number was supported in 100 out of 100 replicates, and in the TSPY and XKRY gene families, the shift was supported in 12-68 out of 100 replicates (Table S4). Thus, intraspecific variability in the studied samples does not affect the robustness of

59 our results, except for the orangutan-specific observations for the TSPY and XKRY families.

Conservation of Y ampliconic gene expression in great apes Expression levels of Y ampliconic gene families. To study evolution of Y ampliconic gene expression, we evaluated expression levels for these gene families in great ape testis samples. We assembled complete or partial reference Y ampliconic gene transcripts using publicly available, or generated by us, RNA-Seq datasets (Dataset S1) and used them to estimate the ampliconic gene expression across great apes (Table S5; see Materials and Methods). Namely, reference transcript data for bonobo were generated using RNA-Seq datasets produced in house, for Bornean orangutan—using a mix of publicly available and generated in-house RNA-Seq datasets, whereas such data for other species were retrieved from publications (Ardlie et al., 2015; Brawand et al., 2011; Carelli et al., 2016; Fungtammasan et al., 2016; Ruiz-Orera et al., 2015; Tomaszkiewicz et al., 2016). The endangered status of great ape species posits a particular challenge for collection of tissues from these animals. Because of this hurdle, in this study we were able to include only two to three sampled individuals per species (Fig. 5; we excluded the results for Sumatran orangutan because only one sample was available for this species; see Materials and Methods).

Our analysis of Y ampliconic gene expression levels led to the following observations (Fig. 5). The BPY2, PRY and XKRY gene families had consistently low expression levels across great apes. In contrast, the TSPY and HSFY gene families had comparatively high expression levels across species, and intra- and interspecific variation in expression levels was also high for these two gene families. In comparison, the CDY, DAZ and RBMY gene families had intermediate expression levels and limited intra- and interspecific variation. Surprisingly, in chimpanzee and human, the expression levels for the VCY family were higher than those for the BPY2 and CDY gene families, even though the VCY family was lost in the majority of great apes, whereas the BPY2 and CDY gene families were conserved across great apes (Fig. 1A).

Figure 3- 5. Summary of gene expression levels across great apes. In the dot plot below, the X-axis represents nine ampliconic gene families and the Y-axis represents their expression levels. The plot represents testis-specific expression of 12 great ape samples. Each dot within a gene family represents expression levels of an individual and the color of the dot denotes the species it

60 belongs to. Missing dots represent gene families that are considered missing or pseudogenized, and their expression levels are excluded from the gene expression analysis (Table S5).

Evolution of gene expression. Using this dataset (Fig. 5, Table S5), we next evaluated whether expression levels of the Y ampliconic gene families were conserved across great ape species. To examine this, we performed phylogenetic ANOVA, which conducts an ANOVA-like test while taking the phylogenetic relationship of great apes into consideration (see Materials and Methods). This test was carried out for five gene families—BPY2, CDY, DAZ, RBMY, and TSPY—which are present in all the great ape species analyzed (Fig. 1A). Phylogenetic ANOVA was performed via applying the EVE model (Rohlfs & Nielsen, 2015) separately to each of the five Y ampliconic gene families, and identified that all five of them had conserved expression, i.e. with no branches experiencing significant speedup or slowdown in expression evolution, across great apes (Table S6). To test the validity of our conclusions given the small sample size and particular gene copy numbers, we performed simulations for different parameters under the EVE model and generated gene expression levels with the sample sizes identical to those in our study (Supplemental Note

61

S2). In 95 out of 100 simulations, we were able to predict conservation of gene expression correctly.

The relationship between copy number and gene expression levels We studied the relationship between Y ampliconic gene copy number and expression levels across five great ape species (with Sumatran orangutan excluded due to the lack of multiple testis samples). When we analyzed the median copy number of each family from our copy number dataset (Fig. 2) together with the median gene expression levels of these families from our gene expression dataset (Fig. 5), we observed that the correlation between them for the majority of species was positive, consistent with previous results in humans (Vegesna et al., 2019). However there were also some differences (Fig. 6). In bonobo and chimpanzee, we identified a strong positive correlation between gene copy number and their expression levels across gene families (bonobo: Spearman correlation rho=0.9, p=0.083; chimpanzee: rho=0.94, p=0.017). In human and gorilla, we also observed a positive correlation, but it was weaker (human: rho=0.58, p=0.108; gorilla: rho=0.59, p=0.126). In the case of Bornean orangutan we did not observe a positive correlation (rho=-0.05, p=0.93), and one of the reasons that might explain this finding is the high variation in Y ampliconic gene copy number in this species (Fig. 3).

Next, we studied the relationship between copy number and gene expression separately for each gene family (Fig. 7). Previous studies in humans showed that within a Y ampliconic gene family the variation in gene copy number does not correlate with their gene expression (Vegesna et al., 2019). Here we tested whether the longer divergence times between great ape species enabled gene copy number to influence expression levels. We observed that, across species, there was a strong (but statistically non- significant) positive correlation of gene expression level with gene count for DAZ (rho=0.9, p=0.083) and TSPY (rho=0.9, p=0.083), and a moderate and non-significant correlation for RBMY (rho=0.6, p=0.35). There was no such trend observed for BPY2 (rho=0.2, p=0.783) with all species having similar expression levels except for gorilla, which had comparatively low expression levels. A negative but non-significant relationship was observed for CDY (rho=-0.5, p=0.45), with chimpanzee and human having higher expression levels but fewer gene copies in comparison to gorilla and Bornean orangutan. In general, we might be lacking power to detect significant associations between gene expression levels and copy number of Y ampliconic genes because of the small sample

62 size. However, we did observe a trend of copy number influencing gene expression levels in three out of five gene families tested.

Figure 3- 6. Relationship between copy number and gene expression of Y ampliconic gene families in great ape species. The five scatter plots represent the relationship between expression and copy number for each of the five species, the name of the species is present at the top of each plot. In each of the scatter plots the X-axis represents natural logarithm of median copy number and the Y-axis represents natural logarithm of median gene expression. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line is the linear function fitted to the given data points. The dots are color-coded to represent the nine gene families, with missing dots corresponding to the gene families that are pseudogenized, deleted, or not expressed, in that species.

Figure 3- 7. Relationship between copy number and gene expression across species. In each of the scatter plots the X-axis represents natural logarithm of median copy number and the Y-axis

63 represents natural logarithm of median gene expression. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line represents the linear function fitted to the given data points. The dots are color-coded to represent the five species. The five scatter plots represent the relationship between expression and copy number for each of the five gene families, with the name of the gene family present at the top of each plot. Only the gene families that are present in all species are shown here.

Y ampliconic gene copy number variation and phenotypes related to sperm competition We next tested whether the differences observed in copy number were linked to presence/absence or degree of sperm competition in great apes. Variation in the overall copy number of all Y ampliconic genes in different great ape species suggested a possible association: monoandrous gorillas had the lowest copy number and its variance, whereas great apes with polyandrous or dispersed mating patterns (e.g., chimpanzees, bonobos, and orangutans) had higher copy number and variance (Table S1). We examined a potential association between copy number variation in four Y ampliconic gene families showing significant differences among species (CDY, RBMY, XKRY, and TSPY) and four different phenotypes related to sperm competition—residual testis weight (Dixson & Anderson, 2004), sperm midpiece volume (Anderson, Nyholt, & Dixson, 2005), sperm concentration (Møller, 1988), and sperm motility (Møller, 1988). While the small number of species examined precluded us from examining associations statistically, the

64 overall trends could be examined. Variation in the four sperm phenotypes were in line with expectations related to the degree of sperm competition in the studied species (Fig. S4). Among four gene families examined (Fig. S5), RBMY and TSPY exhibited copy number variation that might be associated with sperm competition. It has been demonstrated that RBMY copy number is positively associated with sperm motility in humans (Yan et al., 2017) (although this finding was recently disputed (Shi, Louzada, et al., 2019)). Across great apes, we observed that bonobo, a species with the highest sperm motility, concentration, and volume had the highest copy number (and its variance) of RBMY genes (Figs. S3-S5). In contrast, orangutans with the lowest sperm concentration, volume, and one of the lowest levels of sperm motility, had the lowest copy number and its variance of RBMY genes (Figs. S3-S5). Both chimpanzees and gorillas, with relatively low sperm motility, also displayed low RBMY copy numbers. Thus, RBMY copy number variation might be associated with sperm motility, a sperm-competition- related phenotype, in great apes. TSPY copy number was shown to be positively associated with sperm concentration, one of sperm-competition-related phenotypes, in humans (Giachini et al., 2009). Though a clear association with sperm concentration was not detected, we found that TSPY had low copy number (and its variance) in gorilla, the species with minimal sperm competition, and higher copy number in the other species studied (Fig. S3), which have higher levels of sperm competition. Importantly, in bonobo, a species with the highest sperm concentration, we observed the highest number of TSPY gene copies (Fig. S4-S5). Thus, TSPY copy number variation might be associated with sperm competition in great apes.

Discussion To study evolution of multi-copy gene families on the Y chromosome in the extant great apes, we analyzed copy number and expression levels of Y ampliconic genes across six great ape species. For the first time, we estimated variation in copy number of Y ampliconic gene families in a large sample of bonobos (previously only two individuals were compared (Oetjens et al., 2016)) and in two species of orangutan. We also generated testis expression data for Y ampliconic gene families for bonobo and Bornean orangutans, thus providing unique datasets that were previously missing from publicly available resources. Combining these new and previously published data, we investigated the evolution of Y ampliconic gene copy number and of their expression levels, as well as the evolutionary relationship between them, and examined whether the observed variation in

65 copy number could explain phenotypes related to sperm competition and mating structure. Some of our tests showed strong trends but lacked statistical significance, likely because they were underpowered due to a small sample size. Increasing the sample size for this type of study is currently extremely challenging because of the limited availability of samples from endangered species.

Loss of some Y ampliconic gene families in all non-human great ape species examined. Combining data from this and other studies (Cortez et al., 2014; Hughes et al., 2010; Oetjens et al., 2016; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016), we have demonstrated that, compared with human, all other great ape species examined lack one (VCY missing in gorilla and both species of orangutan) or several (HSFY, PRY, and XKRY pseudogenized in bonobo and chimpanzee) Y ampliconic gene families (Fig. 1A). We discovered that gene families with low copy number in some species are frequently lost or pseudogenized in other species. For instance, we found that PRY and XKRY, which were pseudogenized in bonobo and chimpanzee, have low copy number in other great apes examined. Similarly, a recent study from our group showed that Y ampliconic gene families with low copy numbers in humans (Vegesna et al., 2019) were pseudogenized or lost in some great ape species (HSFY, PRY, VCY, and XKRY) due to the lack of recombination, Muller’s ratchet, and/or non-allelic homologous recombination (NAHR) on the Y. Regardless of the mechanism, it is a fact that not all human Y ampliconic gene families are essential in all great ape species.

Our study indicates that low expression levels could also be an important predictor of gene family’s non-essentiality on the Y. Consistent with this hypothesis, lowly expressed PRY and XKRY gene families were pseudogenized in chimpanzee and bonobo. However, gene families such as VCY and HSFY, which were also lost in several great ape species, had relatively high expression levels. VCY and its homolog VCX, which is present on the X chromosome, are highly similar in sequence (at least in humans), and thus the loss of VCY in some species could be compensated by VCX (Vegesna et al., 2019). HSFY could have undergone neofunctionalization in humans (Vegesna et al., 2019). Therefore, in other great apes who lost HSFY, its homologs could also compensate for its ancestral function. Expression of paralogous genes present on other chromosomes should be studied together with expression of Y ampliconic gene families in future studies.

66

Y ampliconic gene families: Life on palindromes and tandem repeats. We observed a positive relationship between copy number and its variance for Y ampliconic gene families in great apes. This finding echoes recent studies in humans (Vegesna et al., 2019; Ye et al., 2018) and points toward a similar organization of these gene families in repeats across great ape species and in their common ancestor. In human and chimpanzee, most Y ampliconic gene families (except for TSPY, which is organized as a tandem repeat) are located in palindromes (Hughes et al., 2010; Skaletsky et al., 2003). Palindromes also exist on the gorilla Y chromosome (Tomaszkiewicz et al., 2016). High copy number variation in ampliconic genes in bonobo and orangutans found here suggest that their Y chromosomes also have repetitive structure, confirming cytogenetic findings (Gläser et al., 1998).

The presence of palindromes on the Y chromosome in great apes enables frequent rearrangements via NAHR and gene conversion among different palindrome arms, leading to the observed high variation within species. Gene conversion leads to conservation of gene sequences and their rescue from accumulation of deleterious mutations. A study of palindrome P8 in healthy humans showed diverse palindromic structures carrying from one to four copies of VCY (Shi, Massaia, et al., 2019). Large- scale chromosomal rearrangements also contribute to the dynamic copy number evolution of Y ampliconic gene families across great apes, as shown by previous cytogenetic analyses (Gläser et al., 1998; Repping et al., 2006; Shi, Louzada, et al., 2019), as well as by an example in the next paragraph.

Interesting patterns were observed for median copy numbers for each of the five Y ampliconic gene families present in bonobo and chimpanzee (Fig. 3). For low-copy- number families, we found a total of six gene copies (one BPY2, three CDY, and two DAZ gene copies) in bonobo. For the same gene families in chimpanzee, six gene copies (one BPY2, three CDY, and two DAZ gene copies) are present on three palindromes of the short arm and five gene copies (one BPY2, two CDY, and two DAZ genes) are located on the three palindromes of the long arm (Hughes et al., 2010). The differences in copy number between bonobo and chimpanzee are consistent with a deletion of the three palindromes bearing five gene copies on the long Y arm in the bonobo lineage after its divergence from the bonobo-chimpanzee common ancestor. This hypothesis is strengthened by the results of a cytogenetic study that mapped the CDY and DAZ gene

67 families to the short arm of the bonobo Y, but to both short and long arms of the chimpanzee Y (Schaller et al., 2010). This pattern was in contrast to that observed for high-copy-number gene families, TSPY and RBMY (Fig. 3). We found that bonobo Y had approximately three times more TSPY and RBMY gene copies than the chimpanzee Y (Table S1). Consistent with this finding, a cytogenetic study reported high amplification of RBMY and TSPY gene families via segmental duplications of a large euchromatic segment in bonobo (Gläser et al., 1998).

Evolutionary forces affecting copy number variation among great apes. Our test of conservation of gene copy number across great ape species indicated significant differences in copy numbers of CDY, RBMY, TSPY, and XKRY. Chimpanzee had lost, whereas bonobo had gained, TSPY and RBMY gene copies when compared to their common ancestor. In the case of gorilla, we observed a significant loss of TSPY gene copies. Additionally, interspecific differences in copy number in our dataset were correlated with interspecific differences in copy number variance (Fig. 3). What can explain the differences observed among species?

Whereas a detailed analysis is outside the scope of our study, we can speculate about the evolutionary forces driving copy number variation of Y ampliconic gene families in great apes. If a certain range of copy number were beneficial, then directional selection would limit variation in copy number even for large gene families. We did not observe such a pattern. In contrast, if variability in copy numbers were beneficial, then diversifying selection would enhance variation in copy number even for small gene families. This pattern was also not found. Instead we observed that variance in copy number was approximately proportional to the size of gene family (Fig. 3), suggesting that larger families have more opportunities for rearrangements resulting in high copy number variability. Thus, our data overall are consistent with random genetic drift resulting from frequent copy number changes of repetitive regions being the major driver of Y ampliconic gene families’ evolution. A similar conclusion was reached when a larger dataset of humans representing multiple haplogroups was studied (Ye et al., 2018). Nevertheless, selection might be operating within certain great ape species, or at particular gene families. Note that high variability observed for copious gene families might mask signatures of diversifying selection, and vice versa, low variability observed for gene families with low copy number might mask signatures of directional selection.

68

Y ampliconic gene copy number and sperm competition and morphology. Sperm competition could be the selective force behind Y ampliconic gene number evolution. We examined a potential link between variation of Y ampliconic genes and levels of sperm competition across great apes using four sperm phenotypes positively associated with sperm competition—residual testis weight, sperm midpiece volume, sperm concentration, and sperm motility. The higher the number of copulations a female has with males with large testes, the larger is the sperm midpiece, where the mitochondria reside and fuel sperm motility (Dixson & Anderson, 2004; Møller, 1988). Also, polyandrous primates evolved to produce more sperm that swim faster, e.g. chimpanzees produce 223 times more sperm than gorilla and 14 times more sperm than orangutan (Fujii-Hanamoto, Matsubayashi, Nakano, Kusunoki, & Enomoto, 2011; Nascimento et al., 2008).

Our results suggest that sperm competition might contribute to determining TSPY copy number evolution across great apes. TSPY exhibited copy number variation that might be associated with sperm competition: its copy number was the lowest in gorilla (who has minimal sperm competition), and was highest in bonobo and chimpanzee experiencing high levels of sperm competition; the copy number was intermediate in human (orangutan was an outlier). We also hypothesize that potential selection for sperm motility, a sperm- related phenotype, is important for shaping RBMY copy number variation, as its copy number variation followed a similar trend to the one observed for TSPY (Figs. S4-S5). Four of the five ampliconic gene families shared among great apes (Fig. 1A)—BPY2, CDY, DAZ, and RBMY—are located within two azoospermia factor regions (AZFb and AZFc). Mutations or deletions of AZFs result in spermatogenic failure phenotypes, including abnormal sperm morphology, in humans (Choi et al., 2007; de Vries et al., 2001; Gläser et al., 1998; Lu et al., 2014; Vogt et al., 1996; Yan et al., 2017). Thus, copy number of these four gene families might contribute to determining sperm morphology. Consistent with this hypothesis, gorillas and humans have the lowest copy number for these gene families (Table S7) and the highest rate of sperm abnormalities (Seuánez, 1980), whereas both species of orangutans have the highest copy number (Table S7) and the lowest rate of sperm pleomorphism (Seuánez, 1980). Y ampliconic gene copy number and mating patterns in great apes. In species with female , such as bonobo, dominant females pick their mates and enable their male progeny to access other females, which benefits spreading the genes on the X

69 chromosome rather than the Y chromosome. In contrast, in species with male dominance, such as chimpanzee, males have to maintain the hierarchy so they have more pressure to conserve the Y chromosome. We hypothesize that the reduced burden on the Y chromosome in species with female dominance results in a smaller and less conserved Y ampliconic region than in species with male dominance. If our prediction is correct, then we expect the chimpanzee Y chromosome to retain more gene families and to be more conserved, i.e. have lower variance in copy number. Though the number of Y ampliconic gene families was the same for chimpanzee and bonobo (Fig. 1A), chimpanzee exhibited lower variance in their overall copy number than bonobo (Table S1), consistent with the predictions resulting from male- vs. female-dominant social structures. However, this result might be affected by the fact that only one chimpanzee subspecies, western chimpanzee, known to have low diversity on the Y (Hallast et al., 2016), was included in the present study.

In bonobo, from the five gene families studied, we observed an increase in copy number in two gene families (TSPY and RBMY) and a decrease in gene copy number in the other three gene families (BPY2, CDY, and DAZ). The loss of gene copies in BPY2, CDY, and DAZ could have been facilitated by the low selective pressure to maintain these genes on the bonobo Y due to female dominance. The lower selective pressure on the Y could have enabled the loss of gene copies in these three families despite their higher gene count in their closest relative, chimpanzee (Table S1).

In the case of the two orangutan species studied, they have diverged more recently and their mating and social patterns remain similar (Delgado & Van Schaik, 2000). In fact, hybrids between Sumatran and Bornean orangutans are fertile (Gläser et al., 1998). As a result, though we observed differences in copy number between the two species for the TSPY and XKRY gene families (Fig. 4), these differences were unstable when we considered variation among individuals (and not a median value, Table S4). High variation in copy number in orangutans (Figs. 3 and S3) could be explained by the fact that their common ancestor experienced a significant gain in gene copy number for several gene families (CDY and XKRY; Fig. 4), and thus there are more opportunities for different copy numbers to be created and be subjected to random genetic drift.

70

Evolution of gene expression, and the relationship between gene expression and copy number. Our analysis suggests that the Y ampliconic gene families present in all great ape species studied exhibit limited interspecific variation in gene expression, and this variation was not significant with our EVE model analysis, either due to the small sample size or reflecting overall conserved expression levels for these gene families. A larger study is needed to distinguish between these two possibilities. If interspecific variation is indicated in such a study, then it would echo an earlier study, in which genome- wide variation in gene expression across different tissues in human and chimpanzee was investigated, and genes with high intraspecific variation were found to exhibit high interspecific variation, particularly in testis (Khaitovich, Enard, Lachmann, & Pääbo, 2006; Khaitovich et al., 2005). If, in contrast, conserved expression levels are confirmed, then DNA methylation might play an important role in dosage regulation of duplicated genes (Chang & Liao, 2012), and future studies should investigate the upstream regions of Y ampliconic genes for DNA methylation patterns.

In humans, testis tissue tolerates high variation in gene expression levels and undergoes dosage regulation to maintain overall conserved gene expression in the presence of gene copy number variation (Vegesna et al., 2019). However, across species, we observed a mixed pattern in the relationship between gene expression levels and copy number: DAZ, RBMY, and TSPY showed a positive correlation, BPY2 displayed no association, and CDY had a negative correlation (Fig. 7). These results imply that, given enough evolutionary time, copy number could influence gene expression levels in some ampliconic gene families.

We observed a positive correlation between copy number and gene expression across gene families in each great ape species but Bornean orangutan. This correlation was stronger in bonobo and chimpanzee, species experiencing high levels of sperm competition, suggesting that in these species it can be particularly important biologically. In general, evolution of copy number and gene expression, although related to each other, might follow different time scales. Our results suggest that evolution of copy number is faster than evolution of gene expression. Rapid, back-and-forth changes in copy number for Y ampliconic genes eventually influence the direction in which gene expression levels shift over longer periods of time. Future studies should decipher isoform sequences of Y ampliconic genes, as well as of their X-chromosomal and autosomal counterparts, and

71 analyze their differential expression, and thus will examine the dynamic evolution of male fertility genes in greater detail.

It is important to note that factors such as age and cellular composition of the testis tissues could influence the estimated expression levels of the Y ampliconic gene families. These factors have to be addressed in future studies. Also, further refinements of reference transcriptomes for each species will aid in obtaining more accurate estimates of expression levels.

Y ampliconic genes and conservation genetics. Copy number variation of Y ampliconic gene families can be used to trace male dispersal and could ultimately aid conservation efforts targeted at preserving endangered great ape species. Indeed, Y chromosome markers provide a potential tool for evaluating historical patterns of male migration in great apes, similar to approaches applied to humans (Hammer et al., 1998; Underhill et al., 2000). Differential dispersal of males vs. females influences population structure at local scales and across landscapes (Underhill & Kivisild, 2007), and becomes critically relevant to reserve and corridor design for long-term conservation of wild populations of great apes. Studies using uniparentally-inherited sex-specific markers can assist in the identification of genetically distinct populations or even species, which should be subsequently preserved as reservoirs of genetic diversity. For example, a recent study of evolutionary history of orangutans using sequences of mtDNA and Y chromosome allowed the identification of a new orangutan species, Pongo tapanuliensis (Nater et al., 2017). Here we presented the first study exploring variation in copy number and expression of Y ampliconic genes across most great ape species (only omitting Tapanuli orangutan, which was recently discovered). To evaluate copy number of Y ampliconic genes, we used ddPCR assays, which were previously demonstrated to be highly accurate and reproducible (C. M. Hindson et al., 2013; Taylor, Laperriere, & Germain, 2017; Vegesna et al., 2019). We presented ampliconic gene copy number variation in bonobos and orangutans for the first time, and showed that orangutans have the highest copy number and the highest variation in copy number across great apes. We observed significant differences in copy number in four out of nine Y ampliconic gene families. To obtain the gene expression dataset, we assembled transcripts and estimated expression levels of Y ampliconic genes using publicly available and generated in house testis-specific RNA-Seq datasets. The analysis of this dataset indicated conserved evolution, i.e. none of the Y

72 ampliconic gene families had significant shifts in their expression levels between species, despite substantial and significant variation in their copy number. We observed a positive correlation between copy number and expression levels for the DAZ, RBMY, and TSPY gene families, in contrast to the results in human, where such correlation was not observed for any Y ampliconic gene families (Vegesna et al., 2019). Thus, copy number can influence gene expression given sufficient evolutionary time. We showed that variation in copy number for the TSPY and RBMY families might be associated with interspecific differences in sperm competition across great apes. More specifically, we observed that bonobo, a species with the highest sperm competition, concentration, and motility, had the highest copy number and variance of copy number of the RBMY and TSPY gene families, which were shown to be associated with human sperm motility and concentration, respectively, in other studies (Giachini et al., 2009; Yan et al., 2017). Our results have important implications for understanding Y chromosome evolution in endangered great apes and should aid conservation efforts aimed at restoring their genetic diversity.

Materials and Methods

DNA samples DNA samples (Table S8) from seven bonobos, six Bornean orangutans, and four Sumatran orangutans, as well as blood samples from an additional Bornean orangutan (KB5405) and an additional Sumatran orangutan (KB5565) were provided by the San Diego Zoological Society. We extracted DNA from the latter two samples using the DNeasy Blood and Tissue Kit (Qiagen). DNA samples from nine western chimpanzees were provided by Mark Shriver at Pennsylvania State University. ddPCR assays for ampliconic gene copy number estimation in bonobo and orangutan The PCR protocol and primers for EvaGreen-based ddPCR assays were designed according to the parameters specified in (Tomaszkiewicz et al., 2016), using great ape species-specific sequences of a two-copy RPP30 and a single-copy SRY as references. BWA-MEM alignments (version 0.7.10) (H. Li, 2013) of raw Illumina reads from several male orangutan and bonobo datasets (RNA-Seq datasets present in Table S9, whole- genome-sequence datasets listed below) to the reference gene sequences were visualized in Integrative Genomics Viewer (IGV) (version 2.3.72) (Thorvaldsdóttir,

73

Robinson, & Mesirov, 2013) and consensus sequences were retrieved in order to design the primers. The primers for evaluating the copy number of BPY2, CDY, DAZ, HSFY, PRY, RBMY, and TSPY gene families in Bornean and Sumatran orangutans were designed in the protein-coding regions using the gene sequences previously published for Sumatran orangutan (Cortez et al., 2014) and RNA-seq datasets generated in-house for the Bornean orangutan (Dataset S2 and see below). Primers for orangutan XKRY (Dataset S2) were designed using the published Sumatran orangutan whole-genome sequencing dataset (SRR10393305, without gene annotation) and their male-specific presence in genomic DNA of both orangutan species was confirmed by PCR. Primers for SRY were designed using a previously published Sumatran orangutan gene sequence (Cortez et al., 2014), and primers for RPP30 were designed using the Sumatran orangutan reference female genome (ponAbe3). To estimate the copy number of the BPY2, CDY, DAZ, RBMY, and TSPY gene families in bonobo and chimpanzee, we used previously published primers for gorilla and human (Tomaszkiewicz et al., 2016) that identically matched the chimpanzee Y-specific sequence, and the publicly available whole-genome male bonobo dataset (SRR740905). The VCY and SRY primers for chimpanzees were designed using the chimpanzee Y chromosome reference sequence (Hughes et al., 2010). The SRY primers for bonobo were designed using the whole-genome male bonobo dataset generated in house (Dataset S2). The primers for RPP30 were designed using the chimpanzee (panTro6) and bonobo (panpan1.1) reference female genomes (Dataset S2). To complete our copy number dataset, we also used the previously published (by our group) copy number data from 14 wild-born gorillas (Tomaszkiewicz et al., 2016) and 10 humans with African Y haplogroup (E) (Ye et al. 2018). Each sample was run in at least three replicates (Dataset S3) and the mean value was calculated across the replicates (Table S10).

Construction of Y-specific great ape phylogenetic trees Hallast and colleagues identified 54,611 positions with intra- and interspecific single- nucleotide variants by comparing a 750,616-bp region across the Y chromosomes of great apes (Hallast et al., 2016). From this dataset, we picked sequences of one individual per species (uniformly at random) and generated a distance matrix (pairwise nucleotide differences) using MEGA7 (Kumar, Stecher, & Tamura, 2016). Using the mutation rate on the Y chromosome of humans (8.88×10−10 mutations per position per year (Helgason et al., 2015)) as a constant for all great apes, and the number of nucleotide differences

74 among species obtained from the distance matrix above, we estimated the time in years since the most recent common ancestor (TMRCA), where time is defined as (Walsh, 2001) 1 푙푒푛푔푡ℎ 표푓 푠푒푞푢푒푛푐푒푠 푇푀푅퐶퐴 = × 푙표푔 ( ) (1) 2(푚푢푡푎푡𝑖표푛 푟푎푡푒) 푒 푛표.표푓 푚푎푡푐ℎ푒푠

MEGA7 was used to estimate an unrooted maximum likelihood tree from the same set of sequences employed to compute TMRCA. We generated two separate trees, the first one including all great ape species analyzed (bonobo, chimpanzee, human, gorilla, Sumatran orangutan, and Bornean orangutan) and the second one excluding Sumatran orangutan. The second tree was used in gene expression analysis, in which we lacked expression data from multiple Sumatran orangutan individuals. Both trees were converted to rooted trees using reroot() function from ape package in R (Paradis, Claude, & Strimmer, 2004). As a final step, we recalibrated the trees to represent the branch lengths as TMRCA (in thousands of years) using the chronos() function in the ape package (Paradis et al., 2004). We used the makeChronosCalib() function to set the TMCRA for each common ancestor node of great apes and passed it as a parameter to the chronos() function, which enforces a molecular clock. Finally, we encoded the trees in Newick format for downstream analysis. The two Newick-formatted trees were ((((Bonobo:2432,Chimp:2432):5641,Human:8073):4633,Gorilla:12706):15346.37,(Boran gutan:578,Sorangutan:578):27474.37) and ((((Bonobo:2432,Chimp:2432):5641,Human:8073):4633,Gorilla:12706):15351.43,Borang utan:28057.43).

Analysis of conservation in copy number across great ape species We used CAFE v4.2 (Computational Analysis of gene Family Evolution) (Han et al., 2013) to study the evolution of ampliconic gene family size. CAFE uses a birth-and-death stochastic process to model gene gain or loss along each lineage of a given phylogenetic tree (De Bie, Cristianini, Demuth, & Hahn, 2006). For each of the nine ampliconic gene families, the median gene copy number in each great ape species and the first phylogenetic tree from the previous section were provided as input to CAFE. Using maximum likelihood, CAFE estimated the rate parameter λ (rate of gene birth and death), and gene copy numbers of each gene family at the internal nodes of the phylogenetic tree. Based on these maximum likelihood estimates, CAFE computed gene-family-specific p- values for tests of significant gain or loss of copy number in individual gene families in any

75 particular extant and ancestral great ape species. For each gene family with a significant gene-family-specific p-value (significance cutoff of 0.05 was used), CAFE also provided a p-value for every branch on the phylogenetic tree, which indicates the significance of the shift in gene family copy number along the branch. Based on these branch-specific p- values, we identified gene families that have undergone expansions or contractions along branches on the phylogenetic tree (Bonferroni-corrected p-value cutoff of 0.05/10=0.005; 10 nodes in great ape phylogenetic tree).

RNA-Seq datasets Testis-specific RNA-Seq datasets were obtained for human, bonobo, chimpanzee, gorilla, Bornean orangutan, and Sumatran orangutan (Table S9). The datasets were either publicly available (Ardlie et al., 2015; Brawand et al., 2011; Carelli et al., 2016; Fungtammasan et al., 2016; Ruiz-Orera et al., 2015; Tomaszkiewicz et al., 2016) or generated in-house. The public RNA-Seq datasets included sequences of strand-specific, paired-end libraries (SRR2040590 and SRR2040591 for chimpanzee, SRR2176206 and SRR2176207 for Bornean orangutan, SRR10393299-SRR10393304 for Sumatran orangutan, SRR3053573 and SRR10393358 for gorilla, SRR1090722, and SRR1077753 for human) and of unstranded libraries (SRR306837 for bonobo, SRR306825 for chimpanzee , SRR306810 for gorilla) (Ardlie et al., 2015; Brawand et al., 2011; Carelli et al., 2016; Ruiz-Orera et al., 2015). Testis-specific expression data for three humans with African Y-chromosome haplogroup E (SRR817512, SRR1100440, SRR1102852) were retrieved from the GTEx project (Ardlie et al., 2015). An additional bonobo RNA-Seq library was generated from total RNA extracted from the bonobo testis sample (individual ID 5013, from San Diego Zoological Society) with the RNeasy Mini Kit (Qiagen) and subsequently treated with DNase I (Ambion). Ribosomal RNA was depleted with the RiboZero Gold rRNA removal kit (Epicentre). The cDNA library was generated with the RNA ScriptSeq v2 RNA-Seq library preparation kit (Epicentre) and quantified with Qubit (Life Technologies) and Bioanalyzer (Agilent 2100). RNA sequencing was carried out on MiSeq using 151-bp paired-end sequencing protocol (approximately 100 million reads were generated). Raw sequence data were deposited in the NCBI Sequence Read Archive under accession numbers SRR10392519-SRR10392521. An additional Bornean orangutan RNA-Seq data set was generated as follows. RNA was extracted from testis tissue from a sample provided by San Diego Zoological Society (individual ID 3405). As in (Ardlie et al., 2015; Brawand et al., 2011; Carelli et al., 2016; Fungtammasan et al.,

76

2016; Ruiz-Orera et al., 2015; Tomaszkiewicz et al., 2016), RNA was extracted, verified RNA integrity with Bioanalyzer and sequenced on HiSeq2500 after preparing the libraries with TruSeq RNA Sample Prep kit (Illumina). Raw sequence data were deposited in the NCBI Sequence Read Archive under accession number SRR10392514-SRR10392519.

Female liver RNA-Seq datasets were obtained from publicly available datasets (SRR306835 for bonobo, SRR306823 for chimpanzee, SRR306808 for gorilla, SRR306798 for orangutan, and SRR1071668 for human) (Ardlie et al., 2015; Brawand et al., 2011; Carithers et al., 2015). They were used to filter out female transcripts during transcriptome assembly.

Transcriptome assembly of Y ampliconic genes in great apes The reference genomes for great apes—gorGor5 (Gordon et al., 2016), panPan1 (Prüfer et al., 2012), panTro5 (Chimpanzee Sequencing and Analysis Consortium, 2005), PonAbe2 (Locke et al., 2011), and hg38—were downloaded from the UCSC Genome Browser (Kent et al., 2002). The transcriptome assembly pipeline was adapted from (Tomaszkiewicz et al., 2016). The RNA-Seq reads (from the previous section) were first checked for the presence of truseq adapters and then we removed the adapters and low- quality regions using Trimmomatic[v0.36] (Bolger, Lohse, & Usadel, 2014). For each great ape species, its testis RNA-Seq reads were first mapped to their respective female reference genome (reference genome excluding the Y chromosome in case of human and chimpanzee) with Tophat2[v2.1.1] (Kim et al., 2013), and the unmapped reads (enriched for male-specific transcripts) were assembled with Trinity[v2.4.0] (Grabherr et al., 2011; Haas et al., 2013) and SOAPdenovo-Trans[v1.03] (Xie et al., 2014) with k-mer size of 25 bp. Other parameters were set based on read length and insert size required for each particular RNA-Seq dataset. The resulting contigs were aligned to the respective female reference genomes with BLAT[v36x2] (Kent, 2002), and contigs that aligned at >90% of their length with 100% identity were filtered from subsequent steps. Next, we aligned female liver RNA-Seq reads to the filtered contigs using Bowtie[v1.1.2] (Langmead, 2010) and removed contigs that were covered at over 90% of their length by mapped female liver RNA-Seq reads. The coverage information of the contigs was obtained using BEDTools[v2.26.0-87-g6f9c61f] (Quinlan & Hall, 2010). We combined the contigs from both the Tophat2 and Trinity assemblers and used CD-HIT[v4.7] (Fu, Niu, Zhu, Wu, & Li, 2012; W. Li & Godzik, 2006) to remove redundant sequences. We next scaffolded the

77 remaining contigs using SSPACE[v3.0] (Boetzer, Henkel, Jansen, Butler, & Pirovano, 2011). We further mapped testis and female liver RNA-Seq reads to the gene scaffolds with Bowtie[v1.1.2] (Langmead, 2010) and retained only male-specific gene scaffolds (with at least 80% of the sequence covered by male-specific reads and no more than 20% of the sequence covered by female-specific reads). From the filtered male-specific scaffolds we generated consensus sequences using minimus2[v3.1.0] from the AMOS consortium (Sommer, Delcher, Salzberg, & Pop, 2007). Annotation of the final transcripts was performed using nucleotide and protein databases using BLAST[v2.6.0+] (Altschul, Gish, Miller, Myers, & Lipman, 1990). The above pipeline (Fig. S6) was run for each great ape species separately, and the longest transcript representing ampliconic gene family was obtained for each species (Dataset S1). The transcripts for genes with low expression were not assembled due to lack of reads covering these genes.

Estimating gene expression levels from RNA-Seq datasets To obtain gene expression levels of Y ampliconic gene families, we used the human RefSeq database downloaded from the UCSC Genome Browser (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/, date: October 2016) as a reference, along with the longest ampliconic gene transcript assembled for the available gene families (see previous section). We generated an index for the reference using the salmon[v0.14.1] index function (Patro, Duggal, Love, Irizarry, & Kingsford, 2017) with k- mer size 31 (-k 31 --keepDuplicates). Standard pipelines such as Tophat2 (Kim et al., 2013) and RSEM (B. Li & Dewey, 2011) optimized to align reads to the same species reference could not be used in our case, and so we developed a new pipeline. For each sample, using the salmon quant (-l A -p 8 --validateMappings) function, we obtain the read counts per transcript on the available testis samples for each species. The transcript-level read counts were converted to gene level using the tximport package[v1.2.0] (Soneson, Love, & Robinson, 2015). The gene-level read counts for the RNA-Seq samples were normalized using DESeq2[v1.14.1] (Love, Huber, & Anders, 2014). The resulting dataset included paired-end and single-end RNA-Seq datasets with or without replicates and, for the sake of consistency, we processed all the datasets as single-end files and used only one replicate per sample to overcome batch effects. PCA of expression data shows distinct clustering of each species, except for one sample of bonobo, which appears to be closer to chimpanzee (Figs. S7-S8). Because we are aligning reads across great ape species, the reads from apes that are closest to human (chimpanzee and bonobo) are

78 expected to align better to the human reference than the reads from more distant species (e.g. orangutans). However, we expect that our use of a k-mer-based pseudo-alignment instead of standard whole-read alignment reduced reference bias. To test this hypothesis, we employed the rlog() function in DESeq2 (Love et al., 2014) to normalize read counts and the used dist() function from the same package to calculate the Euclidean distance between samples. When viewed as a heat map, with the exception of bonobos, the distances between the samples are consistent with the phylogenetic relationship among great ape species (Fig. S9), which is also supported by the dendrograms relating the rows and columns of the heat map. The PCA analysis also showed distinct clustering of the samples by species (Figs. S7-S8).

Testing for conservation in gene expression levels To test whether the expression levels for Y ampliconic gene families are conserved across great ape species, we used the EVE model (expression variance and evolution model) (Rohlfs, Harrigan, & Nielsen, 2014). This model parametrizes the ratio of population to evolutionary expression variance (β) taking phylogeny into account, i.e. it provides the implementation for phylogenetic ANOVA. Similar to the F statistic (a measure of the ratio of variation between groups to variation within groups) in ANOVA, the β parameter estimated by the EVE model represents the ratio of within-species expression variation, to phylogenetically corrected between-species expression variation. The EVE model is based on an Ornstein-Uhlenbeck (OU) process (Rohlfs & Nielsen, 2015), which models a random walk with a pull toward an optimal value. In the OU process employed by the EVE model, genetic drift (σ2) is explained by the random walk, the strength of selection (ɑ) by the directional pull, and optimal gene expression (θ) at the species level by optimal value. The EVE model also has a parameter that captures the variation of expression within species (휏) to estimate β. Given the Y chromosome-specific phylogenetic tree (the second tree without Bornean orangutan was used; see section on ‘Construction of Y-specific great ape phylogenetic trees’) and the expression values for the five Y ampliconic gene families found in all great apes from multiple samples per species, the EVE model estimated the above-mentioned parameters and used them to calculate the β parameter for each gene family i (βi) and all the gene families together (βshared ). The EVE model was then used to test whether the ratio βi for each gene i was similar to all the genes evolving neutrally in the phylogeny (i.e. βi=βshared where i indexes each gene in dataset). Deviations from this expectation are suggestive of selection. We tested whether the βi parameter for any one

79 gene family deviates from this expectation (i.e. βi≠βshared). If βi > βshared , then there is more variation within species than between species at gene i compared to expected, which could be suggestive of diversifying selection within species. Conversely, if βi < βshared , then there is more variation between species than within species at gene i compared to expected, which could be indicative of directional selection along extant or ancestral branches of the phylogeny. We used the -S parameter in the EVE model to perform the expression divergence/diversity test on each of the five gene families (-n 5) separately, using the ampliconic gene expression values from the previous section and the Y chromosome-specific phylogenetic tree as inputs. The EVE model calculated the likelihood ratio between the null and alternative hypotheses (Ho: βi=βshared vs. Ha:

βi≠βshared). The likelihood ratios follow a chi-square distribution with one degree of freedom, which makes it possible to convert the likelihood ratios to p-values. The p-values are used to infer whether expression levels of gene families tested are conserved across great ape species after taking their phylogenetic relationship into account.

Availability of data and materials Transcript sequences are presented in Dataset S1. Primer sequences are presented in Dataset S2. ddPCR replicates of copy numbers are shown in Dataset S3, expression data are presented in Table S5. Code used in the manuscript is available at github link: https://github.com/makovalab-psu/Y-AmpGene_CN_GE_GreatApes.git

References

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local

alignment search tool. Journal of Molecular Biology, 215(3), 403–410.

Anderson, M. J., Nyholt, J., & Dixson, A. F. (2005). Sperm competition and the evolution

of sperm midpiece volume in mammals. Journal of Zoology, 267(02), 135.

Ardlie, K. G., DeLuca, D. S., Segrè, A. V., Sullivan, T. J., Young, T. R., Gelfand, E. T., …

Lockhart. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis:

Multitissue gene regulation in humans. Science, 348(6235), 648–660.

80

Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D., & Pirovano, W. (2011). Scaffolding

pre-assembled contigs using SSPACE. Bioinformatics , 27(4), 578–579.

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for

Illumina sequence data. Bioinformatics , 30(15), 2114–2120.

Brawand, D., Soumillon, M., Necsulea, A., Julien, P., Csárdi, G., Harrigan, P., …

Kaessmann, H. (2011). The evolution of gene expression levels in mammalian

organs. Nature, 478(7369), 343–348.

Carelli, F. N., Hayakawa, T., Go, Y., Imai, H., Warnefors, M., & Kaessmann, H. (2016).

The life history of retrocopies illuminates the evolution of new mammalian genes.

Genome Research, 26(3), 301–314.

Carithers, L. J., Ardlie, K., Barcus, M., Branton, P. A., Britton, A., Buia, S. A., … GTEx

Consortium. (2015). A Novel Approach to High-Quality Postmortem Tissue

Procurement: The GTEx Project. Biopreservation and Biobanking, 13(5), 311–319.

Cechova, M., Harris, R. S., Tomaszkiewicz, M., Arbeithuber, B., Chiaromonte, F., &

Makova, K. D. (2019). High satellite repeat turnover in great apes studied with short-

and long-read technologies. Molecular Biology and Evolution.

https://doi.org/10.1093/molbev/msz156

Chang, A. Y.-F., & Liao, B.-Y. (2012). DNA methylation rebalances gene dosage after

mammalian gene duplications. Molecular Biology and Evolution, 29(1), 133–144.

Charlesworth, D., & Charlesworth, B. (1980). Sex differences in fitness and selection for

centric fusions between sex-chromosomes and autosomes. Genetical Research,

35(2), 205–214.

Chimpanzee Sequencing and Analysis Consortium. (2005). Initial sequence of the

chimpanzee genome and comparison with the human genome. Nature, 437(7055),

69–87.

81

Choi, J., Koh, E., Suzuki, H., Maeda, Y., Yoshida, A., & Namiki, M. (2007). Alu sequence

variants of the BPY2 gene in proven fertile and infertile men with Sertoli cell-only

phenotype. International Journal of Urology, Vol. 14, pp. 431–435.

https://doi.org/10.1111/j.1442-2042.2007.01741.x

Cortez, D., Marin, R., Toledo-Flores, D., Froidevaux, L., Liechti, A., Waters, P. D., …

Kaessmann, H. (2014). Origins and functional evolution of y chromosomes across

mammals. Nature, 508(7497), 488–493.

De Bie, T., Cristianini, N., Demuth, J. P., & Hahn, M. W. (2006). CAFE: a computational

tool for the study of gene family evolution. Bioinformatics , 22(10), 1269–1271.

Delgado, R. A., Jr, & Van Schaik, C. P. (2000). The behavioral ecology and conservation

of the orangutan (Pongo pygmaeus): a tale of two islands. Evolutionary

Anthropology: Issues, News, and Reviews: Issues, News, and Reviews, 9(5), 201–

218. de Vries, J. W., Repping, S., Oates, R., Carson, R., Leschot, N. J., & van der Veen, F.

(2001). Absence of deleted in azoospermia (DAZ) genes in spermatozoa of infertile

men with somatic DAZ deletions. Fertility and Sterility, 75(3), 476–479.

Dixson, A. F., & Anderson, M. J. (2004). Sexual behavior, reproductive physiology and

sperm competition in male mammals. Physiology and Behavior, 83(2), 361–371.

Fisher, R. A. (1931). THE EVOLUTION OF DOMINANCE. Biological Reviews of the

Cambridge Philosophical Society, 6(4), 345–368.

Fruth, B., Hickey, J. R., André, C., Furuichi, T., Hart, J., Hart, T., … Others. (2016). Pan

paniscus. The IUCN Red List of Threatened Species 2016: e. T15932A102331567.

Fujii-Hanamoto, H., Matsubayashi, K., Nakano, M., Kusunoki, H., & Enomoto, T. (2011).

A comparative study on testicular microstructure and relative sperm production in

gorillas, chimpanzees, and orangutans. American Journal of Primatology, 73(6),

570–577.

82

Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the

next-generation sequencing data. Bioinformatics , 28(23), 3150–3152.

Fungtammasan, A., Tomaszkiewicz, M., Campos-Sánchez, R., Eckert, K. A., DeGiorgio,

M., & Makova, K. D. (2016). Reverse Transcription Errors and RNA–DNA

Differences at Short Tandem Repeats. Molecular Biology and Evolution, 33(10),

2744–2758.

Giachini, C., Nuti, F., Turner, D. J., Laface, I., Xue, Y., Daguin, F., … Krausz, C. (2009).

TSPY1 copy number variation influences spermatogenesis and shows differences

among Y lineages. The Journal of Clinical Endocrinology and Metabolism, 94(10),

4016–4022.

Gläser, B., Grützner, F., Willmann, U., Stanyon, R., Arnold, N., Taylor, K., … Schempp,

W. (1998). Simian Y chromosomes: species-specific rearrangements of DAZ, RBM,

and TSPY versus contiguity of PAR and SRY. Mammalian Genome: Official Journal

of the International Mammalian Genome Society, 9(3), 226–231.

Glazko, G. V., & Nei, M. (2003). Estimation of divergence times for major lineages of

primate species. Molecular Biology and Evolution, 20(3), 424–434.

Gordon, D., Huddleston, J., Chaisson, M. J. P., Hill, C. M., Kronenberg, Z. N., Munson,

K. M., … Eichler, E. E. (2016). Long-read sequence assembly of the gorilla genome.

Science, 352(6281), aae0344.

Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., …

Regev, A. (2011). Full-length transcriptome assembly from RNA-Seq data without a

reference genome. Nature Biotechnology, 29(7), 644–652.

Graves, J. A. M., & Marshall Graves, J. A. (1995). The origin and function of the

mammalian Y chromosome and Y-borne genes - an evolving understanding.

BioEssays: News and Reviews in Molecular, Cellular and Developmental Biology,

17(4), 311–320.

83

Greve, G., Alechine, E., Pasantes, J. J., Hodler, C., Rietschel, W., Robinson, T. J., &

Schempp, W. (2011). Y-Chromosome variation in hominids: intraspecific variation is

limited to the polygamous chimpanzee. PloS One, 6(12), e29311.

Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J., …

Regev, A. (2013). De novo transcript sequence reconstruction from RNA-seq using

the Trinity platform for reference generation and analysis. Nature Protocols, 8(8),

1494–1512.

Hallast, P., Balaresque, P., Bowden, G. R., Ballereau, S., & Jobling, M. A. (2013).

Recombination dynamics of a human Y-chromosomal palindrome: rapid GC-biased

gene conversion, multi-kilobase conversion tracts, and rare inversions. PLoS

Genetics, 9(7), e1003666.

Hallast, P., & Jobling, M. A. (2017). The Y chromosomes of the great apes. Human

Genetics, 136(5), 511–528.

Hallast, P., Maisano Delser, P., Batini, C., Zadik, D., Rocchi, M., Schempp, W., …

Jobling, M. A. (2016). Great ape Y Chromosome and mitochondrial DNA

phylogenies reflect subspecies structure and patterns of mating and dispersal.

Genome Research, 26(4), 427–439.

Hammer, M. F., Karafet, T., Rasanayagam, A., Wood, E. T., Altheide, T. K., Jenkins, T.,

… Zegura, S. L. (1998). Out of Africa and back again: nested cladistic analysis of

human Y chromosome variation. Molecular Biology and Evolution, 15(4), 427–441.

Han, M. V., Thomas, G. W. C., Lugo-Martinez, J., & Hahn, M. W. (2013). Estimating

gene gain and loss rates in the presence of error in genome assembly and

annotation using CAFE 3. Molecular Biology and Evolution, 30(8), 1987–1997.

Harcourt, A. H., Harvey, P. H., Larson, S. G., & Short, R. V. (1981). Testis weight, body

weight and breeding system in primates. Nature, 293(5827), 55–57.

84

Harrison, M. E., & Chivers, D. J. (2007). The orang-utan mating system and the

unflanged male: A product of increased food stress during the late Miocene and

Pliocene? Journal of Human Evolution, 52(3), 275–293.

Helgason, A., Einarsson, A. W., Guðmundsdóttir, V. B., Sigurðsson, Á., Gunnarsdóttir,

E. D., Jagadeesan, A., … Stefánsson, K. (2015). The Y-chromosome point mutation

rate in humans. Nature Genetics, 47(5), 453–457.

Hey, J. (2010). The divergence of chimpanzee species and subspecies as revealed in

multipopulation isolation-with-migration analyses. Molecular Biology and Evolution,

27(4), 921–933.

Hindson, B. J., Ness, K. D., Masquelier, D. A., Belgrader, P., Heredia, N. J., Makarewicz,

A. J., … Colston, B. W. (2011). High-throughput droplet digital PCR system for

absolute quantitation of DNA copy number. Analytical Chemistry, 83(22), 8604–

8610.

Hindson, C. M., Chevillet, J. R., Briggs, H. A., Gallichotte, E. N., Ruf, I. K., Hindson, B.

J., … Tewari, M. (2013). Absolute quantification by droplet digital PCR versus

analog real-time PCR. Nature Methods, 10(10), 1003–1005.

Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S., …

Page, D. C. (2012). Strict evolutionary conservation followed rapid gene loss on

human and rhesus Y chromosomes. Nature, 483(7387), 82–86.

Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P.

J., … Page, D. C. (2010). Chimpanzee and human Y chromosomes are remarkably

divergent in structure and gene content. Nature, 463(7280), 536–539.

Iucn, & IUCN. (2016a). Gorilla gorilla: Maisels, F., Bergl, R.A. & Williamson, E.A. IUCN

Red List of Threatened Species. https://doi.org/10.2305/iucn.uk.2018-

2.rlts.t9404a136250858.en

85

Iucn, & IUCN. (2016b). Pan troglodytes: Humle, T., Maisels, F., Oates, J.F., Plumptre, A.

& Williamson, E.A. IUCN Red List of Threatened Species.

https://doi.org/10.2305/iucn.uk.2016-2.rlts.t15933a17964454.en

Iucn, & IUCN. (2016c). Pongo pygmaeus: Ancrenaz, M., Gumal, M., Marshall, A.J.,

Meijaard, E., Wich , S.A. & Husson, S. IUCN Red List of Threatened Species.

https://doi.org/10.2305/iucn.uk.2016-1.rlts.t17975a17966347.en

Iucn, & IUCN. (2017). Pongo abelii: Singleton, I., Wich , S.A., Nowak, M., Usher, G. &

Utami-Atmoko, S.S. IUCN Red List of Threatened Species.

https://doi.org/10.2305/iucn.uk.2017-3.rlts.t121097935a115575085.en

Kent, W. J. (2002). BLAT--the BLAST-like alignment tool. Genome Research, 12(4),

656–664.

Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., &

Haussler, D. (2002). The human genome browser at UCSC. Genome Research,

12(6), 996–1006.

Khaitovich, P., Enard, W., Lachmann, M., & Pääbo, S. (2006). Evolution of primate gene

expression. Nature Reviews. Genetics, 7(9), 693–702.

Khaitovich, P., Hellmann, I., Enard, W., Nowick, K., Leinweber, M., Franz, H., … Pääbo,

S. (2005). Parallel patterns of evolution in the genomes and transcriptomes of

humans and chimpanzees. Science, 309(5742), 1850–1854.

Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., & Salzberg, S. L. (2013).

TopHat2: accurate alignment of transcriptomes in the presence of insertions,

deletions and gene fusions. Genome Biology, 14(4), R36.

Kumar, S., Stecher, G., & Tamura, K. (2016). MEGA7: Molecular Evolutionary Genetics

Analysis Version 7.0 for Bigger Datasets. Molecular Biology and Evolution, 33(7),

1870–1874.

86

Lahn, B. T., & Page, D. C. (1999). Four evolutionary strata on the human X

chromosome. Science, 286(5441), 964–967.

Langmead, B. (2010). Aligning short sequencing reads with Bowtie. Current Protocols in

Bioinformatics / Editoral Board, Andreas D. Baxevanis ... [et Al.], Chapter 11, Unit

11.7.

Li, B., & Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq

data with or without a reference genome. BMC Bioinformatics, 12(1), 323.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with

BWA-MEM. arXiv:1303.3997 [q-bio.GN]. https://doi.org/https://arxiv

.org/abs/1303.3997

Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large

sets of protein or nucleotide sequences. Bioinformatics , 22(13), 1658–1659.

Locke, D. P., Hillier, L. W., Warren, W. C., Worley, K. C., Nazareth, L. V., Muzny, D. M.,

… Wilson, R. K. (2011). Comparative and demographic analysis of orang-utan

genomes. Nature, 469(7331), 529–533.

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and

dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.

Lu, C., Jiang, J., Zhang, R., Wang, Y., Xu, M., Qin, Y., … Wang, X. (2014). Gene copy

number alterations in the azoospermia-associated AZFc region and their effect on

spermatogenic impairment. Molecular Human Reproduction, 20(9), 836–843.

Lucotte, E. A., Skov, L., Jensen, J. M., Macià, M. C., Munch, K., & Schierup, M. H.

(2018). Dynamic Copy Number Evolution of X- and Y-Linked Ampliconic Genes in

Human Populations. Genetics, 209(3), 907–920.

Luo, Z.-X., Yuan, C.-X., Meng, Q.-J., & Ji, Q. (2011). A Jurassic eutherian mammal and

divergence of marsupials and placentals. Nature, 476(7361), 442–445.

87

Møller, A. P. (1988). Ejaculate quality, testes size and sperm competition in primates.

Journal of Human Evolution, 17(5), 479–488.

Mueller, J. L., Skaletsky, H., Brown, L. G., Zaghlul, S., Rock, S., Graves, T., … Page, D.

C. (2013). Independent specialization of the human and mouse X chromosomes for

the male germ line. Nature Genetics, 45(9), 1083–1087.

Nascimento, J. M., Shi, L. Z., Meyers, S., Gagneux, P., Loskutoff, N. M., Botvinick, E. L.,

& Berns, M. W. (2008). The use of optical tweezers to study sperm competition and

motility in primates. Journal of the Royal Society, Interface / the Royal Society,

5(20), 297–302.

Nater, A., Mattle-Greminger, M. P., Nurcahyo, A., Nowak, M. G., de Manuel, M., Desai,

T., … Krützen, M. (2017). Morphometric, Behavioral, and Genomic Evidence for a

New Orangutan Species. Current Biology: CB, 27(22), 3576–3577.

Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome

Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biology and

Evolution, 8(7), 2231–2240.

Paradis, E., Claude, J., & Strimmer, K. (2004). APE: Analyses of Phylogenetics and

Evolution in R language. Bioinformatics , 20(2), 289–290.

Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon

provides fast and bias-aware quantification of transcript expression. Nature

Methods, 14(4), 417–419.

Prüfer, K., Munch, K., Hellmann, I., Akagi, K., Miller, J. R., Walenz, B., … Pääbo, S.

(2012). The bonobo genome compared with the chimpanzee and human genomes.

Nature, 486(7404), 527–531.

Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing

genomic features. Bioinformatics , 26(6), 841–842.

88

Repping, S., van Daalen, S. K. M., Brown, L. G., Korver, C. M., Lange, J., Marszalek, J.

D., … Rozen, S. (2006). High mutation rates have driven extensive structural

polymorphism among human Y chromosomes. Nature Genetics, 38(4), 463–467.

Rohlfs, R. V., Harrigan, P., & Nielsen, R. (2014). Modeling gene expression evolution

with an extended Ornstein-Uhlenbeck process accounting for within-species

variation. Molecular Biology and Evolution, 31(1), 201–211.

Rohlfs, R. V., & Nielsen, R. (2015). Phylogenetic ANOVA: The expression variance and

evolution model for quantitative trait evolution. Systematic Biology, 64(5), 695–708.

Ross, M. T., Grafham, D. V., Coffey, A. J., Scherer, S., McLay, K., Muzny, D., …

Bentley, D. R. (2005). The DNA sequence of the human X chromosome. Nature,

434(7031), 325–337.

Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J., Cordum, H. S., Waterston, R. H.,

… Page, D. C. (2003). Abundant gene conversion between arms of palindromes in

human and ape Y chromosomes. Nature, 423(6942), 873–876.

Ruiz-Orera, J., Hernandez-Rodriguez, J., Chiva, C., Sabidó, E., Kondova, I., Bontrop, R.,

… Albà, M. M. (2015). Origins of De Novo Genes in Human and Chimpanzee. PLoS

Genetics, 11(12), e1005721.

Schaller, F., Fernandes, A. M., Hodler, C., Münch, C., Pasantes, J. J., Rietschel, W., &

Schempp, W. (2010). Y Chromosomal Variation Tracks the Evolution of Mating

Systems in Chimpanzee and Bonobo. PloS One, 5(9), e12482.

Seuánez, H. N. (1980). Chromosomes and spermatozoa of the African great apes.

Journal of Reproduction and Fertility. Supplement, Suppl 28, 91–104.

Shi, W., Louzada, S., Grigorova, M., Massaia, A., Arciero, E., Kibena, L., … Xue, Y.

(2019). Evolutionary and functional analysis of RBMY1 gene copy number variation

on the human Y chromosome. Human Molecular Genetics, 28(16), 2785–2798.

89

Shi, W., Massaia, A., Louzada, S., Handsaker, J., Chow, W., McCarthy, S., … Tyler-

Smith, C. (2019). Birth, expansion, and death of VCY-containing palindromes on the

human Y chromosome. Genome Biology, 20(1), 207.

Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G.,

… Page, D. C. (2003). The male-specific region of the human Y chromosome is a

mosaic of discrete sequence classes. Nature, 423(6942), 825–837.

Skov, L., & Schierup, M. H. (2017). Analysis of 62 hybrid assembled human Y

chromosomes exposes rapid structural changes and high rates of gene conversion.

PLoS Genetics, 13(8), 1–20.

Sommer, D. D., Delcher, A. L., Salzberg, S. L., & Pop, M. (2007). Minimus: a fast,

lightweight genome assembler. BMC Bioinformatics, 8, 64.

Soneson, C., Love, M. I., & Robinson, M. D. (2015). Differential analyses for RNA-seq:

transcript-level estimates improve gene-level inferences. F1000Research, 4, 1521.

Taylor, S. C., Laperriere, G., & Germain, H. (2017). Droplet Digital PCR versus qPCR for

gene expression analysis with low abundant targets: from variable nonsense to

publication quality data. Scientific Reports, 7(1), 2409.

Thorvaldsdóttir, H., Robinson, J. T., & Mesirov, J. P. (2013). Integrative Genomics

Viewer (IGV): high-performance genomics data visualization and exploration.

Briefings in Bioinformatics, 14(2), 178–192.

Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H.

W., Harris, R., … Makova, K. D. (2016). A time- and cost-effective strategy to

sequence mammalian Y Chromosomes: an application to the de novo assembly of

gorilla Y. Genome Research, 26(4), 530–540.

Underhill, P. A., & Kivisild, T. (2007). Use of y chromosome and mitochondrial DNA

population structure in tracing human migrations. Annual Review of Genetics, 41,

539–564.

90

Underhill, P. A., Shen, P., Lin, A. A., Jin, L., Passarino, G., Yang, W. H., … Oefner, P. J.

(2000). Y chromosome sequence variation and the history of human populations.

Nature Genetics, Vol. 26, pp. 358–361. https://doi.org/10.1038/81685

Vegesna, R., Tomaszkiewicz, M., Medvedev, P., & Makova, K. D. (2019). Dosage

regulation, and variation in gene expression and copy number of human Y

chromosome ampliconic genes. PLoS Genetics, 15(9), e1008369.

Veyrunes, F., Waters, P. D., Miethke, P., Rens, W., McMillan, D., Alsop, A. E., …

Marshall Graves, J. A. (2008). Bird-like sex chromosomes of platypus imply recent

origin of mammal sex chromosomes. Genome Research, 18(6), 965–973.

Vogt, P. H., Edelmann, A., Kirsch, S., Henegariu, O., Hirschmann, P., Kiesewetter, F., …

Haidl, G. (1996). Human Y chromosome azoospermia factors (AZF) mapped to

different subregions in Yq11. Human Molecular Genetics, 5(7), 933–943.

Walsh, B. (2001). Estimating the time to the most recent common ancestor for the Y

chromosome or mitochondrial DNA for a pair of individuals. Genetics, 158(2), 897–

912.

Wistuba, J., Schrod, A., Greve, B., Hodges, J. K., Aslam, H., Weinbauer, G. F., &

Luetjens, C. M. (2003). Organization of seminiferous epithelium in primates:

relationship to spermatogenic efficiency, phylogeny, and mating system. Biology of

Reproduction, 69(2), 582–591.

Xie, Y., Wu, G., Tang, J., Luo, R., Patterson, J., Liu, S., … Wang, J. (2014).

SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads.

Bioinformatics , 30(12), 1660–1666.

Yan, Y., Yang, X., Liu, Y., Shen, Y., Tu, W., Dong, Q., … Yang, Y. (2017). Copy number

variation of functional RBMY1 is associated with sperm motility: an azoospermia

factor-linked candidate for asthenozoospermia. Human Reproduction , 32(7), 1521–

1531.

91

Ye, D., Zaidi, A., Tomaszkiewicz, M., Anthony, K., Liebowitz, C., DeGiorgio, M., …

Makova, K. D. (2018). High levels of copy number variation of ampliconic genes

across major human Y haplogroups. Genome Biology and Evolution, (May).

https://doi.org/10.1093/gbe/evy086

Yu, N., Jensen-Seaman, M. I., Chemnick, L., Kidd, J. R., Deinard, A. S., Ryder, O., … Li,

W.-H. (2003). Low nucleotide diversity in chimpanzees and bonobos. Genetics,

164(4), 1511–1518.

92

Chapter 4 Dynamic evolution of great ape Y chromosomes

This chapter will be submitted as a research article by M. Cechova, R. Vegesna, M. Tomaszkiewicz, R.S. Harris, D. Chen, S. Rangavittal, P. Medvedev, K.D. Makova. In this chapter, R. Vegesna, the author of this dissertation, performed the analysis related to gene content and palindromes of bonobo and orangutan Y chromosome assembly. M. Tomaszkiewicz performed all the wet-lab experimental work. M. Michalovova, R.S. Harris and S. Rangavittal assembled the Y chromosome. R.S. Harris, and D. Chen helped with the multiple sequence alignment of great ape sex chromosomes. Supporting information is provided in Appendix C.

Introduction

Great apes include humans and their closest living relatives: chimpanzees, gorillas, and orangutans. The male-specific sex chromosome, the Y, is involved in sex determination and male fertility, but has been understudied in great apes. The Y chromosome assemblies are available for chimpanzee, human, and gorilla (Hughes et al., 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). Cytogenetic studies of great ape Y chromosomes demonstrated substantial interspecific variation in their structure and gene content (Gläser et al., 1998). We have a limited understanding of the evolutionary history in terms of gene and palindrome content of great ape Y chromosomes. This shortcoming is addressed in this study.

Studies of the human Y chromosome have identified variation in copy number of genes present in its ampliconic region (Lucotte et al., 2018; Skov & Schierup, 2017; Vegesna, Tomaszkiewicz, Medvedev, & Makova, 2019; Ye et al., 2018). This variation results from intraspecific rearrangements among the amplicons on the Y chromosome (Lange et al., 2013). The existence of variation in ampliconic gene copy number across great apes suggests the presence of similar repetitive sequences on all great apes Y chromosomes, which undergo frequent rearrangements (Vegesna et al. 2020; Chapter 3). The most frequent repetitive structures observed on the Y chromosomes of great apes studied to date are palindromes—large inverted repeat structures (arms) which are separated by a

93 relatively short sequence (spacer) in between (Hughes et al., 2010; Skaletsky et al., 2003). These palindromic structures enable gene conversion and non-allelic homologous recombination (NAHR), which aids in overcoming the lack of recombination within the male-specific region of the Y (Rozen et al., 2003). However, very little is known about the palindromes outside of the three great ape species with the Y chromosome assemblies available (Hughes et al., 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016).

The conservation of ampliconic structures on Y chromosomes was linked to Muller's ratchet and genetic hitchhiking (Betrán, Demuth, & Williford, 2012). However, genes present on the ampliconic regions have testis-specific expression, whereas the genes outside the amplicons are expressed ubiquitously (Skaletsky et al., 2003). This implies that, apart from sequence conservation, amplicons could have secondary function where they regulate the genes present within the amplicons to be selectively expressed in testis. Not all palindromes on the human Y are conserved in chimpanzee or gorilla (Hughes et al., 2010; Tomaszkiewicz et al., 2016). There are nineteen palindromes on the chimpanzee Y, of which only seven are homologs to the human palindromes and the remaining twelve are chimpanzee-specific (Hughes et al., 2010). The gorilla Y chromosome assembly has sequences homologous to all human palindromes and nine out of the twelve chimpanzee-specific palindromes which are likely to form palindromes also in gorilla because of their high read depth in a male gorilla individual (Tomaszkiewicz et al., 2016). Of the human palindromes conserved in chimpanzee and gorilla, there are two palindromes (P6 and P7) that do not harbour protein-coding genes (Skaletsky et al., 2003). Studying these palindromes should aid in understanding the potential function of amplicons on the Y chromosomes of great apes.

The mammalian Y and X chromosomes originated from a pair of autosomes, however when compared to the X chromosome, the Y has degraded and lost the majority of its genes (Bellott et al., 2014; Skaletsky et al., 2003). On the one hand, one study has shown that, due to the lack of recombination, the Y chromosome has been under constant purifying selection in humans, removing deleterious mutations and maintaining the reproductive fitness of the population (Wilson Sayres, Lohmueller, & Nielsen, 2014). On the other hand, beneficial mutations can be fixed in a population via selective sweeps, with its byproduct being the fixation of deletion or pseudogenization events via genetic hitchhiking (Hughes et al., 2005; Perry, Tito, & Verrelli, 2007). Reconstructing the

94 evolutionary history of the genes on the great ape Y chromosomes will provide us with insights about the evolution of their gene content. We can deduce whether the observed gene content is a result of constant rate of gene birth and death across great apes, or whether there are, for instance, species-specific increases in deletion rates on the Y chromosome, potentially associated with species-specific differences in mating preferences and social structure.

Recombination between the X and the Y chromosomes is limited to the pseudoautosomal regions (Graves, 1995; Lahn & Page, 1999). However, there are cases in which regions of the Y which normally do not recombine with the X (i.e. X-generate regions) undergo gene conversion with the X (Rosser, Balaresque, & Jobling, 2009; Trombetta, Cruciani, Underhill, Sellitto, & Scozzari, 2010). Genes such as PRKY and VCY share higher similarity with their X homologs as a result undergo X-Y gene conversion in the human lineage (Rosser et al., 2009; Trombetta et al., 2010). To better understand the evolution and conservation of genes on the Y, there is a need to identify regions which undergo X- Y gene conversion on great ape Y chromosomes.

In this study, we assembled the Y chromosomes of bonobo and Sumatran orangutan. Within these assemblies we identified the palindromes which are shared with human and chimpanzee. Independent of the assemblies, we identified new factors other than genes that could determine the conservation of palindromes across great apes. Using a model developed by Iwasaki and Takagi (Iwasaki & Takagi, 2007), we reconstructed the evolutionary history of genes across great apes. By comparing the X and Y chromosomes of human, chimpanzee, bonobo, gorilla, and orangutan, we identified conserved regions and searched for X-Y gene conversion events within them.

Results Conservation of human and chimpanzee palindrome sequences. Did the palindromes now present on the human Y (P1-P8) and chimpanzee Y (C1-C19) evolve before or after the great ape lineages split? To answer this question, we identified the proportions of human and chimpanzee palindrome sequences that aligned to bonobo, orangutan and gorilla Ys in our multi-species alignments (Fig. 1A; Table S1-2). Among human palindromes, P5 and P6 were the most conserved (covered by 89-97% of other great ape Y assemblies), whereas the majority of P3 sequences were human-specific

95

(covered by only 31-38% of other great ape Y assemblies). Nevertheless, the common ancestor of great apes most likely already had substantial lengths of sequences homologous to P1, P2, and P4-P8, and some sequences of P3 (Fig. 1B). Chimpanzee palindromes C17, C18, and C19 are homologous to human palindromes P8, P7, and P6, respectively (Hughes et al., 2010). Therefore below we focused on the other chimpanzee palindromes and, following (Hughes et al., 2010), divided them into five homologous groups: C1 (C1+C6+C8+C10+C14+C16), C2 (C2+C11+C15), C3 (C3+C12), C4 (C4+C13), and C5 (C5+C7+C9) (Table S2). The palindromes in the C3, C4, and C5 groups had substantial proportions (usually 70-90%) of their sequences covered by alignments with other great ape Ys (Fig. 1A). In contrast, most of C2 sequences (85%) were shared with bonobo, and a substantial proportion of C1 sequences was chimpanzee- specific. Nonetheless, the common ancestor of great apes likely already had large amounts of sequences homologous to group C3, C4, and C5 palindromes, and also some sequences homologous to group C1 and C2 palindromes (Fig. 1B).

To determine whether the bonobo, orangutan, and gorilla sequences homologous to human or chimpanzee palindromes were multi-copy (i.e. present in more than one copy), and thus could form palindromes, in the common ancestor of great apes, we obtained their read depths from whole-genome sequencing of their respective males (Fig. 1A; see Methods). This approach was used because we expect that some palindromes were collapsed in our Y assemblies and, hence, the copy number within the assembly may be unreliable. Additionally, we used the data on the homology between human and chimpanzee palindromes summarized from the literature (Hughes et al., 2010; Skaletsky et al., 2003b; Tomaszkiewicz et al., 2016) (Table S3-4). Using maximum parsimony reconstruction, we concluded (Note S1) that sequences homologous to P4, P5, P8, C4, and partial sequences homologous to P1, P2, C2, and C3 were multi-copy in the common ancestor of great apes (Fig. 1B). Sequences homologous to P3, P6, and C1 were multi- copy in the human-gorilla common ancestor, those homologous to P7 were multi-copy in the human-chimpanzee common ancestor, and those homologous to C5 were multi-copy in the bonobo-chimpanzee common ancestor (Fig. 1B; Note S1).

96

Figure 4- 1. Evolution of sequences homologous to human and chimpanzee palindromes. (A) Heatmaps showing coverage for each palindrome in each species in the multi-species alignment, and box plots representing copy number (natural log) of 1-kb windows which have homology with human or chimpanzee palindromes. (B) The great ape phylogenetic tree with evolution of human (shown in blue) and chimpanzee (purple) palindromic sequences overlayed on it. Palindrome names in bold indicate that their sequences were present in ≥2 copies. Negative (-) and positive (+) signs indicate gain and loss of palindrome sequence (possibly only partial), respectively. Arrows represent gain (↑) or loss (↓) of palindrome copy number. If several equally parsimonious scenarios were possible, we assumed a later date of acquisition of the multi-copy state for a palindrome (Note S1).

What drives conservation of gene-free palindromes P6 and P7? Palindromes are hypothesized to evolve to allow ampliconic genes to withstand high mutation rates on the Y via gene conversion in the absence of interchromosomal recombination (Betrán et al., 2012; Rozen et al., 2003). Two human palindromes—P6 and P7—do not harbor any genes, however the large proportions of their sequences are present and are multi-copy (Fig. 1A, Table S3) in most great ape species we examined (with exceptions of P7 absent from bonobo and of single-copy P6 and P7 in orangutan).

97

In fact, P6 is the most conserved human palindrome (Fig. 1A) and was present in the multi-copy state in the human-gorilla common ancestor (Fig. 1B). We hypothesized that conservation of P6 and P7 might be explained by their role in regulation of gene expression.

Using ENCODE data (Davis et al., 2018), we identified candidate open chromatin and protein-binding sites in P6 and P7 (Fig. 2-3). In P6, we found markers for open chromatin (DNase I hypersensitive sites) and histone modifications H3K4me1 and H3K27ac, associated with enhancers (Pennacchio et al., 2013), in human umbilical vein endothelial cells (HUVEC). In P7, we found markers for CAMP-responsive element binding protein 1 (CREB-1), participating in transcription regulation (Mayr & Montminy, 2001), in a human liver cancer cell line HepG2. Interestingly, we did not identify any open-chromatin or enhancer marks in P6 and P7 in testis, suggesting that the sites we found above regulate genes expressed outside of this tissue.

Evolution of gene content in great apes

Utilizing sequence assemblies and testis expression data (Vegesna et al., 2020), we evaluated gene content and the rates of gene birth and death on the Y chromosomes of five great ape species. First, we examined the presence/absence of homologs of human Y chromosome genes (16 X-degenerate genes and nine ampliconic gene families, a total of 25 gene families; for multi-copy gene families, we were not studying copy number variation, but only presence/absence of a family in a species; Fig. S1). Such data were previously available for the chimpanzee Y, in which seven out of 25 human Y gene families became pseudogenized or deleted (Hughes et al., 2010), and for the gorilla Y, in which only one gene family (VCY) out of 25 is absent (Tomaszkiewicz et al., 2016). Here, we compiled the data for bonobo and orangutan. From the 25 gene families present on the human Y, the bonobo Y lacked seven (HSFY, PRY, TBL1Y, TXLNGY, USP9Y, VCY, and XKRY) and the orangutan Y lacked five (TXLNGY, CYorf15A, PRKY, USP9Y, and VCY). Second, we annotated putative new genes in our bonobo and orangutan Y assemblies (Note S2). Our results suggest that the bonobo and orangutan Y chromosomes, similar to the chimpanzee (Hughes et al., 2010) and gorilla (Tomaszkiewicz et al., 2016) Ys, do not harbor novel genes. As a result, we obtained the complete information about gene family content on the Y chromosome in five great ape species.

98

Figure 4- 2. IGV screen shots of peaks in DNase-seq, H3K4me1 and H3K27ac on Palindrome P6. A. Peaks on both the arms of the palindromes are shown within the blue and grey boxes. B. Zoom in view of peaks on the left arm of palindrome P6. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2. C. Zoom in view of peaks on the right arm of palindrome P6. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2.

Figure 4- 3. IGV screenshot of CREB1 peaks on Palindrome P7. Peaks on both the arms of the palindromes are shown. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2.

Using this information, we reconstructed gene content at ancestral nodes and asked whether the rates of gene birth and death varied across the great ape phylogeny. For this analysis, we employed the evolutionary model developed by Iwasaki and Takagi (Iwasaki

99

& Takagi, 2007) and used the macaque Y chromosome as an outgroup (Hughes et al., 2012). Because X-degenerate and ampliconic genes might exhibit different trends, we analyzed them separately (Fig. 4, Table S5). Considering gene births, none were observed for X-degenerate genes, and only one (VCY, in the human-chimpanzee-bonobo common ancestor) was observed for ampliconic genes, leading to overall low gene birth rates. Considering gene deaths, three ampliconic gene families and three X-generate genes were lost by the chimpanzee-bonobo common ancestor, leading to death rates of 0.095 and 0.049 events/MY, respectively. Bonobo lost an additional ampliconic gene, whereas chimpanzee lost an additional X-degenerate gene, leading to death rates of 0.182 and 0.080 events/MY, respectively. In contrast, no deaths of either ampliconic or X- degenerate genes were observed in human and gorilla. Orangutan did not experience any deaths of X-degenerate genes, but lost four ampliconic genes. Its ampliconic gene death rate (0.021 events/MY) was still lower than that in the bonobo or in the bonobo- chimpanzee common ancestor. To summarize, across great apes, the Pan genus exhibited the highest death rates for both X-degenerate and ampliconic genes.

Figure 4- 4. Evolution of Y chromosome gene content in great apes. The reconstructed history of gene birth and death for X-degenerate (blue) and ampliconic (red) genes was overlaid on the great ape phylogenetic tree (not drawn to scale), using macaque as an outgroup. The rates of gene birth and death (in events per million years) are shown in parentheses (for complete data see Fig. S3). The list at the root includes the genes that were present in the common ancestor of great apes and macaque. In addition to most of the genes on the human Y, the macaque Y harbors the X- degenerate MXRA5Y gene, which we found to be deleted in orangutan and pseudogenized in bonobo, chimpanzee, gorilla, and human. We currently cannot find a full-length copy of the VCY gene in bonobo. TXLNGY and DDX3Y are also known as CYorf15B and DBY, respectively.

100

Gene conversion between X and Y chromosomes of great apes To study the presence of gene conversion between X-degenerate genes and their homologs on the X chromosome in great apes, we generated a multiple-sequence alignment of the X and Y chromosomes (see Methods). From this alignment, we extracted alignment blocks specific to 12 single copy X-degenerate genes (including both exons and introns) on the human Y (from 16 human X-degenerate genes we excluded CYorf15A, CYorf15B, RPS4Y1, and RPS4Y2 as parts of these genes have repeats on the Y and have homologs on the X chromosome, making detection of gene conversion difficult). We further retained only alignment blocks greater than 50 bp and tested for the presence of gene conversion in them using GENECONV (Sawyer, 1999). The resulting alignment blocks covered 10-75% of sequence among the 12 X-degenerate genes, and gene conversion tracts were up to 410 bp in length. The gene conversion events were observed in NLGN4Y for all species (Table 1). From multiple species alignments, the total number of gene conversion events ranged from two to 13 across great ape species except for orangutan which had only two high confidence events. From pairwise chromosome X and Y alignments in each species, seven to 38 gene conversion events were observed across great ape species. Gene conversion events were most frequent in PRKY, which was previously shown to undergo gene conversion in humans (Rosser et al., 2009). Our results indicate that this gene also undergoes gene conversion in other great apes except for orangutan. No X-Y gene conversion events were detected for EIF1AY, KDM5D, SRY, TMSB4Y, USP9Y, and UTY in any great ape species studied.

Table 4- 1. Gene conversion between X and Y chromosomes of great apes using GENECONV. The first column has information about the gene symbols with the stratum they belong to in brackets. The values in the following column represent the number of gene conversion events with significant p-values (<0.05) which are corrected for multiple comparisons across all sequence pairs. The values in brackets represent the number of gene conversion events with significant p-values (<0.05) for pairwise comparisons corrected for the length of the alignment.

Gene (Strata) Bonobo Chimpanzee Human Gorilla Orangutan AMELY (4) 1 (1) 1 (1) 1 (1) 1 (1) 0 DBY (3) 0 (1) 0 (1) 0 0 0 (1) EIF1AY (3) 0 0 0 0 0 KDM5D (2) 0 0 0 0 0 NLGN4Y (4) 2 (6) 2 (7) 3 (7) 1 (8) 1 (3) PRKY (5) 8 (25) 10 (27) 4 (17) 8 (17) 0 SRY (1) 0 0 0 0 0 TBL1Y (4) 1 (4) 0 (2) 1 (7) 0 (3) 0 (2) TMSB4Y (3) 0 0 0 0 0 USP9Y (3) 0 0 0 0 0 UTY (3) 0 0 0 0 0

101

ZFY (3) 0 0 0 0 1 (1) Total 12 (37) 13 (38) 9 (32) 10 (29) 2 (7)

Discussion Palindromes We found that multi-copy sequences are abundant in great ape Ys and that many of them were present in the common ancestor of great apes. In fact, substantial portions of most human palindromes (five out of eight), and of most chimpanzee palindrome groups (three out of five), were likely multi-copy (and thus potentially palindromic) in the common ancestor of great apes, suggesting conservation over >13 MY of evolution. Moreover, two of the three rhesus macaque palindromes are conserved with human palindromes P4 and P5 (Hughes et al., 2012), indicating conservation over >25 MY. On the other hand, our study also found species-specific amplification or loss of palindromes and other repetitive sequences, indicating that Y ampliconic sequence evolution is highly dynamic. A vivid example of this pattern is an extensive amplification of sequences homologous to P1, P2, P4, P5, and C3-C4 in orangutan. We also found that sizeable proportions of species- specific sequences in the bonobo, gorilla and orangutan Ys are multi-copy. Future studies will establish which of them form palindromes as opposed to tandem repeats. Regardless, repetitive sequences constitute a biologically significant component of great ape Y chromosomes and their multi-copy state might be selected for.

Previous studies (e.g., reviewed in (Betrán et al., 2012; S. Rozen et al., 2003; Trombetta & Cruciani, 2017)) focused on the role of Y-Y gene conversion in preserving Y ampliconic gene families, which are critical for spermatogenesis and fertility (Skaletsky et al., 2003), and suggested that this phenomenon constitutes the major adaptive role of palindromic sequences. Our findings suggest that, instead of genes, some human palindromes (P6 and P7) possess regulatory regions of expression of genes transcribed outside of testis and thus likely located on non-Y chromosomes, and such regulatory regions might drive conservation of gene-free palindromes. This observation should be examined in more detail in the future, but can potentially shift a paradigm in our understanding of Y chromosome functions. Indeed, our results imply that, in addition to carrying genes important for male sex determination and spermatogenesis, the Y chromosome participates in regulating gene expression in the genome. This echoes findings in Drosophila (e.g., (Lemos et al., 2010)) and in the mouse Y chromosome, which contains

102 a small, gene-free region that interacts with the rest of the genome (Kaufmann et al., 2015).

Genes We found that the gene content in the common ancestor of great apes likely was the same as is currently found in gorilla, and included eight ampliconic and 16 X-degenerate genes (Fig. 4). Analyzing the data on ampliconic gene content (Fig. 4), palindrome sequence (Fig. 1B), and ampliconic gene copy number (Fig. 2 in Chapter 3) evolution jointly, we can infer which ampliconic genes were present in the multi-copy state in the common ancestor of great apes. Our results suggest that the common ancestor of great apes had sequences homologous to P1, P2, P4, P5 and P8 in multi-copy state (Fig. 1B), which in human carry DAZ, BPY2, CDY, HSFY, XKRY, and VCY (Tomaszkiewicz et al., 2017). Except for VCY, which was acquired by the human-chimpanzee common ancestor, the remaining five genes were likely present as multi-copy gene families in the common ancestor of great apes, because three of them (DAZ, BPY2, and CDY) are present as multi-copy in all great ape species (Vegesna et al., 2020), and the other two (HSFY and XKRY)—in all great ape species but chimpanzee and bonobo (Vegesna et al., 2020), in which they were lost (Fig. 4).

Our comprehensive analysis of great ape Y gene content has allowed us to investigate rates of gene birth and death on the Y. We discovered that there is only one gene family that was born throughout the whole great ape phylogeny—VCY was acquired by the common ancestor of human and chimpanzee. As a result, except for this branch, we found uniformly low rates of gene birth. A low rate of ampliconic gene birth contradicts predictions of high birth rate made in previous studies for such genes (Trombetta & Cruciani, 2017), but suggests that great ape radiation does not provide sufficient time for gene acquisition by ampliconic regions. For new genes to survive on the Y chromosome, they should be beneficial to males. Ampliconic regions on the Y chromosomes of several other mammals indeed acquired such genes (Chang et al., 2013; Soh et al., 2014).

We expected to observe a high death rate for X-degenerate genes, but a low death rate for ampliconic genes, because the former genes do not undergo Y-Y gene conversion and thus should accumulate deleterious mutations, whereas the latter genes are multi-copy (all copies need to be deleted or pseudogenized for the gene family to die) and can be rescued by Y-Y gene conversion. Unexpectedly, the rates of gene death did not differ

103 between ampliconic and X-degenerate genes. Indeed, ~44.4% of ampliconic gene families were either deleted or pseudogenized, as compared with also ~43.8% of X-degenerate genes, across the great ape Y phylogenetic tree. These values were 67% and 50%, respectively, if we included the macaque Y. While our data did not support our hypothesis, other findings suggest that death of ampliconic genes is a gradual process. Indeed, we have recently shown that ampliconic gene families dead in some great ape species have reduced copy number in other species (Vegesna et al., 2019, 2020), lowering the chances for Y-Y gene conversion. These observations suggest that such genes are on the way to become non-essential and are at death’s door in great apes.

The rates of gene death varied among great ape species. In particular, we observed high rates of death in the Pan lineage—in the common ancestor of bonobo and chimpanzee, but also in the bonobo and chimpanzee lineages. Thus, the evolutionary forces driving such a high rate of gene death have likely been operating in the Pan lineage continuously since its divergence from the human lineage. What evolutionary forces could explain this observation? First, gene-disrupting or gene deletion mutations could be hitchhiking in haplotypes with positively selected mutations. Positive selection might be acting in the Pan lineage due to its polyandrous mating pattern and sperm competition. No gene deaths in the human and gorilla lineages, experiencing no sperm competition, and low gene death rates in orangutan, experiencing limited sperm competition, are consistent with this explanation.

Gene conversion The suppression of recombination between X and Y chromosome has occurred in at least five distinct steps (strata)(Lahn & Page, 1999; Ross et al., 2005). The youngest stratum retains the highest X-Y sequence similarity of ~95% (Ross et al., 2005). However, there are cases where genes on the male-specific part of the Y undergo gene conversion with the X (e.g., PRKY and VCY), which results in regions with higher similarity with the X chromosome (Rosser et al., 2009; Trombetta et al., 2010). We demonstrated X-Y gene conversion at several X-degenerate genes (AMELY, NLGN4Y, PRKY, and TBL1Y) in all great apes (Table 1). Among the genes with X-Y gene conversion, AMELY, NLGN4Y, and TBL1Y belong to stratum 4 and PRKY belongs to stratum 5 (Hughes et al., 2012; Lahn & Page, 1999; Ross et al., 2005). PRKY has higher frequency of gene conversion events when compared to the other three genes, which is consistent with the requirement of high sequence similarity (>92%) for gene conversion to take place (Chen, Cooper,

104

Chuzhanova, Férec, & Patrinos, 2007). These conclusions are based on a multiple sequence alignment of X and Y chromosomes of great apes. The Y chromosome assemblies of human and chimpanzee are of a higher quality when compared to those of other great apes, which could influence the count and length of gene conversion events predicted. In the future, gene conversion in ampliconic genes should also be studied. This would require obtaining the fully resolved (i.e. non-collapsed) ampliconic sequences in the assemblies of the Y chromosome.

Materials and Methods

Human and chimpanzee palindrome arm sequence conservation The multiple sequence alignment of the Y chromosomes of bonobo, chimpanzee, human, gorilla and orangutan were generated using a reference free whole genome aligner cactus (Paten et al., 2011). We used default parameters and did not provide a guide tree to cactus for the generation of multiple sequence alignment. The cactus output (.hal) file was converted to MAF using hal2maf function (--noAncestors --refGenome hg_Y --maxRefGap 100 --maxBlockLen 10000) in HAL tools (Hickey, Paten, Earl, Zerbino, & Haussler, 2013) twice. The first time we set --refGenome to ‘human’, which results in human-centric MAF file (where the first line represents sequences from the human Y) and the second time we set --refGenome parameter to ‘chimpanzee’ to obtain chimpanzee-based MAF files. The coordinates of the human and chimpanzee palindromes (PanTro4) were obtained from a previous publication (Tomaszkiewicz et al., 2016) (Table S6-7). Chimpanzee palindrome coordinates were converted to the PanTro6 version using the liftOver utility from the UCSC Genome Browser (Kent et al., 2002). For few chimpanzee palindromes for which liftOver failed to convert the coordinates, we generated a dotplot of chimpanzee Y to itself using Gepard (Krumsiek, Arnold, & Rattei, 2007) and identified the locations of palindromes from the dotplot.

Using a custom python script, which makes use of AlignIO from Biopython (Cock et al., 2009) we parsed the MAF files to obtain all the alignment blocks which overlapped given palindrome arm coordinates and for each species, sum the number of sites that are covered by these parsed alignment blocks. For each alignment block within the palindrome arm, we considered only those sequences which had less than 5% gaps and counted the number of aligned nucleotides per species at sites where the palindrome arm

105 is unmasked (softmasked with Repeat Masker, with lower-case nucleotides representing repeats). Next we calculated the number of nonrepetitive characters in the whole palindrome arm and used it to calculate the percentage of non-repetitive alignment within the palindrome arm per species.

Palindrome sequence read depth in bonobo, gorilla, and orangutan We used a pre-established pipeline used to identify human and chimpanzee palindrome sequences in Y assemblies (Tomaszkiewicz et al., 2016). The bonobo, gorilla, and orangutan Y contigs were broken into non-overlapping 1-kb windows and each window was aligned to human (hg38) and (separately) chimpanzee (panTro6) Y chromosome sequences using LASTZ (version 1.04.00) (--scores=human_primate.q, --seed=match12, --markend) (Harris, 2007) separately. The substitution scores were set identical to those used for published LASTZ alignments of primates (Miller et al., 2007). The best alignment for each window was retrieved from the LASTZ output. All the windows with alignments of >800 bp were considered as homologs to human or chimpanzee Y palindromes and used in the downstream analysis.

The whole-genome male sequencing reads (paired end reads of length 251bp) from orangutan (ID PR00251), gorilla (ID KB3781), and bonobo (ID 1991-0051) were mapped to the human and (separately) chimpanzee Ys using bwa mem (version 0.7.17-r1188) (Li, 2013) and the resulting output files were sorted and indexed using samtools (version 1.9) (Li et al., 2009). Using computeGCBias and correctGCBias functions from deepTools (version 3.3.0) (Ramírez et al., 2016), we performed GC correction of the sequencing read depths. Finally, using bedtools (v2.27.1) (Quinlan & Hall, 2010) ‘coverage’ function, we obtained the read depth of the windows homologous to human and chimpanzee palindrome sequences identified as described in the previous paragraph. We compared the read depths of palindromic sequence windows to that of windows overlapping X- degenerate genes(Table S8-9). In the case of chimpanzee, windows mapping to all the palindromes with similar repeats were grouped into C1 (C1, C6, C8, C10, C14, and C16), C2 (C2, C11, and C15), C3 (C3 and C12), C4 (C4 and C13), and C5 (C5, C7 and C9) palindrome groups.

106

Search for regulatory factors in human palindromes P6 and P7 We checked for the presence of sites specific to DNA binding proteins on human palindromes P6 and P7, which could imply the presence of functionality related to gene regulation. Initially we used the ENCODE track (http://genome.ucsc.edu/ENCODE/) at the UCSC Genome Browser (Kent et al., 2002) to search for epigenetic modifications in the palindrome regions (Dunham et al., 2012). In particular, we used the track from the Bernstein Lab at the Broad Institute containing H3K27Ac and H3K4Me1 data on seven cell lines from ENCODE, from which we identified human umbilical vein endothelial cells (HUVEC) to have signals in palindrome P6. Later we extended our search to ENCODE data portal (https://www.encodeproject.org/) from where we download the BAM files to look for the presence of peaks (Table S10) (Davis et al., 2018). We used the “Search by region” page under Data in the ENCODE data portal (https://www.encodeproject.org/region-search/) and GRCh38 setting to search for files which are related to the coordinates of palindromes P6 and P7. The resulting files were visualized in the UCSC Genome Browser to identify signals in the peaks, and from this information we identified signals in human liver cancer cell line HepG2 for P7. The ENCODE data processing pipeline by default filters out reads with low mapping quality, as a result we did not find any peaks/signals in the majority of the datasets (https://www.encodeproject.org/pipelines/). Thus, the current version of ENCODE tracks cannot be used as a source for studying epigenetic modifications in palindromes. To validate the two cell types HUVEC and HepG2, in which the signal was observed, we manually downloaded the unfiltered BAM files (HUVEC: H3K27Ac, H3K4Me1 and Dnase- seq; Testis:H3K27Ac, H3K4Me1 and Dnase-seq; and HepG2: CREB1; Table S10) which include low-mapping-quality reads. We performed peak calling using macs2 (version2.1.4; -f BAM --broad --broad-cutoff 0.05) (Gaspar, 2018) and the control samples were used whenever available. We used integrative genomics viewer (IGV) (version 2.4.19)(Robinson et al., 2011) to visualize the data.

Analysis of genes homologous to human Y genes

For the bonobo and orangutan Y chromosomes, some data on the X-degenerate gene content were obtained from a recent review (Hallast & Jobling, 2017) (Note S2) and on the ampliconic gene content—from a recent publication (Vegesna et al. 2020; Chapter 3). However information about RPS4Y2 remained missing. We used gene predictions (see

107 next paragraph) and testis-specific transcriptome assemblies from the same publication for the presence of the RPS4Y2 and MXRA5Y gene (Note S2) (Vegesna et al. 2020; Chapter 3). The gene content of macaque Y chromosome was used as an outgroup of great apes (Hughes et al., 2012).

Novel genes in bonobo and orangutan assemblies We used AUGUSTUS (Stanke & Waack, 2003) (--species=human --softmasking=on -- codingseq=on) to predict genes in the Y chromosome assemblies of bonobo and orangutan. From the list of predicted genes we retained those that have a start and stop codon. Using blastp from BLAST (2.9.0; -db uniprot_sprot.pep -max_target_seqs 1 - evalue 1e-5 -num_threads 10) (Altschul, Gish, Miller, Myers, & Lipman, 1990) we annotated the predicted gene sequences. Based on blast annotation, we classified all the genes which are not annotated by blastp as candidate de novo genes. From these genes, we retained that had >90% identity to the sequence in the UniProt (Boutet et al., 2016) protein database and covered at least 90% of the gene sequence in the blast output. To make sure that the predicted genes are on the Y chromosome and do not represent miss- assemblies, we performed an extra filtering step where we used only those genes which are present on contigs which align to either human or chimpanzee Y chromosomes. The resulting gene annotations which are not found on the human Y chromosome were assigned as candidate novel genes translocated to the Y (Table S11). The longest predicted transcript sequence for genes annotated as Y-chromosome-specific were obtained without the above filters for validation of missing gene content in bonobo and orangutan.

Reconstruction of gene content of great apes Once we obtained the complete Y chromosome gene content in great apes, we converted the gene content into binary values which represent the presence or absence of a gene in a species. The complete loss or pseudogenization of gene/gene families is represented by zero and the remaining cases are represented as ones. Using the model developed by Iwasaki and Takagi (Iwasaki & Takagi, 2007), we reconstructed the evolutionary history of genes (separately ampliconic and X-degenerate) across great apes. The phylogenetic tree of great apes (Locke et al., 2011), along with the table representing the presence or absence of genes, was used as an input to generate the rate of gene birth and gene death for each branch of the tree, and the reconstructed gene content at each internal node of

108 the tree. To obtain the rate in the units of events per million years, the rate of gene birth, as well as the rate of gene death, was divided by the length of the branches (in millions years) (Iwasaki & Takagi, 2007).

Gene conversion events between the X and Y chromosomes We softmasked the X and Y chromosomes of great apes using RepeatMasker (SMIT & A., 2004)(RepeatMasker -pa 63 -xsmall -species Primates ${assembly}.rmsk.fa). Progressive cactus (Paten et al., 2011) was used to align the chromosomes. A guide tree which pairs the X and Y chromosomes from the same species list was used(((chimpY, chimpX),(bonoboY, bonoboX),(humanY, humanX),(gorillaY, gorillaX),(sorangY, sorangX))). The resulting alignment output from cactus was converted to maf format using hal2maf (Hickey et al., 2013) using human Y as a reference (--noAncestors --refGenome humanY --maxRefGap 100 --maxBlockLen 10000). Then we parsed the alignment blocks (retaining blocks longer than >50bp; range of gene conversion tracts in human 55-290 bp, as was observed in previous studies (Chen et al., 2007; Jeffreys & May, 2004)) which fall within the coordinates of X-degenerate genes (excluding CYorf15A, CYorf15B, RPS4Y1, and RPS4Y2, see Results). We did not perform additional filtering based on repeat content within the alignment blocks. For each block, the alignments which constitute the sequences from both the X and Y chromosomes for a species were printed into a FASTA file. We used GENCONV (Sawyer, 1999) ( /w9 /lp -nolog) to identify gene conversion events based on multiple sequence alignment files. By default, the output of GENCONV constitutes a global list of high-confidence gene conversion events after multiple testing for each sequence pair. We also used /lp parameters with which a second list of significant gene conversions for all possible pairwise comparisons it performed. We used a p-value cutoff of 0.05, as GENCONV provides p-values after correcting for multiple comparisons. We used pairwise comparisons to address gene conversion in cases where a chromosome was represented by more than one sequence in the alignment. From the GENCONV output, we parsed events that constitute gene conversion between the X and Y chromosome from the same species and retained events which are longer than 50 bp.

References

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local

109

alignment search tool. Journal of Molecular Biology, 215(3), 403–410.

Bellott, D. W., Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Cho, T. J.,

Koutseva, N., Zaghlul, S., Graves, T., Rock, S., Kremitzki, C., Fulton, R. S., Dugan,

S., Ding, Y., Morton, D., Khan, Z., Lewis, L., Buhay, C., Wang, Q., … Page, D. C.

(2014). Mammalian y chromosomes retain widely expressed dosage-sensitive

regulators. Nature, 508(7497), 494–499.

Betrán, E., Demuth, J. P., & Williford, A. (2012). Why chromosome palindromes?

International Journal of Evolutionary Biology, 2012(Figure 2), 207958.

Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bansal, P., Bridge, A. J., Poux, S.,

Bougueleret, L., & Xenarios, I. (2016). UniProtKB/Swiss-Prot, the Manually

Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.

Methods in Molecular Biology , 1374, 23–54.

Chang, T.-C., Yang, Y., Retzel, E. F., & Liu, W.-S. (2013). Male-specific region of the

bovine Y chromosome is gene rich with a high transcriptomic activity in testis

development. Proceedings of the National Academy of Sciences of the United

States of America, 110(30), 12373–12378.

Chen, J.-M., Cooper, D. N., Chuzhanova, N., Férec, C., & Patrinos, G. P. (2007). Gene

conversion: mechanisms, evolution and human disease. Nature Reviews. Genetics,

8(10), 762–775.

Cheung, V. G., Nayak, R. R., Wang, I. X., Elwyn, S., Cousins, S. M., Morley, M., &

Spielman, R. S. (2010). Polymorphic cis- and trans-regulation of human gene

expression. PLoS Biology, 8(9). https://doi.org/10.1371/journal.pbio.1000480

Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg,

I., Hamelryck, T., Kauff, F., Wilczynski, B., & de Hoon, M. J. L. (2009). Biopython:

freely available Python tools for computational molecular biology and bioinformatics.

Bioinformatics , 25(11), 1422–1423.

110

Davis, C. A., Hitz, B. C., Sloan, C. A., Chan, E. T., Davidson, J. M., Gabdank, I., Hilton,

J. A., Jain, K., Baymuradov, U. K., Narayanan, A. K., Onate, K. C., Graham, K.,

Miyasato, S. R., Dreszer, T. R., Strattan, J. S., Jolanki, O., Tanaka, F. Y., & Cherry,

J. M. (2018). The Encyclopedia of DNA elements (ENCODE): data portal update.

Nucleic Acids Research, 46(D1), D794–D801.

Dunham, I., Kundaje, A., Aldred, S. F., Collins, P. J., Davis, C. A., Doyle, F., Epstein, C.

B., Frietze, S., Harrow, J., Kaul, R., Khatun, J., Lajoie, B. R., Landt, S. G., Lee, B.

K., Pauli, F., Rosenbloom, K. R., Sabo, P., Safi, A., Sanyal, A., … Lochovsky, L.

(2012). An integrated encyclopedia of DNA elements in the human genome. Nature,

489(7414), 57–74.

Gaspar, J. M. (2018). Improved peak-calling with MACS2. In bioRxiv (p. 496521).

https://doi.org/10.1101/496521

Gläser, B., Grützner, F., Willmann, U., Stanyon, R., Arnold, N., Taylor, K., Rietschel, W.,

Zeitler, S., Toder, R., & Schempp, W. (1998). Simian Y chromosomes: species-

specific rearrangements of DAZ, RBM, and TSPY versus contiguity of PAR and

SRY. Mammalian Genome: Official Journal of the International Mammalian Genome

Society, 9(3), 226–231.

Graves, J. A. M. (1995). The origin and function of the mammalian Y chromosome and

Y‐borne genes – an evolving understanding. BioEssays: News and Reviews in

Molecular, Cellular and Developmental Biology, 17(4), 311–320.

Hallast, P., & Jobling, M. A. (2017). The Y chromosomes of the great apes. Human

Genetics, 136(5), 511–528.

Harris, R. S. (2007). Improved pairwise Alignmnet of genomic DNA.

https://etda.libraries.psu.edu/catalog/7971

Hickey, G., Paten, B., Earl, D., Zerbino, D., & Haussler, D. (2013). HAL: a hierarchical

format for storing and analyzing multiple genome alignments. Bioinformatics ,

111

29(10), 1341–1342.

Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S.,

Dugan, S., Ding, Y., Buhay, C. J., Kremitzki, C., Wang, Q., Shen, H., Holder, M.,

Villasana, D., Nazareth, L. V., Cree, A., Courtney, L., Veizer, J., Kotkiewicz, H., …

Page, D. C. (2012). Strict evolutionary conservation followed rapid gene loss on

human and rhesus y chromosomes. Nature, 483(7387), 82–87.

Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P.

J., Fulton, R. S., McGrath, S. D., Locke, D. P., Friedman, C., Trask, B. J., Mardis, E.

R., Warren, W. C., Repping, S., Rozen, S., Wilson, R. K., & Page, D. C. (2010).

Chimpanzee and human Y chromosomes are remarkably divergent in structure and

gene content. Nature, 463(7280), 536–539.

Hughes, J. F., Skaletsky, H., Pyntikova, T., Minx, P. J., Graves, T., Rozen, S., Wilson, R.

K., & Page, D. C. (2005). Conservation of Y-linked genes during human evolution

revealed by comparative sequencing in chimpanzee. Nature, 437(7055), 100–103.

Iwasaki, W., & Takagi, T. (2007). Reconstruction of highly heterogeneous gene-content

evolution across the three domains of life. Bioinformatics , 23(13), i230–i239.

Jeffreys, A. J., & May, C. A. (2004). Intense and highly localized gene conversion activity

in human meiotic crossover hot spots. Nature Genetics, 36(2), 151–156.

Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., &

Haussler, a. D. (2002). The Human Genome Browser at UCSC. In Genome

Research (Vol. 12, Issue 6, pp. 996–1006). https://doi.org/10.1101/gr.229102

Krumsiek, J., Arnold, R., & Rattei, T. (2007). Gepard: a rapid and sensitive tool for

creating dotplots on genome scale. Bioinformatics , 23(8), 1026–1028.

Kuroda-Kawaguchi, T., Skaletsky, H., Brown, L. G., Minx, P. J., Cordum, H. S.,

Waterston, R. H., Wilson, R. K., Silber, S., Oates, R., Rozen, S., & Page, D. C.

(2001). The AZFc region of the Y chromosome features massive palindromes and

112

uniform recurrent deletions in infertile men. In Nature Genetics (Vol. 29, Issue 3, pp.

279–286). https://doi.org/10.1038/ng757

Lahn, B. T., & Page, D. C. (1999). Chromosome Four Evolutionary Strata on the Human

X Chromosome. Science, 286(5441), 964–967.

Lange, J., Noordam, M. J., van Daalen, S. K. M., Skaletsky, H., Clark, B. A., Macville, M.

V., Page, D. C., & Repping, S. (2013). Intrachromosomal homologous

recombination between inverted amplicons on opposing Y-chromosome arms.

Genomics, 102(4), 257–264.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with

BWA-MEM. In arXiv [q-bio.GN]. arXiv. http://arxiv.org/abs/1303.3997

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G.,

Abecasis, G., Durbin, R., & Subgroup, 1000 Genome Project Data Processing.

(2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25,

2078–2079.

Locke, D. P., Hillier, L. W., Warren, W. C., Worley, K. C., Nazareth, L. V., Muzny, D. M.,

Yang, S.-P., Wang, Z., Chinwalla, A. T., Minx, P., Mitreva, M., Cook, L., Delehaunty,

K. D., Fronick, C., Schmidt, H., Fulton, L. A., Fulton, R. S., Nelson, J. O., Magrini,

V., … Wilson, R. K. (2011). Comparative and demographic analysis of orang-utan

genomes. Nature, 469(7331), 529–533.

Lucotte, E. A., Skov, L., Jensen, J. M., Macià, M. C., Munch, K., & Schierup, M. H.

(2018). Dynamic Copy Number Evolution of X- and Y-Linked Ampliconic Genes in

Human Populations. Genetics, 209(3), 907–920.

Mayr, B., & Montminy, M. (2001). Transcriptional regulation by the phosphorylation-

dependent factor CREB. Nature Reviews. Molecular Cell Biology, 2(8), 599–609.

Miller, W., Rosenbloom, K., Hardison, R. C., Hou, M., Taylor, J., Raney, B., Burhans, R.,

King, D. C., Baertsch, R., Blankenberg, D., Kosakovsky Pond, S. L., Nekrutenko, A.,

113

Giardine, B., Harris, R. S., Tyekucheva, S., Diekhans, M., Pringle, T. H., Murphy, W.

J., Lesk, A., … Kent, W. J. (2007). 28-Way vertebrate alignment and conservation

track in the UCSC Genome Browser. In Genome Research (Vol. 17, Issue 12, pp.

1797–1808). https://doi.org/10.1101/gr.6761107

Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome

Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biology and

Evolution, 8(7), 2231–2240.

Paten, B., Earl, D., Nguyen, N., Diekhans, M., Zerbino, D., & Haussler, D. (2011).

Cactus: Algorithms for genome multiple sequence alignment. Genome Research,

21(9), 1512–1528.

Pennacchio, L. A., Bickmore, W., Dean, A., Nobrega, M. A., & Bejerano, G. (2013).

Enhancers: five essential questions. Nature Reviews. Genetics, 14(4), 288–295.

Perry, G. H., Tito, R. Y., & Verrelli, B. C. (2007). The evolutionary history of human and

chimpanzee Y-chromosome gene loss. Molecular Biology and Evolution, 24(3),

853–859.

Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing

genomic features. Bioinformatics , 26(6), 841–842.

Ramírez, F., Ryan, D. P., Grüning, B., Bhardwaj, V., Kilpert, F., Richter, A. S., Heyne, S.,

Dündar, F., & Manke, T. (2016). deepTools2: a next generation web server for

deep-sequencing data analysis. Nucleic Acids Research, 44(W1), W160–W165.

Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G.,

& Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1),

24–26.

Rosser, Z. H., Balaresque, P., & Jobling, M. A. (2009). Gene Conversion between the X

Chromosome and the Male-Specific Region of the Y Chromosome at a

Translocation Hotspot. American Journal of Human Genetics, 85(1), 130–134.

114

Ross, M. T., Grafham, D. V., Coffey, A. J., Scherer, S., McLay, K., Muzny, D., Platzer,

M., Howell, G. R., Burrows, C., Bird, C. P., Frankish, A., Lovell, F. L., Howe, K. L.,

Ashurst, J. L., Fulton, R. S., Sudbrak, R., Wen, G., Jones, M. C., Hurles, M. E., …

Bentley, D. R. (2005). The DNA sequence of the human X chromosome. Nature,

434(March), 325–337.

Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J., Cordum, H. S., Waterston, R. H.,

Wilson, R. K., & Page, D. C. (2003). Abundant gene conversion between arms of

palindromes in human and ape Y chromosomes. Nature, 423(6942), 873–876.

Sawyer, S. A. (1999). GENECONV: a computer package for the statistical detection of

gene conversion. Distributed by the author, Department of Mathematics,

Washington University in St. Louis. St. Louis.

Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G.,

Repping, S., Pyntikova, T., Ali, J., Bieri, T., Chinwalla, A., Delehaunty, A.,

Delehaunty, K., Du, H., Fewell, G., Fulton, L., Fulton, R., Graves, T., Hou, S.-F., …

Page, D. C. (2003a). The male-specific region of the human Y chromosome is a

mosic of discrete sequence classes. Nature, 423, 825–837.

Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G.,

Repping, S., Pyntikova, T., Ali, J., Bieri, T., Chinwalla, A., Delehaunty, A.,

Delehaunty, K., Du, H., Fewell, G., Fulton, L., Fulton, R., Graves, T., Hou, S.-F., …

Page, D. C. (2003b). The male-specific region of the human Y chromosome is a

mosaic of discrete sequence classes. Nature, 423(6942), 825–837.

Skov, L., & Schierup, M. H. (2017). Analysis of 62 hybrid assembled human Y

chromosomes exposes rapid structural changes and high rates of gene conversion.

PLoS Genetics, 13(8), 1–20.

SMIT, & A., F. A. (2004). Repeat-Masker Open-3.0. Http://www.repeatmasker.org.

https://ci.nii.ac.jp/naid/10029514778/

115

Soh, Y. Q. S., Alföldi, J., Pyntikova, T., Brown, L. G., Graves, T., Minx, P. J., Fulton, R.

S., Kremitzki, C., Koutseva, N., Mueller, J. L., Rozen, S., Hughes, J. F., Owens, E.,

Womack, J. E., Murphy, W. J., Cao, Q., de Jong, P., Warren, W. C., Wilson, R. K.,

… Page, D. C. (2014). Sequencing the mouse Y chromosome reveals convergent

gene acquisition and amplification on both sex chromosomes. Cell, 159(4), 800–

813.

Spicuglia, S., & Vanhille, L. (2012). Chromatin signatures of active enhancers. Nucleus ,

3(2), 126–131.

Stanke, M., & Waack, S. (2003). Gene prediction with a hidden Markov model and a new

intron submodel. Bioinformatics , 19 Suppl 2, ii215–ii225.

Tomaszkiewicz, M., Medvedev, P., & Makova, K. D. (2017). Y and W Chromosome

Assemblies: Approaches and Discoveries. Trends in Genetics: TIG, 33(4), 266–282.

Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H.

W., Harris, R., Ye, D., O’Brien, P. C. M., Chikhi, R., Ryder, O. A., Ferguson-Smith,

M. A., Medvedev, P., & Makova, K. D. (2016). A time- and cost-effective strategy to

sequence mammalian Y Chromosomes: an application to the de novo assembly of

gorilla Y. Genome Research, 26(4), 530–540.

Trombetta, B., & Cruciani, F. (2017). Y chromosome palindromes and gene conversion.

Human Genetics, 136(5), 605–619.

Trombetta, B., Cruciani, F., Underhill, P. A., Sellitto, D., & Scozzari, R. (2010). Footprints

of X-to-Y gene conversion in recent human evolution. Molecular Biology and

Evolution, 27(3), 714–725.

Vegesna, R., Tomaszkiewicz, M., Medvedev, P., & Makova, K. D. (2019). Dosage

regulation, and variation in gene expression and copy number of human Y

chromosome ampliconic genes. PLoS Genetics, 15(9), e1008369.

Vegesna, R., Tomaszkiewicz, M., Ryder, O. A., Campos-Sánchez, R., Medvedev, P.,

116

DeGiorgio, M., & Makova, K. D. (2020). Ampliconic genes on the great ape Y

chromosomes: Rapid evolution of copy number but conservation of expression

levels. In Review.

Wilson Sayres, M. A., Lohmueller, K. E., & Nielsen, R. (2014). Natural selection reduced

diversity on human y chromosomes. PLoS Genetics, 10(1), e1004064.

Ye, D., Zaidi, A. A., Tomaszkiewicz, M., Anthony, K., Liebowitz, C., DeGiorgio, M.,

Shriver, M. D., & Makova, K. D. (2018). High Levels of Copy Number Variation of

Ampliconic Genes across Major Human Y Haplogroups. Genome Biology and

Evolution, 10(5), 1333–1350.

117

Chapter 5

Summary In chapter two, we developed a method to estimate the copy number of ampliconic genes on the Y chromosome using whole-genome sequencing data called AmpliCoNE. Using this method, we were able to establish the relationship between the copy number and gene expression of Y ampliconic genes in 149 men. This provided us with insights into the relationship between gene expression and copy number of ampliconic genes, the presence of high tolerance for variation in ampliconic gene expression in testis, and dosage regulation of ampliconic genes to compensate for the variation in their copy number.

In chapter three, as a follow-up to chapter one, we estimated the Y ampliconic gene copy number and expression levels in great apes. We examined the conservation of ampliconic gene copy number and expression across great apes. We observed significant differences in gene copy number between species, whereas the overall expression levels were conserved. When we studied the relationship between copy number and expression across species, we did observe positive correlation in three gene families, which implies that copy number does influence gene expression over long evolutionary times. Similar to humans, within ampliconic gene families we did not observe a relationship between copy number and gene expression, which strengths our observation of dosage regulation of Y ampliconic genes. We observed significant interspecific size differences, sometimes even between sister species—chimpanzee and bonobo. We hypothesize that sperm competition and mating structure, which differ drastically among great apes, might have affected these patterns.

Finally, in chapter four we reconstructed the evolutionary history of gene content in great apes' Y chromosomes to study whether the rate at which the Y chromosome gained or lost genes is constant across great apes or whether there is a species-specific increase. We presented the conservation of human and chimpanzee palindromes in other great apes and attempted to reconstruct the palindromic structure in the great ape common ancestor. We showed that not only genes but also transcription factors can influence the conservation of palindromes. We also identified gene conversion events between X and

118

Y across great apes, which was previously only shown within humans and chimpanzees.

Significance and future work In humans, there are nine ampliconic gene families—BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY—which play an important role in spermatogenesis and are expressed exclusively in testis (Skaletsky et al., 2003). Loss of ampliconic gene families is linked to infertility (Carvalho, Zhang, & Lupski, 2011; Krausz et al., 2011; Navarro-Costa, Plancha, & Gonçalves, 2010; Repping et al., 2002; Vogt, 1996). Recent studies reported that there is a high variation in ampliconic gene copy number in healthy men (Lucotte et al., 2018; Skov, Danish Pan Genome Consortium, & Schierup, 2017; Ye et al., 2018). However, very little was known about how the differences in ampliconic gene copy number influence their expression. We filled this gap by defining the relationship between Y ampliconic gene expression levels and variation in gene copy number, Y haplogroup, and individual’s age. We explained the presence of dosage compensation, which aids in understanding how the Y chromosome adapts to the dynamic shift in copy number.

Similar to variation observed in humans, there is known variation in copy number of ampliconic genes across great apes (Hughes et al., 2010; Oetjens, Shen, Emery, Zou, & Kidd, 2016; Repping et al., 2006; Schaller et al., 2010; Tomaszkiewicz et al., 2016). However, we lacked information about the relationship between ampliconic gene copy number and their expression levels (Hughes et al., 2012, 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). We estimated the ampliconic gene copy number and expression levels in great apes’ Y chromosomes and addressed this question. On the one hand, Y ampliconic genes are linked to spermatogenesis, and on the other hand, there is known variation in mating pattern of great apes (Harrison & Chivers, 2007; Wistuba et al., 2003). We tested whether Y ampliconic genes’ copy number and expression levels are conserved across great apes and demonstrated that gene families such as TSPY and RBMY have species-specific differences which can explain phenotypes linked to mating pattern across great apes. These results shed light on how Y chromosome genes might be involved in achieving different mating patterns among great apes.

119

As a result of sexual antagonism, ampliconic gene families might have accumulated on the Y chromosome to increase male reproductive fitness (Bellott et al., 2014; Betrán, Demuth, & Williford, 2012; Rozen et al., 2003). However, only five of the nine human ampliconic gene families are found across all great ape species (Hallast & Jobling, 2017). The remaining four gene families are lost or pseudogenized in some species. We annotated the Y chromosome assemblies of great apes to study whether they have acquired new genes and reconstructed the evolution of gene content on the Y to understand whether there is a species-specific loss or gain of genes. Also, we identified factors that influence the survival of palindrome structures which can help better understand the evolution of ampliconic gene families which are mostly located in palindromes on the Y. Non-human great ape species are confined to two geographic locations: Africa (gorillas, chimpanzees, and bonobos) and several islands of Indonesia (orangutan), and they all are critically endangered. There is a need to restore their populations. Ampliconic genes make up the majority of protein-coding genes on the Y and obtaining insights into their role in male fertility should help improve male reproductive fitness that influences the survival of great ape populations. This dissertation should be viewed as a starting point to learn about evolution of ampliconic genes and their variation in great apes.

Below is a list of possible future directions. They are to

1. obtain complete Y ampliconic gene sequences from long reads rather than short reads; the latter cannot address small indels and deleterious mutations.

2. look at expression levels of ampliconic orthologs that moved from sex chromosomes onto autosomes and determine the role they play in dosage compensation of gene expression.

3. obtain cell-type specific expression data, as testis as a tissue is a mixture of different cell-types.

4. study expression data at different stages of spermatogenesis with different copies of genes, because ampliconic genes are linked to spermatogenesis.

120

5. develop tools to study transcription factors in palindromic regions which remain understudied.

6. study Y regulatory sequences which can benefit male fitness, just as the Y has accumulated genes beneficial to males.

Major contributions

1. Presented a method to estimate the copy number of human Y ampliconic genes. 2. Presented a study of Y ampliconic copy number and expression among great apes. 3. Studied the relationship between ampliconic gene expression and copy number in humans and other great apes. 4. Demonstrated conservation of ampliconic genes with respect to sperm competition and mating structure of great apes. 5. Provided information on gene and palindrome content of common ancestor of great apes.

121

References

Bellott, D. W., Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Cho, T.-J., …

Page, D. C. (2014). Mammalian Y chromosomes retain widely expressed dosage-

sensitive regulators. Nature, 508(7497), 494–499.

Betrán, E., Demuth, J. P., & Williford, A. (2012). Why Chromosome Palindromes?

International Journal of Evolutionary Biology, 2012, 1–14.

Carvalho, C. M. B., Zhang, F., & Lupski, J. R. (2011). Structural variation of the human

genome: mechanisms, assays, and role in male infertility. Systems Biology in

Reproductive Medicine, 57(1-2), 3–16.

Hallast, P., & Jobling, M. A. (2017). The Y chromosomes of the great apes. Human

Genetics, 136(5), 511–528.

Harrison, M. E., & Chivers, D. J. (2007). The orang-utan mating system and the

unflanged male: A product of increased food stress during the late Miocene and

Pliocene? Journal of Human Evolution, 52(3), 275–293.

Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S., …

Page, D. C. (2012). Strict evolutionary conservation followed rapid gene loss on

human and rhesus Y chromosomes. Nature, 483(7387), 82–86.

Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P.

J., … Page, D. C. (2010). Chimpanzee and human Y chromosomes are remarkably

divergent in structure and gene content. Nature, 463(7280), 536–539.

Krausz, C., Chianese, C., Giachini, C., Guarducci, E., Laface, I., & Forti, G. (2011). The

Y chromosome-linked copy number variations and male fertility. Journal of

Endocrinological Investigation, 34(5), 376–382.

Lucotte, E. A., Skov, L., Jensen, J. M., Coll Macià, M., Munch, K., & Schierup, M. H.

(2018). Dynamic Copy Number Evolution of X- and Y-Linked Ampliconic Genes in

122

Human Populations. Genetics. https://doi.org/10.1534/genetics.118.300826

Navarro-Costa, P., Plancha, C. E., & Gonçalves, J. (2010). Genetic Dissection of the

AZF Regions of the Human Y Chromosome: Thriller or Filler for Male (In)fertility?

Journal of Biomedicine & Biotechnology, 2010, 1–18.

Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome

Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biology and

Evolution, 8(7), 2231–2240.

Repping, S., Skaletsky, H., Lange, J., Silber, S., van der Veen, F., Oates, R. D., …

Rozen, S. (2002). Recombination between Palindromes P5 and P1 on the Human

Y Chromosome Causes Massive Deletions and Spermatogenic Failure. American

Journal of Human Genetics, 71(4), 906–922.

Repping, S., van Daalen, S. K. M., Brown, L. G., Korver, C. M., Lange, J., Marszalek, J.

D., … Rozen, S. (2006). High mutation rates have driven extensive structural

polymorphism among human Y chromosomes. Nature Genetics, 38(4), 463–467.

Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J., Cordum, H. S., Waterston, R. H.,

… Page, D. C. (2003). Abundant gene conversion between arms of palindromes in

human and ape Y chromosomes. Nature, 423(6942), 873–876.

Schaller, F., Fernandes, A. M., Hodler, C., Münch, C., Pasantes, J. J., Rietschel, W., &

Schempp, W. (2010). Y Chromosomal Variation Tracks the Evolution of Mating

Systems in Chimpanzee and Bonobo. PloS One, 5(9), e12482.

Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G.,

… Page, D. C. (2003). The male-specific region of the human Y chromosome is a

mosaic of discrete sequence classes. Nature, 423(6942), 825–837.

Skov, L., Danish Pan Genome Consortium, & Schierup, M. H. (2017). Analysis of 62

hybrid assembled human Y chromosomes exposes rapid structural changes and

high rates of gene conversion. PLoS Genetics, 13(8), e1006834.

123

Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H.

W., Harris, R., … Makova, K. D. (2016). A time- and cost-effective strategy to

sequence mammalian Y Chromosomes: an application to the de novo assembly of

gorilla Y. Genome Research, 26(4), 530–540.

Vogt, P. H. (1996). Human Y Chromosome Function in Male Germ Cell Development. In

Advances in Developmental Biology (1992) (pp. 191–257).

Wistuba, J., Schrod, A., Greve, B., Keith Hodges, J., Aslam, H., Weinbauer, G. F., &

Marc Luetjens, C. (2003). Organization of Seminiferous Epithelium in Primates:

Relationship to Spermatogenic Efficiency, Phylogeny, and Mating System1. Biology

of Reproduction, 69(2), 582–591.

Ye, D., Zaidi, A. A., Tomaszkiewicz, M., Anthony, K., Liebowitz, C., DeGiorgio, M., …

Betran, E. (2018). High Levels of Copy Number Variation of Ampliconic Genes

across Major Human Y Haplogroups. Genome Biology and Evolution, 10(5), 1333–

1350.

124

Appendix A

Supporting Material for Chapter 2

Supplemental Tables Table S1. Copy number counts in the simulated sets. The numbers in the table represent the total number of gene copies for each ampliconic family present on the Y chromosome reference and used to simulate paired end fastq files and the copy number estimates using AmpliCoNE (Observed). The gene families in bold have custom copy numbers in each set.

Gene Set 1 Set 2 Set 3 family

Expected Observed Expected Observed Expected Observed

TSPY 22 22.0 29 29.0 23 23.3

RBMY 7 6.99 12 11.9 9 8.94

VCY 4 4.01 2 1.92 3 3.11

BPY2 3 3.02 3 2.98 3 2.92

CDY 4 3.93 4 4.07 4 3.96

DAZ 4 4.01 4 4.02 4 4.00

HSFY 2 2.08 2 2.04 2 1.93

PRY 2 1.96 2 2.01 2 1.98

XKRY 2 1.95 2 1.92 2 1.99

125

Table S2. The ampliconic gene copy number estimates across technical and biological replicates in three males samples from the GIAB consortium. Ashkenazim Son (HG002), Ashkenazim Father (HG004), and Chinese Father (HG006). The ampliconic gene copy number was estimated for two sequencing runs per individual which are represented as separate column in the table. Gene HG002 HG002 HG003 HG003 HG006 HG006 family (Run1) (Run2) (Run1) (Run3) (141008_D0 (141015_D0 0360) 0360)

BPY2 2.93 2.93 2.92 2.74 1.82 1.99

CDY 4.34 3.97 4.08 3.96 3.05 3.14

DAZ 3.86 3.80 3.86 3.87 1.91 1.78

HSFY 2.01 2.24 2.20 2.18 2.28 2.01

PRY 2.02 1.98 1.84 2.02 2.03 2.04

RBMY 5.83 5.87 6.01 5.44 7.00 6.95

TSPY 40.0 40.2 39.7 39.8 19.2 19.4

VCY 1.45 1.23 1.58 1.13 1.38 1.21

XKRY 1.96 1.98 2.16 1.99 1.81 1.77

126

Table S3. The ampliconic gene copy number estimates across different depths of coverage of Y chromosome in Ashkenazim Son (HG002 Run1) from the GIAB consortium. The reads were subsampled using samtools view function.

Gene 21x 17x 11x 6x family

BPY2 2.93 2.92 2.84 2.89

CDY 4.34 4.34 4.24 4.22

DAZ 3.86 3.85 3.89 3.84

HSFY 2.01 1.96 1.96 1.85

PRY 2.02 2.05 2.05 2.11

RBMY 5.83 5.73 5.69 5.70

TSPY 40.0 40.1 39.7 39.7

VCY 1.45 1.40 1.40 1.64

XKRY 1.96 1.86 1.90 1.77

127

Table S4. ddPCR-based ampliconic gene copy number estimates for four males from the GIAB consortium. Replicates a, b and c are the copy numbers from the three replicate experiments that were performed, and their mean value is used as the final estimate of the copy number. N/A - not available.

Mean copy Individual ID Gene Replicate a Replicate b Replicate c number

NA24149 DAZ 3.92 3.92 4.12 3.99

NA24149 PRY 2.05 2.05 2.24 2.11

NA24149 VCY 2.24 2.22 2.24 2.24

NA24149 BPY2 2.99 3.08 2.93 3.00

NA24149 CDY 3.55 3.64 3.84 3.68

NA24149 HSFY 1.93 1.92 1.78 1.88

NA24149 RBMY 6.11 6.81 6.56 6.49

NA24149 TSPY 44.6 44.5 43.7 44.3

NA24149 XKRY 1.98 1.93 1.90 1.93

NA24385 DAZ 3.72 3.90 3.80 3.8

NA24385 PRY 1.94 1.91 1.91 1.92

NA24385 VCY 2.03 2.09 2.06 2.06

NA24385 BPY2 2.87 2.72 2.86 2.82

NA24385 CDY 3.43 3.49 3.49 3.47

NA24385 HSFY 1.88 1.85 1.81 1.85

NA24385 RBMY 6.87 6.53 6.75 6.72

NA24385 TSPY 42.0 N/A 42.7 42.4

NA24385 XKRY 1.73 1.78 1.81 1.77

NA24631 DAZ 2.01 1.97 2.04 2.01

NA24631 PRY 2.05 2.10 2.17 2.11

128

NA24631 VCY 2.08 2.09 2.10 2.09

NA24631 BPY2 1.85 1.96 1.89 1.9

NA24631 CDY 2.83 2.91 2.81 2.85

NA24631 HSFY 1.92 1.98 1.93 1.95

NA24631 RBMY 7.75 7.95 7.94 7.88

NA24631 TSPY 21.9 22.2 21.7 22.0

NA24631 XKRY 1.96 1.90 1.86 1.91

NA24694 DAZ 1.88 1.83 1.87 1.86

NA24694 PRY 2.01 2.03 2.01 2.02

NA24694 VCY 2.09 2.09 2.06 2.08

NA24694 BPY 1.92 1.95 1.93 1.94

NA24694 CDY 2.92 2.90 2.90 2.91

NA24694 HSFY 1.97 2.02 1.89 1.96

NA24694 RBMY 8.06 8.01 7.79 7.95

NA24694 TSPY 21.8 22.1 22.0 22.0

NA24694 XKRY 1.83 1.95 1.96 1.91

129

Table S5. Sample IDs of the 170 GTEx samples used initially and retained after filtering for outliers in the gene expression analysis.

SAMPLE ID STATUS

GTEX-111CU RETAINED

GTEX-111FC RETAINED

GTEX-111VG RETAINED

GTEX-111YS RETAINED

GTEX-117XS RETAINED

GTEX-117YW RETAINED

GTEX-117YX FILTERED

GTEX-1192W FILTERED

GTEX-11DXY FILTERED

GTEX-11DXZ RETAINED

GTEX-11EI6 RETAINED

GTEX-11EQ8 RETAINED

GTEX-11EQ9 RETAINED

GTEX-11GS4 RETAINED

GTEX-11LCK RETAINED

GTEX-11NUK RETAINED

GTEX-11NV4 RETAINED

GTEX-11P7K RETAINED

GTEX-11P82 RETAINED

GTEX-11TT1 RETAINED

GTEX-11TUW RETAINED

GTEX-11UD2 FILTERED

GTEX-11WQC RETAINED

GTEX-11WQK RETAINED

GTEX-11ZUS RETAINED

GTEX-1212Z RETAINED

130

GTEX-12BJ1 FILTERED

GTEX-12C56 RETAINED

GTEX-12WSH RETAINED

GTEX-12WSI RETAINED

GTEX-12WSL RETAINED

GTEX-12WSM RETAINED

GTEX-12ZZY RETAINED

GTEX-13111 RETAINED

GTEX-13112 RETAINED

GTEX-131XE RETAINED

GTEX-131XF FILTERED

GTEX-132QS RETAINED

GTEX-1399Q RETAINED

GTEX-139TT RETAINED

GTEX-13FHP RETAINED

GTEX-13FLW RETAINED

GTEX-13FTW RETAINED

GTEX-13G51 FILTERED

GTEX-13N2G RETAINED

GTEX-13NYB RETAINED

GTEX-13NZA RETAINED

GTEX-13NZB RETAINED

GTEX-13O1R RETAINED

GTEX-13O21 RETAINED

GTEX-13O61 RETAINED

GTEX-13OVH RETAINED

GTEX-13OVL RETAINED

GTEX-13OW5 RETAINED

GTEX-13OW6 RETAINED

131

GTEX-13OW8 FILTERED

GTEX-13QJ3 FILTERED

GTEX-13VXU RETAINED

GTEX-145MF FILTERED

GTEX-145MH RETAINED

GTEX-145MO RETAINED

GTEX-14753 RETAINED

GTEX-147F4 RETAINED

GTEX-147JS RETAINED

GTEX-14A6H RETAINED

GTEX-14ABY RETAINED

GTEX-N7MS RETAINED

GTEX-NPJ8 FILTERED

GTEX-O5YT RETAINED

GTEX-OOBK RETAINED

GTEX-P4PQ FILTERED

GTEX-P4QS RETAINED

GTEX-PLZ5 RETAINED

GTEX-PLZ6 FILTERED

GTEX-PW2O RETAINED

GTEX-Q2AH RETAINED

GTEX-Q2AI RETAINED

GTEX-QDVN FILTERED

GTEX-QEG4 RETAINED

GTEX-QEG5 RETAINED

GTEX-QLQ7 RETAINED

GTEX-QLQW RETAINED

GTEX-QMRM RETAINED

GTEX-QV31 RETAINED

132

GTEX-QV44 RETAINED

GTEX-R55C RETAINED

GTEX-R55D RETAINED

GTEX-R55E RETAINED

GTEX-REY6 RETAINED

GTEX-RM2N RETAINED

GTEX-RN64 RETAINED

GTEX-RUSQ RETAINED

GTEX-RWSA RETAINED

GTEX-S33H RETAINED

GTEX-S3XE RETAINED

GTEX-S4Q7 RETAINED

GTEX-S4Z8 FILTERED

GTEX-S7PM RETAINED

GTEX-S7SE RETAINED

GTEX-S95S RETAINED

GTEX-SIU7 FILTERED

GTEX-SUCS RETAINED

GTEX-T5JC RETAINED

GTEX-T6MN RETAINED

GTEX-T8EM RETAINED

GTEX-TKQ1 RETAINED

GTEX-TKQ2 RETAINED

GTEX-U3ZH RETAINED

GTEX-U3ZM RETAINED

GTEX-U4B1 RETAINED

GTEX-U8T8 RETAINED

GTEX-U8XE RETAINED

GTEX-UPJH RETAINED

133

GTEX-V1D1 RETAINED

GTEX-V955 RETAINED

GTEX-VJYA RETAINED

GTEX-WFG8 RETAINED

GTEX-WFON RETAINED

GTEX-WH7G RETAINED

GTEX-WHSB RETAINED

GTEX-WHSE RETAINED

GTEX-WK11 RETAINED

GTEX-WOFM RETAINED

GTEX-WVLH RETAINED

GTEX-WY7C FILTERED

GTEX-WYJK RETAINED

GTEX-WZTO RETAINED

GTEX-X261 RETAINED

GTEX-X3Y1 RETAINED

GTEX-X4XX FILTERED

GTEX-X5EB RETAINED

GTEX-XAJ8 RETAINED

GTEX-XBEC RETAINED

GTEX-XBED RETAINED

GTEX-XGQ4 RETAINED

GTEX-XMK1 RETAINED

GTEX-XPT6 RETAINED

GTEX-XPVG RETAINED

GTEX-XQ3S RETAINED

GTEX-Y111 RETAINED

GTEX-Y3I4 RETAINED

GTEX-Y5V6 RETAINED

134

GTEX-Y8E4 RETAINED

GTEX-Y9LG RETAINED

GTEX-YEC3 RETAINED

GTEX-YEC4 RETAINED

GTEX-YF7O RETAINED

GTEX-YFCO RETAINED

GTEX-YJ89 RETAINED

GTEX-Z93S RETAINED

GTEX-ZA64 RETAINED

GTEX-ZAB4 RETAINED

GTEX-ZAB5 RETAINED

GTEX-ZDTT FILTERED

GTEX-ZDYS RETAINED

GTEX-ZLFU FILTERED

GTEX-ZPU1 RETAINED

GTEX-ZQUD RETAINED

GTEX-ZT9W RETAINED

GTEX-ZT9X RETAINED

GTEX-ZTSS RETAINED

GTEX-ZTX8 RETAINED

GTEX-ZUA1 RETAINED

GTEX-ZV7C RETAINED

GTEX-ZVTK RETAINED

GTEX-ZVZP RETAINED

GTEX-ZY6K FILTERED

GTEX-ZYFC RETAINED

GTEX-ZYT6 RETAINED

GTEX-ZZ64 RETAINED

135

Table S6. Median, standard deviation (SD) and range of copy number (CN, N=167) and gene expression (GE) values per ampliconic gene family (N=149).

Gene Family CN.Median CN.SD CN.Range GE.Median GE.SD GE.Range

BPY2 3.76 0.69 1.28-8.19 147.91 69.48 39-395

CDY 4.68 0.49 3.05-7.58 259.71 144.42 44-838

DAZ 4.39 0.81 2.05-10.91 1210.02 397.21 485-2791

HSFY 2.29 0.22 1.84-3.06 941.04 381.06 322-2359

PRY 2.16 0.16 1.65-2.8 38.48 21.85 5-122

RBMY 9.46 1.36 4.83-14.42 1017.46 310.5 465-2763

TSPY 35.7 6.34 22.4-65.69 3272.97 915.9 1860-7569

VCY 1.78 0.36 1.02-2.99 1304.96 774.94 547-8431

XKRY 2.21 0.22 1.58-2.82 2.18 2.08 0-9

136

Table S7. Copy number and gene expression correlation values. Spearman correlation coefficient values (r) and P-values for each gene family are calculated using cor.test() function in R. The P-values cutoff after Bonferroni correction for nine tests is 0.05/9 ≈ 0.006).

All samples R1b (European) E1b (African) I1a (European)

Gene r P-value r P-value r P-value r P-value Family

BPY2 0.05 0.55 0.00 0.979 0.07 0.74 0.12 0.658

CDY -0.02 0.839 0.04 0.712 -0.21 0.34 -0.17 0.532

DAZ 0.08 0.351 0.15 0.195 0.15 0.51 -0.09 0.763

HSFY 0.10 0.212 0.03 0.821 0.43 0.05 0.16 0.558

PRY 0.03 0.702 0.04 0.707 -0.06 0.79 -0.25 0.375

RBMY 0.10 0.231 -0.12 0.298 -0.22 0.32 0.27 0.327

TSPY 0.19 0.022 0.00 0.991 -0.27 0.22 0.14 0.611

VCY -0.06 0.451 -0.19 0.102 0.23 0.30 0.24 0.397

XKRY -0.07 0.374 -0.11 0.349 0.07 0.77 -0.53 0.040

137

Table S8. P-values from permutation tests for copy number differences between haplogroup pairs. Given two haplogroups, to test whether the difference in copy number between the haplogroups is significant, we compared the true difference in mean copy number between haplogroups to the difference in mean of 1 million random permutations (randomly rearranged the haplogroup assignment of the two haplogroups). The P-value represents how many permuted mean-differences are larger than the one we observed in our actual data. P-values that pass a Bonferroni corrected cutoff for 54 tests (0.05/54 = 0.00093) are highlighted in bold.

Copy Number E vs. I E vs. J E vs. R I vs. R J vs. I J vs. R

BPY2 4.44x10-03 2.69x10-02 1.70x10-03 0.771 0.138 0.444

CDY 1.12x10-03 0.533 1.74x10-02 0.119 1.05x10-02 0.314

DAZ 5.04x10-02 0.155 5.30x10-03 0.481 0.304 0.839

HSFY 0.866 0.476 0.697 0.839 0.393 0.292

PRY 5.36x10-02 0.288 2.01x10-02 0.873 0.686 0.730

RBMY 0.401 3.53x10-03 6.94x10-04 8.31x10-02 3.29x10-03 0.00

TSPY 0.00 0.440 0.00 3.47x10-02 3.00x10-06 0.00

VCY 0.638 0.456 0.536 0.989 0.334 0.206

XKRY 0.301 0.483 0.463 0.595 0.113 0.187

138

Table S9. P-values from permutation test for gene expression differences between haplogroup pairs. Given two haplogroups, to test if the difference in gene expression between the haplogroups is significant or not, we compared the true difference in mean gene expression between haplogroups to the difference in mean of 1 million (M) random permutations (randomly rearranged the haplogroup assignment). The P-value represents how many permuted mean-differences are larger than the one we observed in our actual data. None of the P-values pass a Bonferroni corrected cutoff for fifty four tests (0.05/54 = 0.00093).

Gene Expression E vs. I E vs. J E vs. R I vs. R J vs. I J vs. R

BPY2 3.12x10-02 0.148 7.09x10-03 0.759 0.963 0.798

CDY 0.517 0.416 0.124 0.522 0.755 0.934

DAZ 0.715 0.534 1.36x10-02 3.04x10-02 0.829 0.171

HSFY 0.329 0.233 9.84x10-02 0.592 0.500 0.716

PRY 0.309 0.749 0.116 0.755 0.724 0.521

RBMY 0.341 0.451 0.114 0.995 0.269 7.65x10-02

TSPY 0.307 0.177 0.825 0.203 6.10x10-02 7.08x10-02

VCY 0.542 0.275 0.234 0.406 0.922 0.629

XKRY 0.077 0.906 0.353 0.192 0.141 0.571

139

Table S10. Correlation between gene expression and age. Spearman correlation coefficient values (r) and P-values for each of the gene family are calculated using cor.test() function in R. P-values that pass a Bonferroni corrected cutoff for nine tests (0.05/9 ≈0.006) are highlighted in bold.

All samples R1b (European) E1b (African) I1a (European)

Gene r P-value r P-value r P-value r P-value Family

BPY2 -0.05 0.522 -0.19 9.02x10-02 0.52 1.31x10-02 0.07 0.815

CDY 0.01 0.901 -0.1 0.399 0.45 3.6x10-02 0.23 0.400

DAZ -0.18 2.63x10-02 -0.11 0.327 -0.35 0.107 -0.38 0.157

HSFY 0.09 0.282 -0.03 0.817 0.57 6.10x10-3 0.09 0.761

PRY 0.05 0.553 -0.12 0.292 0.61 2.80x10-3 0.11 0.703

RBMY -0.1 0.226 0.03 0.825 -0.06 0.789 -0.30 0.282

TSPY -0.06 0.486 0.02 0.840 0.03 0.904 -0.11 0.703

VCY 0.06 0.490 0.12 0.278 0.17 0.441 -0.44 0.105

XKRY 0.16 4.54x10-02 0.16 0.172 0.21 0.343 0.19 0.486

Table S11. Ampliconic gene homologs and their expression patterns. The homologs of ampliconic genes were obtained from a recent review [1]. The expression

140 pattern is obtained from the human protein atlas (HPA) [2]. Tissue-enriched: expression in one tissue is at least five-fold higher than that in all other tissues/cell lines. Tissue- enhanced (five-fold higher average transcripts per million (TPM) in one or more tissues/cell lines compared to the mean TPM of all tissues/cell lines). Ubiquitous (≥ 1 TPM in all tissues/cell lines). Mixed (detected in at least one tissue/cell line and in none of the above categories).

Ampliconic gene Homolog Chromosome Homolog family expression status

CDY CDYL chr6 Ubiquitous

CDYL2 chr16 Mixed

DAZ DAZL chr3 Testis-enriched

BOLL chr2 Testis-enriched

HSFY HSFX1 chrX Mixed

HSFX2 chrX Testis-enriched

RBMY RBMX chrX Ubiquitous

RBMX2 chrX Ubiquitous

RBMXL1 chr1 Ubiquitous

RBMXL2 chr11 Testis-enriched

RBMXL3 chrX Testis-enriched

TSPY TSPYL1 chr6 Ubiquitous

TSPYL2 chrX Ubiquitous

TSPYL4 chr6 Ubiquitous

TSPYL5 chr8 Testis-enhanced

TSPYL6 chr2 Testis-enriched

VCY VCX chrX Testis-enriched

VCX2 chrX Testis-enriched

VCX3A chrX Testis-enriched

VCX3B chrX Testis-enriched

XKRY XKRX chrX Skin-enhanced Expression pattern adapted from the HPA database [2]

141

142

Table S12. Predicted function of ampliconic gene families. The table was adapted from Table 1 of Paulo Navarro-Costa’s review on ampliconic gene families [1]. All the gene families are linked to spermatogenesis and infertility.

Gene Name Function Symbol

BPY2 Basic charge,Y-linked, 2 Proposed role in cytoskeletal regulation.

CDY Chromodomain protein, Y- Transcriptional co-repressor with histone linked acetyltransferase activity. Postulated role in gene expression regulation and chromatin remodeling.

DAZ Deleted in azoospermia RNA-binding protein, regulates pre-meiotic transcript transport/storage and translation initiation.

HSFY Heat shock transcription Testis-predominant Transcription factor. factor, Y-linked

PRY PTPN13-like,Y-linked Putative pro-apoptotic signaling molecule

RBMY1 RNA binding motif protein, RNA-binding protein involved in RNA splicing and Y-linked, family 1 metabolism, signal transduction and meiotic regulation.

TSPY Testis specific protein,Y- Presumable androgen-dependent regulator of early linked germ cell development. Putative functions in cell cycle regulation.

VCY Variable charge, Ylinked Unknown, possible nucleoli-related regulator of ribosomal assembly. Involved in the cytoskeletal network[3]. Possible relationship in the pathogenesis of male infertility[4].

XKRY XK, Kell blood group Putative multipass transmembrane protein involved complex subunit related, Y- in gamete interaction. linked

143

Table S13. Numbers of high identity alignments to the references for each Y ampliconic gene family. BLAT was used to find all sites with >99% identity. For TSPY there are three pseudogene copies in the reference which share identity with parts of functional copies of TSPY genes.

Copies in the Gene family reference

BPY 3

CDY 4

DAZ 4

HSFY 2

PRY 2

RBMY 6

TSPY 9 (6+3 pseudo)

VCY 2

XKRY 2

144

Supplemental Figures

Figure S1. Variation in copy number of ampliconic gene families. In the dotplot the X-axis represents natural log of median copy number and the Y-axis is the natural log of variance in copy number for the 167 individuals analyzed. The blue line represents the linear regression fit (median ~ var) with an R2 value of 0.91. The color of each dot is labeled with ampliconic gene family described in the legend.

145

Figure S2. Variation in gene expression of ampliconic gene families. In the dotplot, the X-axis represents natural log of the median normalized gene expression values and the Y-axis represents natural log of the variance in gene expression for the 149 individuals analyzed. The blue line represents the linear regression fit (median ~ var) with an R2 value of 0.99. The color of each dot is labeled with ampliconic gene family described in the legend.

146

Figure S3. The relationship between expression level and copy number (N=149). Within each scatter plot, the X-axis represents copy number values and Y-axis represents the normalized gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are in bracket. The gray line represents the linear function fitted to the given data points. The nine scatter plots represent the relationship between expression and copy number for each of the nine ampliconic gene families. There is no significant relationship in either of the nine gene families (Bonferroni correction p-value cutoff of 0.05/9=0.006).

147

Figure S4. The relationship between gene expression and copy number for individuals with an R1b (European) subhaplogroup (N=77). Within each scatter plot, the X-axis represents the copy number values and Y-axis represents the normalized gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets. The gray line represents the linear function fitted to the given data points. The nine scatter plots represent the relationship between expression and copy number of the ampliconic gene families. There is no significant relationship in either of the nine gene families (Bonferroni correction p-value cutoff of 0.05/9=0.006).

148

Figure S5 . The relationship between gene expression and copy number for individuals with a I1a (European) subhaplogroup (N=15). Within each scatter plot the X-axis represents the copy number values and Y-axis represents the normalized gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets. The gray line represents the linear function fitted to the given data points. The nine scatter plots represent the relationship between expression and copy number of the ampliconic gene families. There is no significant relationship in either of the nine gene families (Bonferroni correction p-value cutoff of 0.05/9=0.006).

149

Figure S6. The relationship between gene expression and copy number for individuals with a E1b (African) subhaplogroup (N=22). Within each scatter plot the X-axis represents the copy number values and Y-axis represents the normalized gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets. The gray line represents the linear function fitted to the given data points. The nine scatter plots represent the relationship between expression and copy number of the ampliconic gene families. There is no significant relationship in either of the nine gene families (Bonferroni correction p-value cutoff of 0.05/9=0.006).

150

Figure S7. The relationship between gene expression and age in the individuals analyzed (N=149). The nine scatterplots represent the nine ampliconic gene families with their names as the title of their respective plot. Within each scatter plot the Y-axis represents the age and X-axis represents the gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets.There is no significant relationship between age and expression in all the nine families (Bonferroni correction p-value cutoff of 0.05/9=0.006). The gray line represents the linear function fitted to the points in the plot.

151

Figure S8. The relationship between gene expression and age for individuals with a R1b (European) subhaplogroup (N=77). The nine scatterplots represent the nine ampliconic gene families with their names as the title of their respective plot. Within each scatter plot the Y-axis represents the age and X-axis represents the gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets.There is no significant relationship between age and expression in all the nine families (Bonferroni correction p-value cutoff of 0.05/9=0.006). The gray line represents the linear function fitted to the points in the plot.

152

Figure S9. The relationship between gene expression and age for individuals with a I1a (European) haplogroup (N=15). The nine scatterplots represent the nine ampliconic gene families with their names as the title of their respective plot. Within each scatter plot the Y-axis represents the age and X-axis represents the gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets.There is no significant relationship between age and expression in all the nine families (Bonferroni correction p-value cutoff of 0.05/9=0.006). The gray line represents the linear function fitted to the points in the plot.

153

Figure S10. The relationship between gene expression and age for individuals with a E1b (African) haplogroup (N=22). The nine scatterplots represent the nine ampliconic gene families with their names as the title of their respective plot. Within each scatter plot the Y-axis represents the age and X-axis represents the gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets.There is significant relationship between age and expression in HSFY and PRY families (Bonferroni correction p-value cutoff of 0.05/9=0.006). The gray line represents the linear function fitted to the points in the plot.

154

Figure S11. Combination of expression level differences and individual-level relationship between Y ampliconic gene families and their non-Y homologs can better explain the possible scenarios of evolution for the former. Within each row (A-D), the plot on the left represents the expression level differences between Y ampliconic genes (blue boxplot) and their non-Y homologs (orange boxplot), the plot in the middle represents the individual level relationship between Y ampliconic genes (X-axis) and their non-Y homologs (Y-axis) and on the right are the expected scenarios of evolution. Assuming non-Y homologs represent ancestral expression levels, higher expression of Y ampliconic genes implies independent expression (A,B) and lower expression implies dosage regulation (C,D). Negative correlation among ampliconic genes and their non-Y homologs suggests lack of co-regulation (B,D) and a positive correlation suggests coregulation of gene expression(A,C).

155

Figure S12. PCA plot of all 170 samples using Variance Stabilizing Transformation (VST) normalized read counts. All the points with greater than 20 PC1 value (X-axis) were filtered out.

156

References

1. Navarro-Costa P. Sex, rebellion and decadence: the scandalous evolutionary history of the human Y chromosome. Biochim Biophys Acta. 2012 Dec;1822(12):1851–63.

2. Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, et al. Proteomics. Tissue-based map of the human proteome. Science. 2015 Jan 23;347(6220):1260419.

3. Wong EYM, Tse JYM, Yao K-M, Lui VCH, Tam P-C, Yeung WSB. Identification and characterization of human VCY2-interacting protein: VCY2IP-1, a microtubule- associated protein-like protein. Biol Reprod. 2004 Mar;70(3):775–84.

4. Tse JYM, Wong EYM, Cheung ANY, O WS, Tam PC, Yeung WSB. Specific expression of VCY2 in human male germ cells and its involvement in the pathogenesis of male infertility. Biol Reprod. 2003 Sep;69(3):746–51.

157

Appendix B

Supporting Material for Chapter 3

Supplemental Note 1. CAFE Simulations

Simulations to test gene family size. The great ape dataset includes copy number estimates of nine gene families from six species. We tested whether nine gene families (n=9) were sufficient to predict the rate of gene birth and death, because uncertainty in the rate parameter could influence p-value estimates generated by CAFE (Han, Thomas, Lugo-Martinez, & Hahn, 2013). To test this, we used the gene family copy numbers and the phylogenetic tree from the original CAFE article (Hahn, Demuth, & Han, 2007). However, we only used primate-specific data (human, chimpanzee, and macaque) in our simulations. Our goal was to test the reproducibility of birth and death rate (휆) predictions by CAFE. First, we fixed the phylogenetic tree representing the three primate species to be used in all simulations as (((Chimp:6,Human:6):18,Macaque:24)), and then applied different combinations of gene family copy numbers as input to estimate 휆. As a filtering step, we removed all the gene families that had >200 gene copies cumulatively across the three species. This filtering step was used to remove excessively large gene families and retain gene families whose copy numbers are in the range observed for Y chromosome ampliconic gene families. Next, we filtered all gene families that had the same gene count in all species to ensure there was variation across the species. These filtering steps reduced the gene family count from ~9,800 to 2,445. From this set we picked 30 gene families uniformly at random, and applied CAFE on each to estimate 휆. We repeated this step 20 times (Table SN1). Next, from each set of 30 gene families, we subsampled 5, 10, and 15 gene families uniformly at random and applied CAFE on each to estimate 휆. We observed that with different sizes and combinations of gene families CAFE predicted different values for 휆 (Table SN1). Next, we picked 100 gene families with highest variation in copy number across species and performed the same analyses as above. For all input sizes of 5, 10, 15, and 30 considered, the estimated rate of gene birth and death was identical (휆=0.041667; Table SN2). This result suggests that CAFE requires gene families with high variation in their gene count across species to predict a consistent 휆 value, and the majority of the ampliconic genes considered here have high variation in gene count

158 across great ape species, thereby making CAFE an ideal approach to study their evolution.

Simulations to include multiple samples per species in CAFE analyses. CAFE typically takes the mean/median value as a representation of copy number of gene families in a species. However, we observed high intraspecific variation in copy number of gene families within our dataset, and wished to leverage this information in our predictions of gene family copy number evolution. We therefore tested whether CAFE could reproduce the significant differences observed in great ape ampliconic gene families when we provide it with multiple individuals per species instead of the default application of a single value (mean or median copy number) per gene family as input. With the presence of multiple samples per species, we assumed that CAFE might take the variation in copy number into consideration while estimating p-values. Because the exact phylogenetic relationship among the samples is unknown in terms of the time since their common ancestor, we estimated the time since the most recent common ancestor (TMRCA) for each species using an external data set (Hallast et al., 2016). We followed the same steps used in the generation of the phylogenetic tree of great apes (see Materials and Methods), except that here we calculated the TMRCA for all the combinations of samples available, and took the median TMRCA value within a species as the representative TMRCA for the species.

For each species, we added five branches at the tips (i.e. approximately a star phylogeny) of the great ape phylogenetic tree, such that each species was represented by five individuals (Supplementary Figure SA). The lengths of these five branches are represented by the TMRCA of lineages within each species, with a difference of one thousand years between each internal node. That is, the last split for each individual within a species was within four thousand years since the TMRCA of all five individuals, and this mimics a star phylogeny as the TMRCA ranging from 75 to 550 thousand years is much larger than four thousand years. From our dataset, we picked five individuals per species uniformly at random and provided their ampliconic gene family copy numbers as input to CAFE along with the updated phylogenetic trees (with star phylogenies at the tips) to test for significant gains or losses in copy number. This procedure was repeated 100 times. In these simulations, for each gene family we looked for significant shifts (gains or losses) in gene copy number at each branch along the phylogenetic tree, which is indicated by a p-

159 value (Bonferroni-corrected p-value < 0.005; 0.05/10 by correcting for the 10 external and ancestral branches present on the phylogenetic tree) and compared them to the original run (Table S4) of CAFE. All significant observations identified in the analysis based on median copy number were also significant when we used multiple samples per species. However, within the set of 100 simulated replicates, the frequency in which each branch has a significant shift was not the same. On the ancestral orangutan branch, the CDY gene family displayed a significant shift in all 100 simulations. On the bonobo and chimpanzee branches, the RBMY gene family showed a significant shift in 100 and 83 simulations, respectively. The TSPY gene family had significant shifts on multiple branches, with 99 simulations showing significance in bonobo and chimpanzee, 84 in gorilla, 57 in Bornean orangutan, and 39 in Sumatran orangutan. Finally, the XKRY gene family had 55 simulations with a shift in the ancestral branch of orangutans and 68 with a shift in Sumatran orangutan. The XKRY-specific shift in Bornean orangutan was significant only 12 times out of 100 simulations.

Table SN1. 휆 values based on randomly selected gene families. Each column indicates the number of genes used to estimate the rate of gene birth or death and each row represents different replicates.

N=5 N=10 N=15 N=30

0.011158 0.041667 0.034038 0.017218

0.035354 0.035568 0.022321 0.013378

0.003766 0.010327 0.009695 0.011261

0.007554 0.027215 0.0206 0.017613

0.041667 0.024575 0.017532 0.012252

0.007613 0.00628 0.008843 0.010396

0.006239 0.006899 0.011568 0.010448

0.026245 0.013568 0.041667 0.038559

0.041667 0.041667 0.041667 0.023911

0.004424 0.005405 0.006028 0.009473

0.013036 0.009137 0.012186 0.011632

0.009044 0.010184 0.00975 0.010824

0.004717 0.007606 0.008536 0.023141

160

0.041667 0.041667 0.032196 0.018089

0.01187 0.011803 0.010866 0.012213

0.004724 0.005352 0.008868 0.009709

0.005236 0.014412 0.010689 0.017878

0.041667 0.041667 0.036225 0.038344

0.004755 0.01066 0.012642 0.009939

0.017927 0.018403 0.020957 0.016235

Table SN2. 휆 values based on gene families with high copy number variation across species. Each column indicates the number of genes used to estimate the rate of gene birth or death and each row represents different replicates.

N=5 N=10 N=15 N=30

0.031804 0.041667 0.041667 0.041667

0.041667 0.041667 0.041667 0.041667

0.041667 0.041667 0.041667 0.041667

0.041667 0.041667 0.041667 0.041667

0.041667 0.041667 0.041667 0.041667

0.041667 0.041667 0.041667 0.041667

0.041667 0.041667 0.041667 0.041667

0.041667 0.041667 0.041667 0.041667

0.041667 0.041667 0.041667 0.041667

0.041667 0.041667 0.041667 0.041667

161

Supplemental Figure SA. The great ape phylogenetic tree with five individuals sampled per species (star phylogeny) used in CAFE analysis. Significant shifts were assessed for each of the internal edges numbered 1-10 (edges from the original phylogenetic tree), whereas shifts were not examined for external edges representing the star phylogenies relating sampled individuals within a species.

162

Supplemental Note 2. EVE Simulations

With the difficulty in obtaining testis samples to generate expression data, we had a small sample size for each species except for humans. Currently, the EVE model (R. V. Rohlfs & Nielsen, 2015) cannot handle missing values, and so we limited our analysis to the five gene families that are found in all great ape species. In our dataset, we have two to three individuals sampled per species, except for humans in which we had testis expression measured for more than one hundred sampled individuals. To ensure that sample sizes were comparable across the sampled great apes, we used three (maximum size in non- human species) human testis expression samples that were sampled uniformly at random from the set with African Y-specific haplogroup E from the GTEx dataset (Ardlie et al., 2015). We chose samples with African ancestry as they are expected to have high genetic variation in humans. On the whole, we had 12 testis samples in total (three human, two bonobo, three chimpanzee, two gorilla, and two B. orangutan). We tested whether the EVE model performs well with five gene families and two or three individuals sampled per species.

The built-in function -H in the EVE model simulates expression values when provided with a phylogenetic tree, the number of individuals sampled per species, along with the parameter values for selection strength (α), drift (σ2), ratio of within- to between-species variance (β), and optimal expression value (θ). The within species variance (휏2) or

휎2 population variance is represented by 훽 (R. V. Rohlfs, Harrigan, & Nielsen, 2014). For 2훼 the simulations, we matched sample sizes to empirical sample sizes (3, 2, 3, 2, and 2 for human, bonobo, chimpanzee, gorilla, and B. orangutan, respectively), employed the great ape phylogenetic tree generated using whole-genome sequencing data (Locke et al., 2011), and considered the parameter values from the simulations studies presented in the EVE model article (R. Rohlfs & Nielsen, 2014) (σ2=5, α=3.0, β=6, and θ=100).

Using -H mode in the EVE model, we simulated gene expression values by varying one of the four parameter values (σ2=5, α=3.0, 휏2=6, and θ=100) from 0.5 to 60 (0.5, 1, 3, 5, 10, 15, 20, 40, 60) at a time while keeping the other three parameters, sample size, and phylogenetic tree fixed. Based on the combination of parameters, the EVE model outputs an expression value matrix of size m x n, where m is the number of sampled expressed

163 genes (m=5) and n is the total number of individuals sampled across all species together

(n=12). Using this matrix as input, it tests whether βi=βshared for each simulated gene i, i=1,2,...,m. For each combination of parameters, we simulated gene expression values for 100 independent replicates. Because the parameters for the model were predefined and we did not introduce external variation, we expect the EVE model to predict that all of the simulated gene expression values are conserved across species and that the p-values for the tests of whether βi=βshared, should be non-significant (i.e. p>0.05). If the number of genes and sample sizes are insufficient, then we expect the EVE model to predict that expression of some or all genes is significantly different from the shared expression. Using the significance cutoff 0.05, we summarized the number of times a gene was differentially expressed across great apes in the 100 replicates of each combination of parameters. We plotted the percentage of replicates that have their p-value above the threshold of 0.05 as a heatmap (Figures SB-SD). We observed that 95% of replicates had a non-significant p- value and 5% of replicates could result from random sampling of expression values from a multinormal distribution by EVE model. Based on our simulation results, we are confident that for our sample sizes, the EVE model can predict the difference in gene expression variance correctly 95 out of 100 times.

164

Supplemental Figure SB. Proportion of replicates that have their p-value above the threshold of 0.05 when the selection (α) parameter varied from 0.5 to 60 while other parameters were fixed (σ2=5, β=6, and θ=100).

165

Supplemental Figure SC. Proportion of replicates that have their p-value above the threshold of 0.05 when the drift (σ2) parameter varied from 0.5 to 60 while other parameters were fixed (α=3.0, β=6, and θ=100).

166

Supplemental Figure SD. Proportion of replicates that have their p-value above the threshold of 0.05 when the variation within species parameter (τ2) varied from 0.5 to 60 while other parameters were fixed (σ2=5, α=3.0, and θ=100).

167

Supplemental Tables

Table S1. Summary of ampliconic gene copy numbers across great apes. The median, variance, and range of each individual gene family and all families together in each species studied.

Bonobo Chimpanzee Gene Families Median Variance Min Max Median Variance Min Max

BPY2 1.06 0.27 0.84 2.15 2.05 0.02 1.84 2.3 CDY 2.99 0.06 2.49 3.25 5.28 0.1 4.59 5.52 DAZ 2.03 0.04 1.65 2.19 4.3 0.1 3.46 4.44 HSFY 0 0 0 0 0 0 0 0 PRY 0 0 0 0 0 0 0 0 RBMY 28.59 11.3 22.53 31.52 10.86 3.27 6.25 12.18 TSPY 48.07 31.98 38.09 52.19 17.62 25.5 11.54 25.45 VCY 0 0 0 0 2.07 0.01 1.88 2.25 XKRY 0 0 0 0 0 0 0 0 Species 82.77 87.48 66.39 89.78 40.93 42.5 29.64 50.01

Human Gorilla Gene Families Median Variance Min Max Median Variance Min Max

BPY2 3.26 0.4 1.68 3.67 2.01 0.01 1.95 2.17 CDY 4.06 0.4 2.87 4.63 7.86 3.57 5.91 12.17 DAZ 4.37 0.58 2.78 5.27 2.02 0 1.96 2.17 HSFY 2.07 0.07 1.58 2.29 5.95 2.34 3.97 8.19 PRY 2.13 0.1 1.59 2.78 2.06 0 1.96 2.13 RBMY 9.5 3.93 4.95 11.03 14.16 2.15 12.03 18.06 TSPY 33.04 12.54 26.03 39.09 6.07 1.16 5.84 8.24 VCY 2.11 0.15 1.67 3.17 0 0 0 0 XKRY 1.98 0.06 1.53 2.21 2 0 1.95 2.09 Species 64.32 33.74 50.37 69.34 44.11 13.03 37.82 48.96

168

B. Orangutan S. Orangutan Gene Families Median Variance Min Max Median Variance Min Max

BPY2 14.83 13.3 12.77 21.54 13.64 29.29 12.43 25.33 CDY 34.99 66.68 30.04 49.64 35.75 6.69 33.16 39.58 DAZ 10.52 1.36 9.69 12.59 12.39 0.85 11.55 13.84 HSFY 4.86 0.49 4.36 6.21 5.49 0.36 4.95 6.47 PRY 8.46 1.79 6.77 10.33 10.17 0.16 9.59 10.69 RBMY 3.39 0.41 2.67 4.19 2.8 0.02 2.78 3.08 TSPY 32.23 56.38 28.62 47.7 22.83 50.47 14.04 31.96 VCY 0 0 0 0 0 0 0 0 XKRY 15.21 6.11 13.78 19.81 22.19 3.95 20.49 25.52 Species 128.32 455.34 110.9 160.36 131.72 136.12 116.57 142.83

169

Table S2. P-values from permutation tests for copy number differences between Sumatran and Bornean orangutans. Given two species, we tested whether the difference in copy number between the species is significant. We compared the true difference in mean copy number between the species to the difference in mean of one million random permutations (randomly rearranged the species assignment of the two species). The p-value represents the fraction of permuted mean differences that are larger than the one we observed in our actual data. The p-values that pass a Bonferroni corrected cutoff for eight tests (0.05/8 = 0.00625) are highlighted in bold.

Gene family p-value BPY2 0.863 CDY 0.747 DAZ 0.02 HSFY 0.195 PRY 0.042 RBMY 0.094 TSPY 0.028 XKRY 0.001

170

Table S3. The branch-level p-values showing the presence of significant shift in copy number when compared to its immediate ancestor in the great ape phylogenetic tree. The columns represent the branches in the great ape phylogenetic tree and the rows represent gene families with significant expansions or contractions in one or more branches as predicted by CAFE.

(Bonobo, (Bonobo, Gene (Bonobo, Chimp, Bornean Sumatran Bonobo Chimpanzee Human Chimp, Gorilla Orangutans family Chimp) Human, orangutan orangutan Human) Gorilla)

CDY 0.344 0.158 0.577 0.644 0.246 0.476 0.042 1.86×10-3 0.814 0.298

RBMY 2.02×10-7 9.61×10-4 0.105 0.312 0.522 0.4 0.292 0.195 0.557 0.557

TSPY 1.39×10-9 3.89×10-7 0.324 0.115 0.032 5.07×10-5 0.802 0.43 2.68×10-4 1.01×10-3

XKRY 0.5 0.5 0.026 0.79 0.716 0.843 0.135 4.31×10-3 2.86×10-3 6.46×10-4

171

Table S4. Summary of CAFE results with five individuals per species added as star phylogeny. Each number in the table represents the number of times out of the 100 simulations in which CAFE estimated a significant shift in copy number (p<0.005). For each simulation, ampliconic gene copy numbers from five random individuals per species were used in CAFE analysis to capture the copy number variation within each species. Columns represent the branches of the phylogenetic tree. Rows represent ampliconic gene families. The numbers in bold are the branches with observed significant shifts in copy number when median copy number per species was used in CAFE analysis (Table S3).

(Bonobo, (Bonobo, Gene (Bonobo, Bornean Sumatran Bonobo Chimp Human Chimp, Gorilla Chimp, Human, Orangutans family Chimp) orangutan orangutan Human) Gorilla)

BPY2 1 0 1 0 0 0 15 15 4 0

CDY 15 0 0 3 9 0 15 100 5 2

DAZ 15 0 0 0 0 7 5 11 0 9

HSFY 0 0 13 0 14 1 0 0 0 3

PRY 0 0 15 0 0 0 6 12 0 12

RBMY 100 83 0 2 0 6 3 14 0 22

TSPY 99 99 0 15 15 84 15 15 57 39

VCY 4 11 0 4 0 1 0 0 0 0

XKRY 0 0 8 0 1 0 15 55 12 68

172

Table S5. Gene expression values for Y ampliconic gene families across great apes. Numbers represent read counts after normalization. The read counts for Y ampliconic gene families missing in great apes are represented as NA.

S. Bonobo Chimpanzee Human Gorilla B. orangutan Oranguta n SRR10 SRR10392 Gene SRR3068 SRR20 SRR204 SRR30 SRR11 SRR817 SRR110 SRR305 SRR30 392517 SRR2176 SRR1039 521 (this family 37 40590 0591 6825 00440 512 2852 3573 6810 (this 206 3300 study) study)

BPY2 25.79 79.63 30.49 65.6 58.19 39.66 78.9 62.82 0 4.42 40.17 4.71 36.65

CDY 115.5 97.32 201.21 388.88 793.28 77.89 258.37 157.29 52.09 85.44 107.11 53.64 2040.28

DAZ 274.72 183.58 646.94 420.81 274.12 611.46 493.15 630.82 277.79 251.9 1071.1 644.59 134.39

HSFY NA NA NA NA NA 219.13 436.96 588.14 1820.86 290.2 2155.58 558.95 5558.86

PRY NA NA NA NA NA 13.98 26.42 38.36 0 0 0 0.94 0

RBMY 590.93 453.42 934.47 453.48 356.82 492.76 519.57 575.19 403.67 474.34 66.94 46.11 0

TSPY 913.87 1431.04 530.45 1376.91 722.83 2329.14 1620.16 2013.06 883.3 212.13 1332.18 604.12 598.65

VCY NA NA 386.37 288.1 73.51 883.38 862.83 1179.4 NA NA NA NA NA

XKRY NA NA NA NA NA 0.57 0.7 0.96 0 0 0 0 0

173

Table S6. EVE-model-based likelihood ratios and p-values showing no significant shift in gene expression of shared ampliconic gene families across great apes.

Gene family Likelihood ratio p-value BPY2 0.1064296 0.744 CDY 0.2014877 0.654 DAZ 0.2407138 0.623 RBMY 0.01832644 0.892 TSPY 0.02637181 0.871

Table S7. Sum of median copy number values of shared genes within AZFb and AZFc regions (BPY2, CDY, DAZ and RBMY) across great apes. Gorillas have the lowest copy number and both species of orangutans have the highest copy number of these genes.

Species Sum of median copy number values

Gorilla 14.04

Human 21.19

Chimpanzee 22.49

Bonobo 34.67

B.orangutan 63.73

S.orangutan 64.58

174

Table S8. All the great ape copy number samples used in the study.

Species IID Gorilla KB3781 Gorilla KB10845 Gorilla KB14216 Gorilla KB15257 Gorilla KB15813 Gorilla KB3456 Gorilla KB3512 Gorilla KB4987 Gorilla KB4988 Gorilla KB4989 Gorilla KB5712 Gorilla KB6319 Gorilla KB7026 Gorilla KB7801 Chimpanzee Bandit Chimpanzee BigDaddy Chimpanzee Conan Chimpanzee Moose Chimpanzee Rock Chimpanzee Budda Chimpanzee Cordova Chimpanzee Neptune Chimpanzee Zippy Bornean orangutan KB13383 Bornean orangutan KB3042 Bornean orangutan KB4204 Bornean orangutan KB5418 Bornean orangutan KB5419 Bornean orangutan KB6109 Bornean orangutan KB9002

175

Bonobo KB1843 Bonobo KB3841 Bonobo KB4229 Bonobo KB7032 Bonobo KB7781 Bonobo KB7782 Bonobo KB7998 Sumatran orangutan KB4650 Sumatran orangutan KB4661 Sumatran orangutan KB5390 Sumatran orangutan KB5565 Sumatran orangutan KB5883 Human 7 Human 8 Human 17 Human 23 Human 42 Human 46 Human 47 Human 48 Human 67 Human 72

176

Table S9. List of RNA-Seq samples.

Replica Sequencing Species NCBI SRA ID tes Type Tissue Bornean orangutan SRR2176206, SRR2176207 2 Paired end Testis In lab (3405): SRR10392514- Bornean orangutan SRR10392519 6 Paired End Testis Sumatran orangutan SRR10393299-SRR10393304 6 Paired End Testis Bornean orangutan SRR306798 1 Single End Liver Gorilla SRR306810 1 Single End Testis Gorilla SRR3053573, SRR10393358 2 Paired End Testis Gorilla SRR306808 1 Single end Liver Chimpanzee SRR306825 1 Single End Testis Chimpanzee SRR2040590 1 PairedEnd Testis Chimpanzee SRR2040591 1 Paired End Testis Chimpanzee SRR306823 1 Single End Liver Bonobo SRR306837 1 Single End Testis In lab (5013): SRR10392519- Bonobo SRR10392521 3 Paired End Testis Bonobo SRR306835 1 Single End Liver Human SRR1090722 1 Paired End Testis Human SRR306825 1 Paired End Testis Human SRR817512 1 Paired End Testis Human SRR1100440 1 Paired End Testis Human SRR1071668 1 Paired End Liver

177

Table S10. Mean values of ampliconic gene copy numbers across at least three replicates. Species IID BPY2 CDY DAZ HSFY PRY RBMY TSPY VCY XKRY Human 17 3.31 4.4 5.27 2.28 2.15 10.66 33.04 2.14 2.14 Human 23 2.76 3.47 4.51 1.58 2.49 4.95 35.29 2.12 1.55 Human 42 1.68 3.08 4.81 1.58 2 7.63 26.03 2.03 1.53 Human 46 3.67 4.63 4.49 2.28 2.03 6.82 36.69 2.09 2.2 Human 47 3.49 3.85 4.36 2.04 2.18 11.03 32.88 3.17 1.96 Human 48 2.15 2.87 2.78 2.09 2.78 9.32 32.24 2.09 2.01 Human 67 3.3 4.36 4.33 2.29 2.08 9.69 39.09 2.03 2.17 Human 7 3.22 4.26 4.31 2.05 2.12 10.41 33.03 2.38 1.95 Human 72 2.88 3.5 2.99 1.95 1.59 9.18 31.24 1.67 1.93 Human 8 3.34 4.54 4.37 2.27 2.21 10.71 35.72 2.38 2.21 Chimpanzee Bandit 1.84 4.59 3.46 NA NA 6.25 11.61 1.88 NA Chimpanzee BigDaddy 2.19 5.47 4.44 NA NA 12.01 11.54 2.25 NA Chimpanzee Budda 2.2 5.52 4.33 NA NA 9.11 15.85 2.09 NA Chimpanzee Conan 2.03 5.28 4.3 NA NA 9.65 17.62 2.05 NA Chimpanzee Cordova 2.03 4.98 4.09 NA NA 10.46 15.15 2.03 NA Gorilla KB10845 1.97 9.82 1.96 7.84 2.04 13.71 5.87 NA 2.01 B. orangutan KB13383 16.01 49.64 12.04 5.56 10.29 4.03 43.56 NA 19.24 Gorilla KB14216 2 6.01 2.07 3.97 1.96 15.86 6 NA 1.98 Gorilla KB15257 2.11 6.22 2.07 4.14 2.06 14.43 8.21 NA 2.05 Gorilla KB15813 1.97 7.9 2.06 5.98 1.99 13.8 7.88 NA 2 Bonobo KB1843 1.12 3.09 2.19 NA NA 31.52 51.86 NA NA B. orangutan KB3042 21.07 39.31 12.59 4.4 7.55 4.19 47.7 NA 14.86 Gorilla KB3456 2.17 7.81 2.17 5.93 2.08 15.91 7.93 NA 1.99 Gorilla KB3512 1.95 6.13 2.02 4.09 2.08 14.1 6 NA 2.03 Gorilla KB3781 1.98 8.32 1.97 6.1 2.02 16.14 8.08 NA 1.98 Bonobo KB3841 1.06 2.98 1.86 NA NA 28.8 48.07 NA NA B. orangutan KB4204 12.77 30.15 9.73 4.36 6.77 2.83 29.08 NA 15.21 Bonobo KB4229 2.15 3.25 2.13 NA NA 28.32 44.64 NA NA S. orangutan KB4650 13.38 35.75 13.84 5.98 10.69 2.8 31.96 NA 23.19 S. orangutan KB4661 12.43 33.16 11.55 4.95 9.9 2.78 21.31 NA 20.49 Gorilla KB4987 1.96 5.99 1.98 4.03 1.97 13.9 6.04 NA 1.97 Gorilla KB4988 2.04 9.87 2.01 7.97 1.99 13.85 6 NA 1.96 Gorilla KB4989 2.01 8.07 2.02 6.06 2.05 12.03 6.06 NA 2.06 S. orangutan KB5390 13.64 37.72 12.39 6.47 10.17 2.98 22.83 NA 25.52 B. orangutan KB5418 14.83 34.99 10.52 4.86 10.33 3.39 32.23 NA 17.17 B. orangutan KB5419 13.89 30.04 9.69 4.71 8.46 2.96 30.44 NA 13.78 S. orangutan KB5565 13.69 39.58 11.84 5.26 9.59 3.08 14.04 NA 21.06 Gorilla KB5712 2.02 5.91 1.96 4.95 2.09 13.73 5.84 NA 1.95 S. orangutan KB5883 25.33 34.3 12.97 5.49 10.2 2.79 29.55 NA 22.19 B. orangutan KB6109 21.54 46.61 10.65 6.21 7.98 4.05 35.3 NA 19.81 Gorilla KB6319 2.1 6.09 2 4.14 2.13 18.06 8.1 NA 2.01 Gorilla KB7026 2.13 12.17 2.08 8.19 2.07 14.21 6.08 NA 2.03 Bonobo KB7032 1.96 2.99 2.06 NA NA 28.59 52.19 NA NA

178

Bonobo KB7781 1.06 3.09 2.03 NA NA 29.57 48.41 NA NA Bonobo KB7782 0.84 2.49 1.65 NA NA 23.13 39.31 NA NA Gorilla KB7801 2.04 8.34 2.11 6.23 2.12 14.7 8.24 NA 2.09 Bonobo KB7998 0.98 2.93 1.87 NA NA 22.53 38.09 NA NA B. orangutan KB9002 13.28 30.38 9.81 5.58 8.79 2.67 28.62 NA 13.99 Chimpanzee Moose 2.05 5.25 4.09 NA NA 11.1 25.45 2.07 NA Chimpanzee Neptune 2.3 5.5 4.31 NA NA 10.86 23.67 2.13 NA Chimpanzee Rock 1.97 4.94 3.82 NA NA 11 20.96 1.99 NA Chimpanzee Zippy 2.21 5.42 4.35 NA NA 12.18 21.51 2.08 NA

179

Table S11. Phenotypes of sperm across great apes. The sperm phenotypes were obtained from different studies. The phenotypes in bold were used to compare the ampliconic gene copy number trends.

Gorilla Homo Pan Pan Pongo Phenotype gorilla sapiens paniscus troglodytes pygmaeus Reference (Anderson, Nyholt, & Dixson, Head length (μm3) 6.8 4.5 4.7 4.7 4.9 2005) (Anderson Midpiece length(μm3) 13.2 5.9 8.9 6.3 8.9 et al., 2005) (Anderson Tail length(μm3) 32.3 46.5 54.5 49.4 46.7 et al., 2005) (Anderson Head volume(μm3) 62.3 28.2 29.4 31.9 38.6 et al., 2005) (Anderson Midpiece volume(μm3) 6.9 2.8 9.3 7.8 3.9 et al., 2005) (Anderson Tail volume(μm3) 5.1 5.4 9.1 7.8 7.3 et al., 2005) Multiple male- Multiple multiple male-multiple (Good et Mating system Polygynous Various female female Dispersed al., 2013) (Dixson & Anderson, Body weight (kg) 169 68 39.1 44.34 74.64 2004) (Dixson & Combined testes weight Anderson, (g) 29.6 40.5 135.2 118.8 35.3 2004) (Dixson & Circulating testosterone Anderson, (ng/ml) 4.1 5.7 1.2 4.3 2.4 2004) −0.622 0.430 (Dixson & (Small (Large Anderson, Residual testes weight Testis) −0.251 (S) Testis) 0.326 (L) −0.335 (S) 2004) (Wistuba et No of tubules 400 200 200 100 al., 2003) (Wistuba et No of stages per tubule 2.03 2.26 2.26 1.69 al., 2003) (Wistuba et Multistage tubules % 79.5 91 88 55 al., 2003)

180

(Møller, Ejaculation volume (ml) 0.522 4.3 1.8 2.01 1.1 1988) (Møller, Sperm conc (106ml-1) 142.86 63.5 456.1 370.16 61 1988) (Møller, Number of sperm (106) 83.8 255 821 656.83 67.1 1988) (Møller, Sperm Motility (%) 50 60 75 41.66 47 1988) (Møller, Body Weight (kg) 200 63.5 45 44.3 74.6 1988) (Møller, Testes weight (kg) 0.0296 0.0405 0.1188 0.0353 1988) (Fujii- Hanamoto, Matsubaya shi, Nakano, Kusunoki, Total Testicular & Enomoto, volume(ml) 7.6 121.7 27.7 2011) (Fujii- Spermatogenic Index (x Hanamoto 10-3) 0.67 149 10 et al., 2011)

181

Supplemental Figures Figure S1. Relationship between median and variance of copy number across all great apes. The gray line shows the best fit from ordinary least squares regression (R2 = 0.1476; p- value=0.1664).

182

Figure S2. Principal components analysis of the overall copy number of Y ampliconic gene families across great apes. Proportion of total variance explained by the first five principal components (PCs) is on the Y-axis. The first, second, third, fourth, and fifth PCs explained 68.7%, 22.8%, 6.5%, 1.2%, and 0.8% of the variation, respectively.

183

Figure S3. Across species, gene families with higher copy number have higher variance. In each of the scatterplots the X-axis represents natural log of median copy number and the Y-axis represents natural log of variance in copy number. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line represents the linear function fitted to the given data points. The dots are color-coded to represent the six species, with missing dots indicating that the corresponding gene families are either lost or pseudogenized in that species.

184

Figure S4. Phenotypes related to sperm competition in great apes.

The X-axis represents species in decreasing order of the level of sperm competition and the Y-axis in each plot represents the phenotypes: (A) sperm midpiece volume (Anderson et al., 2005), (B) residual testis weight (ResTW) after correcting for the body weight (Dixson & Anderson, 2004), (C) sperm concentration (Møller, 1988), and (D) sperm motility (Møller, 1988). The summary of all the values for available sperm phenotypes for great apes are provided in Table S11.

185

Figure S5. Copy number variation in CDY, RBMY, TSPY, and XKRY.

The X-axis represents species in decreasing order of their sperm competition and Y-axis represents the copy number of (A) CDY, (B) RBMY (C) TSPY, and (D) XKRY. The black points represent the variance in copy number for each species.

186

Figure S6. Transcriptome assembly pipeline.

187

Figure S7. Principal components analysis of great ape gene expression values (RLD-normalized). The gene expression values were normalized using regularized log transformation rlog() function in DESeq2 (Love, Huber, & Anders, 2014).

188

Figure S8. Principal components analysis of great ape gene expression values (VST-normalized). Gene expression values were normalized using Variance Stabilizing Transformation varianceStabilizingTransformation() function in DESeq2 (Love et al., 2014).

189

Figure S9. Heatmap depicting the Euclidean distances of gene expression values between pairs of sampled individuals.

190

References

Anderson, M. J., Nyholt, J., & Dixson, A. F. (2005). Sperm competition and the evolution

of sperm midpiece volume in mammals. Journal of Zoology, 267(02), 135.

Ardlie, K. G., DeLuca, D. S., Segrè, A. V., Sullivan, T. J., Young, T. R., Gelfand, E. T., …

Lockhart. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis:

Multitissue gene regulation in humans. Science, 348(6235), 648–660.

Dixson, A. F., & Anderson, M. J. (2004). Sexual behavior, reproductive physiology and

sperm competition in male mammals. Physiology and Behavior, 83(2), 361–371.

Fujii-Hanamoto, H., Matsubayashi, K., Nakano, M., Kusunoki, H., & Enomoto, T. (2011).

A comparative study on testicular microstructure and relative sperm production in

gorillas, chimpanzees, and orangutans. American Journal of Primatology, 73(6),

570–577.

Good, J. M., Wiebe, V., Albert, F. W., Burbano, H. A., Kircher, M., Green, R. E., …

Pääbo, S. (2013). Comparative population genomics of the ejaculate in humans and

the great apes. Molecular Biology and Evolution, 30(4), 964–976.

Hahn, M. W., Demuth, J. P., & Han, S.-G. (2007). Accelerated rate of gene gain and loss

in primates. Genetics, 177(3), 1941–1949.

Hallast, P., Maisano Delser, P., Batini, C., Zadik, D., Rocchi, M., Schempp, W., …

Jobling, M. A. (2016). Great ape Y Chromosome and mitochondrial DNA

phylogenies reflect subspecies structure and patterns of mating and dispersal.

Genome Research, 26(4), 427–439.

Han, M. V., Thomas, G. W. C., Lugo-Martinez, J., & Hahn, M. W. (2013). Estimating

gene gain and loss rates in the presence of error in genome assembly and

annotation using CAFE 3. Molecular Biology and Evolution, 30(8), 1987–1997.

Locke, D. P., Hillier, L. W., Warren, W. C., Worley, K. C., Nazareth, L. V., Muzny, D. M.,

191

… Wilson, R. K. (2011). Comparative and demographic analysis of orang-utan

genomes. Nature, 469(7331), 529–533.

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and

dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.

Møller, A. P. (1988). Ejaculate quality, testes size and sperm competition in primates.

Journal of Human Evolution, 17(5), 479–488.

Rohlfs, R., & Nielsen, R. (2014). Phylogenetic ANOVA: The Expression Variance and

Evolution (EVE) model for quantitative trait evolution. https://doi.org/10.1101/004374

Rohlfs, R. V., Harrigan, P., & Nielsen, R. (2014). Modeling gene expression evolution

with an extended Ornstein-Uhlenbeck process accounting for within-species

variation. Molecular Biology and Evolution, 31(1), 201–211.

Rohlfs, R. V., & Nielsen, R. (2015). Phylogenetic ANOVA: The expression variance and

evolution model for quantitative trait evolution. Systematic Biology, 64(5), 695–708.

Wistuba, J., Schrod, A., Greve, B., Hodges, J. K., Aslam, H., Weinbauer, G. F., &

Luetjens, C. M. (2003). Organization of seminiferous epithelium in primates:

relationship to spermatogenic efficiency, phylogeny, and mating system. Biology of

Reproduction, 69(2), 582–591.

192

Appendix C

Supporting Material for Chapter 4

Supplemental Note S1. Evolutionary scenarios for palindromes

Palindrome 4. We used two different approaches to obtain information about the conservation of human and chimpanzee palindromes across great apes. First, we used the multiple sequence alignment of Y chromosome assemblies to obtain the coverage of these palindromes. In the case of P4 we observed 23,175 bp of alignment in the bonobo assembly (Table S1). This analysis gave us the percentage of P4 present in bonobo Y assembly, however it did not infer that there is a continuous 23-kb block of palindrome P4 on bonobo Y. P4 could be highly fragmented due to Y chromosome degradation or rearrangements and the multiple sequence alignment can still capture such homologous fragments of the palindromes. The longest blocks in the Y multiple sequence alignment that overlap with P4 are 2-4 kb in length and these alignment blocks included sequences from gorilla and human Ys. In the remaining species, sequences homologous to P4 are mostly represented by gaps. Second, to identify the copy number of P4 homologs present in other species, we used alignments generated with LASTZ (Harris, 2007) based alignments. Non-overlapping 1-kb windows of the assembly were aligned to human P4 using LASTZ. The read depth of windows with >80% identity to P4 was used to estimate the copy number of P4. However, we did not find any 1-kb window in bonobo Y which maps to human palindrome P4 with >80% identify. Since we did not find high-confidence windows aligning to P4 in bonobo, we concluded that it is highly fragmented in this species.

Evolution of sequences homologous to human and chimpanzee palindromes. Common ancestor of great apes. Partial sequences of all human and chimpanzee palindromes were present. P1, P2, P4, P5, P8, C2, C3, C4, and C17 were in multi-copy state (Tables SN1A-B). P3, P6, and C1 might have been in single- or multi-copy state (Tables SN1A-B).

Orangutan. Increase in copy number of P1,P2, P5, C3, and C4 (Fig. 1A, Table SN1A-B). Loss of C3 (≈25% loss in coverage Table S2) and P2 (≈27% loss in coverage Table S1) segments when compared to other great apes.

193

Common ancestor of gorilla, human, bonobo, and chimpanzee. P3, P6, and C1 are in a multi-copy state either at this node or in the common ancestor of great apes (Tables SN1A- B).

Gorilla. Loss of copy number for P8, C2, and C17 compared to bonobo and orangutan (Fig. 1, Tables SN1A-B). C3 and C4 had more than two copies in human, chimpanzee and orangutan, however only two copies in gorilla (Tables SN1A-B). For C2, bonobo and orangutan have higher copy number when compared to gorilla. Loss of segments in C1 (≈15% loss in coverage) and C19 (≈40-60% loss in coverage) when compared to other great apes (Table S2).

Common ancestor of human, bonobo, and chimpanzee. All the palindromes are assumed to be multi-copy with the exception of C5, which could have been in a single-copy state (Tables SN1A-B). Gorilla and orangutan have more than two copies of P4 which is in two copies or lost in human, bonobo and chimpanzees (Tables SN1A-B).

Human. Palindrome P3 has ≈30-35% covered in other great apes (Table S1), so we assume that the remaining portion of P3 is human-specific. Humans also lost most of the sequences homologous to palindrome C2; we observe some sequence homologous to C2 on the human Y, however they are degraded and not visible on an alignment in human and chimpanzee Y dot plot (Hughes et al., 2010). Therefore, we assume C2 was deleted human.

Common ancestor of bonobo and chimpanzee. Gain of C2 segment, both bonobo and chimpanzee share 85% coverage whereas the other great apes cover 20-30% of C2, which implies a Pan genus specific gain of C2 sequences (Table S2). C1 and C5 groups are present in more than two copies in both chimpanzee (Hughes et al., 2010) and bonobo, where as other species have two or fewer copies of these palindromes (Fig. 1A). The Pan genus lost P4, we observe that sequence homologous to P4 are present in bonobo and chimpanzee Y, however they are degraded and not visible as an alignment in the human and chimpanzee Y dot plot (Hughes et al., 2010). Therefore, we assume that P4 was deleted.

Bonobo. Bonobo lost copies of P1, P2, P6, P7, C3, C4, C18, and C19 (Fig. 1A, Table SN1A-B). It also experienced loss of segments in C18 (≈30-60% loss in coverage Table S2) and P7 (≈60% loss in coverage Table S1) when compared to other great apes.

194

Chimpanzee. Chimpanzee gained copies of P1, P2, and P5 as these palindromes share homology with C3 and C4, which are in multi-copy in chimpanzee (Hughes et al., 2010). Chimpanzee also gained a segment of C1, a palindrome which has <50% coverage with the other great ape Y chromosomes (Table S2).

Table SN1A. Reconstructing human palindrome evolution using maximum parsimony. The values from extant species were taken from Table S3-4 and rounded to the following numbers of copies: <1.34 => “1”, “1.33-1.66” =>”1-2”, “1.66-2.5” => “2”, “>2.5” => “M” (“more than two”). P1 P2 P3# P4 P5 P6 P7 P8 Bonobo 1-2 1-2 1-2 0 2 1-2 1 2 Chimpan M M 1$ 0 M 2 2 2 zee BC* 1-2-M 1-2-M 1 0 2-M 2 1-2 2 Human 2 2 2 2 2 2 2 2 BCH** 2 2 1-2 2 2 2 2 2 Gorilla 2 2 2 M 2 1-2 1 1 BCHG*** 2 2 2 2-M 2 2 1-2 1-2 Oranguta M M 1 M M 1 1 2 n GA**** 2-M 2-M 1-2 M 2-M 1-2 1 2 *Bonobo-chimpanzee common ancestor **Bonobo-chimpanzee-human common ancestor ***Bonobo-chimpanzee-human-gorilla common ancestor ****Common ancestor of great apes $We conservatively assigned the copy number of P3 as 1 in chimpanzee to make sure we do not inflate its combined copy number. #We can conservatively assume that P3 became multi-copy in BCHG, but instead it might have been multi- copy in the common ancestor of great apes and lost its multi-copy status in orangutan instead

Table SN1B. Reconstructing chimpanzee palindrome evolution using maximum parsimony. The values from extant species were taken from Table S3-4 and rounded to the following numbers of copies: <1.34 => “1”, “1.33-1.66” =>”1-2”, “1.66-2.5” => “2”, “>2.5” => “M” (“more than two”). C1 group C2 group C3 group C4 group C5 group C17 C18 C19 Chimpan M M M M M 2 2 2 zee Bonobo M M 1 1-2 M 2 1 1 BC* M M 1-M M M 2 2 1-2

195

Human 2 0 M M 1 2 2 2 BCH** 2-M M M M 1-M 2 2 2 Gorilla 2 2 2 2 1 1 1 1 BCHG*** 2 2-M 2-M 2-M 1 1-2 1 1-2 Oranguta 1 M M M 1 2 1 1 n GA**** 1-2 M M M 1 2 1 1 *Bonobo-chimpanzee common ancestor **Bonobo-chimpanzee-human common ancestor ***Bonobo-chimpanzee-human-gorilla common ancestor ****Common ancestor of great apes

196

Supplemental Note S2. Augustus predictions

AUGUSTUS (Stanke & Waack, 2003) predicted 219 genes on the bonobo Y assembly, of which 25 complete or partial genes represent homologs of known human protein-coding genes. In the case of Sumatran orangutan, AUGUSTUS predicted 90 genes of which 33 complete or partial genes represent homologs of known human protein-coding genes. After implementing requirements of gene predictions (1) to have start and stop codons, and (2) be present on contigs that align to human or chimpanzee Y, we did not find any novel genes on the Y chromosome of orangutan, however we found two candidates— SUZ12 and PSMA6—which have >95% identity and >90% coverage to the gene homologs on the autosomes of bonobo.

A possible transposition of the autosomal SUZ12 gene (located on human chromosome 17) onto the bonobo Y chromosome was predicted (Table SN2) based on the limited number of introns (one intron), in contrast to its autosomal homolog, which has 15 introns (NM_015355). The SUZ12 gene has no matches to human and gorilla Y, however the first 121 bp of its predicted sequence align to chimpanzee (palindromes C2, C11 and C15) and orangutan Y. However, when we aligned testis RNA-seq data to the predicted SUZ12 gene on the bonobo Y chromosome, the first exon with the start codon was not expressed (Fig. SN2A), whereas the second exon was expressed (Fig. SN2B). The single nucleotide variants in the RNA-seq reads mapping to the second exon are consistent with the variants of the SUZ12 gene present on chromosome 17. Thus, we concluded that the translocated SUZ12 was pseudogenized on bonobo Y.

The PSMA6 gene was also predicted in bonobo, which shared 99.6% identity with its homolog on the chimpanzee Y and 97.8% identity on human Y. However, there were no homologous sequences in the orangutan and gorilla Y assemblies. The PSMA6 gene on human Y was annotated as a pseudogene in the database (Gene ID: 5687). Therefore, we concluded that PSMA6 is also a pseudogene on the bonobo. Thus, no novel genes, as compared to the human Y chromosome genes, were found on the bonobo and orangutan Y chromosomes.

197

Table SN2. Gene annotation of SUZ12 homolog as predicted by AUGUSTUS

Seqname Source Feature Start End Strand

Contig591 AUGUSTU gene 74 61572 - S

Contig591 AUGUSTU transcript 74 61572 - S

Contig591 AUGUSTU stop_codon 74 76 - S

Contig591 AUGUSTU CDS 74 2353 - S

Contig591 AUGUSTU CDS 61453 61572 - S

Contig591 AUGUSTU start_codon 61570 61572 - S

198

Supplemental Figure SN2A. IGV (Robinson et al., 2011) view of first exon on SUZ12 homolog on bonobo Y. Testis specific RNASeq reads (SRA ID: SRR306837) were mapped to bonobo Y assembly using BWA MEM (Li, 2013).

199

Supplemental Figure SN2B. IGV view of second exon on SUZ12 homolog on bonobo Y. Testis specific RNASeq reads (SRA ID: SRR306837) were mapped to bonobo Y assembly using BWA MEM (Li, 2013).

200

Supplemental Tables

Table S1. The sequence coverage (percentage) of human palindromes P1-8 (arms) across great apes. The repeats annotated by RepeatMasker (SMIT & A., 2004) were excluded in the percentage calculations.

Length, bp Coverage, percentage Human palindrom Chimpanze Bonob Oranguta Human Chimpanze Bonob Gorill Oranguta e e o Gorilla n * e o a n 33403 P1 362895 340679 0 249906 608650 59.62 55.97 54.88 41.06

P2 50379 50144 49092 29106 76387 65.95 65.64 64.27 38.10

P3 64460 62335 68254 55671 179793 35.85 34.67 37.96 30.96

P4 77345 76543 83948 72509 93979 82.30 81.45 89.33 77.15 15704 P5 158548 158405 0 148243 166929 94.98 94.89 94.08 88.81

P6 35555 35436 34535 32729 36832 96.53 96.21 93.76 88.86

P7 4439 1134 4385 4134 5414 81.99 20.95 80.99 76.36

P8 15027 13698 12688 12783 16728 89.83 81.89 75.85 76.42 74397 118471 Total 768648 738374 2 605081 2 64.88 62.33 62.80 51.07 *Palindrome arm length

201

Table S2. The sequence coverage (percentage) of chimpanzee palindromes C1-19 (arms) across great apes. The repeats annotated by RepeatMasker (SMIT & A., 2004) were excluded from the calculations. The palindromes were clustered into five homology groups: C1 (C1+C6+C8+C10+C14+C16), C2 (C2+C11+C15), C3 (C3+C12), C4 (C4+C13), and C5 (C5+C7+C9). The numbers in bold represent the palindrome with highest coverage within each group, which we used as the representative coverage of that homology group.

Length, bp Coverage, percentage

Chimpanzee palindrome Human Bonobo Gorilla Orangutan Chimpanzee* Human Bonobo Gorilla Orangutan

C1 36118 35368 25476 37460 66916 53.98 52.85 38.07 55.98

C2 35233 111830 27957 38673 141680 24.87 78.93 19.73 27.30

C3 58885 59610 57211 38076 82274 71.57 72.45 69.54 46.28

C4 119067 120124 119349 116986 140401 84.80 85.56 85.01 83.32

C5 113353 107475 113915 110252 136579 82.99 78.69 83.41 80.72

C6 16015 15745 6517 17378 58166 27.53 27.07 11.20 29.88

C7 45695 45809 46167 45408 137290 33.28 33.37 33.63 33.07

C8 21783 21047 12228 22766 64725 33.65 32.52 18.89 35.17

C9 46890 46809 47450 46382 139044 33.72 33.66 34.13 33.36

C10 16096 15957 6926 17487 59322 27.13 26.90 11.68 29.48

C11 24198 105870 27840 38567 123692 19.56 85.59 22.51 31.18

C12 46539 47451 46260 26584 76418 60.90 62.09 60.54 34.79

C13 117844 118277 117312 116043 132034 89.25 89.58 88.85 87.89

C14 19998 19943 11045 21285 47853 41.79 41.68 23.08 44.48

C15 23050 95239 25619 36280 111841 20.61 85.16** 22.91 32.44

C16 40950 33549 28037 41545 76391 53.61 43.92 36.70 54.38

C17 15364 15072 12961 12978 17516 87.71 86.05 74.00 74.09

C18 6686 2582 4741 6304 6888 97.07 37.49 68.83 91.52

C19 155747 156318 57117 122860 168341 92.52 92.86 33.93 72.98

Total 426178 488431 303092 383879 637282 72.96 81.67 57.36 66.48 *Palindrome arm length **In the case of cluster C2, different palindromes had highest coverage for different species and we considered C15 as a representative because it had the highest coverage for more than one species, while other palindromes had highest coverage for only one species.

202

Table S3. The copy number for sequences homologous to human palindromes. The numbers for bonobo, gorilla, and orangutan were obtained based on median read depth of 1-kb windows homologous to human or chimpanzee palindromes. The copy number for chimpanzee and human were obtained by examining the dotplot of human and chimpanzee Y(Hughes et al., 2010). In brackets, we indicate the known homologs of chimpanzee and human palindromes in human and chimpanzee, respectively (Hughes et al., 2010).

Palindrome/ P1 P2 P3* P4 P5 P6 P7 P8 Species

Human 2 2 2 2 2 2 2 2

Chimpanzee >2(C3+C4) >2(C3) >2(C1 0 >2(C4) 2(C19) 2(C18) 2(C17) parts)

Bonobo 1.64 1.46 1.62 0 2.13 1.42 0.70 1.98

Gorilla 2.19 2.13 1.87 2.74 2.04 1.42 1.16 1.19

Orangutan 6.52 13.13 1.05 5.67 7.29 1.06 1.04 1.80

*Note: We assume that some parts of human palindrome P3 in chimpanzee are multi-copy (those that share homology with C1), while others are single-copy.

203

Table S4. The copy number for sequences homologous to chimpanzee palindromes. The numbers for bonobo, gorilla, and orangutan were obtained based on median read depth of 1-kb windows homologous to human or chimpanzee palindromes. The copy number for chimpanzee and human were obtained by examining the dotplot of human and chimpanzee Y(Hughes et al., 2010). In brackets, we indicate the known homologs of chimpanzee and human palindromes in human and chimpanzee, respectively (Hughes et al., 2010).

Palindrome/ C1 group C2 group C3 group C4 group C5 C17 C18 C19 Species grou p

Human 2(P3 parts) 0 >2(P1,P2) >2(P5,P1) 1 2 2(P7 2(P6 (P8) ) )

Chimpanzee >2 >2 >2 >2 >2 2 2 2

Bonobo 13.29 8.42 1.18 1.41 9.31 2.27 0.80 1.25

Gorilla 2.28 2.48 2.13 2.07 1.30 1.19 1.15 1.28

Orangutan 1.02 6.07 12.13 7.85 1.07 1.87 1.02 1.02

204

Table S5. Gene birth and death rates (in events per millions of years) on the Y chromosome of great apes, as predicted using the Iwasaki and Takagi gene reconstruction model. BC - common ancestor of bonobo and chimpanzee; BCH - common ancestor of bonobo, chimpanzee, and human; BCHG - common ancestor of bonobo, chimpanzee, human, and gorilla. GA - common ancestor of great apes.

X-degenerate genes Ampliconic genes

Branch Birth rate Death rate Birth rate Death rate

Bonobo-BC 1.00x10-4 1.00x10-4 1.00x10-4 0.182

Chimpanzee-BC 1.00x10-4 8.00x10-2 1.00x10-4 1.00x10-4

Human-BCH 1.90x10-5 1.90x10-5 1.90x10-5 1.90x10-5

Gorilla-BCHG 1.43x10-5 1.43x10-5 1.43x10-5 1.43x10-5

Orangutan-GA 7.14x10-6 2.06x10-2 7.14x10-6 7.14x10-6

Macaque-Root 3.45x10-6 3.45x10-6 3.45x10-6 9.92x10-3

BC-BCH 2.35x10-5 4.89x10-2 2.35x10-5 9.54x10-2

BCH-BCHG 5.71x10-5 5.71x10-5 5.71x10-2 5.71x10-5

BCHG-GA 1.43x10-5 1.43x10-5 1.43x10-5 1.43x10-5

GA-Root 6.67x10-6 4.04x10-3 6.67x10-6 6.67x10-6

205

Table S6. The coordinates of palindromes on panTro6 Y chromosome. The coordinates were obtained from Tomaszkiewicz et al. 2016 and updated to the current version of chimpanzee Y.

Approx. arm length (half of Left_arm_en Palindrome Start End Amplicon palindrome) d (Approx) C1 1759451 2053069 Pink 146809 1906260 C2 2298081 2984818 Blue 343368.5 2641450 C3 3587737 3925944 Red 169103.5 3756841 C4 4669973 5444881 Gold 387454 5057427 C5 8651546 9099775 Turquoise 224114.5 8875661 C6 9099776 9383863 Pink 142043.5 9241820 C7 9383864 9832093 Turquoise 224114.5 9607979 C8 9832094 10116181 Pink 142043.5 9974138 C9 10116182 10564411 Turquoise 224114.5 10340297 C10 10564412 10851248 Pink 143418 10707830 C11 11060051 11674261 Blue 307105 11367156 C12 12651797 12963129 Red 155666 12807463 C13 13340000 14084660 Gold 372330 13712330 C14 14798116 15024316 Pink 113100 14911216 C15 15493746 16056815 Blue 281534.5 15775281 C16 16477310 16791733 Pink 157211.5 16634522 C17 21591500 21671300 39900 21631400 C18 23517939 23546819 14440 23532379 C19 23807052 24577396 385172 24192224

206

Table S7. The coordinates of palindromes on hg38 Y chromosome. The coordinates were obtained from Tomaszkiewicz et al. 2016.

Left_arm_end Palindrome Start End (Approx)

P1 23359067 26311550 24822577

P2 23061889 23358813 23208197

P3 21924954 22661453 22208730

P4 18450291 18870104 18640356

P5 17455877 18450126 17951255

P6 16159590 16425757 16269541

P7 15874906 15904894 15883575

P8 13984498 14058230 14019652

Table S8. The coordinates of X degenerate genes on hg38 Y chromosome. The locations are based on annotation from NCBI RefSeq annotation.

GENE START END

SRY 2786855 2787699

AMELY 6865918 6874027

DBY 12904831 12920478

EIF1AY 20575725 20593154

NLGN4Y 14523746 14843726

PRKY 7273972 7381547

KDM5D 19705417 19744939

TBL1Y 6910686 7091683

TMSB4Y 13703567 13706024

USP9Y 12701231 12860839

UTY 13248379 13480673

ZFY 2935477 2982506

207

Table S9. The coordinates of X degenerate genes on panTor6 Y chromosome. The locations are based on annotation from NCBI RefSeq annotation.

GENE START END

SRY 26210218 26211127

AMELY 25814614 25822712

DBY (DDX3Y) 20429349 20442635

EIF1AY 17534922 17551649

NLGN4Y 22946482 23238101

PRKY 25143378 25246224

KDM5D 18097144 18135348

TBL1Y 25425481 25499431

USP9Y 20238918 20383654

UTY 20869562 21028971

ZFY 26012708 26038802

Table S10. List of ENCODE experiment and sample ids for the samples analyzed.

Experiment unfiltered BAM Target Tissue ENCSR000ALB ENCFF735TGN H3K27Ac HUVEC ENCSR000AKL ENCFF322MOQ H3K4me1 HUVEC

ENCSR000ALG ENCFF261CBZ Control HUVEC

ENCSR000EOQ ENCFF042VZB Dnase-seq HUVEC

ENCSR112ALD ENCFF319GEZ CREB1 HepG2

ENCSR136ZQZ ENCFF807LQS H3K27Ac Testis

ENCSR956VQB ENCFF077NRU H3k4me1 Testis

ENCSR215WNN ENCFF511LQO Control Testis

ENCSR729DRB ENCFF639PHQ Dnase-seq Testis

208

Table S11. Annotation of possible novel genes on bonobo Y (post filtering).

Gene coverage (Alignment Gene Percentag Query Alignment /Query Gene ID annotation e Identity Length length e-value length) SUZ12_HU g80 MAN 99.187 799 738 0 92.37 PSA6_MO g32 USE 95.732 164 164 2.16E-117 100.00

209

Supplemental Figures

Figure S1. Reconstructed gene content of great apes. The first six rows have information about the gene content of great apes and macaque, which were used as an input for the model of Iwasaki and Takagi (Iwasaki & Takagi, 2007). The other rows were reconstructed by the model. BC - common ancestor of bonobo and chimpanzee; BCH - common ancestor of bonobo, chimpanzee, and human; BCHG - common ancestor of bonobo, chimpanzee, human, and gorillas. GA - common ancestor of great apes. We defined the presence of RPS4Y2 and MXRA5Y in bonobo and orangutan based on AUGUSTUS, and Y chromosome specific testis transcriptome assembly results (Vegesna et al., 2020). The presence of RPS4Y2 gene was confirmed in bonobo through gene prediction (shares 100% identity with chimpanzee RPS4Y2) and assembled transcript sequences (shares 99.6% identity with chimpanzee RPS4Y2). MXRA5Y, which is psedogenized in human and chimpanzee (Hughes et al., 2012), was missing in orangutan (no gene prediction or transcript found) and pseudogenized in bonobo (gene prediction annotated as X chromosome homolog MXRA5 and missing the first three exons of MXRA5 in its sequence). In the case of gorilla, we did not find its transcript in transcriptome assembly and a BLAT search of human MXRA5Y gene (NC_000024.10:11952465-11993293) resulted in a 12-kb long hit which is around 20% of the gene. We assumed the gene is lost in gorilla as well.

210

References

Harris, R. S. (2007). Improved pairwise Alignmnet of genomic DNA.

https://etda.libraries.psu.edu/catalog/7971

Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S.,

Dugan, S., Ding, Y., Buhay, C. J., Kremitzki, C., Wang, Q., Shen, H., Holder, M.,

Villasana, D., Nazareth, L. V., Cree, A., Courtney, L., Veizer, J., Kotkiewicz, H., …

Page, D. C. (2012). Strict evolutionary conservation followed rapid gene loss on

human and rhesus y chromosomes. Nature, 483(7387), 82–87.

Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P.

J., Fulton, R. S., McGrath, S. D., Locke, D. P., Friedman, C., Trask, B. J., Mardis, E.

R., Warren, W. C., Repping, S., Rozen, S., Wilson, R. K., & Page, D. C. (2010).

Chimpanzee and human Y chromosomes are remarkably divergent in structure and

gene content. Nature, 463(7280), 536–539.

Iwasaki, W., & Takagi, T. (2007). Reconstruction of highly heterogeneous gene-content

evolution across the three domains of life. Bioinformatics , 23(13), i230–i239.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with

BWA-MEM. In arXiv [q-bio.GN]. arXiv. http://arxiv.org/abs/1303.3997

Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G.,

& Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1),

24–26.

SMIT, & A., F. A. (2004). Repeat-Masker Open-3.0. Http://www.repeatmasker.org.

https://ci.nii.ac.jp/naid/10029514778/

Stanke, M., & Waack, S. (2003). Gene prediction with a hidden Markov model and a new

intron submodel. Bioinformatics , 19 Suppl 2, ii215–ii225.

Vegesna, R., Tomaszkiewicz, M., Ryder, O. A., Campos-Sánchez, R., Medvedev, P.,

211

DeGiorgio, M., & Makova, K. D. (2020). Ampliconic genes on the great ape Y

chromosomes: Rapid evolution of copy number but conservation of expression

levels. In Review.

Harris, R. S. (2007). Improved pairwise Alignmnet of genomic DNA.

https://etda.libraries.psu.edu/catalog/7971

Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S.,

Dugan, S., Ding, Y., Buhay, C. J., Kremitzki, C., Wang, Q., Shen, H., Holder, M.,

Villasana, D., Nazareth, L. V., Cree, A., Courtney, L., Veizer, J., Kotkiewicz, H., …

Page, D. C. (2012). Strict evolutionary conservation followed rapid gene loss on

human and rhesus y chromosomes. Nature, 483(7387), 82–87.

Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P.

J., Fulton, R. S., McGrath, S. D., Locke, D. P., Friedman, C., Trask, B. J., Mardis, E.

R., Warren, W. C., Repping, S., Rozen, S., Wilson, R. K., & Page, D. C. (2010).

Chimpanzee and human Y chromosomes are remarkably divergent in structure and

gene content. Nature, 463(7280), 536–539.

Iwasaki, W., & Takagi, T. (2007). Reconstruction of highly heterogeneous gene-content

evolution across the three domains of life. Bioinformatics , 23(13), i230–i239.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with

BWA-MEM. In arXiv [q-bio.GN]. arXiv. http://arxiv.org/abs/1303.3997

Rahulsimham Vegesna, Marta Tomaszkiewicz, Oliver A. Ryder, Rebeca Campos-

Sánchez, Paul Medvedev, Michael DeGiorgio, and Kateryna D. Makova. (2020).

Ampliconic genes on the great ape Y chromosomes: Rapid evolution of copy

number but conservation of expression levels. In Review.

Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G.,

& Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1),

212

24–26.

SMIT, & A., F. A. (2004). Repeat-Masker Open-3.0. Http://www.repeatmasker.org.

https://ci.nii.ac.jp/naid/10029514778/

Stanke, M., & Waack, S. (2003). Gene prediction with a hidden Markov model and a new

intron submodel. Bioinformatics , 19 Suppl 2, ii215–ii225.

213

VITA Rahulsimham Vegesna

Education:

PhD in Bioinformatics and Genomics (2013-2020) The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA.

Master of Science in Bioinformatics (PSM) (2009-2011) The University of Texas at El Paso, El Paso, TX, USA.

Bachelor of Technology in Bioinformatics (2005-2009) Sathyabama University, Chennai, TN, India.

Work Experience:

Associate Applications System Analyst (2012-2013) Bioinformatics and Computational Biology Department, M.D. Anderson Cancer Center, Houston, Texas, USA, 77230.

Fellowship and Awards:

The Huck Institutes of the Life Sciences Fellowship 2013-2014

Computation, Bioinformatics, and Statistics (CBIOS) Training Program 2017-2018

Selected Publications:

• Vegesna, R., Tomaszkiewicz, M., Medvedev, P., & Makova, K. D. (2019). Dosage regulation, and variation in gene expression and copy number of human Y chromosome ampliconic genes. PLoS genetics, 15(9), e1008369. • Torres-García, W., Zheng, S., Sivachenko, A., Vegesna, R., Wang, Q., Yao, R., ... & Verhaak, R. G. (2014). PRADA: pipeline for RNA sequencing data analysis. Bioinformatics, 30(15), 2224-2226. • Yoshihara, K., Wang, Q., Torres-Garcia, W., Zheng, S., Vegesna, R., Kim, H., & Verhaak, R. G. (2015). The landscape and therapeutic relevance of cancer- associated transcript fusions. Oncogene, 34(37), 4845-4854. • Zheng, S., Fu, J., Vegesna, R., Mao, Y., Heathcock, L. E., Torres-Garcia, W., ... & Brennan, C. W. (2013). A survey of intragenic breakpoints in glioblastoma identifies a distinct subset associated with poor survival. Genes & development, 27(13), 1462-1472. • Yoshihara, K., Shahmoradgoli, M., Martínez, E., Vegesna, R., Kim, H., Torres- Garcia, W., ... & Carter, S. L. (2013). Inferring tumour purity and stromal and immune cell admixture from expression data. Nature communications, 4(1), 1-11. • Cancer Genome Atlas Research Network. (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature, 499(7456), 43.