The Pennsylvania State University
The Graduate School
EVOLUTION OF Y CHROMOSOME AMPLICONIC GENES IN GREAT APES
A Dissertation in
Bioinformatics and Genomics
by
Rahulsimham Vegesna
© 2020 Rahulsimham Vegesna
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
May 2020
The dissertation of Rahulsimham Vegesna was reviewed and approved by the following:
Paul Medvedev Associate Professor of Computer Science & Engineering Associate Professor of Biochemistry & Molecular Biology Dissertation Co-Adviser Co-Chair of Committee
Kateryna D. Makova Pentz Professor of Biology Dissertation Co-Adviser Co-Chair of Committee
Michael DeGiorgio Associate Professor of Biology and Statistics
Wansheng Liu Professor of Animal Genomics
George H. Perry Chair, Intercollege Graduate Degree Program in Bioinformatics and Genomics Associate Professor of Anthropology and Biology
ii
ABSTRACT
In addition to the sex-determining gene SRY and several other single-copy genes, the human Y chromosome harbors nine multi-copy gene families which are expressed exclusively in testis. In humans, these gene families are important for spermatogenesis and their loss is observed in patients suffering from infertility. However, only five of the nine ampliconic gene families are found across great apes, while others are missing or pseudogenized in some species. My research goal is to understand the evolution of the Y ampliconic gene families in humans and in non-human great ape species. The specific objectives I addressed in this dissertation are
1. To test whether Y ampliconic gene expression levels depend on their copy number and whether there is a gene dosage compensation to counteract the ampliconic gene copy number variation observed in humans. For the nine ampliconic gene families found in humans, the copy number and expression levels were estimated in 149 men. Among the Y ampliconic gene families, higher copy number leads to higher expression. Within the Y ampliconic gene families, copy number does not influence gene expression, rather a high tolerance for variation in gene expression was observed in testis of presumably healthy men. We also found that expression of five Y ampliconic gene families is coordinated with that of their non-Y (i.e. X or autosomal) homologs. Indeed, five ampliconic gene families had consistently lower expression levels when compared to their non-Y homologs suggesting dosage regulation, while the HSFY family had higher expression levels than its X homolog and thus lacked dosage regulation.
2. To test whether the Y ampliconic gene copy number and gene expression levels are conserved across great apes. For the ampliconic gene families found in great apes, the copy number and expression levels were estimated in independent datasets ranging from two to 14 samples per species. Our results indicate high variability in gene family size but conservation in gene expression levels in Y ampliconic gene families. This relationship was similar to what was observed in humans. However, for three gene families, size was positively correlated with gene expression levels across species, suggesting that, given sufficient evolutionary time, copy number influences gene expression on the Y chromosome.
3. To study the dynamics of gene (and gene family) loss and gain in great ape Y chromosomes. Given the assemblies and alignments of great ape Y chromosomes, we determined the gene content on the Y chromosome of bonobo and orangutan. We then reconstructed the evolutionary history of gene content across great apes to observe that there was an increased rate of loss of genes in Pan genus (bonobo and chimpanzee) when compared to other great apes. The human palindromes P6 and P7 which are void of known ampliconic genes are conserved across great apes. The potential reason for their conservation is presence of possible gene expression regulators and not genes on these palindromes.
The results of this dissertation significantly advance our understanding of Y chromosome evolution in great apes. They provide an overview of variation in gene copy number and expression levels of these highly similar gene families which have been a challenge to study previously.
Table of Contents LIST OF TABLES ...... viii LIST OF FIGURES ...... ix ACKNOWLEDGMENTS ...... xiii Chapter 1 ...... 1 Introduction ...... 1 References ...... 3 Chapter 2 ...... 7 Dosage regulation, and variation in gene expression and copy number at human Y chromosome ampliconic genes ...... 7 Abstract ...... 7 Introduction ...... 8 Results ...... 11 AmpliCoNE: Ampliconic Copy Number Estimator ...... 11 Y ampliconic gene copy number estimates ...... 13 Y ampliconic gene families with low copy number in humans are frequently deleted in non-human great apes ...... 14 Y ampliconic gene expression ...... 15 More copious gene families have higher gene expression levels ...... 15 Within a family, copy number and gene expression are not correlated ...... 16 Y haplogroups and ampliconic gene families ...... 17 The role of age in ampliconic gene expression ...... 20 Ampliconic gene dosage regulation ...... 20 Discussion ...... 26 Variability in Y ampliconic gene copy number ...... 27 Variability in Y ampliconic gene expression ...... 28 Dosage regulation of Y ampliconic genes ...... 30 Materials and Methods ...... 34 AmpliCoNE: Ampliconic Copy Number Estimator ...... 34 Simulation-based validation of AmpliCoNE ...... 36 Datasets ...... 36 Pipeline for human WGS analysis ...... 37 Experimental validation with droplet digital PCR (ddPCR) ...... 37
iv
Estimating gene expression levels ...... 38 Human Y haplogroup determination ...... 38 Code availability ...... 39 References ...... 39 Chapter 3 ...... 47 Ampliconic genes on the great ape Y chromosomes: Rapid evolution of copy number but conservation of expression levels ...... 47 Abstract ...... 47 Introduction ...... 48 Results ...... 52 Dynamic evolution of Y ampliconic gene copy number ...... 52 Conservation of Y ampliconic gene expression in great apes ...... 60 The relationship between copy number and gene expression levels ...... 62 Y ampliconic gene copy number variation and phenotypes related to sperm competition ...... 64 Discussion ...... 65 Materials and Methods ...... 73 DNA samples ...... 73 ddPCR assays for ampliconic gene copy number estimation in bonobo and orangutan ...... 73 Construction of Y-specific great ape phylogenetic trees ...... 74 Analysis of conservation in copy number across great ape species ...... 75 RNA-Seq datasets ...... 76 Transcriptome assembly of Y ampliconic genes in great apes ...... 77 Estimating gene expression levels from RNA-Seq datasets ...... 78 Testing for conservation in gene expression levels ...... 79 Availability of data and materials ...... 80 References ...... 80 Chapter 4 ...... 93 Dynamic evolution of great ape Y chromosomes ...... 93 Introduction ...... 93 Results ...... 95 Conservation of human and chimpanzee palindrome sequences...... 95 What drives conservation of gene-free palindromes P6 and P7? ...... 97
v
Evolution of gene content in great apes ...... 98 Gene conversion between X and Y chromosomes of great apes ...... 101 Discussion ...... 102 Palindromes ...... 102 Genes ...... 103 Gene conversion ...... 104 Materials and Methods ...... 105 Human and chimpanzee palindrome arm sequence conservation ...... 105 Palindrome sequence read depth in bonobo, gorilla, and orangutan ...... 106 Search for regulatory factors in human palindromes P6 and P7 ...... 107 Analysis of genes homologous to human Y genes ...... 107 Novel genes in bonobo and orangutan assemblies ...... 108 Reconstruction of gene content of great apes...... 108 Gene conversion events between the X and Y chromosomes ...... 109 References ...... 109 Chapter 5 ...... 118 Summary ...... 118 Significance and future work ...... 119 Major contributions ...... 121 References ...... 122 Appendix A ...... 125 Supplemental Tables ...... 125 Supplemental Figures ...... 145 References ...... 157 Appendix B ...... 158 Supplemental Note 1. CAFE Simulations ...... 158 Supplemental Note 2. EVE Simulations ...... 163 Supplemental Tables ...... 168 Supplemental Figures ...... 182 References ...... 191 Appendix C ...... 193 Supplemental Note S1. Evolutionary scenarios for palindromes ...... 193 Supplemental Note S2. Augustus predictions ...... 197
vi
Supplemental Tables ...... 201 Supplemental Figures ...... 210 References ...... 211
vii
LIST OF TABLES
Table 3- 1. Differences in Y ampliconic gene copy numbers across species as evaluated with ANOVA and CAFE. To determine which ampliconic gene families vary in their copy number across great ape species, we performed conventional one-way ANOVA (F-statistic and p-value are shown). To identify significant expansions or contractions of gene family size across great apes, we performed CAFE analysis. Significant p-values (Bonferroni-corrected p-value cutoff of 0.05/9=0.006; nine gene families) are in bold. ...57
Table 4- 1. Gene conversion between X and Y chromosomes of great apes using GENECONV. The first column has information about the gene symbols with the stratum they belong to in brackets. The values in the following column represent the number of gene conversion events with significant p-values (<0.05) which are corrected for multiple comparisons across all sequence pairs. The values in brackets represent the number of gene conversion events with significant p-values (<0.05) for pairwise comparisons corrected for the length of the alignment...... 101
viii
LIST OF FIGURES Figure 2- 1. Correlation in copy number and expression levels across Y ampliconic gene families. The gene families are clustered based on correlation coefficients. (A) Correlation in copy numbers among 167 individuals. (B) Correlation in gene expression levels among 149 individuals...... 14 Figure 2- 2. Relationship between copy number and expression levels for nine Y ampliconic gene families. The copy number (X-axis) and gene expression values (Y- axis) values for 149 individuals are presented on a natural log scale. The dots are values for different men, and boxplots are the distribution of values for individual gene families. Both the dots and boxplots are color-coded by their respective gene families. The black line represents the linear function (copy number ~ expression) fitted to the points on the plot. The coefficient of determination (R2) for the linear model is 0.25...... 15 Figure 2- 3. The distribution of ampliconic gene copy numbers and expression levels across Y haplogroups. For each plot the x-axis shows Y haplogroups: E - African (N=22, yellow), I - European (N=24, green), J - Western Asian (N=11, blue), and R - European (N=85,red), and the y-axis shows copy number estimates or gene expression levels. The black dashed line represents the overall mean copy number or expression values for all the samples on each plot. The permutation-based significance of pairwise haplogroup comparisons is shown with stars ( *** < 0.001, ** < 0.01, * < 0.05 P-value). The one-way ANOVA test p-values are printed at the top of each plot. Bonferroni-corrected cutoff for nine tests (0.05/9 ≈ 0.006) is used to identify significance of ANOVA...... 18 Figure 2- 4. Possible differences in expression between Y ampliconic genes and their non-Y homologs. (A) Neofunctionalization: Ampliconic gene family, after moving to Y or divergence from the X, obtained a new beneficial function. The gene family might be under positive selection and its expression might be independent from its non-Y homolog. (B) Subfunctionalization: Ampliconic gene family, after moving to the Y or divergence from the X, acquired a (testis-specific) sub-function. Because subfunctionalization is division of labour, the sum of ampliconic gene expression and non-Y homolog expression should be equal to the ancestral expression. The average expression of ampliconic gene family could be lower or higher than that of their homologs and depends on the subfunction. (C) Relaxed selection: Ampliconic gene family, due to its multi-copy nature and presence of gene conversion, evolves at a faster rate than its non-Y homolog and is not under selection...... 21 Figure 2- 5. Expression differences between Y ampliconic genes and their non-Y homologs. Each plot compares expression levels of Y ampliconic gene family (the sum of expression of all copies of a gene family, blue) to their homologs on the X chromosome (green) or autosomes (yellow). The gene names are shown on the x-axis and normalized expression levels—on the y-axis...... 25 Figure 2- 6. Individual-level relationship between Y ampliconic genes and their non- Y homologs. Each plot compares expression levels of Y ampliconic gene family (the sum of expression of all copies of a gene family) to their non-Y homologs in each individual (N=149). Each dot represents an individual. The black line represents the linear regression fit to the data and the respective equation is at the top of each plot. The R-squared value is at the right-hand side bottom of each plot...... 26
ix
Figure 3- 1. Y ampliconic genes in great apes. A. Venn diagram showing gene content comparison among great ape species. B. Plot of the first two principal components (PCs) of Y ampliconic gene copy numbers across great ape species (the first and second PCs explained 68.7% and 22.8% of the variation, respectively; Fig. S2 shows variation explained by the other components)...... 52 Figure 3- 2. Variation in copy number of Y ampliconic gene families in great apes. Box plots summarizing the distribution of copy numbers of the six great ape species across nine Y ampliconic gene families. The gene families are separated into individual plots with the gene family name at the top. Within each plot, the X-axis represents six species (bonobo, chimpanzee, human, gorilla, Bornean orangutan, and Sumatran orangutan), and the Y-axis represents copy number. The black dot within each boxplot is the median value per species...... 54 Figure 3- 3. Larger Y ampliconic gene families are more variable across great apes. The six scatter plots represent the relationship between median copy number and variance for each of the species, and the species name is present at the top of each plot. The X-axis represents natural logarithm of median copy number and the Y-axis is a natural logarithm of variance in copy number. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line represents the linear function fitted to the given data points. The dots are color-coded to represent the nine gene families, with missing dots indicating gene family absence in that species...... 55 Figure 3- 4. Results of CAFE analysis identifying Y ampliconic gene families with significant shifts in gene copy number when compared to their ancestors. For each gene family with a significant difference in copy number, the phylogenetic tree representing the estimated copy number at internal nodes is shown. Significant shifts are highlighted in blue (contraction) and red (expansion). The copy numbers at the internal nodes were predicted by CAFE...... 58 Figure 3- 5. Summary of gene expression levels across great apes. In the dot plot below, the X-axis represents nine ampliconic gene families and the Y-axis represents their expression levels. The plot represents testis-specific expression of 12 great ape samples. Each dot within a gene family represents expression levels of an individual and the color of the dot denotes the species it belongs to. Missing dots represent gene families that are considered missing or pseudogenized, and their expression levels are excluded from the gene expression analysis (Table S5)...... 60 Figure 3- 6. Relationship between copy number and gene expression of Y ampliconic gene families in great ape species. The five scatter plots represent the relationship between expression and copy number for each of the five species, the name of the species is present at the top of each plot. In each of the scatter plots the X-axis represents natural logarithm of median copy number and the Y-axis represents natural logarithm of median gene expression. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line is the linear function fitted to the given data points. The dots are color-coded to represent the nine
x gene families, with missing dots corresponding to the gene families that are pseudogenized, deleted, or not expressed, in that species...... 63 Figure 3- 7. Relationship between copy number and gene expression across species. In each of the scatter plots the X-axis represents natural logarithm of median copy number and the Y-axis represents natural logarithm of median gene expression. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line represents the linear function fitted to the given data points. The dots are color-coded to represent the five species. The five scatter plots represent the relationship between expression and copy number for each of the five gene families, with the name of the gene family present at the top of each plot. Only the gene families that are present in all species are shown here...... 63
Figure 4- 1. Evolution of sequences homologous to human and chimpanzee palindromes. (A) Heatmaps showing coverage for each palindrome in each species in the multi-species alignment, and box plots representing copy number (natural log) of 1-kb windows which have homology with human or chimpanzee palindromes. (B) The great ape phylogenetic tree with evolution of human (shown in blue) and chimpanzee (purple) palindromic sequences overlayed on it. Palindrome names in bold indicate that their sequences were present in ≥2 copies. Negative (-) and positive (+) signs indicate gain and loss of palindrome sequence (possibly only partial), respectively. Arrows represent gain (↑) or loss (↓) of palindrome copy number. If several equally parsimonious scenarios were possible, we assumed a later date of acquisition of the multi-copy state for a palindrome (Note S1)...... 97 Figure 4- 2. IGV screen shots of peaks in DNase-seq, H3K4me1 and H3K27ac on Palindrome P6. A. Peaks on both the arms of the palindromes are shown within the blue and grey boxes. B. Zoom in view of peaks on the left arm of palindrome P6. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2. C. Zoom in view of peaks on the right arm of palindrome P6. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2...... 99 Figure 4- 3. IGV screenshot of CREB1 peaks on Palindrome P7. Peaks on both the arms of the palindromes are shown. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2...... 99 Figure 4- 4. Evolution of Y chromosome gene content in great apes. The reconstructed history of gene birth and death for X-degenerate (blue) and ampliconic (red) genes was overlaid on the great ape phylogenetic tree (not drawn to scale), using macaque as an outgroup. The rates of gene birth and death (in events per million years) are shown in parentheses (for complete data see Fig. S3). The list at the root includes the genes that were present in the common ancestor of great apes and macaque. In addition to most of the genes on the human Y, the macaque Y harbors the X-degenerate MXRA5Y gene, which we found to be deleted in orangutan and pseudogenized in bonobo, chimpanzee, gorilla, and human. We currently cannot find a full-length copy of the VCY
xi gene in bonobo. TXLNGY and DDX3Y are also known as CYorf15B and DBY, respectively...... 100
xii
ACKNOWLEDGMENTS
I would like to thank Paul Medvedev, Kateryna Makova, Shashikant Cooduvalli, Michael DeGiorgio, Steve Schaeffer, Marry Poss, Wansheng Liu, and Francesca Chiaromonte for their time and effort during the last six and half years. I would like to thank past and present members of Medvedev lab and Makova lab for their support through the highs and lows of my PhD journey and for being there to celebrate even the small victories. I would also like to thank the Huck administrative staff and bx administrators for their help throughout my stay. I would also like to thank members of GenoMIX and IGSA, who helped me grow as a person. I learned a lot from every one of you and thanks for making my PhD journey memorable.
I would also like to thank different funding resources for their financial support over the years, including PSU-NIH funded CBIOS Predoctoral Training Program, grants from the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM130691 (to K.D.M.), the National Science Foundation(NSF) awards DBI-1356529 (to P.M.), IIS-1453527, IIS-1421908, and CCF-1439057 (to P.M.). In addition, funds were made available through the Clinical and Translational Sciences Institute, Institute for Cyber Science, and Eberly College of Sciences—at Penn State, Pennsylvania Department of Health using Tobacco Settlement and CURE Funds. The findings and conclusions do not necessarily reflect the view of the above funding agencies.
Finally, I would like to thank my family and friends for their continuous support and for believing in me at every point of my life.
xiii
Chapter 1
Introduction
Hominidae (great apes) is a family of primates which shared a common ancestor 13 million years ago (Glazko & Nei, 2003). The members of this family include four extant genera: chimpanzees (Pan), gorillas (Gorilla), humans (Homo), and orangutans (Pongo). These primates are classified as great apes because of their intelligence, strength, large size and absence of a tail (“Fowler’s Zoo and Wild Animal Medicine, Volume 8 | ScienceDirect,” n.d.). Geographically, non-human great apes are localized to equatorial Africa (chimpanzees and gorillas) and Southeast Asia (orangutans), and they are under significant threat due to severe hunting, habitat loss, and spread of infectious diseases (“Fowler’s Zoo and Wild Animal Medicine, Volume 8 | ScienceDirect,” n.d.). The International Union for Conservation of Nature has classified these non-human great apes as endangered (Fruth et al., 2016; Iucn & IUCN, 2016a, 2016b, 2016c, 2017). There are multiple efforts to conserve these great ape species either in captivity or in the wild (“Fowler’s Zoo and Wild Animal Medicine, Volume 8 | ScienceDirect,” n.d.). One of the most important goals of conservation of endangered species is to preserve genetic variation and reproductive fitness to ensure their recovery as self-sustaining, healthy populations (Comizzoli, Songsasen, & Wildt, 2010). The Y chromosome of great apes harbors gene families linked to spermatogenesis (Skaletsky et al., 2003), which could influence the reproductive success of great apes. There is a substantial gap in our understanding of basic biology of these multi-copy gene families on the Y chromosome, which this dissertation helps to address.
Great ape sex chromosomes constitute an autosome-like X and a hemizygous Y, which have originated from a pair of homologous chromosomes approximately 160-190 million year ago (MYA) (Luo, Yuan, Meng, & Ji, 2011; Veyrunes et al., 2008). One of the proto- sex chromosomes acquired the testis-determining factor (SRY) turning it into a proto-Y, followed by a series of inversions which prevented the Y chromosome to recombine with the X (Lahn & Page, 1999; Ross et al., 2005). Lack of recombination resulted in massive gene loss and degradation of the Y (Bellott et al., 2014; Skaletsky et al., 2003), however the Y chromosome overcame this process by obtaining large repeat structures (amplicons) which enable gene conversion and non-allelic homologous recombination within the Y (Betrán, Demuth, & Williford, 2012). The ampliconic regions consist of intrachromosomal duplications in the form of inverted and tandem repeats (Skaletsky et al., 2003). The Y chromosome assemblies are available for human, chimpanzee, and gorilla and are missing for bonobo and orangutans (Hughes et al., 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). Cytogenetic studies showed differences in gene content, gene order, and size of the Y chromosome among great apes (Gläser et al., 1998; Weber, Schempp, & Wiesner, 1986; Yunis & Prakash, 1982).
The repeats in the ampliconic region include tandem repeats and palindromic structures composed of highly similar inverted repeats (arms) around a relatively short unique sequence (spacer). The close proximity of repeats to each other enables gene conversion and non-allelic homologous recombination, which removes deleterious mutations and lowers the diversity of the repeats and the genes within them (Betrán et al., 2012). As a result, the arms within a palindrome are 99.9% identical to each other, which makes the paralogous genes on the arms very similar to each other in sequence (Skaletsky et al., 2003). The genes in the ampliconic region are present as multi-copy gene families (Skaletsky et al., 2003). There are nine protein-coding ampliconic gene families on the human Y: BPY2 (basic protein Y2 ), CDY (chromodomain Y), DAZ (deleted in azoospermia), HSFY (heat-shock transcription factor Y), PRY (PTP-BL related Y), RBMY (RNA-binding motif Y), TSPY (testis-specific Y), VCY (variable charge), and XKRY (X Kell blood-related Y). In non-human great apes, few of these gene families are missing: HSFY, PRY, and XKRY are missing in chimpanzee and possibly also in bonobo, and the VCY gene family is missing in bonobo, gorilla, and orangutans (Hallast & Jobling, 2017). The byproduct of intraspecific non-allelic homologous recombination within ampliconic region is variation of copy number of ampliconic genes within a species (Hughes et al., 2010; Lucotte et al., 2018; Oetjens, Shen, Emery, Zou, & Kidd, 2016; Repping et al., 2006; Schaller et al., 2010; Skov & Schierup, 2017; Tomaszkiewicz et al., 2016; Ye et al., 2018). We do not have the information about the evolutionary conservation of copy number of ampliconic gene families across great apes as the ampliconic copy number of orangutan species is still unknown. For those species we have data for, the TSPY and RBMY gene families have higher variation in copy number when compared to other genes families among humans, chimpanzees, and gorillas (Hughes et al., 2010; Lucotte et al., 2018; Oetjens et al., 2016; Repping et al., 2006; Schaller et al., 2010; Skov & Schierup, 2017; Tomaszkiewicz et al., 2016; Ye et al., 2018).
Ampliconic genes are expressed predominantly or exclusively in testis and play an important role in spermatogenesis (Skaletsky et al., 2003). Loss or partial deletion of ampliconic gene copies is linked to infertility in humans. Screening for deletions in the Y chromosome of individuals with infertility resulted in three azoospermia factor regions (AZFa, AZFb, and AZFc), which are active during different phases of spermatogenesis (P. H. Vogt et al., 1996), and complete or partial deletion of these regions is linked to azoospermia and arrest of spermatogenesis (Carvalho, Zhang, & Lupski, 2011; Krausz et al., 2011; Navarro-Costa, Plancha, & Gonaçlves, 2010; Repping et al., 2002; Peter H. Vogt, 1996). The exact function of individual gene families is understudied (Navarro- Costa, 2012). With respect to ampliconic gene expression, we have limited information: studies showed that testis-specific expression of Y ampliconic genes was acquired prior to their amplification on the Y (Bellott et al., 2014). The expression levels of Y genes decreased when compared to ancestral expression levels of single proto-sex chromosome alleles in primates (Cortez et al., 2014). A recent study investigating the expression of Y ampliconic genes during male meiosis found that gene families with high variation in copy number also have high expression levels at different stages of sperm development
2
(Lucotte et al., 2018). Apart from this, very little is known about the expression of ampliconic genes and how copy number variation impacts gene expression.
Despite the importance of Y chromosome ampliconic gene families for infertility and reproductive fitness, there is still a substantial gap in our understanding of basic biology of ampliconic gene families on the Y chromosome. For example, some questions of interest are: what is the effect of variation in ampliconic copy number on their gene expression levels, what mechanisms did testis adapt to handle the shift in gene copy number between individuals, are the ampliconic gene copy number and expression levels conserved across great apes, and do the dynamics of gene loss or gain in great apes Y explain the differences in matting pattern across great apes. My research goal is to understand the evolution of the Y ampliconic gene families in humans and in non-human great ape species. In this dissertation my specific objectives are to explore the variability of ampliconic genes at DNA and RNA levels using high-throughput next-generation sequencing technologies. In Chapter 2, I will test whether Y ampliconic gene expression levels depend on their copy number, and whether there is a gene dosage regulation to counteract the ampliconic gene copy number variation observed in humans using the Genotype-Tissue Expression (GTEx) dataset (Ardlie et al., 2015). In Chapter 3, as a follow-up to Chapter 2, we test whether the Y ampliconic gene copy number and gene expression levels are conserved across great apes. The goal is to test for significant gain or loss of copies and of gene expression levels across great ape species and to test whether the variation observed can explain phenotypes related to sperm competition in great apes. In Chapter 4, we study the dynamics of gene (and gene family) loss and gain in great ape Y chromosomes using Y chromosome assemblies of great apes. Chapter 5 presents conclusions from the Dissertation.
References
Ardlie, K. G., DeLuca, D. S., Segrè, A. V., Sullivan, T. J., Young, T. R., Gelfand, E. T., … Lockhart. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235), 648–660. Bellott, D. W., Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Cho, T. J., … Page, D. C. (2014). Mammalian y chromosomes retain widely expressed dosage- sensitive regulators. Nature, 508(7497), 494–499. Betrán, E., Demuth, J. P., & Williford, A. (2012). Why Chromosome Palindromes? International Journal of Evolutionary Biology, 2012(Figure 2), 1–14. Carvalho, C. M. B., Zhang, F., & Lupski, J. R. (2011). Structural variation of the human
3
genome: mechanisms, assays, and role in male infertility. Systems Biology in Reproductive Medicine, 57(1-2), 3–16. Comizzoli, P., Songsasen, N., & Wildt, D. E. (2010). Protecting and extending fertility for females of wild and endangered mammals. Cancer Treatment and Research, 156, 87–100. Cortez, D., Marin, R., Toledo-Flores, D., Froidevaux, L., Liechti, A., Waters, P. D., … Kaessmann, H. (2014). Origins and functional evolution of y chromosomes across mammals. Nature, 508(7497), 488–493. Fowler’s Zoo and Wild Animal Medicine, Volume 8 | ScienceDirect. (n.d.). Retrieved September 4, 2019, from https://www.sciencedirect.com/book/9781455773978/fowlers-zoo-and-wild-animal- medicine-volume-8 Fruth, B., Hickey, J. R., André, C., Furuichi, T., Hart, J., Hart, T., … Others. (2016). Pan paniscus. The IUCN Red List of Threatened Species 2016: e. T15932A102331567. Gläser, B., Grützner, F., Willmann, U., Stanyon, R., Arnold, N., Taylor, K., … Schempp, W. (1998). Simian Y chromosomes: species-specific rearrangements of DAZ, RBM, and TSPY versus contiguity of PAR and SRY. Mammalian Genome: Official Journal of the International Mammalian Genome Society, 9(3), 226–231. Glazko, G. V., & Nei, M. (2003). Estimation of divergence times for major lineages of primate species. Molecular Biology and Evolution, 20(3), 424–434. Hallast, P., & Jobling, M. A. (2017). The Y chromosomes of the great apes. Human Genetics, 136(5), 511–528. Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P. J., … Page, D. C. (2010). Chimpanzee and human Y chromosomes are remarkably divergent in structure and gene content. Nature, 463(7280), 536–539. Iucn, & IUCN. (2016a). Gorilla gorilla: Maisels, F., Bergl, R.A. & Williamson, E.A. IUCN Red List of Threatened Species. https://doi.org/10.2305/iucn.uk.2018- 2.rlts.t9404a136250858.en Iucn, & IUCN. (2016b). Pan troglodytes: Humle, T., Maisels, F., Oates, J.F., Plumptre, A. & Williamson, E.A. IUCN Red List of Threatened Species. https://doi.org/10.2305/iucn.uk.2016-2.rlts.t15933a17964454.en Iucn, & IUCN. (2016c). Pongo pygmaeus: Ancrenaz, M., Gumal, M., Marshall, A.J., Meijaard, E., Wich , S.A. & Husson, S. IUCN Red List of Threatened Species. https://doi.org/10.2305/iucn.uk.2016-1.rlts.t17975a17966347.en
4
Iucn, & IUCN. (2017). Pongo abelii: Singleton, I., Wich , S.A., Nowak, M., Usher, G. & Utami-Atmoko, S.S. IUCN Red List of Threatened Species. https://doi.org/10.2305/iucn.uk.2017-3.rlts.t121097935a115575085.en Krausz, C., Chianese, C., Giachini, C., Guarducci, E., Laface, I., & Forti, G. (2011). The Y chromosome-linked copy number variations and male fertility. Journal of Endocrinological Investigation, 34(5), 376–382. Lahn, B. T., & Page, D. C. (1999). Four evolutionary strata on the human X chromosome. Science, 286(5441), 964–967. Lucotte, E. A., Skov, L., Jensen, J. M., Macià, M. C., Munch, K., & Schierup, M. H. (2018). Dynamic Copy Number Evolution of X- and Y-Linked Ampliconic Genes in Human Populations. Genetics, 209(3), 907–920. Luo, Z.-X., Yuan, C.-X., Meng, Q.-J., & Ji, Q. (2011). A Jurassic eutherian mammal and divergence of marsupials and placentals. Nature, 476(7361), 442–445. Navarro-Costa, P. (2012). Sex, rebellion and decadence: the scandalous evolutionary history of the human Y chromosome. Biochimica et Biophysica Acta, 1822(12), 1851– 1863. Navarro-Costa, P., Plancha, C. E., & Gonaçlves, J. (2010). Genetic dissection of the AZF regions of the human Y chromosome: Thriller or filler for male (In)fertility? Journal of Biomedicine & Biotechnology, 2010. https://doi.org/10.1155/2010/936569 Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biology and Evolution, 8(7), 2231–2240. Repping, S., Skaletsky, H., Lange, J., Silber, S., van der Veen, F., Oates, R. D., … Rozen, S. (2002). Recombination between Palindromes P5 and P1 on the Human Y Chromosome Causes Massive Deletions and Spermatogenic Failure. American Journal of Human Genetics, 71(4), 906–922. Repping, S., van Daalen, S. K. M., Brown, L. G., Korver, C. M., Lange, J., Marszalek, J. D., … Rozen, S. (2006). High mutation rates have driven extensive structural polymorphism among human Y chromosomes. Nature Genetics, 38(4), 463–467. Ross, M. T., Grafham, D. V., Coffey, A. J., Scherer, S., McLay, K., Muzny, D., … Bentley, D. R. (2005). The DNA sequence of the human X chromosome. Nature, 434(7031), 325–337. Schaller, F., Fernandes, A. M., Hodler, C., Münch, C., Pasantes, J. J., Rietschel, W., & Schempp, W. (2010). Y Chromosomal Variation Tracks the Evolution of Mating
5
Systems in Chimpanzee and Bonobo. PloS One, 5(9), e12482. Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G., … Page, D. C. (2003). The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature, 423(6942), 825–837. Skov, L., & Schierup, M. H. (2017). Analysis of 62 hybrid assembled human Y chromosomes exposes rapid structural changes and high rates of gene conversion. PLoS Genetics, 13(8), 1–20. Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H. W., Harris, R., … Makova, K. D. (2016). A time- and cost-effective strategy to sequence mammalian Y Chromosomes: an application to the de novo assembly of gorilla Y. Genome Research, 26(4), 530–540. Veyrunes, F., Waters, P. D., Miethke, P., Rens, W., McMillan, D., Alsop, A. E., … Marshall Graves, J. A. (2008). Bird-like sex chromosomes of platypus imply recent origin of mammal sex chromosomes. Genome Research, 18(6), 965–973. Vogt, P. H. (1996). Human Y Chromosome Function in Male Germ Cell Development. In Advances in Developmental Biology (1992) (pp. 191–257). Vogt, P. H., Edelmann, A., Kirsch, S., Henegariu, O., Hirschmann, P., Kiesewetter, F., … Haidl, G. (1996). Human Y chromosome azoospermia factors (AZF) mapped to different subregions in Yq11. Human Molecular Genetics, 5(7), 933–943. Weber, B., Schempp, W., & Wiesner, H. (1986). An evolutionarily conserved early replicating segment on the sex chromosomes of man and the great apes. Cytogenetic and Genome Research, 43(1-2), 72–78. Ye, D., Zaidi, A., Tomaszkiewicz, M., Anthony, K., Liebowitz, C., DeGiorgio, M., … Makova, K. D. (2018). High levels of copy number variation of ampliconic genes across major human Y haplogroups. Genome Biology and Evolution, (May). https://doi.org/10.1093/gbe/evy086 Yunis, J., & Prakash, O. (1982). The origin of man: a chromosomal pictorial legacy. Science, 215(4539), 1525–1530.
6
Chapter 2 Dosage regulation, and variation in gene expression and copy number at human Y chromosome ampliconic genes
The material in this chapter was previously published in 2019 as a research article by R. Vegesna, M. Tomaszkiewicz, P. Medvedev, and K.D. Makova, appearing in PLoS Genetics 15(9): e1008369. Supporting information that accompanied this article is provided in Appendix A.
Abstract The Y chromosome harbors nine multi-copy ampliconic gene families expressed exclusively in testis. The gene copies within each family are >99% identical to each other, which poses a major challenge in evaluating their copy number. Recent studies demonstrated high variation in Y ampliconic gene copy number among humans. However, how this variation affects expression levels in human testis remains understudied. Here, we developed a novel computational tool Ampliconic Copy Number Estimator (AmpliCoNE) that utilizes read sequencing depth information to estimate Y ampliconic gene copy number per family. We applied this tool to whole-genome sequencing data of 149 men with matched testis expression data whose samples are part of the Genotype- Tissue Expression (GTEx) project. We found that the Y ampliconic gene families with low copy number in humans were deleted or pseudogenized in non-human great apes, suggesting relaxation of functional constraints. Among the Y ampliconic gene families, higher copy number leads to higher expression. Within the Y ampliconic gene families, copy number does not influence gene expression, rather a high tolerance for variation in gene expression was observed in testis of presumably healthy men. No differences in gene expression levels were found among major Y haplogroups. Age positively correlated with expression levels of the HSFY and PRY gene families in the African subhaplogroup E1b, but not in the European subhaplogroups R1b and I1. We also found that expression of five out of six Y ampliconic gene families is coordinated with that of their non-Y (i.e. X or autosomal) homologs. Indeed, five ampliconic gene families had consistently lower expression levels when compared to their non-Y homologs suggesting dosage regulation,
7 while the HSFY family had higher expression levels than its X homolog and thus lacked dosage regulation.
Introduction
The human Y chromosome harbors 10.2 million bases (Mb) of ampliconic regions containing nine protein-coding multi-copy gene families (Skaletsky et al., 2003). These genes are important not only because of their association with male infertility (Repping et al., 2002; Skaletsky et al., 2003) but also because they might hold the key to understanding the evolutionary forces that have shaped the Y chromosome. Ampliconic gene families show a high level of copy number variability (Lucotte et al., 2018; Skov, Danish Pan Genome Consortium, & Schierup, 2017; Ye et al., 2018) and, possibly, a similar variability in gene expression levels. Understanding the relationship between these two variabilities is an important step in the study of these genes. Yet, there has been no comprehensive investigation to-date that explores expression of these gene families and its connection to copy number at a large, population-level scale.
Studying ampliconic gene families has been a considerable challenge because they exhibit a much higher intra-familial sequence similarity than other gene families. The majority (eight out of nine) of Y ampliconic gene families are located in palindromes— structures composed of highly similar inverted repeats (arms) around a relatively short unique sequence (spacer). The arms within a palindrome are 99.9% identical to each other, which results in a high sequence identity among paralogous genes located on the arms (Skaletsky et al., 2003). The ninth family, TSPY, is present as an array of tandem repeats outside of palindromes (Skaletsky et al., 2003), however its genes still share sequence identity of >99%. It has been hypothesized that the Y chromosome has acquired its ampliconic structure as a way of facilitating gene conversion (Rozen et al., 2003), which can overcome the decay due to a lack of interchromosomal recombination (Betrán, Demuth, & Williford, 2012; B. Charlesworth & Charlesworth, 2000).
Why these ampliconic gene families are preserved on the Y chromosome remains an open question. It has been suggested that this is due to sexual antagonism eventually leading to increased male reproductive fitness (Bellott et al., 2014; Betrán et al., 2012; Rozen et al., 2003). Sexual antagonism is expected to lead to the accumulation of genes and
8 mutations benefiting males on the Y chromosome (D. Charlesworth & Charlesworth, 1980). Consistent with the sexual antagonism hypothesis, all ampliconic genes on the Y are expressed exclusively or predominantly in testis. However, it is also possible that these genes have recently evolved under relaxed function constraints. The ability to analyze the expression levels of Y ampliconic genes at a large scale can help in exploring their potential functional constraints via comparing their testis expression level to that of their non-Y homologs (when available). For instance, if a Y ampliconic gene family undergoes neofunctionalization, then its resulting expression level is expected to be independent of and potentially higher than that for its non-Y homologs (which we assume retained the ancestral function).
In support of some functional constraints is the observation that the loss or partial deletion of Y ampliconic gene copies is linked to infertility in humans. For example, TSPY copy number was linked to both infertility (Giachini et al., 2009) and sperm count (Giachini et al., 2009; C. Krausz et al., 2011; Csilla Krausz, Giachini, & Forti, 2010). The long arm of the human Y chromosome includes three azoospermia factor regions (AZFa, AZFb, and AZFc) which cover most of the ampliconic genes families and are active during different phases of spermatogenesis (Vogt et al., 1996). Complete or partial deletion of these regions is linked to azoospermia and arrest of spermatogenesis (Carvalho, Zhang, & Lupski, 2011; C. Krausz et al., 2011; Navarro-Costa, Plancha, & Gonaçlves, 2010; Repping et al., 2002; Vogt et al., 1996). Presumably, copy number decrease linked with infertility is accompanied by a reduction in gene expression of the affected Y ampliconic gene families, however this is yet to be demonstrated.
Recent studies indicated high variation in Y ampliconic gene copy number in healthy men (Lucotte et al., 2018; Skov et al., 2017; Ye et al., 2018). Skov and colleagues (Skov et al., 2017) studied Y ampliconic gene copy number variation in 62 men of Danish descent and identified multiple copy number changes across all nine gene families among unrelated individuals, as well as de novo copy number differences for the TSPY and VCY gene families between a father and a son. Ye and colleagues (Ye et al., 2018) assessed Y ampliconic gene copy number variation in 100 individuals from around the world. They observed that the size of gene family is correlated with its variation in copy number: larger families, such as TSPY and RBMY, have higher levels of variation, however the variation appears to be independent of the Y haplogroup. Two men rarely had the same Y
9 ampliconic gene copy number profile and, when they did, this was likely a result of homoplasy. Lucotte and colleagues (Lucotte et al., 2018) used the data from the Simons Genome Diversity Project (Mallick et al., 2016) and observed substantial variation in copy number in six out of nine human Y ampliconic gene families (Lucotte et al., 2018). Teitz and colleagues (Teitz, Pyntikova, Skaletsky, & Page, 2018) assessed copy number of full- length Y chromosome amplicons located in the AZFc region in men sequenced by the 1000 Genomes Project (Teitz et al., 2018). Their results suggest that selection has preserved the ancestral ampliconic copy number on the Y chromosome in diverse human lineages (Teitz et al., 2018).
These multiple studies of copy number notwithstanding, there has been little investigation of gene expression in Y ampliconic genes. A recent study investigating the expression of Y ampliconic genes during male meiosis found that gene families with high variation in copy number also have high expression levels at different stages of sperm development (Lucotte et al., 2018). Other than the results of this single study, there is a big gap in our understanding of variation in expression of Y ampliconic genes among humans, even though gene expression could be a better predictor of genes’ functions than copy number. Additionally, previous studies have reported that aging affects gene expression (Vinuela et al., 2016; Yang et al., 2015).
Even less is known about how variation in copy number of Y ampliconic genes affects their gene expression. Most parsimoniously, a gain of a complete gene copy should lead to an increase in gene expression levels, unless the extra copy obtains a new function through neofunctionalization, has decreased functional demands due to subfunctionalization or is lost due to pseudogenization. Indeed, this parsimonious hypothesis was supported by the data from the 1000 Genomes Project, where most genes overlapping multiallelic copy number variations (CNVs) display a positive correlation between copy number and gene expression (Handsaker et al., 2015). However, studies across different model organisms have reported that differences in copy number result in either increased, decreased or unchanged expression levels among individuals in a population (Henrichsen, Chaignat, & Reymond, 2009). This more complex relationship can be caused by several scenarios during duplication. For instance, a tandem duplication event may not include regulatory elements, may physically disrupt topologically associated domains (TADs), which prevents the interaction of the gene with its enhancer in 3D space
10
(Lupiáñez et al., 2015; Spielmann, Lupiáñez, & Mundlos, 2018), or may result in a new copy acting as a negative feedback loop to reduce transcription (Henrichsen et al., 2009). Moreover, a non-tandem duplication may occur to a site that is not transcriptionally active (Henrichsen et al., 2009). Which of these parsimonious or more complex scenarios occurs on the human Y chromosome ampliconic genes has not been explored.
In this study, we explored the above questions by analyzing the largest data set available to-date consisting of expression data from testis, along with matched whole-genome sequencing data, from 170 men, as generated by the Genotype Tissue Expression (GTEx) consortium (Carithers et al., 2015). Simultaneously, we developed a novel computational tool AmpliCoNE to estimate the copy number of an ampliconic gene family from sequencing data. Such estimation is complicated by the presence of multiple highly-similar gene copies in the reference, which makes conventional tools inapplicable (Medvedev, Stanciu, & Brudno, 2009). Custom strategies have been developed and shown to be effective(Cortez et al., 2014; Handsaker et al., 2015; Lucotte et al., 2018; Oetjens, Shen, Emery, Zou, & Kidd, 2016; Skov et al., 2017; Sudmant et al., 2010), but we did not identify any existing software that could be run directly on ampliconic Y chromosome gene families.
Using AmpliCoNE, we explored whether variation in Y ampliconic gene expression levels could be explained by variation in gene copy number, Y haplogroup, and individual’s age. We correlated the estimated with AmpliCoNE copy numbers of Y ampliconic gene families to their expression levels in testis and studied how this correlation is affected by Y haplogroups. Additionally, we investigated how testis-specific expression of Y ampliconic genes diverged from their non-Y homologs during evolution.
Results
AmpliCoNE: Ampliconic Copy Number Estimator AmpliCoNE is composed of two programs. The first (AmpliCoNE-build) is executed only once to process the reference genome. It takes the location of all the gene copies in the reference genome, grouped by family, determines which positions in the genes are informative (i.e. where read depth is an effective predictor of copy number) and which positions in the reference can be used as a control (where copy number variation is
11 infrequent and the read depth has limited noise). The second step (AmpliCoNE-count) is then executed separately for every sample. It parses read alignments and measures the GC-corrected read depth at the informative positions. It then accumulates this information at a family-level and reports the copy number for each gene family, using the read depth at control positions as a baseline. We provide further details in the Methods.
To evaluate AmpliCoNE’s accuracy, we ran it on simulated data and whole-genome short- read data from the Genome in a Bottle (GIAB) consortium (Zook et al., 2014). Using the hg38 human genome reference, we simulated three datasets with varying copy numbers of RBMY, TSPY, and VCY gene families and kept the copy numbers for the remaining six gene families constant (i.e. with the copy number found in the reference). AmpliCoNE estimated ampliconic copy numbers correctly 100% of the time in the simulated datasets (Table S1). We then compared gene family copy numbers between different GIAB experimental runs (technical replicates) for the same human sample (Table S2), as well as between a father and a son (which can be treated as biological replicates because copy number differences between generations are expected to be rare (Skov et al., 2017)). AmpliCoNE consistently predicted copy numbers with a difference of less than 0.5 copies per family. We tested AmpliCoNE at different depths of coverage and showed that it can predict similar copy numbers (estimates with difference of less than 0.5) even for datasets with the Y chromosome sequencing depth as low as 6x (Table S3). AmpliCoNE’s runtime is dependent on the number of reads it needs to process. For instance, it took AmpliCoNE 11 minutes to process the GTEx Y-chromosome-specific BAM file (~500 MB in size).
To measure the concordance between AmpliCoNE’s copy number estimates and complementary non-sequencing assays, we used droplet digital PCR (ddPCR). Both AmpliCoNE and ddPCR were applied to estimate Y ampliconic gene copy numbers for four males sequenced by the GIAB consortium (Tables 1 and S4) (Zook et al., 2014). The ddPCR estimates were identical to AmpliCoNE estimates for five out of nine gene families (BPY2, DAZ, HSFY, PRY, and XKRY) in all four samples. The CDY and RBMY family copy numbers differed between the two methods in only one and two individuals, respectively. The VCY and TSPY family copy number estimates differed in three and four individuals, respectively. Compared with ddPCR, AmpliCoNE consistently underestimated the copy number for the VCY gene family. Previous studies have indicated presence of X- to-Y gene conversion between VCX and VCY (Iwase, Satta, Hirai, Hirai, & Takahata,
12
2010; Trombetta, Cruciani, Underhill, Sellitto, & Scozzari, 2010). We investigated this case in more detail and discovered that genes from the VCY family harbor only a very short (220-bp) sequence distinguishing them from their VCX paralogs. This sequence has a low sequencing depth even after GC correction, which results in the underestimation of the VCY copy number by AmpliCoNE. In the case of TSPY, it is known to have many highly similar pseudogene copies which may themselves vary in copy number, which can potentially confound both AmpliCoNE and ddPCR estimates. These caveats notwithstanding, AmpliCoNE’s biases in estimating copy numbers for TSPY and VCY are consistent across samples and thus should not affect our results in a systematic way.
Y ampliconic gene copy number estimates Using AmpliCoNE, we estimated copy numbers of Y chromosome ampliconic genes in 170 presumably healthy men whose genomes were sequenced in their entirety as part of the GTEx project (Carithers et al., 2015). These individuals (Table S5) were selected because they had matched testis expression data. The individuals belonged to nine major haplogroups: B, E, G, I, J, L, O, Q, R, and T (Table 2). The majority of the samples in the dataset had European or African Y haplogroups, with a few Asian haplogroups present. We also used AmpliCoNE to estimate the copy numbers of X-degenerate genes, which are expected to be 1 in healthy samples. Three samples had copy number estimates close to zero for two or more ampliconic gene families, or had less than one copy for several X- degenerate genes , which could suggest an individual with a disease or could result from a technical artifact, and thus were removed from the downstream analysis. As a result, we retained 167 samples.
Gene families with higher median copy number had higher variation when compared to gene families with lower median copy number (R2=0.91; Fig. S1). RBMY and TSPY were the largest gene families and displayed the highest variation in copy number (5-14 and 20-64 copies for RBMY and TSPY, respectively). HSFY, PRY, VCY, and XKRY were the smallest gene families, which on average had two copies per individual, and displayed low variation in copy number. We observed a positive correlation in copy number among BPY2, CDY, and DAZ gene families which could be explained by their co-localization on palindrome P1; duplication or deletion involving P1 can affect the copy numbers of all three gene families (Fig. 1A).
13
Figure 2- 1. Correlation in copy number and expression levels across Y ampliconic gene families. The gene families are clustered based on correlation coefficients. (A) Correlation in copy numbers among 167 individuals. (B) Correlation in gene expression levels among 149 individuals.
Y ampliconic gene families with low copy number in humans are frequently deleted in non-human great apes We expected to observe a higher probability for gene families with lower median copy number to be completely deleted due to random rearrangements. Therefore we aimed to test whether the gene families with lower copy number in human had a higher chance of being deleted in non-human great ape species. It is known from previous studies that the VCY gene family is missing in bonobo, gorilla, and orangutan, whereas the HSFY, PRY, and XKRY families are missing in bonobo and chimpanzee (Hallast & Jobling, 2017). Consistent with our hypothesis, the HSFY, PRY, VCY, and XKRY gene families had low copy numbers in humans (Fig. S1; Table S6).
14
Y ampliconic gene expression To explore the relationship between ampliconic gene copy number and their expression levels, we analyzed testis expression data from the same 167 humans whose Y ampliconic gene copy number was estimated with AmpliCoNE. After removing outliers (see Materials and Methods), we retained 149 samples and obtained expression levels for each gene family—the sum of expression of all the gene copies within each family—in each of them. We found that, similar to our observation for copy numbers (Fig. S1), families with higher gene expression levels had higher variation in gene expression (R2 =0.99; Fig. S2). The TSPY family had the highest gene expression level and the highest variation in expression across individuals, and XKRY—the lowest (Table S6; Fig. S2). The XKRY gene family could be considered to be not expressed (as its expression levels are zero) in 58 individuals or expressed at very low levels (with DESeq2 normalized read count < 10) in the remaining 91 individuals. DAZ, HSFY, and RBMY gene families had similar median expression levels and variance among themselves (Table S6; Fig. S2). Within our dataset, we found two sets of ampliconic gene families whose expression levels were positively correlated with each other (Fig. 1B). The first set included BPY2, CDY, HSFY, and PRY, and the second set—DAZ, TSPY, RBMY, and VCY (Fig. 1B). The expression of these sets of gene families could be co-regulated or might have cell-type specificity.
More copious gene families have higher gene expression levels When we investigated the relationship between expression levels and copy number among all 149 individuals across nine ampliconic gene families, we found that more copious gene families tended to have higher expression levels in comparison to the less copious gene families (Fig. 2). Indeed, the expression levels were positively correlated with estimates of copy numbers (Spearman's rank correlation rho = 0.43; P-value < 2.2x10-16). The DAZ, HSFY, and VCY gene families appeared to be outliers in this analysis, as they had gene expression levels similar to the RBMY gene family even though their median copy number estimates were approximately half or less than half of RBMY gene family. The DAZ gene family had similar gene copy number yet higher expression levels when compared to the CDY gene family. The XKRY family consistently had very low expression levels, even though its median copy number per individual was two.
Figure 2- 2. Relationship between copy number and expression levels for nine Y ampliconic gene families. The copy number (X-axis) and gene expression values (Y-axis) values for 149 individuals are presented on a natural log scale. The dots are values for different men, and boxplots
15 are the distribution of values for individual gene families. Both the dots and boxplots are color- coded by their respective gene families. The black line represents the linear function (copy number ~ expression) fitted to the points on the plot. The coefficient of determination (R2) for the linear model is 0.25.
Within a family, copy number and gene expression are not correlated Next, we tested whether copy number, as measured for each individual, is positively correlated with gene expression levels, again measured for each individual, within the same gene family. There was no significant correlation in any of the nine families studied (all P-values were above the Bonferroni-corrected P-value cutoff of 0.05/9=0.006; Fig. S3; Table S7). To control for genetic variation on the Y, we next compared copy number estimates to gene expression levels for individuals with the same Y subhaplogroup. We focused on the European R1b and I1a, and the African E1b subhaplogroups because they had more than 10 individuals in our dataset (77, 15 and 22, respectively; Table 2). We still found no significant correlations between copy number and expression levels in any of the
16 nine gene families for individuals from either of these three subhaplogroups (all P-values were above the Bonferroni-corrected P-value cutoff of 0.05/9=0.006; Figs. S4-S6; Table S7).
Y haplogroups and ampliconic gene families We further asked whether the major Y haplogroup can at least in part explain the variation we observed in copy number and in gene expression levels of Y chromosome ampliconic genes. We focused our analysis on major haplogroups R (European), I (European), E (African), and J (Western Asian) because they were represented by at least 10 samples in our dataset (Table 2). Using one-way ANOVA, we found that the copy numbers of BPY2 (P= 2.34x10-3), RBMY (P=2.97x10-8), and TSPY (P =1.07x10-22) gene families had significant differences among the four major Y haplogroups analyzed (Bonferroni- corrected P-value cutoff of 0.05/9=0.006; Table 3). The remaining six gene families did not display significant differences among Y haplogroups (Table 3). When we compared the mean copy number differences between haplogroups in a pairwise fashion using a permutation test (1 million permutations; 9 gene families are tested for 6 cases—R vs E; R vs I; R vs J; I vs E, I vs J, E vs J—thus we performed 9 x 6 = 54 tests; Bonferroni- corrected P-value cutoff of 0.05/54 = 0.00093), TSPY differed significantly in copy numbers (Fig. 3) between major European (R and I) vs. African (E) or vs. Western Asian (J) haplogroups (P=0 for R vs. E; P=0 for I vs. E; P=0 for R vs. J; P=0.3x10-5 for I vs. J; Table S8). RBMY copy numbers differed significantly between a European (R) vs. African (E) or Western Asian (J) haplogroups (P=6.94x10-4 for R vs. E; P=0 for R vs J; Table S8). No significant differences between the two major European haplogroups (R and I) were observed (Table S8). In contrast, we found that gene expression levels of all nine Y ampliconic gene families were not significantly different among major Y haplogroups (all P-values were above the P-value cutoff of 0.05/9 ≈ 0.006; one-way ANOVA; Table 3). We observed a trend suggesting differences in expression values among haplogroups for the BPY2 and DAZ gene families, but these differences were small in scale. Nevertheless, out of the nine gene families, BPY2 (P=0.056) and DAZ (P=0.01) had low P-values for the ANOVA analysis (Table 3, Fig. 3) and for the permutation test comparing mean expression levels between haplogroups (P=7.09x10-3 for E vs. R for BPY2; P=1.36x10-2 for E vs. R for DAZ; P cutoff of 0.05/54 = 0.00093; Table S9). When we compared the trend in copy number and gene expression differentiation among haplogroups, we observed that in the TSPY
17 gene family both copy number and gene expression levels were lower for the European haplogroups (I, R) than for the African (E) or Western Asian (J) haplogroups (Fig. 3). This trend was statistically significant for copy number, but not significant for gene expression. Analyzing a larger sample size might lead to finding this trend to be significant also for gene expression.
Figure 2- 3. The distribution of ampliconic gene copy numbers and expression levels across Y haplogroups. For each plot the x-axis shows Y haplogroups: E - African (N=22, yellow), I - European (N=24, green), J - Western Asian (N=11, blue), and R - European (N=85,red), and the y-axis shows copy number estimates or gene expression levels. The black dashed line represents the overall mean copy number or expression values for all the samples on each plot. The permutation-based significance of pairwise haplogroup comparisons is shown with stars ( *** < 0.001, ** < 0.01, * < 0.05 P-value). The one-way ANOVA test p-values are printed at the top of each plot. Bonferroni-corrected cutoff for nine tests (0.05/9 ≈ 0.006) is used to identify significance of ANOVA.
18
19
The role of age in ampliconic gene expression To examine the potential role of aging in determining Y ampliconic gene expression, we compared the ages of individuals at the time of sample collection to the ampliconic gene expression levels and found no statistically significant relationship (nine gene families were tested for correlation which results in Bonferroni correction P-value cutoff of 0.05/9=0.006; Fig. S7; Table S10). Next, to perform a similar analysis for individuals with the same subhaplogroup, we limited our analysis to individuals with the European R1b and I1a, and African E1b subhaplogroups (77, 15, and 22 individuals, respectively). For the R1b and I1a subhaplogroups we found no significant relationship between age and expression levels for any of the nine Y ampliconic gene families studied (Fig. S8-9; Table S10). However, for the African E1b subhaplogroup, gene families HSFY (Spearman correlation= 0.57; P=0.0061) and PRY (Spearman correlation= 0.61; P=0.0028) had a positive correlation between expression levels and age, which was significant after Bonferroni correction (Fig. S10; Table S10). A larger dataset of African haplogroups should be studied to validate this relationship.
Ampliconic gene dosage regulation The presence of homologs outside of the Y for two groups of Y ampliconic gene families allows us to study evolution of their gene expression levels (Bhowmick, Satta, & Takahata, 2007). In particular, the CDY and DAZ genes were copied to the Y chromosome from autosomes (Bhowmick et al., 2007); the HSFY, RBMY, TSPY, VCY, and XKRY gene families have homologs on the X and were likely present on the ancestral autosomes giving rise to the two sex chromosomes (Bhowmick et al., 2007). In the analyses below, we assume that the testis-specific expression of Y ampliconic genes was acquired prior to their amplification on the Y (Bellott et al., 2014)and that their autosomal or X- chromosomal homologs have maintained ancestral expression levels, i.e. they possess expression levels of Y ampliconic genes prior to their Y linkage (L. Gu & Walters, 2017). The latter assumption is based on the overall slower rates of evolution of X-chromosomal and autosomal genes as compared to their Y-chromosomal homologs.
We envision three possible scenarios for gene expression evolution of Y ampliconic gene families that have non-Y homologs (Fig. 4). First, because of sexual antagonism, a gene on the Y could obtain beneficial mutations and diverge in function from its non-Y homolog to acquire new functions in testis (i.e. neo-functionalization). The expression of such a
20 gene family would be independent from, and potentially higher than that for, its non-Y homologs (scenario A). Second, a gene family on the Y could retain function of the non-Y homolog but acquire testis-specific expression (i.e. sub-functionalization). In this case, either the non-Y copy represents the ancestral expression levels and the Y copies are expected to maintain low expression levels, or the sum of expression from the Y and non- Y copies is regulated to be at levels similar to those of the non-Y copy in the ancestor (scenario B). In this case, the expression of both Y and non-Y homologs might be downregulated. Third, genes on the Y might be under relaxed selective constraints and thus have low expression levels (scenario C) (Lynch & Conery, 2000). Below we test these three scenarios by comparing expression levels of both Y and non-Y ampliconic gene homologs in testis tissue.
In addition to the analysis of such overall differences in the expression level (Fig. 4), we can also examine the relationship between the Y ampliconic genes’ and their non-Y homologs’ gene expression across individuals, which should further assist in determining a particular evolutionary scenario (Fig. S11). If the expression levels of Y ampliconic genes are higher than those of their non-Y homologs, and across individuals the expression levels of these two groups of genes are positively correlated, then this pattern is consistent with neo-functionalization of the Y ampliconic genes.
Figure 2- 4. Possible differences in expression between Y ampliconic genes and their non- Y homologs. (A) Neofunctionalization: Ampliconic gene family, after moving to Y or divergence from the X, obtained a new beneficial function. The gene family might be under positive selection and its expression might be independent from its non-Y homolog. (B) Subfunctionalization: Ampliconic gene family, after moving to the Y or divergence from the X, acquired a (testis-specific) sub-function. Because subfunctionalization is division of labour, the sum of ampliconic gene expression and non-Y homolog expression should be equal to the ancestral expression. The average expression of ampliconic gene family could be lower or higher than that of their homologs and depends on the subfunction. (C) Relaxed selection: Ampliconic gene family, due to its multi-
21 copy nature and presence of gene conversion, evolves at a faster rate than its non-Y homolog and is not under selection.
This is because higher expression levels of ampliconic genes than those at the ancestral state suggest independent expression of Y ampliconic genes from their non-Y homologs, and a positive correlation between Y ampliconic genes and their non-Y homologs suggests coregulation, e.g. they might share similar transcription factors (Yu, Luscombe, Qian, & Gerstein, 2003). A combination of these two patterns suggests an acquisition of a new function (scenario A) (Fig. S11A). If the expression levels of Y ampliconic genes are higher than those of their non-Y homologs, and across individuals the expression levels of these two groups of genes are negatively correlated, then the data are compatible with neo- or sub-functionalization (scenario A or B). Indeed, the observed negative correlation could be explained by neo-functionalization, where ampliconic genes acquired a new function and inhibit the expression of the non-Y homologs. Alternatively, the negative correlation could be explained by sub-functionalization, where ampliconic genes acquired new transcription factors which limit their expression to a few cell types, and the negative correlation is due to the differences in the abundance of cell types in which ampliconic genes are expressed (Fig. S11B). If the expression levels of Y ampliconic genes are lower than those of their non-Y homologs, and across individuals the expression
22 levels of these two groups of genes are positively correlated, then this pattern is consistent with any of the three scenarios A-C. This is because the lower expression levels of Y ampliconic genes could be due to downregulation of gene expression by the Y chromosome to accommodate the multi-copy state of ampliconic genes (Lan & Pritchard, 2016), evolution of which could still be compatible with any of three three scenarios A-C (Fig. S11C). If the expression levels of the Y ampliconic genes are lower than those of their non-Y homologs, and across individuals the expression levels of these two groups of genes are negatively correlated, then the data are compatible with scenario A or B. This is because negative correlation eliminates the scenario of relaxed selection, i.e. scenario C (Fig. S11D). Finally, if we observe no correlation in expression levels between Y ampliconic genes and their non-Y homologs, then we can conclude that their expression is independent from each other, which could be a result of neo-functionalization, sub- functionalization or random drift in expression levels under relaxed selection.
To test these scenarios, we first compared testis expression levels between Y ampliconic gene families CDY and DAZ, which were copied to the Y from autosomes, and their autosomal homologs (Fig. 5). The CDY autosomal homologs CDYL and CDYL2 are ubiquitously expressed; and the DAZ autosomal homolog DAZL has testis-specific expression (Ardlie et al., 2015; Bhowmick et al., 2007; Dorus, Gilbert, Forster, Barndt, & Lahn, 2003; Vangompel & Xu, 2011). The expression levels of CDY (the sum of expression levels for the whole gene family) were 89% lower than those for their autosomal homologs (the sum of expression of CDYL and CDYL2), and for DAZ they were 63% lower than those for their autosomal homolog DAZL (Fig. 5). Next, we tested whether the expression levels for Y ampliconic genes and their autosomal homologs are regulated at the level of each individual. For each gene family, we examined a potential correlation in gene expression levels between the Y ampliconic genes and their non-Y homologs. We observed a significant negative correlation between CDY and CDYL+CDYL2 expression levels (Spearman correlation=-0.31; P=2x10-4), which indicates that, across individuals, whenever the CDY expression level increases, the CDYL+CDYL2 expression levels decrease (Fig. 6). In case of DAZ, a positive correlation in expression levels (Spearman correlation=0.57; P=0) was observed between DAZ and its autosomal homolog DAZL (Fig. 6). Lower expression of CDY and DAZ than their non-Y homologs could be a result of downregulation of gene expression by Y chromosome to maintain the multi-copy state, however the negative correlation in CDY vs. CDYL+CDYL2 expression levels indicates
23 the presence of either neo- or subfunctionalization. DAZ could have undergone any of the three scenarios which are difficult to differentiate based on the available data.
We next examined how testis-specific gene expression of the HSFY, RBMY, TSPY, VCY, and XKRY gene families diverged from that of their X homologs. Most of the X homologs of ampliconic genes (except for VCY and XKRY) are expressed in multiple tissues along with testis. The XKRX gene, an X homolog of the XKRY gene family, is not expressed in testis and we omitted this gene family from our analysis (Table S11). Three Y gene families studied (RBMY, TSPY, and VCY) on average had lower expression levels in comparison to their X homologs (66%, 75%, and 71% lower, respectively; Fig. 5). HSFY was the only gene family that on average had higher expression in comparison to their homologs on the X (35% higher than X-homologs). This could imply that HSFY might have acquired a new function which is selected for in testis (scenario A). At the level of the studied individuals, all four studied gene families exhibited positive correlation in gene expression levels between their Y ampliconic and X homolog genes, suggesting a potential co- regulation (Fig. 6). This correlation was particularly strong for the HSFY and VCY gene families (Spearman correlation of 0.69 and 0.84, respectively). The observed higher expression of HSFY than of its X homologs, as well as positive correlation in gene expression levels between these two groups of genes, is a strong indicator of neofunctionalization. In the case of RBMY, TSPY, and VCY, it is challenging to differentiate among the three scenarios we propose based on the available data.
24
Figure 2- 5. Expression differences between Y ampliconic genes and their non-Y homologs. Each plot compares expression levels of Y ampliconic gene family (the sum of expression of all copies of a gene family, blue) to their homologs on the X chromosome (green) or autosomes (yellow). The gene names are shown on the x-axis and normalized expression levels—on the y- axis.
25
Figure 2- 6. Individual-level relationship between Y ampliconic genes and their non-Y homologs. Each plot compares expression levels of Y ampliconic gene family (the sum of expression of all copies of a gene family) to their non-Y homologs in each individual (N=149). Each dot represents an individual. The black line represents the linear regression fit to the data and the respective equation is at the top of each plot. The R-squared value is at the right-hand side bottom of each plot.
Discussion
Ampliconic genes constitute the majority (80%) of protein-coding genes present on the human Y chromosome and play an important role in spermatogenesis (Skaletsky et al., 2003). Yet, very little is known about the significance of Y ampliconic gene copy number
26 variation in determining their expression levels in humans. Here we analyzed both copy number and testis-specific expression of ampliconic gene families in 149 presumably healthy men. Our goal was to understand the relationship between copy number variation and expression levels while accounting for Y chromosome haplogroups. Variability in Y ampliconic gene copy number Our results indicate that smaller Y ampliconic gene families maintain lower variation in copy number and, as the size of gene families increases, variation in copy number also increases, in agreement with previous studies (Lucotte et al., 2018; Skov et al., 2017; Ye et al., 2018) . The parsimonious explanation for this observation is that a greater number of gene copies leads to loss or gain of gene copies because of a higher probability of rearrangements via replication slippage and/or non-allelic homologous recombination (NAHR) (Jobling, 2008; Lambert et al., 1999; PJ Hastings James R Lupski, 2010). On the human Y, the larger gene families are either spread across multiple palindromes (e.g., RBMY) or are arranged as a tandem array (TSPY), and such arrangements can result in multiple scenarios of NAHR which will lead to gain or loss of gene copies. BPY2 has two functional copies on palindrome P1 and one copy outside of palindromes, and such an arrangement can also result in NAHR.
We found that the large TSPY and RBMY gene families have not only a high level of variation in copy number, but also a significantly different number of gene copies among the major Y haplogroups analyzed. An earlier study also found significant differences in copy number for these two gene families among human Y haplogroups across the world and suggested that this observation cannot be explained by selection (Ye et al., 2018). However, selection explanation might warrant a further investigation. Indeed, a recent molecular analysis of infertile men indicated a positive correlation between the number of RBMY copies and sperm count and motility (Yan et al., 2017). Moreover, RBMY is a male- specific oncogene (Tsuei et al., 2011). Therefore it will be of interest to investigate whether variation in RBMY copy number across Y haplogroups influences these two disease- related phenotypes and might be subject to natural selection. Similarly, TSPY is a candidate proto-oncogene which can regulate its own expression via a positive feedback loop in gonadoblastoma and a variety of somatic cancers (Kido & Lau, 2014). Thus, additional studies should be performed to test whether variation in TSPY copy number across haplogroups is associated with differential predisposition to gonadoblastoma.
27
The smaller Y ampliconic gene families (HSFY, PRY, VCY, and XKRY) have lower variation in copy number compared with larger families. These gene families, for which the average family size is only two copies, are each present on an individual palindrome (the two copies are present as inverted repeats on opposite palindrome arms). Recombination between inverted repeats is expected to result in an inversion keeping copy number constant (W. Gu, Zhang, & Lupski, 2008). In addition, the presence of only two copies increases the chances of a complete gene family elimination due to Muller’s ratchet or of rearrangements which involve the whole palindrome. Consistent with this prediction, we find these gene families to be deleted or pseudogenized in several great ape species (Hallast & Jobling, 2017).
Thus, the copy number of ampliconic genes is an important factor in determining the survival of a gene family on the Y chromosome. Too few copies can lead to a complete loss of a gene family (see the preceding paragraph), whereas too many copies can lead to frequent NAHR which can rapidly increase or decrease copy number (Connallon & Clark, 2010). Consistent with this expectation, it was suggested that the human Y chromosome evolves under selection to maintain an optimal copy number for its amplicons in diverse human lineages (Teitz et al., 2018). Most likely both random genetic drift and natural selection contribute to determining the Y chromosome ampliconic gene copy number. Drift leads to smaller-scale changes in copy numbers, whereas selection might act at removing extreme copy numbers because too few copies might lead to infertility and too many copies might lead to genetic instability and thus both are selected against. Variation in Y ampliconic gene copy number in subfertile and infertile males should be investigated in future studies and should shed additional light on the balance between these two evolutionary processes.
Note that in the present study we only examined complete gene copy gains or losses, but insertions and deletions inside a gene can also affect gene expression and functionality, and might be linked to infertility (Poli, Iriarte, Iudica, Zanier, & Coco, 2015). The effects of such smaller CNVs are more robustly evaluated from long-read data and we leave this exploration to future work. Variability in Y ampliconic gene expression Here we studied the expression levels of the Y ampliconic gene families in testis tissue of presumably healthy individuals. The vast majority of cells in testis are germline cells in the
28 seminiferous ducts, where spermatogenesis takes place. We primarily captured Y chromosome gene expression in spermatogonia prior to meiosis and throughout different spermatogenesis stages after meiosis (Larson, Kopania, & Good, 2018; Sin, Ichijima, Koh, Namiki, & Namekawa, 2012); this is because Y transcription is silenced at other stages of spermatogenesis due to meiotic sex chromosome inactivation (Handel, 2004; Larson et al., 2018) and postmeiotic sex chromosome repression (Larson et al., 2018; Sin et al., 2012). As a tissue, testis is a mixture of germline cells at different stages of development, Sertoli cells, myoid peritubular cells, and interstitial Leydig cells. Thus, the expression values generated using testis tissue as a source represent cumulative gene expression of germline cells at different stages of spermatogenesis with a mixture of somatic cells. This potential limitation notwithstanding, our results indicate substantial variation in expression levels for Y ampliconic genes in testis among men and suggest that different levels of Y ampliconic genes’ expression are tolerated by presumably healthy individuals.
When we compared copy number of ampliconic genes to their gene expression values, we found that across gene families the gene families with higher median copy number had higher expression levels. This is consistent with an observation made by Lucotte and colleagues (Lucotte et al., 2018) who reported on the expression of Y ampliconic genes at different stages of spermatogenesis with respect to variation in their copy number. Overall, the Y chromosome has higher copy number of genes for those gene families whose median expression levels are higher in testis, however it is important to note that this relationship might be different at individual cell types in testis and should be studied further.
When we examined the relationship between copy number and expression within a gene family, our analysis revealed that expression of Y ampliconic gene families is independent of their copy number. Moreover, no significant differences in Y ampliconic gene expression levels were observed among Y haplogroups, even though we found significant differences among Y haplogroups in copy number for some gene families (BPY2, TSPY, and RBMY). This suggests that testis tissue might have evolved the ability to tolerate different Y ampliconic gene copy numbers, and also variable Y ampliconic gene expression levels.
Approximately 77% of all protein-coding genes in the human genome are expressed in testis (Djureinovic et al., 2014), and some of these genes could regulate expression of the
29
Y ampliconic genes. Understanding the 3D organization and chromatin structure on the Y is expected to aid in identifying the genomic regions and genes that ampliconic genes interact with and are regulated by in the genome. Future studies analyzing expression data at different stages of spermatogenesis in individuals with different Y ampliconic gene copy numbers will assist in deciphering the role of copy number variation in determining gene expression in more detail. Additionally, our findings should be confirmed by studies of gene expression at the protein level.
A man’s advanced age has significant negative impact on reproduction (Harris, Fronczak, Roth, & Meacham, 2011). Semen parameters such as daily sperm production, total sperm count, and sperm viability are negatively correlated with age (Gunes, Hekim, Arslan, & Asci, 2016). However, within our dataset, we observed mixed results regarding age effects on Y ampliconic gene expression: age did not influence variation in gene expression of these genes in individuals with European Y subhaplogroups I1a and R1b, however HSFY and PRY expression had a positive correlation with age in individuals with an African subhaplogroup E1b. These findings should be validated with a larger data set to examine the role of Y ampliconic genes in changes in spermatogenesis with age.
Dosage regulation of Y ampliconic genes The Y chromosome degradation, which is common across eutherian mammals, has resulted in the loss of the majority of genes originally present on the proto-sex autosomal pair (Vicoso & Bachtrog, 2009). To balance the loss of genes on the Y in males, the mammalian X chromosome adapted its expression levels by inactivating one of its copies and increasing the expression of the other copy in females (Nguyen & Disteche, 2005; Straub & Becker, 2007; Vicoso & Bachtrog, 2009). We wondered whether a similar process evolved at Y ampliconic genes that have non-Y homologs, namely whether the expression of Y ampliconic genes and their non-Y homologs has been co-regulated. Alternatively, Y ampliconic genes might have evolved new functions, and thus potentially high expression levels, independent of their non-Y homologs. Yet another alternative would be the overall low expression levels because of the relaxation of functional constraints on the Y ampliconic genes. The precise functions of Y ampliconic genes have been under-characterized (Table S12) due to the repeated nature of the Y chromosome and scarcity of testable orthologs in model organisms. While Y ampliconic genes have
30 testis-specific expression likely as a result of sexual antagonism, the majority of non-Y homologs of Y ampliconic genes have ubiquitous expression.
Recently, a multi-step model for preservation of tandem duplicate genes was presented. According to this model, the expression of gene duplicates is downregulated immediately after the duplication event, followed by dosage sharing which could lead to functional adaptations such as sub- or neofunctionalization (Lan & Pritchard, 2016). Knowing that non-Y homologs of Y ampliconic genes are expressed in testis (except for XKRX), we compared the expression levels of closely related homologs of ampliconic genes on both autosomes and X chromosome to the sum of expression levels for all the copies of a Y gene family. We demonstrated that, with the exception of the HSFY family, Y ampliconic gene families have consistently lower expression levels when compared to their non-Y homologs, thus not elevating the overall expression level of the family. We term this phenomenon dosage regulation of Y ampliconic genes. Lower expression of Y ampliconic gene families could be an adaptation of the Y to maintain the multi-copy state of ampliconic gene families. By lowering the expression of the whole gene family, the Y can buffer sudden loss or gain of gene copies. In addition to dosage regulation, the gene family should be expressed at optimal levels to maintain their functionality during spermatogenesis. Lower optimal expression of Y ampliconic gene families compared to their non-Y homologs could be a result of subfunctionalization (e.g., testis specificity in expression), which benefits germline cell development. Alternatively, such low expression could be a result of relaxed selection, and, in agreement with this possibility, Y ampliconic genes show a higher rate of nonsynonymous to synonymous substitution rates compared to single-copy X degenerate genes on the Y (Betrán et al., 2012). Alternatively, a gene family could be under positive selection or undergoing neofunctionalization even in their low-level expression state. The expression of ampliconic gene families is important for spermatogenesis because of an association between gene deletions and infertility, but relaxed selection can facilitate rapid differentiation of ampliconic gene function.
We found that expression levels of the CDY ampliconic genes and those of their autosomal homologs are negatively correlated among individual men. This suggests that the CDY gene family might not be expressed at the same time during spermatogenesis as its autosomal homologs or that there is a coordinated downregulation of CDY expression with a rise in CDYL and CDYL2 expression (and vice versa). In humans, the CDYL and CDYL2
31 autosomal genes produce the ubiquitously expressed long transcripts, but lost the testis- specific short transcript which is now produced by CDY (Dorus et al., 2003). The combined tissue expression patterns of CDY, CDYL, and CDYL2 in human recapitulate the expression patterns of CDYL and CDYL2 in mouse or rabbit, which do not have CDY on their Y chromosome (Dorus et al., 2003).
In contrast with CDY, we found that expression levels of DAZ, HSFY, and VCY gene families are strongly positively correlated with their non-Y homolog expression across individuals, which suggests a co-regulation in gene expression levels of these ampliconic gene families and their homologs (the RBMY and TSPY families also show positive correlation, however it is not strong). When we examine the linear relationship between ampliconic gene families and their homologs among individual men, the Y ampliconic gene expression increases at a slower pace when compared to the expression of their non-Y homologs, except for HSFY where the expression increases at a similar rate for both Y and non-Y homologs (Fig. 6).
The VCY gene family is the most commonly lost gene family among great apes, however in our dataset the expression of this gene family is higher than for most other gene families on the Y and is higher than is predicted from its copy number (Fig. 2). The homologs of VCY on the X chromosome (VCX, VCX2, VCX3A, and VCX3B) are expressed in testis (Lahn & Page, 2000; Uhlén et al., 2015)—and we show that at higher levels than the VCY family itself. In addition, there is high sequence identity (>95%) between the VCX and VCY gene families, which could imply that both VCX and VCY could have been under selection to maintain function of the gene family, however, to balance the expression of the multi- copy VCX family, VCY might have lowered its expression. The role of both VCX and VCY in ribosome assembly in spermatogenesis has been suggested (Zou et al., 2003). The loss of VCY in great ape species might have been compensated by functionally similar VCX family expression in testis. The expression levels of the VCX family across great apes must be studied to understand its role in the loss of VCY.
A recent study found multiple distinct clusters of full-length Y ampliconic gene transcripts, likely originating from different copies of the same family (Sahlin, Tomaszkiewicz, Makova, & Medvedev, 2018). Therefore, the presence of multiple full-length transcripts (Sahlin et al., 2018) and low expression levels for Y ampliconic gene families (the present study)
32 suggest that individual gene copies within a family are downregulated to accommodate the expression of the whole gene family on the Y chromosome and outside of it (on autosomes and on the X). This hypothesis needs to be examined in future studies in which expression levels of individual gene copies will be evaluated with long-read sequencing technology. It will also be important to decipher the isoforms and their expression levels for Y ampliconic genes and their non-Y homologs to understand whether Y ampliconic genes and their homologs express the same isoforms, or whether Y ampliconic genes express their own, unique, testis-specific isoforms.
It is essential to note that, in addition to evolution of expression levels of the whole gene family including its non-Y homologs, the Y ampliconic genes can diverge to acquire additional male-specific functionality because they are present on the Y, which is susceptible to accumulating genetic differences dictated by sexual antagonism. In other words, Y ampliconic genes could have gained secondary functions independent of their functions on the proto-sex chromosomes. This scenario might be exemplified by the case of the HSFY family, whose expression levels have increased in comparison to its X- chromosome homologs. This pattern suggests that this gene family underwent neofunctionalization. The exact function of HSFY is unknown, but its role in transcription regulation has been suggested because it harbors a DNA-binding domain (Shinka et al., 2004). In fact, it was shown that HSFY and HSFX share only this DNA-binding domain but not the rest of their sequences and thus indeed might have diverged in their functions (Shinka et al., 2004). Moreover, HSFY has stage-specific expression during spermatogenesis, suggesting that it acquired a function different than that of heat shock proteins it is homologous to (Shinka et al., 2004). The loss of HSFY was linked to infertility (Kichine et al., 2012; Kinoshita et al., 2006; Shinka et al., 2004). In another study, underexpression of HSFY was linked to arrest of maturation of nascent germ cells to motile sperm (Stahl, Mielnik, Barbieri, Schlegel, & Paduch, 2012). According to our study, the expression of HSFY gene family was positively correlated with age in the African E1b Y haplogroup, however such a relationship was not found in the R1b haplogroup. Further studies addressing transcription regulation by the HSFY family in individuals of varying age across different Y haplogroups are required to understand the HSFY functionality in more detail.
33
We assume that non-Y homologs have retained the ancestral expression state because of the overall fast evolutionary rate on the Y chromosome (Makova & Li, 2002). However, the X chromosome and autosomes have also been evolving, albeit slower than the Y. Evolutionary changes acquired by non-Y homologs since they diverged from the Y homologs have not been addressed in this study due to the lack of ancestral expression data. To address this, future studies should identify species which have orthologs of human ampliconic genes in a single-copy state on their Y chromosome. In the case of CDY and DAZ gene families, future studies should identify species in which these genes’ orthologs are present in a single-copy state on their autosomes and absent from the sex chromosomes. Once such species are identified, their testis-specific expression data for these genes could be used as the ancestral expression state.
Materials and Methods
AmpliCoNE: Ampliconic Copy Number Estimator To estimate copy number in highly-similar multi-copy gene families, several strategies have been proposed. One can align each read to all possible locations in the reference genome (Alkan et al., 2009), identify sites in the reference genome that uniquely distinguish and tag paralogs of interest (Handsaker et al., 2015; Oetjens et al., 2016; Sudmant et al., 2010; Teitz et al., 2018), use simulated reads for mock genomes with human gene cDNAs at different gene copy counts to obtain a theoretical function of the coverage distribution with respect to copy number (Cortez et al., 2014), or customize the reference to keep a single copy of each gene family (Lucotte et al., 2018; Skov et al., 2017). While these strategies were effective in their respective papers, we could not find software that could work on human Y ampliconic genes. We therefore combine the ideas from these strategies into AmpliCoNE, a tool for estimating copy number in highly-similar multi-copy gene families. The Results section contains an overview of AmpliCoNE, but we provide more details here. Determining control and informative positions To calibrate a baseline for read depth at copy number of one, AmpliCoNE uses single copy regions unique to the Y (AmpliCoNE also provides an option to use X-degenerate regions as a control, but we do not describe it in the manuscript). These regions are identified by AmpliCoNE-build as positions in the Y chromosome such that the k-mer starting at the position does not match any other location in the reference. We used a k-
34 mer of size 101, since this is the length of the shortest reads that we use from GTEx, and we allowed up to two edits in matching to other locations. AmpliCoNE-build computes these positions by using the GEM-mappability tool (v1.315) (Derrien et al., 2012) and extracting those locations with a mappability of 1.
AmpliCoNE-build pre-determines which positions of the genome will be used for later measuring read depth. A position within a gene family is said to be informative if (1) the location is not within an annotated high-copy repeat region (e.g. a transposable element) or a short tandem repeat, (2) the k-mer starting at that location is specific to its gene family of origin, and (3) the k-mer is non-repetitive within its gene. For (1), AmpliCoNE-build takes repeat annotations as input, such as ones generated by RepeatMasker (“RepeatMasker Home Page,” n.d.) and Tandem Repeat Finder (Benson, 1999). This step is necessary since these regions are notoriously hard to align to. For (2) and (3), AmpliCoNE-build uses a strategy similar to the one used by Tietz and colleagues to annotate Y chromosome amplicons (Teitz et al., 2018). It extracts all the 101-mers from the Y chromosome and maps them back to the reference using Bowtie2 (Langmead & Salzberg, 2012). It allows Bowtie2 to generate up to 15 alignments per k-mer (-k 15) but discards alignments that have more than two mismatches. A k-mer is then determined to be specific to the gene family if all its alignments fall within the gene regions in the family. It is determined to be non-repetitive within its gene if the number of alignments equals to the number of genes in the family. Using only gene-specific locations is crucial for AmpliCoNE's accuracy, since non-specific locations would add biased noise to later read depth estimates. Computing read depth and calling copy number AmpliCoNE-count takes an alignment file of male reads to the male reference as input. It only retains alignments that are part of a properly mapping read pair and have at least 88 perfect matches in the first 90 bp of the read. This threshold is designed to retain only reliable alignments and is intended to match the criteria used for determining control positions. For each non-repeat-masked position 푖 on the Y chromosome, AmpliCoNE- count then computes the number of alignments starting at 푖, which we refer to as the read depth 퐷𝑖.
It is known that the GC bias affects the depth of reads generated using Illumina technology (Bentley et al., 2008). Therefore, AmpliCoNE-count adjusts the read depth by using a sliding-window-based GC correction method (Yoon, Xuan, Makarov, Ye, & Sebat, 2009).
35
Concretely, AmpliCoNE-count first collects the read depths for the control positions and notes the GC percentage of the 501 bp window centered at those positions. The read depths are then binned according to their GC percentage, using 100 bins: for a given bin
푏, we calculate μb , the mean read depth of the control locations with a GC percentage belonging to 푏. We also let μ be the mean read depth over all control positions. The GC- corrected read depth for position 푖 is then calculated as 휇퐷𝑖/휇푏.
For each gene, AmpliCoNE-count computes the gene copy count as the mean GC- corrected read depth for all informative locations in the gene, divided by the mean control read depth μ. To obtain the final copy count for each family, AmpliCoNE-count reports the total sum of the copy counts of all the genes in the family.
Simulation-based validation of AmpliCoNE To evaluate the accuracy of AmpliCoNE, we ran simulations. There are nine TSPY genes (six functional + three pseudogene), six RBMY genes and two VCY genes in the hg38 reference. We added different copies of these three ampliconic gene families to the Y chromosome (Table S1) to simulate reads. The total number of gene copies in the three custom references used to generate the simulated reads were 22/7/4 copies (for TSPY/RBMY/VCY, respectively) in set 1; 29/12/2 copies in set 2; 23/9/3 copies in set 3. Using wgsim [0.3.2] (lh, n.d.) we simulated 666 million paired-end reads of length 101 bp and insert size of 260 bp (the exact parameters were "-d 260 -N 666873346 -1 101 -2 101 -S 9 -e 0 -r 0 -R 0"). The reads from the three simulated datasets were aligned to the hg38 reference genome using BWA MEM[v0.7.15](Li, 2013). The SAM files were sorted and PCR duplicates were removed using the PICARD toolkit[v1.128](“Website,” n.d.). Finally, samtools[v1.3.1](Li et al., 2009) were used to index the alignments. The sorted indexed BAM files were presented as input to AmpliCoNE-count.
Datasets We used mRNA sequencing data for 170 testis samples with matched whole-genome sequencing (WGS) data from the GTEx project (Carithers et al., 2015). The GTEx RNA- seq libraries were generated with the Illumina TruSeq protocol and whole-genome sequencing was performed with paired-end reads ranging from 100 bp to 150 bp in length
36 with target insert size of 350-370 bp (Carithers et al., 2015). As the validation dataset for AmpliCoNE, we used WGS data from four males (depth of coverage ranging from ~45- 50x in HG002 and HG003, ~300x in HG005 and ~100x in HG006) sequenced by the GIAB consortium (Zook et al., 2014).
Pipeline for human WGS analysis The Y-chromosome-specific alignments of the GTEx dataset were extracted from dbGAP using the SRA toolkit (Leinonen, Sugawara, Shumway, & on behalf of the International Nucleotide Sequence Database Collaboration, 2010). From the alignments, we extracted the reads and aligned them to the hg38 reference genome using bwa-mem [v0.7.15](Li, 2013). The SAM files were sorted and PCR duplicates were removed using PICARD toolkit [v1.128](“Website,” n.d.). Finally, samtools[v1.3.1](Li et al., 2009) were used to index the read alignment files. The generated BAM files were presented as input to AmpliCoNE-count to estimate ampliconic copy number.
AMpliCoNE-build requires the locations of all the gene copies, in the reference genome, for each ampliconic gene family. While the location of functional copies are already annotated in hg38, these do not include highly similar pseudogenized copies. These are necessary to include since they will affect the read mappings. For each family, we therefore took an arbitrary annotated copy of a gene, and used BLAT (W. James Kent, 2002) to find all sites aligning with >99% identity (Table S13). These locations were given as input to AmpliCoNE-build.
Experimental validation with droplet digital PCR (ddPCR) In order to validate the in silico ampliconic gene copy number count in four individuals sequenced by the GIAB consortium (Zook et al., 2014), we acquired their DNA (NA24385, NA24149, NA24631, and NA24694) from Coriell and performed ddPCR for all nine Y ampliconic gene families. In order to infer the copy number of these gene families we used SRY, a single-copy gene on the Y chromosome, and RPP30, a two-copy autosomal gene, as references. We ran ddPCR for each sample in triplicates using EvaGreen dsDNA dye (Bio-Rad) on the Biorad QX200 digital droplet platform with the protocol and primers from our previous publication (Tomaszkiewicz et al., 2016). The results were analyzed using QuantaSoft software. Subsequently, the concentration (copies/uL) of each ampliconic
37 gene family of interest was divided by the concentration of the references, SRY and RPP30 (Table S4).
Estimating gene expression levels Gene expression estimates were obtained using the kallisto-DESeq2 pipeline described below. The standard human (hg38) RefSeq transcripts obtained from the UCSC Genome Browser (W. J. Kent et al., 2002) were used as reference. We generated an index for the reference using the kallisto [v0.43.0] index function with default parameters (Bray, Pimentel, Melsted, & Pachter, 2016). For each sample we obtained read counts per transcript using the kallisto quant (--bias, --seed=9, --bootstrap-samples=100) function. The hg38 refFlat file containing the transcript-to-gene mapping information was obtained from the UCSC Genome Browser (W. J. Kent et al., 2002) annotation database, which was used to convert the transcript-level read counts to gene-level expression levels using tximport package [v1.2.0] (Soneson, Love, & Robinson, 2015). Since there were no replicates for the samples, we set the 170 sample ids as different conditions in the design, and the gene-level read counts for 170 RNASeq samples were normalized using DESeq2 [v1.14.1] (Love, Huber, & Anders, 2014). Additionally, read counts based on the vst (Variance Stabilizing Transformation) function in DESeq2 were used to check for outliers. To identify outliers in the dataset we performed Principal Component Analysis using the prcomp() function on the vst-based normalized read counts. When we plotted the first and second principal components, we found 21 samples outside the main cluster of the remaining 149 samples (Fig. S12). We followed steps described in DEseq2 vignettes and plotted the heatmap of sample-to-sample distance for the top 1,000 highly expressed genes to identify outliers visually and we found the same 21 samples as outliers. Thus, we filtered out these 21 samples and utilized the expression values for the nine ampliconic families in the remaining 149 samples in the downstream analysis. We summed the expression values for all the gene copies within a gene family to obtain family-level expression values. Human Y haplogroup determination Yhaplo[v1.0.11] (Poznik & David Poznik, 2016) was used to predict Y haplogroup of the samples. The version of Yhaplo[1.0.11] we used expects the SNP coordinates consistent with the hg19 (Church et al., 2011) version of the human reference. The Y-chromosome- specific BAM files downloaded from dbGAP were aligned to the hg19 version of the human reference using BWA MEM. We directly converted the downloaded BAM files into pileup
38 format using samtools mpileup function. A custom script was used to convert the pileup file into Yhaplo-compatible input format. We annotated the Y haplotype for all the samples in the dataset using Yhaplo default parameters. Code availability Code used in the manuscript is available at github link: https://github.com/makovalab- psu/GTEx_Testis_Analysis. Steps to install and use AmpliCoNE are available at github: https://github.com/makovalab-psu/AmpliCoNE-tool
References
Alkan, C., Kidd, J. M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., … Eichler, E. E. (2009). Personalized copy number and segmental duplication maps using next-generation sequencing. Nature Genetics, 41(10), 1061–1067. Ardlie, K. G., DeLuca, D. S., Segrè, A. V., Sullivan, T. J., Young, T. R., Gelfand, E. T., … Lockhart. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235), 648–660. Bellott, D. W., Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Cho, T.-J., … Page, D. C. (2014). Mammalian Y chromosomes retain widely expressed dosage- sensitive regulators. Nature, 508(7497), 494–499. Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27(2), 573–580. Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J., Brown, C. G., … Smith, A. J. (2008). Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218), 53. Betrán, E., Demuth, J. P., & Williford, A. (2012). Why Chromosome Palindromes? International Journal of Evolutionary Biology, 2012(Figure 2), 1–14. Bhowmick, B. K., Satta, Y., & Takahata, N. (2007). The origin and evolution of human ampliconic gene families and ampliconic structure. Genome Research, 17(4), 441– 450. Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology, 34(5), 525–527. Carithers, L. J., Ardlie, K., Barcus, M., Branton, P. A., Britton, A., Buia, S. A., … GTEx
39
Consortium. (2015). A Novel Approach to High-Quality Postmortem Tissue Procurement: The GTEx Project. Biopreservation and Biobanking, 13(5), 311–319. Carvalho, C. M. B., Zhang, F., & Lupski, J. R. (2011). Structural variation of the human genome: mechanisms, assays, and role in male infertility. Systems Biology in Reproductive Medicine, 57(1-2), 3–16. Charlesworth, B., & Charlesworth, D. (2000). The degeneration of Y chromosomes. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 355(1403), 1563–1572. Charlesworth, D., & Charlesworth, B. (1980). Sex differences in fitness and selection for centric fusions between sex-chromosomes and autosomes. Genetical Research, 35(02), 205. Church, D. M., Schneider, V. A., Graves, T., Auger, K., Cunningham, F., Bouk, N., … Hubbard, T. (2011). Modernizing reference genome assemblies. PLoS Biology, 9(7), e1001091. Connallon, T., & Clark, A. G. (2010). Gene duplication, gene conversion and the evolution of the Y chromosome. Genetics, 186(1), 277–286. Cortez, D., Marin, R., Toledo-Flores, D., Froidevaux, L., Liechti, A., Waters, P. D., … Kaessmann, H. (2014). Origins and functional evolution of Y chromosomes across mammals. Nature, 508(7497), 488–493. Derrien, T., Estellé, J., Marco Sola, S., Knowles, D. G., Raineri, E., Guigó, R., & Ribeca, P. (2012). Fast computation and applications of genome mappability. PloS One, 7(1), e30377. Djureinovic, D., Fagerberg, L., Hallström, B., Danielsson, A., Lindskog, C., Uhlén, M., & Pontén, F. (2014). The human testis-specific proteome defined by transcriptomics and antibody-based profiling. Molecular Human Reproduction, 20(6), 476–488. Dorus, S., Gilbert, S. L., Forster, M. L., Barndt, R. J., & Lahn, B. T. (2003). The CDY- related gene family: coordinated evolution in copy number, expression profile and protein sequence. Human Molecular Genetics, 12(14), 1643–1650. Giachini, C., Nuti, F., Turner, D. J., Laface, I., Xue, Y., Daguin, F., … Krausz, C. (2009). TSPY1Copy Number Variation Influences Spermatogenesis and Shows Differences among Y Lineages. The Journal of Clinical Endocrinology and Metabolism, 94(10), 4016–4022. Gu, L., & Walters, J. R. (2017). Evolution of Sex Chromosome Dosage Compensation in Animals: A Beautiful Theory, Undermined by Facts and Bedeviled by Details.
40
Genome Biology and Evolution, 9(9), 2461–2476. Gunes, S., Hekim, G. N. T., Arslan, M. A., & Asci, R. (2016). Effects of aging on the male reproductive system. Journal of Assisted Reproduction and Genetics, 33(4), 441– 454. Gu, W., Zhang, F., & Lupski, J. R. (2008). Mechanisms for human genomic rearrangements. PathoGenetics, 1(1), 4. Hallast, P., & Jobling, M. A. (2017). The Y chromosomes of the great apes. Human Genetics, 136(5), 511–528. Handel, M. A. (2004). The XY body: a specialized meiotic chromatin domain. Experimental Cell Research, 296(1), 57–63. Handsaker, R. E., Van Doren, V., Berman, J. R., Genovese, G., Kashin, S., Boettger, L. M., & McCarroll, S. A. (2015). Large multiallelic copy number variations in humans. Nature Genetics, 47(3), 296–303. Harris, I. D., Fronczak, C., Roth, L., & Meacham, R. B. (2011). Fertility and the aging male. Reviews in Urology, 13(4), e184–e190. Henrichsen, C. N., Chaignat, E., & Reymond, A. (2009). Copy number variants, diseases and gene expression. Human Molecular Genetics, 18(R1), R1–R8. Iwase, M., Satta, Y., Hirai, H., Hirai, Y., & Takahata, N. (2010). Frequent gene conversion events between the X and Y homologous chromosomal regions in primates. BMC Evolutionary Biology, 10, 225. Jobling, M. A. (2008). Copy number variation on the human Y chromosome. Cytogenetic and Genome Research, 123(1-4), 253–262. Kent, W. J. (2002). BLAT--the BLAST-like alignment tool. Genome Research, 12(4), 656– 664. Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., & Haussler, a. D. (2002). The Human Genome Browser at UCSC. Genome Research, 12(6), 996–1006. Kichine, E., Rozé, V., Di Cristofaro, J., Taulier, D., Navarro, A., Streichemberger, E., … Mitchell, M. J. (2012). HSFY genes and the P4 palindrome in the AZFb interval of the human Y chromosome are not required for spermatocyte maturation. Human Reproduction , 27(2), 615–624. Kido, T., & Lau, Y.-F. C. (2014). The Y-located gonadoblastoma gene TSPY amplifies its own expression through a positive feedback loop in prostate cancer cells. Biochemical and Biophysical Research Communications, 446(1), 206–211.
41
Kinoshita, K., Shinka, T., Sato, Y., Kurahashi, H., Kowa, H., Chen, G., … Nakahori, Y. (2006). Expression analysis of a mouse orthologue of HSFY, a candidate for the azoospermic factor on the human Y chromosome. The Journal of Medical Investigation: JMI, 53(1-2), 117–122. Krausz, C., Chianese, C., Giachini, C., Guarducci, E., Laface, I., & Forti, G. (2011). The Y chromosome-linked copy number variations and male fertility. Journal of Endocrinological Investigation, 34(5), 376–382. Krausz, C., Giachini, C., & Forti, G. (2010). TSPY and Male Fertility. Genes, 1(2), 308– 316. Lahn, B. T., & Page, D. C. (2000). A human sex-chromosomal gene family expressed in male germ cells and encoding variably charged proteins. Human Molecular Genetics, 9(2), 311–319. Lambert, S., Saintigny, Y., Delacote, F., Amiot, F., Chaput, B., Lecomte, M., … Lopez, B. S. (1999). Analysis of intrachromosomal homologous recombination in mammalian cell, using tandem repeat sequences. Mutation Research, 433(3), 159–168. Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357–359. Lan, X., & Pritchard, J. K. (2016). Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals. Science, 352(6288), 1009–1013. Larson, E. L., Kopania, E. E. K., & Good, J. M. (2018). Spermatogenesis and the Evolution of Mammalian Sex Chromosomes. Trends in Genetics: TIG, 34(9), 722–732. Leinonen, R., Sugawara, H., Shumway, M., & on behalf of the International Nucleotide Sequence Database Collaboration. (2010). The Sequence Read Archive. Nucleic Acids Research, 39(Database), D19–D21. Li H. (2011). wgsim-Read simulator for next generation sequencing. Github Repository. [online] http://github.com/lh3/wgsim.
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM. Retrieved from http://arxiv.org/abs/1303.3997 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25(16), 2078–2079. Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.
42
Lucotte, E. A., Skov, L., Jensen, J. M., Coll Macià, M., Munch, K., & Schierup, M. H. (2018). Dynamic Copy Number Evolution of X- and Y-Linked Ampliconic Genes in Human Populations. Genetics. https://doi.org/10.1534/genetics.118.300826 Lupiáñez, D. G., Kraft, K., Heinrich, V., Krawitz, P., Brancati, F., Klopocki, E., … Mundlos, S. (2015). Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell, 161(5), 1012–1025. Lynch, M., & Conery, J. S. (2000). [Review of The evolutionary fate and consequences of duplicate genes]. Science, 290(5494), 1151–1155. Makova, K. D., & Li, W.-H. (2002). Strong male-driven evolution of DNA sequences in humans and apes. Nature, 416(6881), 624–626. Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., … Reich, D. (2016). The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature, 538(7624), 201–206. Medvedev, P., Stanciu, M., & Brudno, M. (2009). Computational methods for discovering structural variation with next-generation sequencing. Nature Methods, 6(11s), S13. Navarro-Costa, P. (2012). Sex, rebellion and decadence: the scandalous evolutionary history of the human Y chromosome. Biochimica et Biophysica Acta, 1822(12), 1851– 1863. Navarro-Costa, P., Plancha, C. E., & Gonaçlves, J. (2010). Genetic dissection of the AZF regions of the human Y chromosome: Thriller or filler for male (In)fertility? Journal of Biomedicine & Biotechnology, 2010. https://doi.org/10.1155/2010/936569 Nguyen, D. K., & Disteche, C. M. (2005). Dosage compensation of the active X chromosome in mammals. Nature Genetics, 38(1), 47–53. Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biology and Evolution, 8(7), 2231–2240. PJ Hastings James R Lupski, S. M. R. A. G. I. (2010). Mechanisms of change in gene copy number. Nature Reviews. Genetics, 10(8), 551–564. Poli, M. N., Iriarte, P. F., Iudica, C., Zanier, J. H. M., & Coco, R. (2015). New Sequence Variations in Spermatogenesis Candidates Genes. JBRA Assisted Reproduction, 19(4), 216–222. Poznik, G. D., & David Poznik, G. (2016). Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men. https://doi.org/10.1101/088716
43
RepeatMasker Home Page. (n.d.). Retrieved June 20, 2018, from http://www.repeatmasker.org Repping, S., Skaletsky, H., Lange, J., Silber, S., van der Veen, F., Oates, R. D., … Rozen, S. (2002). Recombination between Palindromes P5 and P1 on the Human Y Chromosome Causes Massive Deletions and Spermatogenic Failure. American Journal of Human Genetics, 71(4), 906–922. Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J., Cordum, H. S., Waterston, R. H., … Page, D. C. (2003). Abundant gene conversion between arms of palindromes in human and ape Y chromosomes. Nature, 423(6942), 873–876. Sahlin, K., Tomaszkiewicz, M., Makova, K. D., & Medvedev, P. (2018). Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon. Nature Communications, 9(1), 4601. Shinka, T., Sato, Y., Chen, G., Naroda, T., Kinoshita, K., Unemi, Y., … Nakahori, Y. (2004). Molecular characterization of heat shock-like factor encoded on the human Y chromosome, and implications for male infertility. Biology of Reproduction, 71(1), 297–306. Sin, H.-S., Ichijima, Y., Koh, E., Namiki, M., & Namekawa, S. H. (2012). Human postmeiotic sex chromatin and its impact on sex chromosome evolution. Genome Research, 22(5), 827–836. Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G., … Page, D. C. (2003). The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature, 423(6942), 825–837. Skov, L., Danish Pan Genome Consortium, & Schierup, M. H. (2017). Analysis of 62 hybrid assembled human Y chromosomes exposes rapid structural changes and high rates of gene conversion. PLoS Genetics, 13(8), e1006834. Soneson, C., Love, M. I., & Robinson, M. D. (2015). Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4, 1521. Spielmann, M., Lupiáñez, D. G., & Mundlos, S. (2018). Structural variation in the 3D genome. Nature Reviews. Genetics, 19(7), 453–467. Stahl, P. J., Mielnik, A. N., Barbieri, C. E., Schlegel, P. N., & Paduch, D. A. (2012). Deletion or underexpression of the Y-chromosome genes CDY2 and HSFY is associated with maturation arrest in American men with nonobstructive azoospermia. Asian Journal of Andrology, 14(5), 676–682. Straub, T., & Becker, P. B. (2007). Dosage compensation: the beginning and end of
44
generalization. Nature Reviews. Genetics, 8(1), 47–57. Sudmant, P. H., Kitzman, J. O., Antonacci, F., Alkan, C., Malig, M., Tsalenko, A., … Eichler, E. E. (2010). Diversity of human copy number variation and multicopy genes. Science, 330(6004), 641–646. Teitz, L. S., Pyntikova, T., Skaletsky, H., & Page, D. C. (2018). Selection Has Countered High Mutability to Preserve the Ancestral Copy Number of Y Chromosome Amplicons in Diverse Human Lineages. American Journal of Human Genetics, 103(2), 261–275. Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H. W., Harris, R., … Makova, K. D. (2016). A time- and cost-effective strategy to sequence mammalian Y Chromosomes: an application to the de novo assembly of gorilla Y. Genome Research, 26(4), 530–540. Trombetta, B., Cruciani, F., Underhill, P. A., Sellitto, D., & Scozzari, R. (2010). Footprints of X-to-Y gene conversion in recent human evolution. Molecular Biology and Evolution, 27(3), 714–725. Tsuei, D.-J., Lee, P.-H., Peng, H.-Y., Lu, H.-L., Su, D.-S., Jeng, Y.-M., … Chang, M.-H. (2011). Male germ cell-specific RNA binding protein RBMY: a new oncogene explaining male predominance in liver cancer. PloS One, 6(11), e26948. Uhlén, M., Fagerberg, L., Hallström, B. M., Lindskog, C., Oksvold, P., Mardinoglu, A., … Pontén, F. (2015). Proteomics. Tissue-based map of the human proteome. Science, 347(6220), 1260419. Vangompel, M. J. W., & Xu, E. Y. (2011). The roles of the DAZ family in spermatogenesis: More than just translation? Spermatogenesis, 1(1), 36–46. Vicoso, B., & Bachtrog, D. (2009). Progress and prospects toward our understanding of the evolution of dosage compensation. Chromosome Research: An International Journal on the Molecular, Supramolecular and Evolutionary Aspects of Chromosome Biology, 17(5), 585–602. Vinuela, A., Brown, A. A., Buil, A., Tsai, P.-C., Davies, M. N., Bell, J. T., … Small, K. (2016). Age-dependent changes in mean and variance of gene expression across tissues in a twin cohort. https://doi.org/10.1101/063883 Vogt, P. H., Edelmann, A., Kirsch, S., Henegariu, O., Hirschmann, P., Kiesewetter, F., … Haidl, G. (1996). Human Y chromosome azoospermia factors (AZF) mapped to different subregions in Yq11. Human Molecular Genetics, 5(7), 933–943. Website. (n.d.). Retrieved June 13, 2018, from http://broadinstitute.github.io/picard/ Yang, J., The GTEx Consortium, Huang, T., Petralia, F., Long, Q., Zhang, B., … Tu, Z.
45
(2015). Synchronized age-related gene expression changes across multiple tissues in human and the link to complex diseases. Scientific Reports, 5(1). https://doi.org/10.1038/srep15145 Yan, Y., Yang, X., Liu, Y., Shen, Y., Tu, W., Dong, Q., … Yang, Y. (2017). Copy number variation of functional RBMY1 is associated with sperm motility: an azoospermia factor-linked candidate for asthenozoospermia. Human Reproduction , 32(7), 1521– 1531. Ye, D., Zaidi, A. A., Tomaszkiewicz, M., Anthony, K., Liebowitz, C., DeGiorgio, M., … Betran, E. (2018). High Levels of Copy Number Variation of Ampliconic Genes across Major Human Y Haplogroups. Genome Biology and Evolution, 10(5), 1333–1350. Yoon, S., Xuan, Z., Makarov, V., Ye, K., & Sebat, J. (2009). Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research, 19(9), 1586–1592. Yu, H., Luscombe, N. M., Qian, J., & Gerstein, M. (2003). Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends in Genetics: TIG, 19(8), 422–427. Zook, J. M., Chapman, B., Wang, J., Mittelman, D., Hofmann, O., Hide, W., & Salit, M. (2014). Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nature Biotechnology, 32(3), 246. Zou, S. W., Zhang, J. C., Zhang, X. D., Miao, S. Y., Zong, S. D., Sheng, Q., & Wang, L. F. (2003). Expression and localization of VCX/Y proteins and their possible involvement in regulation of ribosome assembly during spermatogenesis. Cell Research, 13(3), 171–177.
46
Chapter 3
Ampliconic genes on the great ape Y chromosomes: Rapid evolution of copy number but conservation of expression levels
This chapter will be submitted as a research article by R. Vegesna, M. Tomaszkiewicz, O.A. Ryder, R. Campos-Sánchez, P. Medvedev, M. DeGiorgio and K.D. Makova. In this chapter, M. Tomaszkiewicz performed all the wet-lab experimental work. Supporting information is provided in Appendix B.
Abstract
Multi-copy ampliconic gene families on the Y chromosome play an important role in spermatogenesis. Thus, studying their genetic variation in endangered great ape species is critical. We estimated the sizes (copy number) of nine Y ampliconic gene families in population samples of chimpanzee, bonobo, and orangutan with droplet digital PCR, combined these estimates with published data for human and gorilla, and produced genome-wide testis gene expression data for great apes. Analyzing this comprehensive dataset within an evolutionary framework, we, first, found high inter- and intraspecific variation in gene family size, with larger families exhibiting higher variation as compared to smaller families, a pattern consistent with random genetic drift. Second, for four gene families, we observed significant interspecific size differences, sometimes even between sister species—chimpanzee and bonobo. Sperm competition and mating structure, which differ drastically among great apes, might have affected these patterns. Third, despite substantial variation in copy number, Y ampliconic gene families’ expression levels did not differ significantly among species, suggesting dosage regulation. Fourth, for three gene families, size was positively correlated with gene expression levels across species, suggesting that, given sufficient evolutionary time, copy number influences gene expression. Our results indicate high variability in size but conservation in gene expression levels in Y ampliconic gene families, significantly advancing our understanding
47 of Y chromosome evolution in great apes. The Y gene copy number estimation protocols developed here can be used to trace male-biased dispersal and thus aid in conservation efforts of endangered great apes.
Introduction
Great apes (family Hominidae) include four genera—Pongo (Bornean, Sumatran and Tapanuli orangutans), Gorilla (eastern and western gorillas), Pan (common chimpanzee and bonobo), and Homo (humans)—who shared a common ancestor approximately 13 million years ago (MYA) (Glazko & Nei, 2003). All great apes but humans are endangered species (Fruth et al., 2016; Iucn & IUCN, 2016a, 2016b, 2016c, 2017). Therefore, understanding genetic variation within and among species, and preserving reproductive fitness of these animals is of utmost importance. Some of the clues to addressing these pressing questions lie in the analysis of sex chromosomes of great apes. Sex chromosomes harbor genes linked to spermatogenesis and influencing fertility and reproduction (Ross et al., 2005; Skaletsky et al., 2003), however we lack a key understanding of the diversity of these genes across great apes.
The great ape sex chromosomes, X and Y, originated from a pair of homologous chromosomes in the common ancestor of therian mammals 160-190 MYA (Luo, Yuan, Meng, & Ji, 2011; Veyrunes et al., 2008). Over time, the X chromosome has retained most of its ancestral gene content facilitated by continued recombination in females (Mueller et al., 2013), whereas the Y chromosome underwent a series of inversions and lost most of its genes due to the lack of recombination with the X (Charlesworth & Charlesworth, 1980; Skaletsky et al., 2003). Additionally, sexual antagonism led to accumulation of genes and mutations beneficial to males on the Y (Fisher, 1931). Since the split of great apes from their common ancestor, the Y chromosome continued to diverge. Cytogenetic studies have demonstrated that the Y chromosome differs in size, gene content, and gene order among great ape species (Gläser et al., 1998). To date, among great apes, only the human, chimpanzee and gorilla Y chromosomes have been sequenced and assembled, with the gorilla assembly being in draft state (Hughes et al., 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). The Y chromosomes of other great ape species are yet to be deciphered.
48
The same sequence regions are present on the Y chromosomes of great apes studied to date. These include the pseudoautosomal region (PAR) and the male-specific region (MSY, or the male-specific region on the Y) (Hughes et al., 2012; Skaletsky et al., 2003). The PAR can recombine with the X and thus is identical to the homologous region on it (Graves & Marshall Graves, 1995; Lahn & Page, 1999). The MSY region in great apes is interspersed with heterochromatic (Cechova et al., 2019) and euchromatic sequences of different sizes. The euchromatic MSY portion consists of single-copy X-degenerate and X-transposed regions, and of highly repetitive ampliconic regions (Hughes et al., 2012, 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). The X-degenerate regions constitute the remnants of the ancient proto-sex chromosomes, while the X-transposed region (so far found only in human) represents a recent transposistion from the X to the Y. The ampliconic regions harbor protein-coding multi-copy gene families that are expressed in testis and are associated with spermatogenesis and male fertility (Hughes et al., 2012, 2010; Rozen et al., 2003; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). In humans, these are BPY2 (basic protein Y2), CDY (chromodomain Y), DAZ (deleted in azoospermia), HSFY (heat-shock transcription factor Y), PRY (PTP-BL related Y), RBMY (RNA-binding motif Y), TSPY (testis-specific Y), VCY (variable charge), and XKRY (X Kell blood-related Y) (Skaletsky et al., 2003). Five of these nine gene families—BPY2, CDY, DAZ, RBMY, and TSPY—are shared among great apes studied so far, however information about presence/absence of the other four gene families in different great ape species remains incomplete (reviewed in (Hallast & Jobling, 2017)). The majority of ampliconic gene families in human and chimpanzee are located in palindromes—large inverted repeats common on the Y chromosome (Hughes et al., 2010; Rozen et al., 2003; Skaletsky et al., 2003). (The exception to this pattern is the TSPY gene family, which in humans is present as a tandem array outside of palindromes (Skaletsky et al., 2003).) The presence of palindromes facilitates gene conversion between ampliconic genes, which removes deleterious mutations and lowers sequence diversity within Y ampliconic gene families (Hallast, Balaresque, Bowden, Ballereau, & Jobling, 2013; Rozen et al., 2003). Understanding the evolutionary dynamics of these gene families across great apes is essential for preserving genetic diversity and survival of these endangered species. However, a comprehensive investigation of the evolution of copy number and expression of Y ampliconic gene families has been lacking to date.
49
Several studies indicated intraspecific variation in Y chromosome ampliconic gene copy number in great apes (Hughes et al., 2010; Lucotte et al., 2018; Oetjens, Shen, Emery, Zou, & Kidd, 2016; Repping et al., 2006; Schaller et al., 2010; Skov & Schierup, 2017; Tomaszkiewicz et al., 2016; Ye et al., 2018). High variation in gene copy number for the RBMY and TSPY gene families was observed in humans, chimpanzees, and gorillas (Lucotte et al., 2018; Oetjens et al., 2016; Skov & Schierup, 2017; Tomaszkiewicz et al., 2016; Vegesna, Tomaszkiewicz, Medvedev, & Makova, 2019; Ye et al., 2018). Chimpanzees also exhibit high copy number variation in the DAZ gene family (Oetjens et al., 2016; Schaller et al., 2010), and gorillas—in the CDY and HSFY gene families (Tomaszkiewicz et al., 2016). Targeted fluorescence in situ hybridization (FISH) intraspecific analysis of the DAZ and CDY gene families identified no variation in Bornean orangutan, but two variants in Sumatran orangutan (Greve et al., 2011). Thus, the precise range of copy number variation for Y ampliconic gene families remains unknown in either of these two orangutan species. The available information on copy number variation for Y ampliconic gene families in bonobos is currently limited to two individuals (Oetjens et al., 2016). Thus, we are critically missing data on ampliconic gene copy number variation in orangutans and bonobos. Moreover, variation in Y ampliconic gene copy number in great apes has never been analyzed in an evolutionary framework, which could contribute to conservation genetics evaluations.
The evolution of gene expression of Y ampliconic gene families in great apes has remained even more understudied. Recently, we demonstrated dosage regulation of human Y ampliconic gene expression in testis when compared to their homologs on the X or autosomes (Vegesna et al., 2019). Additionally, expression levels and Y haplogroup or gene copy number of an individual were not significantly associated with each other (Vegesna et al., 2019). However, across gene families, we observed a positive correlation between the copy number and expression levels (Vegesna et al., 2019), which was also shown in another study examining expression of Y ampliconic gene families at different stages of spermatogenesis (Lucotte et al., 2018). Apart from these few studies, little is known about variation and evolution in expression of Y ampliconic gene families in great apes. Moreover, the relationship between copy number and expression levels for the Y ampliconic genes in non-human great ape species remains unexplored.
50
Previous studies suggested that evolution of the Y chromosome could reflect different mating patterns and social structure in great apes (Hallast et al., 2016; Hughes et al., 2010; Schaller et al., 2010). Great apes exhibit substantial variation in mating systems, which can result in different levels of sperm competition. Bonobos and chimpanzees have a multimale-multifemale, i.e. polygynandrous, mating system, where female’s promiscuity results in high levels of sperm competition. In contrast, gorillas have a unimale- multifemale, i.e. polygynous, mating system, which results in low levels of sperm competition. Orangutans and humans fall in between (Harrison & Chivers, 2007; Wistuba et al., 2003). The roving male polygynous mating system in orangutans, and mating systems defined from monogamous to polygynous in humans, result in levels of sperm competition that are lower than those in chimpanzees/bonobos and higher than those in gorillas (Harrison & Chivers, 2007; Wistuba et al., 2003). Using testis size and several sperm phenotypes as a proxy of sperm competition (Harcourt, Harvey, Larson, & Short, 1981), one can examine an association between genetic variation on the Y chromosome with varying levels of sperm competition across great ape species. Based on the importance of Y ampliconic gene families in spermatogenesis, it is reasonable to hypothesize that their copy number and/or expression levels can be different among great apes with various mating patterns and different levels of sperm competition, and can be associated with sperm phenotypes, however this has not been explored previously.
In this study, we present the first comprehensive analysis of the evolution of Y ampliconic gene copy number and expression levels across great apes. Using droplet digital PCR (ddPCR), we estimated copy number of ampliconic gene families in nonhuman great apes. We tested whether the copy number of Y ampliconic gene families is conserved across great apes and identified species that have experienced a significant gain or loss in copy number. Additionally, we generated testis expression data for bonobo and Bornean orangutan, thus augmenting such data we and others previously generated for gorilla, orangutan, chimpanzee, and human (Ardlie et al., 2015; Brawand et al., 2011; Carelli et al., 2016; Fungtammasan et al., 2016; Ruiz-Orera et al., 2015; Tomaszkiewicz et al., 2016). We assembled the transcripts of Y ampliconic gene families and tested whether their expression is conserved across great apes. Given the Y ampliconic gene families’ copy number and expression data, we investigated the evolutionary relationship between them. Finally, we examined whether the variation in copy number or gene expression can explain phenotypes related to sperm competition. Our results highlight the important role
51 of ampliconic genes in shaping Y chromosome evolution and evolution of great apes in general.
Results
Dynamic evolution of Y ampliconic gene copy number Overall copy number and variance. To evaluate copy number of Y ampliconic genes, we used a ddPCR protocol similar to the one utilized in previous studies from our group (Tomaszkiewicz et al., 2016; Ye et al., 2018) (see Materials and Methods). With ddPCR, template DNA is fractionated into multiple droplets within which PCR takes place, and each droplet is analyzed to determine copy number in a sample (B. J. Hindson et al., 2011). ddPCR differs from quantitative real-time PCR in that it estimates the absolute quantity of DNA without generating a standard curve (B. J. Hindson et al., 2011). It serves as a more economic alternative to whole-genome sequencing for copy number evaluation for targeted genomic regions or gene families, while providing similar copy number estimates (Vegesna et al., 2019), and is particularly attractive in the absence of the Y chromosome reference (which is the case for bonobo and Sumatran and Bornean orangutans).
Our samples included seven bonobos, nine chimpanzees, seven Bornean orangutans, and five Sumatran orangutans. To the best of our knowledge, all samples came from wild- born, unrelated individuals. Additionally, we used Y ampliconic gene copy number estimates generated by our group previously for 10 humans with African ancestry (Ye et al., 2018) and 14 wild-born gorillas (Tomaszkiewicz et al., 2016). Summarizing the data generated in our study and previous findings (reviewed in (Hallast & Jobling, 2017)), we show that eight ampliconic gene families—BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, and XKRY—are shared exclusively among human, gorilla, and both species of orangutan (Fig. 1A). We demonstrate that the XKRY gene family is present in both Bornean and Sumatran orangutans (Fig. 1A). HSFY, PRY, and XKRY are pseudogenized in chimpanzee and bonobo and VCY is absent in bonobo, gorilla, and both species of orangutan examined (Fig. 1A). For these gene families, we assigned gene count of zero in our analysis.
Figure 3- 1. Y ampliconic genes in great apes. A. Venn diagram showing gene content comparison among great ape species. B. Plot of the first two principal components (PCs) of Y
52 ampliconic gene copy numbers across great ape species (the first and second PCs explained 68.7% and 22.8% of the variation, respectively; Fig. S2 shows variation explained by the other components).
When we compared the overall number of Y ampliconic gene copies (i.e. the sum of copies of all gene families) among species (Table S1), we found that the two orangutan species had the highest numbers (median≈130 genes), and chimpanzee (median=40.9) and gorilla (median=44.1) had the lowest numbers, whereas the numbers for human (median=64.3) and bonobo (median=82.8) were in-between. When we computed the inter-individual variance in the number of Y ampliconic gene copies within each species (Table S1), we observed that Bornean orangutan (variance=455) and Sumatran orangutan (variance=136) had the highest variances, whereas gorilla (variance=13.0) had the lowest, with human’s (variance=33.7), chimpanzee’s (variance=42.5) and bonobo’s (variance=87.5) variances being in-between. In general, the higher the total number of Y ampliconic gene copies a species had, the higher was its variance in copy number (Fig. S1). After principal component analysis (PCA) of copy number estimates in individual great ape samples, species formed well-separated clusters (Figs. 1B and S2).
Copy number and its variance in individual gene families. Separating the data by Y ampliconic gene family and species (Fig. 2), we observed a positive correlation between gene family copy number and its variance in each species (Fig. 3) and in each gene family (Fig. S3). The TSPY gene family had consistently higher copy number and variance than other Y chromosome ampliconic gene families in bonobo, chimpanzee, and Sumatran
53 orangutan, and had the second highest (after CDY) copy number and variance in Bornean orangutan (Fig. 3). The RBMY gene family also had high copy number and variance in all great ape species except for the two orangutan species. In contrast, the VCY family, present only in human and chimpanzee, had consistently low copy numbers (Figs. 2-3).
Figure 3- 2. Variation in copy number of Y ampliconic gene families in great apes. Box plots summarizing the distribution of copy numbers of the six great ape species across nine Y ampliconic gene families. The gene families are separated into individual plots with the gene family name at the top. Within each plot, the X-axis represents six species (bonobo, chimpanzee, human, gorilla, Bornean orangutan, and Sumatran orangutan), and the Y-axis represents copy number. The black dot within each boxplot is the median value per species.
Most ampliconic gene families were more copious in orangutans than in other species (Figs. 2-3). For example, the PRY and XKRY gene families each had at least eight copies in orangutans (Bornean orangutan: eight copies for PRY and 15 copies for XKRY; Sumatran orangutan: 10 copies for PRY and 22 copies for XKRY)—in contrast, in Homininae these gene families were either lost (in bonobo and chimpanzee) or had a median size of only two copies (in human and gorilla). Also, each of the BPY2, CDY, and DAZ gene families had more than twice the number of gene copies in orangutans than that found in other great ape species.
Gene families lost in some species (HSFY, PRY, VCY, and XKRY) usually had few copies and low variation in the closely related species. For instance, the HSFY, PRY, and XKRY
54 gene families were pseudogenized in bonobo and chimpanzee, and human had a low copy number (on average two copies) for these gene families (Figs. 2-3). Similarly, the VCY gene family was lost in the majority of great ape species, except for chimpanzee and human, in which it had a low copy number (on average two copies; Figs. 2-3).
Copy number differences between recently diverged species. As an initial investigation into how quickly Y ampliconic gene families evolve, we tested whether copy numbers for individual gene families differed significantly between recently diverged, sister species. Two pairs of closely related species were included in this comparison— chimpanzee and bonobo, separated ~0.77-1.8 MYA (Hey, 2010; Yu et al., 2003), and Sumatran and Bornean orangutans, separated ~0.4 MYA (Locke et al., 2011). Five gene families were tested in bonobo vs. chimpanzee, and eight—in Sumatran vs. Bornean orangutans (Fig. 1A), with a permutation test in which we compared the mean copy number difference between the two species (permuting species labels with one million permutations; bonobo vs. chimpanzee Bonferroni-corrected p-value cutoff of 0.05/5 = 0.01; Sumatran vs. Bornean orangutan Bonferroni-corrected p-value cutoff of 0.05/8 = 0.00625). Between bonobo and chimpanzee, each of the five gene families tested exhibited a significant difference in its copy number (p-values of 1.67×10-3, <10-6, <10-6, <10-6, and <10-6, for BPY2, CDY, DAZ, RBMY and TSPY gene families, respectively). On the contrary, between the two orangutan species, we only found a significant difference in copy number for XKRY (p-value=1.32×10-3; see Table S2 for p-values for the other gene families). Thus, significant differences in Y ampliconic gene copy number do exist, even between closely related species.
Figure 3- 3. Larger Y ampliconic gene families are more variable across great apes. The six scatter plots represent the relationship between median copy number and variance for each of the species, and the species name is present at the top of each plot. The X-axis represents natural logarithm of median copy number and the Y-axis is a natural logarithm of variance in copy number. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line represents the linear function fitted to the given data points. The
55 dots are color-coded to represent the nine gene families, with missing dots indicating gene family absence in that species.
Evolution of copy number across species. Building upon this observation, we tested whether copy number in Y ampliconic gene families is conserved across great ape species and identified species with significant gain or loss of gene copies. Simple ANOVA performed on all the copy number values showed a significant difference in gene copy number for each gene family analyzed across great apes except for VCY (Table 1). We next tested whether these differences were still observed after taking the phylogenetic relationship among great apes into consideration. For this purpose, we used CAFE (Han, Thomas, Lugo-Martinez, & Hahn, 2013), a tool that implements a stochastic birth-and- death process to model the expansion and contraction of gene family sizes over a phylogeny, and ran it with the Y-chromosome-specific phylogenetic tree (see Materials
56 and Methods) and the median copy number per species for each of the families as input. We performed simulations to validate the use of CAFE for the given dataset (see Supplementary Note 1). CAFE estimated the rate of birth/death (훌) of ampliconic genes as 5.03×10-5 events per thousand years.
CAFE predicted that two of the nine Y ampliconic gene families tested had a significant expansion or contraction in their size (RBMY, p=0.001; and XKRY, p=0.001; Table 1 and Fig. 4; Bonferroni-corrected p-value cutoff of 0.05/9=0.006; 9 gene families) and two additional gene families had low p-values even if non-significant after correcting for multiple testing (CDY, p=0.018; and TSPY, p=0.009; Table 1 and Fig. 4). For these four gene families, CAFE also provided p-values for each branch of the phylogenetic tree with a significant gain or loss of gene copies when compared to its immediate ancestral node (Bonferroni-corrected p-value cutoff of 0.05/10=0.005; 10 nodes in great ape phylogenetic tree). Three interesting patterns emerged from this analysis (Fig. 4 and Table S3). First, the TSPY gene family, which had consistently high variation in copy number across great apes (Figs. 2-3), had the largest number of branches with significant differences in copy number across the phylogeny. We observed significant lineage-specific reductions in its family size in chimpanzee (from 30 copies inferred in the immediate ancestral node to 18 copies in chimpanzee, p=1.39×10-9), gorilla (from 21 to six copies, p=5.07×10-5), and Sumatran orangutan (from 27 to 23 copies, p=1.01×10-3), and significant expansions in Bornean orangutan (from 27 to 32 copies; p=2.68×10-4) and bonobo (from 30 to 48 copies, p=3.89×10-7).
Table 3- 1. Differences in Y ampliconic gene copy numbers across species as evaluated with ANOVA and CAFE. To determine which ampliconic gene families vary in their copy number across great ape species, we performed conventional one-way ANOVA (F-statistic and p-value are shown). To identify significant expansions or contractions of gene family size across great apes, we performed CAFE analysis. Significant p-values (Bonferroni-corrected p-value cutoff of 0.05/9=0.006; nine gene families) are in bold.
Gene family ANOVA F-value ANOVA p-value CAFE p-value BPY 80.84 1.26×10-21 0.345 CDY 178.23 6.52×10-29 0.014 DAZ 366.80 7.51×10-36 0.331 HSFY 26.74 7.55×10-9 0.488 PRY 357.70 1.11×10-24 0.209 RBMY 162.17 5.08×10-28 0.001
57
TSPY 82.98 7.38×10-22 0.008 VCY* 1.19 0.290 0.187 XKRY 481.47 1.08×10-26 0.001 *For VCY, power is limited because we only used the data from two species (chimpanzee and human). This gene family is absent in the other great ape species analyzed (see text for details).
Second, two gene families (CDY and XKRY) showed significant expansions in the branch leading to the two orangutan species, and one of them (XKRY) also exhibited significant differences between the two orangutan species. In the case of CDY, the node connecting the two orangutan species gained significant number of copies when compared to the common ancestor of great apes (from 15 to 35 copies, p=1.86×10-3). In the XKRY gene family, there was also a significant gain in gene copies in the common ancestor of Bornean and Sumatran orangutans when compared to the common ancestor of great apes (from six to 18 copies, p= 4.31×10-3). Additionally, Bornean orangutan lost gene copies (from 18 to 15 copies, p=2.86×10-3) and Sumatran orangutan gained gene copies (from 18 to 22 copies; p=6.46×10-4) when compared to their common ancestor. And third, intriguingly, in the RBMY gene family, bonobo gained gene copies (from 17 to 29 copies, p=2.02×10-7), while chimpanzee lost gene copies (from 17 to 11 copies, p=9.61×10-4), when compared to their common ancestor.
Figure 3- 4. Results of CAFE analysis identifying Y ampliconic gene families with significant shifts in gene copy number when compared to their ancestors. For each gene family with a significant difference in copy number, the phylogenetic tree representing the estimated copy number at internal nodes is shown. Significant shifts are highlighted in blue (contraction) and red (expansion). The copy numbers at the internal nodes were predicted by CAFE.
58
To evaluate the influence of intraspecific variation on our analysis of gene copy number evolution, we ran CAFE while using the data for five randomly subsampled individuals (instead of a single summary value, i.e. median) per species (an approximately star phylogeny among individuals of the same species was assumed, see Materials and Methods). The sample size of five here corresponds to the smallest sample size we have per species (for Sumatran orangutan). This procedure was performed 100 times. We observed that differences in chimpanzee, bonobo and gorilla lineages were still significant most of the time (in 83-100 times out of 100, Table S4). In the case of orangutans, the shift in CDY copy number was supported in 100 out of 100 replicates, and in the TSPY and XKRY gene families, the shift was supported in 12-68 out of 100 replicates (Table S4). Thus, intraspecific variability in the studied samples does not affect the robustness of
59 our results, except for the orangutan-specific observations for the TSPY and XKRY families.
Conservation of Y ampliconic gene expression in great apes Expression levels of Y ampliconic gene families. To study evolution of Y ampliconic gene expression, we evaluated expression levels for these gene families in great ape testis samples. We assembled complete or partial reference Y ampliconic gene transcripts using publicly available, or generated by us, RNA-Seq datasets (Dataset S1) and used them to estimate the ampliconic gene expression across great apes (Table S5; see Materials and Methods). Namely, reference transcript data for bonobo were generated using RNA-Seq datasets produced in house, for Bornean orangutan—using a mix of publicly available and generated in-house RNA-Seq datasets, whereas such data for other species were retrieved from publications (Ardlie et al., 2015; Brawand et al., 2011; Carelli et al., 2016; Fungtammasan et al., 2016; Ruiz-Orera et al., 2015; Tomaszkiewicz et al., 2016). The endangered status of great ape species posits a particular challenge for collection of tissues from these animals. Because of this hurdle, in this study we were able to include only two to three sampled individuals per species (Fig. 5; we excluded the results for Sumatran orangutan because only one sample was available for this species; see Materials and Methods).
Our analysis of Y ampliconic gene expression levels led to the following observations (Fig. 5). The BPY2, PRY and XKRY gene families had consistently low expression levels across great apes. In contrast, the TSPY and HSFY gene families had comparatively high expression levels across species, and intra- and interspecific variation in expression levels was also high for these two gene families. In comparison, the CDY, DAZ and RBMY gene families had intermediate expression levels and limited intra- and interspecific variation. Surprisingly, in chimpanzee and human, the expression levels for the VCY family were higher than those for the BPY2 and CDY gene families, even though the VCY family was lost in the majority of great apes, whereas the BPY2 and CDY gene families were conserved across great apes (Fig. 1A).
Figure 3- 5. Summary of gene expression levels across great apes. In the dot plot below, the X-axis represents nine ampliconic gene families and the Y-axis represents their expression levels. The plot represents testis-specific expression of 12 great ape samples. Each dot within a gene family represents expression levels of an individual and the color of the dot denotes the species it
60 belongs to. Missing dots represent gene families that are considered missing or pseudogenized, and their expression levels are excluded from the gene expression analysis (Table S5).
Evolution of gene expression. Using this dataset (Fig. 5, Table S5), we next evaluated whether expression levels of the Y ampliconic gene families were conserved across great ape species. To examine this, we performed phylogenetic ANOVA, which conducts an ANOVA-like test while taking the phylogenetic relationship of great apes into consideration (see Materials and Methods). This test was carried out for five gene families—BPY2, CDY, DAZ, RBMY, and TSPY—which are present in all the great ape species analyzed (Fig. 1A). Phylogenetic ANOVA was performed via applying the EVE model (Rohlfs & Nielsen, 2015) separately to each of the five Y ampliconic gene families, and identified that all five of them had conserved expression, i.e. with no branches experiencing significant speedup or slowdown in expression evolution, across great apes (Table S6). To test the validity of our conclusions given the small sample size and particular gene copy numbers, we performed simulations for different parameters under the EVE model and generated gene expression levels with the sample sizes identical to those in our study (Supplemental Note
61
S2). In 95 out of 100 simulations, we were able to predict conservation of gene expression correctly.
The relationship between copy number and gene expression levels We studied the relationship between Y ampliconic gene copy number and expression levels across five great ape species (with Sumatran orangutan excluded due to the lack of multiple testis samples). When we analyzed the median copy number of each family from our copy number dataset (Fig. 2) together with the median gene expression levels of these families from our gene expression dataset (Fig. 5), we observed that the correlation between them for the majority of species was positive, consistent with previous results in humans (Vegesna et al., 2019). However there were also some differences (Fig. 6). In bonobo and chimpanzee, we identified a strong positive correlation between gene copy number and their expression levels across gene families (bonobo: Spearman correlation rho=0.9, p=0.083; chimpanzee: rho=0.94, p=0.017). In human and gorilla, we also observed a positive correlation, but it was weaker (human: rho=0.58, p=0.108; gorilla: rho=0.59, p=0.126). In the case of Bornean orangutan we did not observe a positive correlation (rho=-0.05, p=0.93), and one of the reasons that might explain this finding is the high variation in Y ampliconic gene copy number in this species (Fig. 3).
Next, we studied the relationship between copy number and gene expression separately for each gene family (Fig. 7). Previous studies in humans showed that within a Y ampliconic gene family the variation in gene copy number does not correlate with their gene expression (Vegesna et al., 2019). Here we tested whether the longer divergence times between great ape species enabled gene copy number to influence expression levels. We observed that, across species, there was a strong (but statistically non- significant) positive correlation of gene expression level with gene count for DAZ (rho=0.9, p=0.083) and TSPY (rho=0.9, p=0.083), and a moderate and non-significant correlation for RBMY (rho=0.6, p=0.35). There was no such trend observed for BPY2 (rho=0.2, p=0.783) with all species having similar expression levels except for gorilla, which had comparatively low expression levels. A negative but non-significant relationship was observed for CDY (rho=-0.5, p=0.45), with chimpanzee and human having higher expression levels but fewer gene copies in comparison to gorilla and Bornean orangutan. In general, we might be lacking power to detect significant associations between gene expression levels and copy number of Y ampliconic genes because of the small sample
62 size. However, we did observe a trend of copy number influencing gene expression levels in three out of five gene families tested.
Figure 3- 6. Relationship between copy number and gene expression of Y ampliconic gene families in great ape species. The five scatter plots represent the relationship between expression and copy number for each of the five species, the name of the species is present at the top of each plot. In each of the scatter plots the X-axis represents natural logarithm of median copy number and the Y-axis represents natural logarithm of median gene expression. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line is the linear function fitted to the given data points. The dots are color-coded to represent the nine gene families, with missing dots corresponding to the gene families that are pseudogenized, deleted, or not expressed, in that species.
Figure 3- 7. Relationship between copy number and gene expression across species. In each of the scatter plots the X-axis represents natural logarithm of median copy number and the Y-axis
63 represents natural logarithm of median gene expression. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line represents the linear function fitted to the given data points. The dots are color-coded to represent the five species. The five scatter plots represent the relationship between expression and copy number for each of the five gene families, with the name of the gene family present at the top of each plot. Only the gene families that are present in all species are shown here.
Y ampliconic gene copy number variation and phenotypes related to sperm competition We next tested whether the differences observed in copy number were linked to presence/absence or degree of sperm competition in great apes. Variation in the overall copy number of all Y ampliconic genes in different great ape species suggested a possible association: monoandrous gorillas had the lowest copy number and its variance, whereas great apes with polyandrous or dispersed mating patterns (e.g., chimpanzees, bonobos, and orangutans) had higher copy number and variance (Table S1). We examined a potential association between copy number variation in four Y ampliconic gene families showing significant differences among species (CDY, RBMY, XKRY, and TSPY) and four different phenotypes related to sperm competition—residual testis weight (Dixson & Anderson, 2004), sperm midpiece volume (Anderson, Nyholt, & Dixson, 2005), sperm concentration (Møller, 1988), and sperm motility (Møller, 1988). While the small number of species examined precluded us from examining associations statistically, the
64 overall trends could be examined. Variation in the four sperm phenotypes were in line with expectations related to the degree of sperm competition in the studied species (Fig. S4). Among four gene families examined (Fig. S5), RBMY and TSPY exhibited copy number variation that might be associated with sperm competition. It has been demonstrated that RBMY copy number is positively associated with sperm motility in humans (Yan et al., 2017) (although this finding was recently disputed (Shi, Louzada, et al., 2019)). Across great apes, we observed that bonobo, a species with the highest sperm motility, concentration, and volume had the highest copy number (and its variance) of RBMY genes (Figs. S3-S5). In contrast, orangutans with the lowest sperm concentration, volume, and one of the lowest levels of sperm motility, had the lowest copy number and its variance of RBMY genes (Figs. S3-S5). Both chimpanzees and gorillas, with relatively low sperm motility, also displayed low RBMY copy numbers. Thus, RBMY copy number variation might be associated with sperm motility, a sperm-competition- related phenotype, in great apes. TSPY copy number was shown to be positively associated with sperm concentration, one of sperm-competition-related phenotypes, in humans (Giachini et al., 2009). Though a clear association with sperm concentration was not detected, we found that TSPY had low copy number (and its variance) in gorilla, the species with minimal sperm competition, and higher copy number in the other species studied (Fig. S3), which have higher levels of sperm competition. Importantly, in bonobo, a species with the highest sperm concentration, we observed the highest number of TSPY gene copies (Fig. S4-S5). Thus, TSPY copy number variation might be associated with sperm competition in great apes.
Discussion To study evolution of multi-copy gene families on the Y chromosome in the extant great apes, we analyzed copy number and expression levels of Y ampliconic genes across six great ape species. For the first time, we estimated variation in copy number of Y ampliconic gene families in a large sample of bonobos (previously only two individuals were compared (Oetjens et al., 2016)) and in two species of orangutan. We also generated testis expression data for Y ampliconic gene families for bonobo and Bornean orangutans, thus providing unique datasets that were previously missing from publicly available resources. Combining these new and previously published data, we investigated the evolution of Y ampliconic gene copy number and of their expression levels, as well as the evolutionary relationship between them, and examined whether the observed variation in
65 copy number could explain phenotypes related to sperm competition and mating structure. Some of our tests showed strong trends but lacked statistical significance, likely because they were underpowered due to a small sample size. Increasing the sample size for this type of study is currently extremely challenging because of the limited availability of samples from endangered species.
Loss of some Y ampliconic gene families in all non-human great ape species examined. Combining data from this and other studies (Cortez et al., 2014; Hughes et al., 2010; Oetjens et al., 2016; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016), we have demonstrated that, compared with human, all other great ape species examined lack one (VCY missing in gorilla and both species of orangutan) or several (HSFY, PRY, and XKRY pseudogenized in bonobo and chimpanzee) Y ampliconic gene families (Fig. 1A). We discovered that gene families with low copy number in some species are frequently lost or pseudogenized in other species. For instance, we found that PRY and XKRY, which were pseudogenized in bonobo and chimpanzee, have low copy number in other great apes examined. Similarly, a recent study from our group showed that Y ampliconic gene families with low copy numbers in humans (Vegesna et al., 2019) were pseudogenized or lost in some great ape species (HSFY, PRY, VCY, and XKRY) due to the lack of recombination, Muller’s ratchet, and/or non-allelic homologous recombination (NAHR) on the Y. Regardless of the mechanism, it is a fact that not all human Y ampliconic gene families are essential in all great ape species.
Our study indicates that low expression levels could also be an important predictor of gene family’s non-essentiality on the Y. Consistent with this hypothesis, lowly expressed PRY and XKRY gene families were pseudogenized in chimpanzee and bonobo. However, gene families such as VCY and HSFY, which were also lost in several great ape species, had relatively high expression levels. VCY and its homolog VCX, which is present on the X chromosome, are highly similar in sequence (at least in humans), and thus the loss of VCY in some species could be compensated by VCX (Vegesna et al., 2019). HSFY could have undergone neofunctionalization in humans (Vegesna et al., 2019). Therefore, in other great apes who lost HSFY, its homologs could also compensate for its ancestral function. Expression of paralogous genes present on other chromosomes should be studied together with expression of Y ampliconic gene families in future studies.
66
Y ampliconic gene families: Life on palindromes and tandem repeats. We observed a positive relationship between copy number and its variance for Y ampliconic gene families in great apes. This finding echoes recent studies in humans (Vegesna et al., 2019; Ye et al., 2018) and points toward a similar organization of these gene families in repeats across great ape species and in their common ancestor. In human and chimpanzee, most Y ampliconic gene families (except for TSPY, which is organized as a tandem repeat) are located in palindromes (Hughes et al., 2010; Skaletsky et al., 2003). Palindromes also exist on the gorilla Y chromosome (Tomaszkiewicz et al., 2016). High copy number variation in ampliconic genes in bonobo and orangutans found here suggest that their Y chromosomes also have repetitive structure, confirming cytogenetic findings (Gläser et al., 1998).
The presence of palindromes on the Y chromosome in great apes enables frequent rearrangements via NAHR and gene conversion among different palindrome arms, leading to the observed high variation within species. Gene conversion leads to conservation of gene sequences and their rescue from accumulation of deleterious mutations. A study of palindrome P8 in healthy humans showed diverse palindromic structures carrying from one to four copies of VCY (Shi, Massaia, et al., 2019). Large- scale chromosomal rearrangements also contribute to the dynamic copy number evolution of Y ampliconic gene families across great apes, as shown by previous cytogenetic analyses (Gläser et al., 1998; Repping et al., 2006; Shi, Louzada, et al., 2019), as well as by an example in the next paragraph.
Interesting patterns were observed for median copy numbers for each of the five Y ampliconic gene families present in bonobo and chimpanzee (Fig. 3). For low-copy- number families, we found a total of six gene copies (one BPY2, three CDY, and two DAZ gene copies) in bonobo. For the same gene families in chimpanzee, six gene copies (one BPY2, three CDY, and two DAZ gene copies) are present on three palindromes of the short arm and five gene copies (one BPY2, two CDY, and two DAZ genes) are located on the three palindromes of the long arm (Hughes et al., 2010). The differences in copy number between bonobo and chimpanzee are consistent with a deletion of the three palindromes bearing five gene copies on the long Y arm in the bonobo lineage after its divergence from the bonobo-chimpanzee common ancestor. This hypothesis is strengthened by the results of a cytogenetic study that mapped the CDY and DAZ gene
67 families to the short arm of the bonobo Y, but to both short and long arms of the chimpanzee Y (Schaller et al., 2010). This pattern was in contrast to that observed for high-copy-number gene families, TSPY and RBMY (Fig. 3). We found that bonobo Y had approximately three times more TSPY and RBMY gene copies than the chimpanzee Y (Table S1). Consistent with this finding, a cytogenetic study reported high amplification of RBMY and TSPY gene families via segmental duplications of a large euchromatic segment in bonobo (Gläser et al., 1998).
Evolutionary forces affecting copy number variation among great apes. Our test of conservation of gene copy number across great ape species indicated significant differences in copy numbers of CDY, RBMY, TSPY, and XKRY. Chimpanzee had lost, whereas bonobo had gained, TSPY and RBMY gene copies when compared to their common ancestor. In the case of gorilla, we observed a significant loss of TSPY gene copies. Additionally, interspecific differences in copy number in our dataset were correlated with interspecific differences in copy number variance (Fig. 3). What can explain the differences observed among species?
Whereas a detailed analysis is outside the scope of our study, we can speculate about the evolutionary forces driving copy number variation of Y ampliconic gene families in great apes. If a certain range of copy number were beneficial, then directional selection would limit variation in copy number even for large gene families. We did not observe such a pattern. In contrast, if variability in copy numbers were beneficial, then diversifying selection would enhance variation in copy number even for small gene families. This pattern was also not found. Instead we observed that variance in copy number was approximately proportional to the size of gene family (Fig. 3), suggesting that larger families have more opportunities for rearrangements resulting in high copy number variability. Thus, our data overall are consistent with random genetic drift resulting from frequent copy number changes of repetitive regions being the major driver of Y ampliconic gene families’ evolution. A similar conclusion was reached when a larger dataset of humans representing multiple haplogroups was studied (Ye et al., 2018). Nevertheless, selection might be operating within certain great ape species, or at particular gene families. Note that high variability observed for copious gene families might mask signatures of diversifying selection, and vice versa, low variability observed for gene families with low copy number might mask signatures of directional selection.
68
Y ampliconic gene copy number and sperm competition and morphology. Sperm competition could be the selective force behind Y ampliconic gene number evolution. We examined a potential link between variation of Y ampliconic genes and levels of sperm competition across great apes using four sperm phenotypes positively associated with sperm competition—residual testis weight, sperm midpiece volume, sperm concentration, and sperm motility. The higher the number of copulations a female has with males with large testes, the larger is the sperm midpiece, where the mitochondria reside and fuel sperm motility (Dixson & Anderson, 2004; Møller, 1988). Also, polyandrous primates evolved to produce more sperm that swim faster, e.g. chimpanzees produce 223 times more sperm than gorilla and 14 times more sperm than orangutan (Fujii-Hanamoto, Matsubayashi, Nakano, Kusunoki, & Enomoto, 2011; Nascimento et al., 2008).
Our results suggest that sperm competition might contribute to determining TSPY copy number evolution across great apes. TSPY exhibited copy number variation that might be associated with sperm competition: its copy number was the lowest in gorilla (who has minimal sperm competition), and was highest in bonobo and chimpanzee experiencing high levels of sperm competition; the copy number was intermediate in human (orangutan was an outlier). We also hypothesize that potential selection for sperm motility, a sperm- related phenotype, is important for shaping RBMY copy number variation, as its copy number variation followed a similar trend to the one observed for TSPY (Figs. S4-S5). Four of the five ampliconic gene families shared among great apes (Fig. 1A)—BPY2, CDY, DAZ, and RBMY—are located within two azoospermia factor regions (AZFb and AZFc). Mutations or deletions of AZFs result in spermatogenic failure phenotypes, including abnormal sperm morphology, in humans (Choi et al., 2007; de Vries et al., 2001; Gläser et al., 1998; Lu et al., 2014; Vogt et al., 1996; Yan et al., 2017). Thus, copy number of these four gene families might contribute to determining sperm morphology. Consistent with this hypothesis, gorillas and humans have the lowest copy number for these gene families (Table S7) and the highest rate of sperm abnormalities (Seuánez, 1980), whereas both species of orangutans have the highest copy number (Table S7) and the lowest rate of sperm pleomorphism (Seuánez, 1980). Y ampliconic gene copy number and mating patterns in great apes. In species with female dominance, such as bonobo, dominant females pick their mates and enable their male progeny to access other females, which benefits spreading the genes on the X
69 chromosome rather than the Y chromosome. In contrast, in species with male dominance, such as chimpanzee, males have to maintain the hierarchy so they have more pressure to conserve the Y chromosome. We hypothesize that the reduced burden on the Y chromosome in species with female dominance results in a smaller and less conserved Y ampliconic region than in species with male dominance. If our prediction is correct, then we expect the chimpanzee Y chromosome to retain more gene families and to be more conserved, i.e. have lower variance in copy number. Though the number of Y ampliconic gene families was the same for chimpanzee and bonobo (Fig. 1A), chimpanzee exhibited lower variance in their overall copy number than bonobo (Table S1), consistent with the predictions resulting from male- vs. female-dominant social structures. However, this result might be affected by the fact that only one chimpanzee subspecies, western chimpanzee, known to have low diversity on the Y (Hallast et al., 2016), was included in the present study.
In bonobo, from the five gene families studied, we observed an increase in copy number in two gene families (TSPY and RBMY) and a decrease in gene copy number in the other three gene families (BPY2, CDY, and DAZ). The loss of gene copies in BPY2, CDY, and DAZ could have been facilitated by the low selective pressure to maintain these genes on the bonobo Y due to female dominance. The lower selective pressure on the Y could have enabled the loss of gene copies in these three families despite their higher gene count in their closest relative, chimpanzee (Table S1).
In the case of the two orangutan species studied, they have diverged more recently and their mating and social patterns remain similar (Delgado & Van Schaik, 2000). In fact, hybrids between Sumatran and Bornean orangutans are fertile (Gläser et al., 1998). As a result, though we observed differences in copy number between the two species for the TSPY and XKRY gene families (Fig. 4), these differences were unstable when we considered variation among individuals (and not a median value, Table S4). High variation in copy number in orangutans (Figs. 3 and S3) could be explained by the fact that their common ancestor experienced a significant gain in gene copy number for several gene families (CDY and XKRY; Fig. 4), and thus there are more opportunities for different copy numbers to be created and be subjected to random genetic drift.
70
Evolution of gene expression, and the relationship between gene expression and copy number. Our analysis suggests that the Y ampliconic gene families present in all great ape species studied exhibit limited interspecific variation in gene expression, and this variation was not significant with our EVE model analysis, either due to the small sample size or reflecting overall conserved expression levels for these gene families. A larger study is needed to distinguish between these two possibilities. If interspecific variation is indicated in such a study, then it would echo an earlier study, in which genome- wide variation in gene expression across different tissues in human and chimpanzee was investigated, and genes with high intraspecific variation were found to exhibit high interspecific variation, particularly in testis (Khaitovich, Enard, Lachmann, & Pääbo, 2006; Khaitovich et al., 2005). If, in contrast, conserved expression levels are confirmed, then DNA methylation might play an important role in dosage regulation of duplicated genes (Chang & Liao, 2012), and future studies should investigate the upstream regions of Y ampliconic genes for DNA methylation patterns.
In humans, testis tissue tolerates high variation in gene expression levels and undergoes dosage regulation to maintain overall conserved gene expression in the presence of gene copy number variation (Vegesna et al., 2019). However, across species, we observed a mixed pattern in the relationship between gene expression levels and copy number: DAZ, RBMY, and TSPY showed a positive correlation, BPY2 displayed no association, and CDY had a negative correlation (Fig. 7). These results imply that, given enough evolutionary time, copy number could influence gene expression levels in some ampliconic gene families.
We observed a positive correlation between copy number and gene expression across gene families in each great ape species but Bornean orangutan. This correlation was stronger in bonobo and chimpanzee, species experiencing high levels of sperm competition, suggesting that in these species it can be particularly important biologically. In general, evolution of copy number and gene expression, although related to each other, might follow different time scales. Our results suggest that evolution of copy number is faster than evolution of gene expression. Rapid, back-and-forth changes in copy number for Y ampliconic genes eventually influence the direction in which gene expression levels shift over longer periods of time. Future studies should decipher isoform sequences of Y ampliconic genes, as well as of their X-chromosomal and autosomal counterparts, and
71 analyze their differential expression, and thus will examine the dynamic evolution of male fertility genes in greater detail.
It is important to note that factors such as age and cellular composition of the testis tissues could influence the estimated expression levels of the Y ampliconic gene families. These factors have to be addressed in future studies. Also, further refinements of reference transcriptomes for each species will aid in obtaining more accurate estimates of expression levels.
Y ampliconic genes and conservation genetics. Copy number variation of Y ampliconic gene families can be used to trace male dispersal and could ultimately aid conservation efforts targeted at preserving endangered great ape species. Indeed, Y chromosome markers provide a potential tool for evaluating historical patterns of male migration in great apes, similar to approaches applied to humans (Hammer et al., 1998; Underhill et al., 2000). Differential dispersal of males vs. females influences population structure at local scales and across landscapes (Underhill & Kivisild, 2007), and becomes critically relevant to reserve and corridor design for long-term conservation of wild populations of great apes. Studies using uniparentally-inherited sex-specific markers can assist in the identification of genetically distinct populations or even species, which should be subsequently preserved as reservoirs of genetic diversity. For example, a recent study of evolutionary history of orangutans using sequences of mtDNA and Y chromosome allowed the identification of a new orangutan species, Pongo tapanuliensis (Nater et al., 2017). Here we presented the first study exploring variation in copy number and expression of Y ampliconic genes across most great ape species (only omitting Tapanuli orangutan, which was recently discovered). To evaluate copy number of Y ampliconic genes, we used ddPCR assays, which were previously demonstrated to be highly accurate and reproducible (C. M. Hindson et al., 2013; Taylor, Laperriere, & Germain, 2017; Vegesna et al., 2019). We presented ampliconic gene copy number variation in bonobos and orangutans for the first time, and showed that orangutans have the highest copy number and the highest variation in copy number across great apes. We observed significant differences in copy number in four out of nine Y ampliconic gene families. To obtain the gene expression dataset, we assembled transcripts and estimated expression levels of Y ampliconic genes using publicly available and generated in house testis-specific RNA-Seq datasets. The analysis of this dataset indicated conserved evolution, i.e. none of the Y
72 ampliconic gene families had significant shifts in their expression levels between species, despite substantial and significant variation in their copy number. We observed a positive correlation between copy number and expression levels for the DAZ, RBMY, and TSPY gene families, in contrast to the results in human, where such correlation was not observed for any Y ampliconic gene families (Vegesna et al., 2019). Thus, copy number can influence gene expression given sufficient evolutionary time. We showed that variation in copy number for the TSPY and RBMY families might be associated with interspecific differences in sperm competition across great apes. More specifically, we observed that bonobo, a species with the highest sperm competition, concentration, and motility, had the highest copy number and variance of copy number of the RBMY and TSPY gene families, which were shown to be associated with human sperm motility and concentration, respectively, in other studies (Giachini et al., 2009; Yan et al., 2017). Our results have important implications for understanding Y chromosome evolution in endangered great apes and should aid conservation efforts aimed at restoring their genetic diversity.
Materials and Methods
DNA samples DNA samples (Table S8) from seven bonobos, six Bornean orangutans, and four Sumatran orangutans, as well as blood samples from an additional Bornean orangutan (KB5405) and an additional Sumatran orangutan (KB5565) were provided by the San Diego Zoological Society. We extracted DNA from the latter two samples using the DNeasy Blood and Tissue Kit (Qiagen). DNA samples from nine western chimpanzees were provided by Mark Shriver at Pennsylvania State University. ddPCR assays for ampliconic gene copy number estimation in bonobo and orangutan The PCR protocol and primers for EvaGreen-based ddPCR assays were designed according to the parameters specified in (Tomaszkiewicz et al., 2016), using great ape species-specific sequences of a two-copy RPP30 and a single-copy SRY as references. BWA-MEM alignments (version 0.7.10) (H. Li, 2013) of raw Illumina reads from several male orangutan and bonobo datasets (RNA-Seq datasets present in Table S9, whole- genome-sequence datasets listed below) to the reference gene sequences were visualized in Integrative Genomics Viewer (IGV) (version 2.3.72) (Thorvaldsdóttir,
73
Robinson, & Mesirov, 2013) and consensus sequences were retrieved in order to design the primers. The primers for evaluating the copy number of BPY2, CDY, DAZ, HSFY, PRY, RBMY, and TSPY gene families in Bornean and Sumatran orangutans were designed in the protein-coding regions using the gene sequences previously published for Sumatran orangutan (Cortez et al., 2014) and RNA-seq datasets generated in-house for the Bornean orangutan (Dataset S2 and see below). Primers for orangutan XKRY (Dataset S2) were designed using the published Sumatran orangutan whole-genome sequencing dataset (SRR10393305, without gene annotation) and their male-specific presence in genomic DNA of both orangutan species was confirmed by PCR. Primers for SRY were designed using a previously published Sumatran orangutan gene sequence (Cortez et al., 2014), and primers for RPP30 were designed using the Sumatran orangutan reference female genome (ponAbe3). To estimate the copy number of the BPY2, CDY, DAZ, RBMY, and TSPY gene families in bonobo and chimpanzee, we used previously published primers for gorilla and human (Tomaszkiewicz et al., 2016) that identically matched the chimpanzee Y-specific sequence, and the publicly available whole-genome male bonobo dataset (SRR740905). The VCY and SRY primers for chimpanzees were designed using the chimpanzee Y chromosome reference sequence (Hughes et al., 2010). The SRY primers for bonobo were designed using the whole-genome male bonobo dataset generated in house (Dataset S2). The primers for RPP30 were designed using the chimpanzee (panTro6) and bonobo (panpan1.1) reference female genomes (Dataset S2). To complete our copy number dataset, we also used the previously published (by our group) copy number data from 14 wild-born gorillas (Tomaszkiewicz et al., 2016) and 10 humans with African Y haplogroup (E) (Ye et al. 2018). Each sample was run in at least three replicates (Dataset S3) and the mean value was calculated across the replicates (Table S10).
Construction of Y-specific great ape phylogenetic trees Hallast and colleagues identified 54,611 positions with intra- and interspecific single- nucleotide variants by comparing a 750,616-bp region across the Y chromosomes of great apes (Hallast et al., 2016). From this dataset, we picked sequences of one individual per species (uniformly at random) and generated a distance matrix (pairwise nucleotide differences) using MEGA7 (Kumar, Stecher, & Tamura, 2016). Using the mutation rate on the Y chromosome of humans (8.88×10−10 mutations per position per year (Helgason et al., 2015)) as a constant for all great apes, and the number of nucleotide differences
74 among species obtained from the distance matrix above, we estimated the time in years since the most recent common ancestor (TMRCA), where time is defined as (Walsh, 2001) 1 푙푒푛푔푡ℎ 표푓 푠푒푞푢푒푛푐푒푠 푇푀푅퐶퐴 = × 푙표푔 ( ) (1) 2(푚푢푡푎푡𝑖표푛 푟푎푡푒) 푒 푛표.표푓 푚푎푡푐ℎ푒푠
MEGA7 was used to estimate an unrooted maximum likelihood tree from the same set of sequences employed to compute TMRCA. We generated two separate trees, the first one including all great ape species analyzed (bonobo, chimpanzee, human, gorilla, Sumatran orangutan, and Bornean orangutan) and the second one excluding Sumatran orangutan. The second tree was used in gene expression analysis, in which we lacked expression data from multiple Sumatran orangutan individuals. Both trees were converted to rooted trees using reroot() function from ape package in R (Paradis, Claude, & Strimmer, 2004). As a final step, we recalibrated the trees to represent the branch lengths as TMRCA (in thousands of years) using the chronos() function in the ape package (Paradis et al., 2004). We used the makeChronosCalib() function to set the TMCRA for each common ancestor node of great apes and passed it as a parameter to the chronos() function, which enforces a molecular clock. Finally, we encoded the trees in Newick format for downstream analysis. The two Newick-formatted trees were ((((Bonobo:2432,Chimp:2432):5641,Human:8073):4633,Gorilla:12706):15346.37,(Boran gutan:578,Sorangutan:578):27474.37) and ((((Bonobo:2432,Chimp:2432):5641,Human:8073):4633,Gorilla:12706):15351.43,Borang utan:28057.43).
Analysis of conservation in copy number across great ape species We used CAFE v4.2 (Computational Analysis of gene Family Evolution) (Han et al., 2013) to study the evolution of ampliconic gene family size. CAFE uses a birth-and-death stochastic process to model gene gain or loss along each lineage of a given phylogenetic tree (De Bie, Cristianini, Demuth, & Hahn, 2006). For each of the nine ampliconic gene families, the median gene copy number in each great ape species and the first phylogenetic tree from the previous section were provided as input to CAFE. Using maximum likelihood, CAFE estimated the rate parameter λ (rate of gene birth and death), and gene copy numbers of each gene family at the internal nodes of the phylogenetic tree. Based on these maximum likelihood estimates, CAFE computed gene-family-specific p- values for tests of significant gain or loss of copy number in individual gene families in any
75 particular extant and ancestral great ape species. For each gene family with a significant gene-family-specific p-value (significance cutoff of 0.05 was used), CAFE also provided a p-value for every branch on the phylogenetic tree, which indicates the significance of the shift in gene family copy number along the branch. Based on these branch-specific p- values, we identified gene families that have undergone expansions or contractions along branches on the phylogenetic tree (Bonferroni-corrected p-value cutoff of 0.05/10=0.005; 10 nodes in great ape phylogenetic tree).
RNA-Seq datasets Testis-specific RNA-Seq datasets were obtained for human, bonobo, chimpanzee, gorilla, Bornean orangutan, and Sumatran orangutan (Table S9). The datasets were either publicly available (Ardlie et al., 2015; Brawand et al., 2011; Carelli et al., 2016; Fungtammasan et al., 2016; Ruiz-Orera et al., 2015; Tomaszkiewicz et al., 2016) or generated in-house. The public RNA-Seq datasets included sequences of strand-specific, paired-end libraries (SRR2040590 and SRR2040591 for chimpanzee, SRR2176206 and SRR2176207 for Bornean orangutan, SRR10393299-SRR10393304 for Sumatran orangutan, SRR3053573 and SRR10393358 for gorilla, SRR1090722, and SRR1077753 for human) and of unstranded libraries (SRR306837 for bonobo, SRR306825 for chimpanzee , SRR306810 for gorilla) (Ardlie et al., 2015; Brawand et al., 2011; Carelli et al., 2016; Ruiz-Orera et al., 2015). Testis-specific expression data for three humans with African Y-chromosome haplogroup E (SRR817512, SRR1100440, SRR1102852) were retrieved from the GTEx project (Ardlie et al., 2015). An additional bonobo RNA-Seq library was generated from total RNA extracted from the bonobo testis sample (individual ID 5013, from San Diego Zoological Society) with the RNeasy Mini Kit (Qiagen) and subsequently treated with DNase I (Ambion). Ribosomal RNA was depleted with the RiboZero Gold rRNA removal kit (Epicentre). The cDNA library was generated with the RNA ScriptSeq v2 RNA-Seq library preparation kit (Epicentre) and quantified with Qubit (Life Technologies) and Bioanalyzer (Agilent 2100). RNA sequencing was carried out on MiSeq using 151-bp paired-end sequencing protocol (approximately 100 million reads were generated). Raw sequence data were deposited in the NCBI Sequence Read Archive under accession numbers SRR10392519-SRR10392521. An additional Bornean orangutan RNA-Seq data set was generated as follows. RNA was extracted from testis tissue from a sample provided by San Diego Zoological Society (individual ID 3405). As in (Ardlie et al., 2015; Brawand et al., 2011; Carelli et al., 2016; Fungtammasan et al.,
76
2016; Ruiz-Orera et al., 2015; Tomaszkiewicz et al., 2016), RNA was extracted, verified RNA integrity with Bioanalyzer and sequenced on HiSeq2500 after preparing the libraries with TruSeq RNA Sample Prep kit (Illumina). Raw sequence data were deposited in the NCBI Sequence Read Archive under accession number SRR10392514-SRR10392519.
Female liver RNA-Seq datasets were obtained from publicly available datasets (SRR306835 for bonobo, SRR306823 for chimpanzee, SRR306808 for gorilla, SRR306798 for orangutan, and SRR1071668 for human) (Ardlie et al., 2015; Brawand et al., 2011; Carithers et al., 2015). They were used to filter out female transcripts during transcriptome assembly.
Transcriptome assembly of Y ampliconic genes in great apes The reference genomes for great apes—gorGor5 (Gordon et al., 2016), panPan1 (Prüfer et al., 2012), panTro5 (Chimpanzee Sequencing and Analysis Consortium, 2005), PonAbe2 (Locke et al., 2011), and hg38—were downloaded from the UCSC Genome Browser (Kent et al., 2002). The transcriptome assembly pipeline was adapted from (Tomaszkiewicz et al., 2016). The RNA-Seq reads (from the previous section) were first checked for the presence of truseq adapters and then we removed the adapters and low- quality regions using Trimmomatic[v0.36] (Bolger, Lohse, & Usadel, 2014). For each great ape species, its testis RNA-Seq reads were first mapped to their respective female reference genome (reference genome excluding the Y chromosome in case of human and chimpanzee) with Tophat2[v2.1.1] (Kim et al., 2013), and the unmapped reads (enriched for male-specific transcripts) were assembled with Trinity[v2.4.0] (Grabherr et al., 2011; Haas et al., 2013) and SOAPdenovo-Trans[v1.03] (Xie et al., 2014) with k-mer size of 25 bp. Other parameters were set based on read length and insert size required for each particular RNA-Seq dataset. The resulting contigs were aligned to the respective female reference genomes with BLAT[v36x2] (Kent, 2002), and contigs that aligned at >90% of their length with 100% identity were filtered from subsequent steps. Next, we aligned female liver RNA-Seq reads to the filtered contigs using Bowtie[v1.1.2] (Langmead, 2010) and removed contigs that were covered at over 90% of their length by mapped female liver RNA-Seq reads. The coverage information of the contigs was obtained using BEDTools[v2.26.0-87-g6f9c61f] (Quinlan & Hall, 2010). We combined the contigs from both the Tophat2 and Trinity assemblers and used CD-HIT[v4.7] (Fu, Niu, Zhu, Wu, & Li, 2012; W. Li & Godzik, 2006) to remove redundant sequences. We next scaffolded the
77 remaining contigs using SSPACE[v3.0] (Boetzer, Henkel, Jansen, Butler, & Pirovano, 2011). We further mapped testis and female liver RNA-Seq reads to the gene scaffolds with Bowtie[v1.1.2] (Langmead, 2010) and retained only male-specific gene scaffolds (with at least 80% of the sequence covered by male-specific reads and no more than 20% of the sequence covered by female-specific reads). From the filtered male-specific scaffolds we generated consensus sequences using minimus2[v3.1.0] from the AMOS consortium (Sommer, Delcher, Salzberg, & Pop, 2007). Annotation of the final transcripts was performed using nucleotide and protein databases using BLAST[v2.6.0+] (Altschul, Gish, Miller, Myers, & Lipman, 1990). The above pipeline (Fig. S6) was run for each great ape species separately, and the longest transcript representing ampliconic gene family was obtained for each species (Dataset S1). The transcripts for genes with low expression were not assembled due to lack of reads covering these genes.
Estimating gene expression levels from RNA-Seq datasets To obtain gene expression levels of Y ampliconic gene families, we used the human RefSeq database downloaded from the UCSC Genome Browser (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/, date: October 2016) as a reference, along with the longest ampliconic gene transcript assembled for the available gene families (see previous section). We generated an index for the reference using the salmon[v0.14.1] index function (Patro, Duggal, Love, Irizarry, & Kingsford, 2017) with k- mer size 31 (-k 31 --keepDuplicates). Standard pipelines such as Tophat2 (Kim et al., 2013) and RSEM (B. Li & Dewey, 2011) optimized to align reads to the same species reference could not be used in our case, and so we developed a new pipeline. For each sample, using the salmon quant (-l A -p 8 --validateMappings) function, we obtain the read counts per transcript on the available testis samples for each species. The transcript-level read counts were converted to gene level using the tximport package[v1.2.0] (Soneson, Love, & Robinson, 2015). The gene-level read counts for the RNA-Seq samples were normalized using DESeq2[v1.14.1] (Love, Huber, & Anders, 2014). The resulting dataset included paired-end and single-end RNA-Seq datasets with or without replicates and, for the sake of consistency, we processed all the datasets as single-end files and used only one replicate per sample to overcome batch effects. PCA of expression data shows distinct clustering of each species, except for one sample of bonobo, which appears to be closer to chimpanzee (Figs. S7-S8). Because we are aligning reads across great ape species, the reads from apes that are closest to human (chimpanzee and bonobo) are
78 expected to align better to the human reference than the reads from more distant species (e.g. orangutans). However, we expect that our use of a k-mer-based pseudo-alignment instead of standard whole-read alignment reduced reference bias. To test this hypothesis, we employed the rlog() function in DESeq2 (Love et al., 2014) to normalize read counts and the used dist() function from the same package to calculate the Euclidean distance between samples. When viewed as a heat map, with the exception of bonobos, the distances between the samples are consistent with the phylogenetic relationship among great ape species (Fig. S9), which is also supported by the dendrograms relating the rows and columns of the heat map. The PCA analysis also showed distinct clustering of the samples by species (Figs. S7-S8).
Testing for conservation in gene expression levels To test whether the expression levels for Y ampliconic gene families are conserved across great ape species, we used the EVE model (expression variance and evolution model) (Rohlfs, Harrigan, & Nielsen, 2014). This model parametrizes the ratio of population to evolutionary expression variance (β) taking phylogeny into account, i.e. it provides the implementation for phylogenetic ANOVA. Similar to the F statistic (a measure of the ratio of variation between groups to variation within groups) in ANOVA, the β parameter estimated by the EVE model represents the ratio of within-species expression variation, to phylogenetically corrected between-species expression variation. The EVE model is based on an Ornstein-Uhlenbeck (OU) process (Rohlfs & Nielsen, 2015), which models a random walk with a pull toward an optimal value. In the OU process employed by the EVE model, genetic drift (σ2) is explained by the random walk, the strength of selection (ɑ) by the directional pull, and optimal gene expression (θ) at the species level by optimal value. The EVE model also has a parameter that captures the variation of expression within species (휏) to estimate β. Given the Y chromosome-specific phylogenetic tree (the second tree without Bornean orangutan was used; see section on ‘Construction of Y-specific great ape phylogenetic trees’) and the expression values for the five Y ampliconic gene families found in all great apes from multiple samples per species, the EVE model estimated the above-mentioned parameters and used them to calculate the β parameter for each gene family i (βi) and all the gene families together (βshared ). The EVE model was then used to test whether the ratio βi for each gene i was similar to all the genes evolving neutrally in the phylogeny (i.e. βi=βshared where i indexes each gene in dataset). Deviations from this expectation are suggestive of selection. We tested whether the βi parameter for any one
79 gene family deviates from this expectation (i.e. βi≠βshared). If βi > βshared , then there is more variation within species than between species at gene i compared to expected, which could be suggestive of diversifying selection within species. Conversely, if βi < βshared , then there is more variation between species than within species at gene i compared to expected, which could be indicative of directional selection along extant or ancestral branches of the phylogeny. We used the -S parameter in the EVE model to perform the expression divergence/diversity test on each of the five gene families (-n 5) separately, using the ampliconic gene expression values from the previous section and the Y chromosome-specific phylogenetic tree as inputs. The EVE model calculated the likelihood ratio between the null and alternative hypotheses (Ho: βi=βshared vs. Ha:
βi≠βshared). The likelihood ratios follow a chi-square distribution with one degree of freedom, which makes it possible to convert the likelihood ratios to p-values. The p-values are used to infer whether expression levels of gene families tested are conserved across great ape species after taking their phylogenetic relationship into account.
Availability of data and materials Transcript sequences are presented in Dataset S1. Primer sequences are presented in Dataset S2. ddPCR replicates of copy numbers are shown in Dataset S3, expression data are presented in Table S5. Code used in the manuscript is available at github link: https://github.com/makovalab-psu/Y-AmpGene_CN_GE_GreatApes.git
References
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local
alignment search tool. Journal of Molecular Biology, 215(3), 403–410.
Anderson, M. J., Nyholt, J., & Dixson, A. F. (2005). Sperm competition and the evolution
of sperm midpiece volume in mammals. Journal of Zoology, 267(02), 135.
Ardlie, K. G., DeLuca, D. S., Segrè, A. V., Sullivan, T. J., Young, T. R., Gelfand, E. T., …
Lockhart. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis:
Multitissue gene regulation in humans. Science, 348(6235), 648–660.
80
Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D., & Pirovano, W. (2011). Scaffolding
pre-assembled contigs using SSPACE. Bioinformatics , 27(4), 578–579.
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for
Illumina sequence data. Bioinformatics , 30(15), 2114–2120.
Brawand, D., Soumillon, M., Necsulea, A., Julien, P., Csárdi, G., Harrigan, P., …
Kaessmann, H. (2011). The evolution of gene expression levels in mammalian
organs. Nature, 478(7369), 343–348.
Carelli, F. N., Hayakawa, T., Go, Y., Imai, H., Warnefors, M., & Kaessmann, H. (2016).
The life history of retrocopies illuminates the evolution of new mammalian genes.
Genome Research, 26(3), 301–314.
Carithers, L. J., Ardlie, K., Barcus, M., Branton, P. A., Britton, A., Buia, S. A., … GTEx
Consortium. (2015). A Novel Approach to High-Quality Postmortem Tissue
Procurement: The GTEx Project. Biopreservation and Biobanking, 13(5), 311–319.
Cechova, M., Harris, R. S., Tomaszkiewicz, M., Arbeithuber, B., Chiaromonte, F., &
Makova, K. D. (2019). High satellite repeat turnover in great apes studied with short-
and long-read technologies. Molecular Biology and Evolution.
https://doi.org/10.1093/molbev/msz156
Chang, A. Y.-F., & Liao, B.-Y. (2012). DNA methylation rebalances gene dosage after
mammalian gene duplications. Molecular Biology and Evolution, 29(1), 133–144.
Charlesworth, D., & Charlesworth, B. (1980). Sex differences in fitness and selection for
centric fusions between sex-chromosomes and autosomes. Genetical Research,
35(2), 205–214.
Chimpanzee Sequencing and Analysis Consortium. (2005). Initial sequence of the
chimpanzee genome and comparison with the human genome. Nature, 437(7055),
69–87.
81
Choi, J., Koh, E., Suzuki, H., Maeda, Y., Yoshida, A., & Namiki, M. (2007). Alu sequence
variants of the BPY2 gene in proven fertile and infertile men with Sertoli cell-only
phenotype. International Journal of Urology, Vol. 14, pp. 431–435.
https://doi.org/10.1111/j.1442-2042.2007.01741.x
Cortez, D., Marin, R., Toledo-Flores, D., Froidevaux, L., Liechti, A., Waters, P. D., …
Kaessmann, H. (2014). Origins and functional evolution of y chromosomes across
mammals. Nature, 508(7497), 488–493.
De Bie, T., Cristianini, N., Demuth, J. P., & Hahn, M. W. (2006). CAFE: a computational
tool for the study of gene family evolution. Bioinformatics , 22(10), 1269–1271.
Delgado, R. A., Jr, & Van Schaik, C. P. (2000). The behavioral ecology and conservation
of the orangutan (Pongo pygmaeus): a tale of two islands. Evolutionary
Anthropology: Issues, News, and Reviews: Issues, News, and Reviews, 9(5), 201–
218. de Vries, J. W., Repping, S., Oates, R., Carson, R., Leschot, N. J., & van der Veen, F.
(2001). Absence of deleted in azoospermia (DAZ) genes in spermatozoa of infertile
men with somatic DAZ deletions. Fertility and Sterility, 75(3), 476–479.
Dixson, A. F., & Anderson, M. J. (2004). Sexual behavior, reproductive physiology and
sperm competition in male mammals. Physiology and Behavior, 83(2), 361–371.
Fisher, R. A. (1931). THE EVOLUTION OF DOMINANCE. Biological Reviews of the
Cambridge Philosophical Society, 6(4), 345–368.
Fruth, B., Hickey, J. R., André, C., Furuichi, T., Hart, J., Hart, T., … Others. (2016). Pan
paniscus. The IUCN Red List of Threatened Species 2016: e. T15932A102331567.
Fujii-Hanamoto, H., Matsubayashi, K., Nakano, M., Kusunoki, H., & Enomoto, T. (2011).
A comparative study on testicular microstructure and relative sperm production in
gorillas, chimpanzees, and orangutans. American Journal of Primatology, 73(6),
570–577.
82
Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the
next-generation sequencing data. Bioinformatics , 28(23), 3150–3152.
Fungtammasan, A., Tomaszkiewicz, M., Campos-Sánchez, R., Eckert, K. A., DeGiorgio,
M., & Makova, K. D. (2016). Reverse Transcription Errors and RNA–DNA
Differences at Short Tandem Repeats. Molecular Biology and Evolution, 33(10),
2744–2758.
Giachini, C., Nuti, F., Turner, D. J., Laface, I., Xue, Y., Daguin, F., … Krausz, C. (2009).
TSPY1 copy number variation influences spermatogenesis and shows differences
among Y lineages. The Journal of Clinical Endocrinology and Metabolism, 94(10),
4016–4022.
Gläser, B., Grützner, F., Willmann, U., Stanyon, R., Arnold, N., Taylor, K., … Schempp,
W. (1998). Simian Y chromosomes: species-specific rearrangements of DAZ, RBM,
and TSPY versus contiguity of PAR and SRY. Mammalian Genome: Official Journal
of the International Mammalian Genome Society, 9(3), 226–231.
Glazko, G. V., & Nei, M. (2003). Estimation of divergence times for major lineages of
primate species. Molecular Biology and Evolution, 20(3), 424–434.
Gordon, D., Huddleston, J., Chaisson, M. J. P., Hill, C. M., Kronenberg, Z. N., Munson,
K. M., … Eichler, E. E. (2016). Long-read sequence assembly of the gorilla genome.
Science, 352(6281), aae0344.
Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., …
Regev, A. (2011). Full-length transcriptome assembly from RNA-Seq data without a
reference genome. Nature Biotechnology, 29(7), 644–652.
Graves, J. A. M., & Marshall Graves, J. A. (1995). The origin and function of the
mammalian Y chromosome and Y-borne genes - an evolving understanding.
BioEssays: News and Reviews in Molecular, Cellular and Developmental Biology,
17(4), 311–320.
83
Greve, G., Alechine, E., Pasantes, J. J., Hodler, C., Rietschel, W., Robinson, T. J., &
Schempp, W. (2011). Y-Chromosome variation in hominids: intraspecific variation is
limited to the polygamous chimpanzee. PloS One, 6(12), e29311.
Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J., …
Regev, A. (2013). De novo transcript sequence reconstruction from RNA-seq using
the Trinity platform for reference generation and analysis. Nature Protocols, 8(8),
1494–1512.
Hallast, P., Balaresque, P., Bowden, G. R., Ballereau, S., & Jobling, M. A. (2013).
Recombination dynamics of a human Y-chromosomal palindrome: rapid GC-biased
gene conversion, multi-kilobase conversion tracts, and rare inversions. PLoS
Genetics, 9(7), e1003666.
Hallast, P., & Jobling, M. A. (2017). The Y chromosomes of the great apes. Human
Genetics, 136(5), 511–528.
Hallast, P., Maisano Delser, P., Batini, C., Zadik, D., Rocchi, M., Schempp, W., …
Jobling, M. A. (2016). Great ape Y Chromosome and mitochondrial DNA
phylogenies reflect subspecies structure and patterns of mating and dispersal.
Genome Research, 26(4), 427–439.
Hammer, M. F., Karafet, T., Rasanayagam, A., Wood, E. T., Altheide, T. K., Jenkins, T.,
… Zegura, S. L. (1998). Out of Africa and back again: nested cladistic analysis of
human Y chromosome variation. Molecular Biology and Evolution, 15(4), 427–441.
Han, M. V., Thomas, G. W. C., Lugo-Martinez, J., & Hahn, M. W. (2013). Estimating
gene gain and loss rates in the presence of error in genome assembly and
annotation using CAFE 3. Molecular Biology and Evolution, 30(8), 1987–1997.
Harcourt, A. H., Harvey, P. H., Larson, S. G., & Short, R. V. (1981). Testis weight, body
weight and breeding system in primates. Nature, 293(5827), 55–57.
84
Harrison, M. E., & Chivers, D. J. (2007). The orang-utan mating system and the
unflanged male: A product of increased food stress during the late Miocene and
Pliocene? Journal of Human Evolution, 52(3), 275–293.
Helgason, A., Einarsson, A. W., Guðmundsdóttir, V. B., Sigurðsson, Á., Gunnarsdóttir,
E. D., Jagadeesan, A., … Stefánsson, K. (2015). The Y-chromosome point mutation
rate in humans. Nature Genetics, 47(5), 453–457.
Hey, J. (2010). The divergence of chimpanzee species and subspecies as revealed in
multipopulation isolation-with-migration analyses. Molecular Biology and Evolution,
27(4), 921–933.
Hindson, B. J., Ness, K. D., Masquelier, D. A., Belgrader, P., Heredia, N. J., Makarewicz,
A. J., … Colston, B. W. (2011). High-throughput droplet digital PCR system for
absolute quantitation of DNA copy number. Analytical Chemistry, 83(22), 8604–
8610.
Hindson, C. M., Chevillet, J. R., Briggs, H. A., Gallichotte, E. N., Ruf, I. K., Hindson, B.
J., … Tewari, M. (2013). Absolute quantification by droplet digital PCR versus
analog real-time PCR. Nature Methods, 10(10), 1003–1005.
Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S., …
Page, D. C. (2012). Strict evolutionary conservation followed rapid gene loss on
human and rhesus Y chromosomes. Nature, 483(7387), 82–86.
Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P.
J., … Page, D. C. (2010). Chimpanzee and human Y chromosomes are remarkably
divergent in structure and gene content. Nature, 463(7280), 536–539.
Iucn, & IUCN. (2016a). Gorilla gorilla: Maisels, F., Bergl, R.A. & Williamson, E.A. IUCN
Red List of Threatened Species. https://doi.org/10.2305/iucn.uk.2018-
2.rlts.t9404a136250858.en
85
Iucn, & IUCN. (2016b). Pan troglodytes: Humle, T., Maisels, F., Oates, J.F., Plumptre, A.
& Williamson, E.A. IUCN Red List of Threatened Species.
https://doi.org/10.2305/iucn.uk.2016-2.rlts.t15933a17964454.en
Iucn, & IUCN. (2016c). Pongo pygmaeus: Ancrenaz, M., Gumal, M., Marshall, A.J.,
Meijaard, E., Wich , S.A. & Husson, S. IUCN Red List of Threatened Species.
https://doi.org/10.2305/iucn.uk.2016-1.rlts.t17975a17966347.en
Iucn, & IUCN. (2017). Pongo abelii: Singleton, I., Wich , S.A., Nowak, M., Usher, G. &
Utami-Atmoko, S.S. IUCN Red List of Threatened Species.
https://doi.org/10.2305/iucn.uk.2017-3.rlts.t121097935a115575085.en
Kent, W. J. (2002). BLAT--the BLAST-like alignment tool. Genome Research, 12(4),
656–664.
Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., &
Haussler, D. (2002). The human genome browser at UCSC. Genome Research,
12(6), 996–1006.
Khaitovich, P., Enard, W., Lachmann, M., & Pääbo, S. (2006). Evolution of primate gene
expression. Nature Reviews. Genetics, 7(9), 693–702.
Khaitovich, P., Hellmann, I., Enard, W., Nowick, K., Leinweber, M., Franz, H., … Pääbo,
S. (2005). Parallel patterns of evolution in the genomes and transcriptomes of
humans and chimpanzees. Science, 309(5742), 1850–1854.
Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., & Salzberg, S. L. (2013).
TopHat2: accurate alignment of transcriptomes in the presence of insertions,
deletions and gene fusions. Genome Biology, 14(4), R36.
Kumar, S., Stecher, G., & Tamura, K. (2016). MEGA7: Molecular Evolutionary Genetics
Analysis Version 7.0 for Bigger Datasets. Molecular Biology and Evolution, 33(7),
1870–1874.
86
Lahn, B. T., & Page, D. C. (1999). Four evolutionary strata on the human X
chromosome. Science, 286(5441), 964–967.
Langmead, B. (2010). Aligning short sequencing reads with Bowtie. Current Protocols in
Bioinformatics / Editoral Board, Andreas D. Baxevanis ... [et Al.], Chapter 11, Unit
11.7.
Li, B., & Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq
data with or without a reference genome. BMC Bioinformatics, 12(1), 323.
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with
BWA-MEM. arXiv:1303.3997 [q-bio.GN]. https://doi.org/https://arxiv
.org/abs/1303.3997
Li, W., & Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large
sets of protein or nucleotide sequences. Bioinformatics , 22(13), 1658–1659.
Locke, D. P., Hillier, L. W., Warren, W. C., Worley, K. C., Nazareth, L. V., Muzny, D. M.,
… Wilson, R. K. (2011). Comparative and demographic analysis of orang-utan
genomes. Nature, 469(7331), 529–533.
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and
dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.
Lu, C., Jiang, J., Zhang, R., Wang, Y., Xu, M., Qin, Y., … Wang, X. (2014). Gene copy
number alterations in the azoospermia-associated AZFc region and their effect on
spermatogenic impairment. Molecular Human Reproduction, 20(9), 836–843.
Lucotte, E. A., Skov, L., Jensen, J. M., Macià, M. C., Munch, K., & Schierup, M. H.
(2018). Dynamic Copy Number Evolution of X- and Y-Linked Ampliconic Genes in
Human Populations. Genetics, 209(3), 907–920.
Luo, Z.-X., Yuan, C.-X., Meng, Q.-J., & Ji, Q. (2011). A Jurassic eutherian mammal and
divergence of marsupials and placentals. Nature, 476(7361), 442–445.
87
Møller, A. P. (1988). Ejaculate quality, testes size and sperm competition in primates.
Journal of Human Evolution, 17(5), 479–488.
Mueller, J. L., Skaletsky, H., Brown, L. G., Zaghlul, S., Rock, S., Graves, T., … Page, D.
C. (2013). Independent specialization of the human and mouse X chromosomes for
the male germ line. Nature Genetics, 45(9), 1083–1087.
Nascimento, J. M., Shi, L. Z., Meyers, S., Gagneux, P., Loskutoff, N. M., Botvinick, E. L.,
& Berns, M. W. (2008). The use of optical tweezers to study sperm competition and
motility in primates. Journal of the Royal Society, Interface / the Royal Society,
5(20), 297–302.
Nater, A., Mattle-Greminger, M. P., Nurcahyo, A., Nowak, M. G., de Manuel, M., Desai,
T., … Krützen, M. (2017). Morphometric, Behavioral, and Genomic Evidence for a
New Orangutan Species. Current Biology: CB, 27(22), 3576–3577.
Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome
Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biology and
Evolution, 8(7), 2231–2240.
Paradis, E., Claude, J., & Strimmer, K. (2004). APE: Analyses of Phylogenetics and
Evolution in R language. Bioinformatics , 20(2), 289–290.
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon
provides fast and bias-aware quantification of transcript expression. Nature
Methods, 14(4), 417–419.
Prüfer, K., Munch, K., Hellmann, I., Akagi, K., Miller, J. R., Walenz, B., … Pääbo, S.
(2012). The bonobo genome compared with the chimpanzee and human genomes.
Nature, 486(7404), 527–531.
Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing
genomic features. Bioinformatics , 26(6), 841–842.
88
Repping, S., van Daalen, S. K. M., Brown, L. G., Korver, C. M., Lange, J., Marszalek, J.
D., … Rozen, S. (2006). High mutation rates have driven extensive structural
polymorphism among human Y chromosomes. Nature Genetics, 38(4), 463–467.
Rohlfs, R. V., Harrigan, P., & Nielsen, R. (2014). Modeling gene expression evolution
with an extended Ornstein-Uhlenbeck process accounting for within-species
variation. Molecular Biology and Evolution, 31(1), 201–211.
Rohlfs, R. V., & Nielsen, R. (2015). Phylogenetic ANOVA: The expression variance and
evolution model for quantitative trait evolution. Systematic Biology, 64(5), 695–708.
Ross, M. T., Grafham, D. V., Coffey, A. J., Scherer, S., McLay, K., Muzny, D., …
Bentley, D. R. (2005). The DNA sequence of the human X chromosome. Nature,
434(7031), 325–337.
Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J., Cordum, H. S., Waterston, R. H.,
… Page, D. C. (2003). Abundant gene conversion between arms of palindromes in
human and ape Y chromosomes. Nature, 423(6942), 873–876.
Ruiz-Orera, J., Hernandez-Rodriguez, J., Chiva, C., Sabidó, E., Kondova, I., Bontrop, R.,
… Albà, M. M. (2015). Origins of De Novo Genes in Human and Chimpanzee. PLoS
Genetics, 11(12), e1005721.
Schaller, F., Fernandes, A. M., Hodler, C., Münch, C., Pasantes, J. J., Rietschel, W., &
Schempp, W. (2010). Y Chromosomal Variation Tracks the Evolution of Mating
Systems in Chimpanzee and Bonobo. PloS One, 5(9), e12482.
Seuánez, H. N. (1980). Chromosomes and spermatozoa of the African great apes.
Journal of Reproduction and Fertility. Supplement, Suppl 28, 91–104.
Shi, W., Louzada, S., Grigorova, M., Massaia, A., Arciero, E., Kibena, L., … Xue, Y.
(2019). Evolutionary and functional analysis of RBMY1 gene copy number variation
on the human Y chromosome. Human Molecular Genetics, 28(16), 2785–2798.
89
Shi, W., Massaia, A., Louzada, S., Handsaker, J., Chow, W., McCarthy, S., … Tyler-
Smith, C. (2019). Birth, expansion, and death of VCY-containing palindromes on the
human Y chromosome. Genome Biology, 20(1), 207.
Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G.,
… Page, D. C. (2003). The male-specific region of the human Y chromosome is a
mosaic of discrete sequence classes. Nature, 423(6942), 825–837.
Skov, L., & Schierup, M. H. (2017). Analysis of 62 hybrid assembled human Y
chromosomes exposes rapid structural changes and high rates of gene conversion.
PLoS Genetics, 13(8), 1–20.
Sommer, D. D., Delcher, A. L., Salzberg, S. L., & Pop, M. (2007). Minimus: a fast,
lightweight genome assembler. BMC Bioinformatics, 8, 64.
Soneson, C., Love, M. I., & Robinson, M. D. (2015). Differential analyses for RNA-seq:
transcript-level estimates improve gene-level inferences. F1000Research, 4, 1521.
Taylor, S. C., Laperriere, G., & Germain, H. (2017). Droplet Digital PCR versus qPCR for
gene expression analysis with low abundant targets: from variable nonsense to
publication quality data. Scientific Reports, 7(1), 2409.
Thorvaldsdóttir, H., Robinson, J. T., & Mesirov, J. P. (2013). Integrative Genomics
Viewer (IGV): high-performance genomics data visualization and exploration.
Briefings in Bioinformatics, 14(2), 178–192.
Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H.
W., Harris, R., … Makova, K. D. (2016). A time- and cost-effective strategy to
sequence mammalian Y Chromosomes: an application to the de novo assembly of
gorilla Y. Genome Research, 26(4), 530–540.
Underhill, P. A., & Kivisild, T. (2007). Use of y chromosome and mitochondrial DNA
population structure in tracing human migrations. Annual Review of Genetics, 41,
539–564.
90
Underhill, P. A., Shen, P., Lin, A. A., Jin, L., Passarino, G., Yang, W. H., … Oefner, P. J.
(2000). Y chromosome sequence variation and the history of human populations.
Nature Genetics, Vol. 26, pp. 358–361. https://doi.org/10.1038/81685
Vegesna, R., Tomaszkiewicz, M., Medvedev, P., & Makova, K. D. (2019). Dosage
regulation, and variation in gene expression and copy number of human Y
chromosome ampliconic genes. PLoS Genetics, 15(9), e1008369.
Veyrunes, F., Waters, P. D., Miethke, P., Rens, W., McMillan, D., Alsop, A. E., …
Marshall Graves, J. A. (2008). Bird-like sex chromosomes of platypus imply recent
origin of mammal sex chromosomes. Genome Research, 18(6), 965–973.
Vogt, P. H., Edelmann, A., Kirsch, S., Henegariu, O., Hirschmann, P., Kiesewetter, F., …
Haidl, G. (1996). Human Y chromosome azoospermia factors (AZF) mapped to
different subregions in Yq11. Human Molecular Genetics, 5(7), 933–943.
Walsh, B. (2001). Estimating the time to the most recent common ancestor for the Y
chromosome or mitochondrial DNA for a pair of individuals. Genetics, 158(2), 897–
912.
Wistuba, J., Schrod, A., Greve, B., Hodges, J. K., Aslam, H., Weinbauer, G. F., &
Luetjens, C. M. (2003). Organization of seminiferous epithelium in primates:
relationship to spermatogenic efficiency, phylogeny, and mating system. Biology of
Reproduction, 69(2), 582–591.
Xie, Y., Wu, G., Tang, J., Luo, R., Patterson, J., Liu, S., … Wang, J. (2014).
SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads.
Bioinformatics , 30(12), 1660–1666.
Yan, Y., Yang, X., Liu, Y., Shen, Y., Tu, W., Dong, Q., … Yang, Y. (2017). Copy number
variation of functional RBMY1 is associated with sperm motility: an azoospermia
factor-linked candidate for asthenozoospermia. Human Reproduction , 32(7), 1521–
1531.
91
Ye, D., Zaidi, A., Tomaszkiewicz, M., Anthony, K., Liebowitz, C., DeGiorgio, M., …
Makova, K. D. (2018). High levels of copy number variation of ampliconic genes
across major human Y haplogroups. Genome Biology and Evolution, (May).
https://doi.org/10.1093/gbe/evy086
Yu, N., Jensen-Seaman, M. I., Chemnick, L., Kidd, J. R., Deinard, A. S., Ryder, O., … Li,
W.-H. (2003). Low nucleotide diversity in chimpanzees and bonobos. Genetics,
164(4), 1511–1518.
92
Chapter 4 Dynamic evolution of great ape Y chromosomes
This chapter will be submitted as a research article by M. Cechova, R. Vegesna, M. Tomaszkiewicz, R.S. Harris, D. Chen, S. Rangavittal, P. Medvedev, K.D. Makova. In this chapter, R. Vegesna, the author of this dissertation, performed the analysis related to gene content and palindromes of bonobo and orangutan Y chromosome assembly. M. Tomaszkiewicz performed all the wet-lab experimental work. M. Michalovova, R.S. Harris and S. Rangavittal assembled the Y chromosome. R.S. Harris, and D. Chen helped with the multiple sequence alignment of great ape sex chromosomes. Supporting information is provided in Appendix C.
Introduction
Great apes include humans and their closest living relatives: chimpanzees, gorillas, and orangutans. The male-specific sex chromosome, the Y, is involved in sex determination and male fertility, but has been understudied in great apes. The Y chromosome assemblies are available for chimpanzee, human, and gorilla (Hughes et al., 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). Cytogenetic studies of great ape Y chromosomes demonstrated substantial interspecific variation in their structure and gene content (Gläser et al., 1998). We have a limited understanding of the evolutionary history in terms of gene and palindrome content of great ape Y chromosomes. This shortcoming is addressed in this study.
Studies of the human Y chromosome have identified variation in copy number of genes present in its ampliconic region (Lucotte et al., 2018; Skov & Schierup, 2017; Vegesna, Tomaszkiewicz, Medvedev, & Makova, 2019; Ye et al., 2018). This variation results from intraspecific rearrangements among the amplicons on the Y chromosome (Lange et al., 2013). The existence of variation in ampliconic gene copy number across great apes suggests the presence of similar repetitive sequences on all great apes Y chromosomes, which undergo frequent rearrangements (Vegesna et al. 2020; Chapter 3). The most frequent repetitive structures observed on the Y chromosomes of great apes studied to date are palindromes—large inverted repeat structures (arms) which are separated by a
93 relatively short sequence (spacer) in between (Hughes et al., 2010; Skaletsky et al., 2003). These palindromic structures enable gene conversion and non-allelic homologous recombination (NAHR), which aids in overcoming the lack of recombination within the male-specific region of the Y (Rozen et al., 2003). However, very little is known about the palindromes outside of the three great ape species with the Y chromosome assemblies available (Hughes et al., 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016).
The conservation of ampliconic structures on Y chromosomes was linked to Muller's ratchet and genetic hitchhiking (Betrán, Demuth, & Williford, 2012). However, genes present on the ampliconic regions have testis-specific expression, whereas the genes outside the amplicons are expressed ubiquitously (Skaletsky et al., 2003). This implies that, apart from sequence conservation, amplicons could have secondary function where they regulate the genes present within the amplicons to be selectively expressed in testis. Not all palindromes on the human Y are conserved in chimpanzee or gorilla (Hughes et al., 2010; Tomaszkiewicz et al., 2016). There are nineteen palindromes on the chimpanzee Y, of which only seven are homologs to the human palindromes and the remaining twelve are chimpanzee-specific (Hughes et al., 2010). The gorilla Y chromosome assembly has sequences homologous to all human palindromes and nine out of the twelve chimpanzee-specific palindromes which are likely to form palindromes also in gorilla because of their high read depth in a male gorilla individual (Tomaszkiewicz et al., 2016). Of the human palindromes conserved in chimpanzee and gorilla, there are two palindromes (P6 and P7) that do not harbour protein-coding genes (Skaletsky et al., 2003). Studying these palindromes should aid in understanding the potential function of amplicons on the Y chromosomes of great apes.
The mammalian Y and X chromosomes originated from a pair of autosomes, however when compared to the X chromosome, the Y has degraded and lost the majority of its genes (Bellott et al., 2014; Skaletsky et al., 2003). On the one hand, one study has shown that, due to the lack of recombination, the Y chromosome has been under constant purifying selection in humans, removing deleterious mutations and maintaining the reproductive fitness of the population (Wilson Sayres, Lohmueller, & Nielsen, 2014). On the other hand, beneficial mutations can be fixed in a population via selective sweeps, with its byproduct being the fixation of deletion or pseudogenization events via genetic hitchhiking (Hughes et al., 2005; Perry, Tito, & Verrelli, 2007). Reconstructing the
94 evolutionary history of the genes on the great ape Y chromosomes will provide us with insights about the evolution of their gene content. We can deduce whether the observed gene content is a result of constant rate of gene birth and death across great apes, or whether there are, for instance, species-specific increases in deletion rates on the Y chromosome, potentially associated with species-specific differences in mating preferences and social structure.
Recombination between the X and the Y chromosomes is limited to the pseudoautosomal regions (Graves, 1995; Lahn & Page, 1999). However, there are cases in which regions of the Y which normally do not recombine with the X (i.e. X-generate regions) undergo gene conversion with the X (Rosser, Balaresque, & Jobling, 2009; Trombetta, Cruciani, Underhill, Sellitto, & Scozzari, 2010). Genes such as PRKY and VCY share higher similarity with their X homologs as a result undergo X-Y gene conversion in the human lineage (Rosser et al., 2009; Trombetta et al., 2010). To better understand the evolution and conservation of genes on the Y, there is a need to identify regions which undergo X- Y gene conversion on great ape Y chromosomes.
In this study, we assembled the Y chromosomes of bonobo and Sumatran orangutan. Within these assemblies we identified the palindromes which are shared with human and chimpanzee. Independent of the assemblies, we identified new factors other than genes that could determine the conservation of palindromes across great apes. Using a model developed by Iwasaki and Takagi (Iwasaki & Takagi, 2007), we reconstructed the evolutionary history of genes across great apes. By comparing the X and Y chromosomes of human, chimpanzee, bonobo, gorilla, and orangutan, we identified conserved regions and searched for X-Y gene conversion events within them.
Results Conservation of human and chimpanzee palindrome sequences. Did the palindromes now present on the human Y (P1-P8) and chimpanzee Y (C1-C19) evolve before or after the great ape lineages split? To answer this question, we identified the proportions of human and chimpanzee palindrome sequences that aligned to bonobo, orangutan and gorilla Ys in our multi-species alignments (Fig. 1A; Table S1-2). Among human palindromes, P5 and P6 were the most conserved (covered by 89-97% of other great ape Y assemblies), whereas the majority of P3 sequences were human-specific
95
(covered by only 31-38% of other great ape Y assemblies). Nevertheless, the common ancestor of great apes most likely already had substantial lengths of sequences homologous to P1, P2, and P4-P8, and some sequences of P3 (Fig. 1B). Chimpanzee palindromes C17, C18, and C19 are homologous to human palindromes P8, P7, and P6, respectively (Hughes et al., 2010). Therefore below we focused on the other chimpanzee palindromes and, following (Hughes et al., 2010), divided them into five homologous groups: C1 (C1+C6+C8+C10+C14+C16), C2 (C2+C11+C15), C3 (C3+C12), C4 (C4+C13), and C5 (C5+C7+C9) (Table S2). The palindromes in the C3, C4, and C5 groups had substantial proportions (usually 70-90%) of their sequences covered by alignments with other great ape Ys (Fig. 1A). In contrast, most of C2 sequences (85%) were shared with bonobo, and a substantial proportion of C1 sequences was chimpanzee- specific. Nonetheless, the common ancestor of great apes likely already had large amounts of sequences homologous to group C3, C4, and C5 palindromes, and also some sequences homologous to group C1 and C2 palindromes (Fig. 1B).
To determine whether the bonobo, orangutan, and gorilla sequences homologous to human or chimpanzee palindromes were multi-copy (i.e. present in more than one copy), and thus could form palindromes, in the common ancestor of great apes, we obtained their read depths from whole-genome sequencing of their respective males (Fig. 1A; see Methods). This approach was used because we expect that some palindromes were collapsed in our Y assemblies and, hence, the copy number within the assembly may be unreliable. Additionally, we used the data on the homology between human and chimpanzee palindromes summarized from the literature (Hughes et al., 2010; Skaletsky et al., 2003b; Tomaszkiewicz et al., 2016) (Table S3-4). Using maximum parsimony reconstruction, we concluded (Note S1) that sequences homologous to P4, P5, P8, C4, and partial sequences homologous to P1, P2, C2, and C3 were multi-copy in the common ancestor of great apes (Fig. 1B). Sequences homologous to P3, P6, and C1 were multi- copy in the human-gorilla common ancestor, those homologous to P7 were multi-copy in the human-chimpanzee common ancestor, and those homologous to C5 were multi-copy in the bonobo-chimpanzee common ancestor (Fig. 1B; Note S1).
96
Figure 4- 1. Evolution of sequences homologous to human and chimpanzee palindromes. (A) Heatmaps showing coverage for each palindrome in each species in the multi-species alignment, and box plots representing copy number (natural log) of 1-kb windows which have homology with human or chimpanzee palindromes. (B) The great ape phylogenetic tree with evolution of human (shown in blue) and chimpanzee (purple) palindromic sequences overlayed on it. Palindrome names in bold indicate that their sequences were present in ≥2 copies. Negative (-) and positive (+) signs indicate gain and loss of palindrome sequence (possibly only partial), respectively. Arrows represent gain (↑) or loss (↓) of palindrome copy number. If several equally parsimonious scenarios were possible, we assumed a later date of acquisition of the multi-copy state for a palindrome (Note S1).
What drives conservation of gene-free palindromes P6 and P7? Palindromes are hypothesized to evolve to allow ampliconic genes to withstand high mutation rates on the Y via gene conversion in the absence of interchromosomal recombination (Betrán et al., 2012; Rozen et al., 2003). Two human palindromes—P6 and P7—do not harbor any genes, however the large proportions of their sequences are present and are multi-copy (Fig. 1A, Table S3) in most great ape species we examined (with exceptions of P7 absent from bonobo and of single-copy P6 and P7 in orangutan).
97
In fact, P6 is the most conserved human palindrome (Fig. 1A) and was present in the multi-copy state in the human-gorilla common ancestor (Fig. 1B). We hypothesized that conservation of P6 and P7 might be explained by their role in regulation of gene expression.
Using ENCODE data (Davis et al., 2018), we identified candidate open chromatin and protein-binding sites in P6 and P7 (Fig. 2-3). In P6, we found markers for open chromatin (DNase I hypersensitive sites) and histone modifications H3K4me1 and H3K27ac, associated with enhancers (Pennacchio et al., 2013), in human umbilical vein endothelial cells (HUVEC). In P7, we found markers for CAMP-responsive element binding protein 1 (CREB-1), participating in transcription regulation (Mayr & Montminy, 2001), in a human liver cancer cell line HepG2. Interestingly, we did not identify any open-chromatin or enhancer marks in P6 and P7 in testis, suggesting that the sites we found above regulate genes expressed outside of this tissue.
Evolution of gene content in great apes
Utilizing sequence assemblies and testis expression data (Vegesna et al., 2020), we evaluated gene content and the rates of gene birth and death on the Y chromosomes of five great ape species. First, we examined the presence/absence of homologs of human Y chromosome genes (16 X-degenerate genes and nine ampliconic gene families, a total of 25 gene families; for multi-copy gene families, we were not studying copy number variation, but only presence/absence of a family in a species; Fig. S1). Such data were previously available for the chimpanzee Y, in which seven out of 25 human Y gene families became pseudogenized or deleted (Hughes et al., 2010), and for the gorilla Y, in which only one gene family (VCY) out of 25 is absent (Tomaszkiewicz et al., 2016). Here, we compiled the data for bonobo and orangutan. From the 25 gene families present on the human Y, the bonobo Y lacked seven (HSFY, PRY, TBL1Y, TXLNGY, USP9Y, VCY, and XKRY) and the orangutan Y lacked five (TXLNGY, CYorf15A, PRKY, USP9Y, and VCY). Second, we annotated putative new genes in our bonobo and orangutan Y assemblies (Note S2). Our results suggest that the bonobo and orangutan Y chromosomes, similar to the chimpanzee (Hughes et al., 2010) and gorilla (Tomaszkiewicz et al., 2016) Ys, do not harbor novel genes. As a result, we obtained the complete information about gene family content on the Y chromosome in five great ape species.
98
Figure 4- 2. IGV screen shots of peaks in DNase-seq, H3K4me1 and H3K27ac on Palindrome P6. A. Peaks on both the arms of the palindromes are shown within the blue and grey boxes. B. Zoom in view of peaks on the left arm of palindrome P6. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2. C. Zoom in view of peaks on the right arm of palindrome P6. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2.
Figure 4- 3. IGV screenshot of CREB1 peaks on Palindrome P7. Peaks on both the arms of the palindromes are shown. The coverage track represents the depth of coverage and peaks track represents the peaks identified by macs2.
Using this information, we reconstructed gene content at ancestral nodes and asked whether the rates of gene birth and death varied across the great ape phylogeny. For this analysis, we employed the evolutionary model developed by Iwasaki and Takagi (Iwasaki
99
& Takagi, 2007) and used the macaque Y chromosome as an outgroup (Hughes et al., 2012). Because X-degenerate and ampliconic genes might exhibit different trends, we analyzed them separately (Fig. 4, Table S5). Considering gene births, none were observed for X-degenerate genes, and only one (VCY, in the human-chimpanzee-bonobo common ancestor) was observed for ampliconic genes, leading to overall low gene birth rates. Considering gene deaths, three ampliconic gene families and three X-generate genes were lost by the chimpanzee-bonobo common ancestor, leading to death rates of 0.095 and 0.049 events/MY, respectively. Bonobo lost an additional ampliconic gene, whereas chimpanzee lost an additional X-degenerate gene, leading to death rates of 0.182 and 0.080 events/MY, respectively. In contrast, no deaths of either ampliconic or X- degenerate genes were observed in human and gorilla. Orangutan did not experience any deaths of X-degenerate genes, but lost four ampliconic genes. Its ampliconic gene death rate (0.021 events/MY) was still lower than that in the bonobo or in the bonobo- chimpanzee common ancestor. To summarize, across great apes, the Pan genus exhibited the highest death rates for both X-degenerate and ampliconic genes.
Figure 4- 4. Evolution of Y chromosome gene content in great apes. The reconstructed history of gene birth and death for X-degenerate (blue) and ampliconic (red) genes was overlaid on the great ape phylogenetic tree (not drawn to scale), using macaque as an outgroup. The rates of gene birth and death (in events per million years) are shown in parentheses (for complete data see Fig. S3). The list at the root includes the genes that were present in the common ancestor of great apes and macaque. In addition to most of the genes on the human Y, the macaque Y harbors the X- degenerate MXRA5Y gene, which we found to be deleted in orangutan and pseudogenized in bonobo, chimpanzee, gorilla, and human. We currently cannot find a full-length copy of the VCY gene in bonobo. TXLNGY and DDX3Y are also known as CYorf15B and DBY, respectively.
100
Gene conversion between X and Y chromosomes of great apes To study the presence of gene conversion between X-degenerate genes and their homologs on the X chromosome in great apes, we generated a multiple-sequence alignment of the X and Y chromosomes (see Methods). From this alignment, we extracted alignment blocks specific to 12 single copy X-degenerate genes (including both exons and introns) on the human Y (from 16 human X-degenerate genes we excluded CYorf15A, CYorf15B, RPS4Y1, and RPS4Y2 as parts of these genes have repeats on the Y and have homologs on the X chromosome, making detection of gene conversion difficult). We further retained only alignment blocks greater than 50 bp and tested for the presence of gene conversion in them using GENECONV (Sawyer, 1999). The resulting alignment blocks covered 10-75% of sequence among the 12 X-degenerate genes, and gene conversion tracts were up to 410 bp in length. The gene conversion events were observed in NLGN4Y for all species (Table 1). From multiple species alignments, the total number of gene conversion events ranged from two to 13 across great ape species except for orangutan which had only two high confidence events. From pairwise chromosome X and Y alignments in each species, seven to 38 gene conversion events were observed across great ape species. Gene conversion events were most frequent in PRKY, which was previously shown to undergo gene conversion in humans (Rosser et al., 2009). Our results indicate that this gene also undergoes gene conversion in other great apes except for orangutan. No X-Y gene conversion events were detected for EIF1AY, KDM5D, SRY, TMSB4Y, USP9Y, and UTY in any great ape species studied.
Table 4- 1. Gene conversion between X and Y chromosomes of great apes using GENECONV. The first column has information about the gene symbols with the stratum they belong to in brackets. The values in the following column represent the number of gene conversion events with significant p-values (<0.05) which are corrected for multiple comparisons across all sequence pairs. The values in brackets represent the number of gene conversion events with significant p-values (<0.05) for pairwise comparisons corrected for the length of the alignment.
Gene (Strata) Bonobo Chimpanzee Human Gorilla Orangutan AMELY (4) 1 (1) 1 (1) 1 (1) 1 (1) 0 DBY (3) 0 (1) 0 (1) 0 0 0 (1) EIF1AY (3) 0 0 0 0 0 KDM5D (2) 0 0 0 0 0 NLGN4Y (4) 2 (6) 2 (7) 3 (7) 1 (8) 1 (3) PRKY (5) 8 (25) 10 (27) 4 (17) 8 (17) 0 SRY (1) 0 0 0 0 0 TBL1Y (4) 1 (4) 0 (2) 1 (7) 0 (3) 0 (2) TMSB4Y (3) 0 0 0 0 0 USP9Y (3) 0 0 0 0 0 UTY (3) 0 0 0 0 0
101
ZFY (3) 0 0 0 0 1 (1) Total 12 (37) 13 (38) 9 (32) 10 (29) 2 (7)
Discussion Palindromes We found that multi-copy sequences are abundant in great ape Ys and that many of them were present in the common ancestor of great apes. In fact, substantial portions of most human palindromes (five out of eight), and of most chimpanzee palindrome groups (three out of five), were likely multi-copy (and thus potentially palindromic) in the common ancestor of great apes, suggesting conservation over >13 MY of evolution. Moreover, two of the three rhesus macaque palindromes are conserved with human palindromes P4 and P5 (Hughes et al., 2012), indicating conservation over >25 MY. On the other hand, our study also found species-specific amplification or loss of palindromes and other repetitive sequences, indicating that Y ampliconic sequence evolution is highly dynamic. A vivid example of this pattern is an extensive amplification of sequences homologous to P1, P2, P4, P5, and C3-C4 in orangutan. We also found that sizeable proportions of species- specific sequences in the bonobo, gorilla and orangutan Ys are multi-copy. Future studies will establish which of them form palindromes as opposed to tandem repeats. Regardless, repetitive sequences constitute a biologically significant component of great ape Y chromosomes and their multi-copy state might be selected for.
Previous studies (e.g., reviewed in (Betrán et al., 2012; S. Rozen et al., 2003; Trombetta & Cruciani, 2017)) focused on the role of Y-Y gene conversion in preserving Y ampliconic gene families, which are critical for spermatogenesis and fertility (Skaletsky et al., 2003), and suggested that this phenomenon constitutes the major adaptive role of palindromic sequences. Our findings suggest that, instead of genes, some human palindromes (P6 and P7) possess regulatory regions of expression of genes transcribed outside of testis and thus likely located on non-Y chromosomes, and such regulatory regions might drive conservation of gene-free palindromes. This observation should be examined in more detail in the future, but can potentially shift a paradigm in our understanding of Y chromosome functions. Indeed, our results imply that, in addition to carrying genes important for male sex determination and spermatogenesis, the Y chromosome participates in regulating gene expression in the genome. This echoes findings in Drosophila (e.g., (Lemos et al., 2010)) and in the mouse Y chromosome, which contains
102 a small, gene-free region that interacts with the rest of the genome (Kaufmann et al., 2015).
Genes We found that the gene content in the common ancestor of great apes likely was the same as is currently found in gorilla, and included eight ampliconic and 16 X-degenerate genes (Fig. 4). Analyzing the data on ampliconic gene content (Fig. 4), palindrome sequence (Fig. 1B), and ampliconic gene copy number (Fig. 2 in Chapter 3) evolution jointly, we can infer which ampliconic genes were present in the multi-copy state in the common ancestor of great apes. Our results suggest that the common ancestor of great apes had sequences homologous to P1, P2, P4, P5 and P8 in multi-copy state (Fig. 1B), which in human carry DAZ, BPY2, CDY, HSFY, XKRY, and VCY (Tomaszkiewicz et al., 2017). Except for VCY, which was acquired by the human-chimpanzee common ancestor, the remaining five genes were likely present as multi-copy gene families in the common ancestor of great apes, because three of them (DAZ, BPY2, and CDY) are present as multi-copy in all great ape species (Vegesna et al., 2020), and the other two (HSFY and XKRY)—in all great ape species but chimpanzee and bonobo (Vegesna et al., 2020), in which they were lost (Fig. 4).
Our comprehensive analysis of great ape Y gene content has allowed us to investigate rates of gene birth and death on the Y. We discovered that there is only one gene family that was born throughout the whole great ape phylogeny—VCY was acquired by the common ancestor of human and chimpanzee. As a result, except for this branch, we found uniformly low rates of gene birth. A low rate of ampliconic gene birth contradicts predictions of high birth rate made in previous studies for such genes (Trombetta & Cruciani, 2017), but suggests that great ape radiation does not provide sufficient time for gene acquisition by ampliconic regions. For new genes to survive on the Y chromosome, they should be beneficial to males. Ampliconic regions on the Y chromosomes of several other mammals indeed acquired such genes (Chang et al., 2013; Soh et al., 2014).
We expected to observe a high death rate for X-degenerate genes, but a low death rate for ampliconic genes, because the former genes do not undergo Y-Y gene conversion and thus should accumulate deleterious mutations, whereas the latter genes are multi-copy (all copies need to be deleted or pseudogenized for the gene family to die) and can be rescued by Y-Y gene conversion. Unexpectedly, the rates of gene death did not differ
103 between ampliconic and X-degenerate genes. Indeed, ~44.4% of ampliconic gene families were either deleted or pseudogenized, as compared with also ~43.8% of X-degenerate genes, across the great ape Y phylogenetic tree. These values were 67% and 50%, respectively, if we included the macaque Y. While our data did not support our hypothesis, other findings suggest that death of ampliconic genes is a gradual process. Indeed, we have recently shown that ampliconic gene families dead in some great ape species have reduced copy number in other species (Vegesna et al., 2019, 2020), lowering the chances for Y-Y gene conversion. These observations suggest that such genes are on the way to become non-essential and are at death’s door in great apes.
The rates of gene death varied among great ape species. In particular, we observed high rates of death in the Pan lineage—in the common ancestor of bonobo and chimpanzee, but also in the bonobo and chimpanzee lineages. Thus, the evolutionary forces driving such a high rate of gene death have likely been operating in the Pan lineage continuously since its divergence from the human lineage. What evolutionary forces could explain this observation? First, gene-disrupting or gene deletion mutations could be hitchhiking in haplotypes with positively selected mutations. Positive selection might be acting in the Pan lineage due to its polyandrous mating pattern and sperm competition. No gene deaths in the human and gorilla lineages, experiencing no sperm competition, and low gene death rates in orangutan, experiencing limited sperm competition, are consistent with this explanation.
Gene conversion The suppression of recombination between X and Y chromosome has occurred in at least five distinct steps (strata)(Lahn & Page, 1999; Ross et al., 2005). The youngest stratum retains the highest X-Y sequence similarity of ~95% (Ross et al., 2005). However, there are cases where genes on the male-specific part of the Y undergo gene conversion with the X (e.g., PRKY and VCY), which results in regions with higher similarity with the X chromosome (Rosser et al., 2009; Trombetta et al., 2010). We demonstrated X-Y gene conversion at several X-degenerate genes (AMELY, NLGN4Y, PRKY, and TBL1Y) in all great apes (Table 1). Among the genes with X-Y gene conversion, AMELY, NLGN4Y, and TBL1Y belong to stratum 4 and PRKY belongs to stratum 5 (Hughes et al., 2012; Lahn & Page, 1999; Ross et al., 2005). PRKY has higher frequency of gene conversion events when compared to the other three genes, which is consistent with the requirement of high sequence similarity (>92%) for gene conversion to take place (Chen, Cooper,
104
Chuzhanova, Férec, & Patrinos, 2007). These conclusions are based on a multiple sequence alignment of X and Y chromosomes of great apes. The Y chromosome assemblies of human and chimpanzee are of a higher quality when compared to those of other great apes, which could influence the count and length of gene conversion events predicted. In the future, gene conversion in ampliconic genes should also be studied. This would require obtaining the fully resolved (i.e. non-collapsed) ampliconic sequences in the assemblies of the Y chromosome.
Materials and Methods
Human and chimpanzee palindrome arm sequence conservation The multiple sequence alignment of the Y chromosomes of bonobo, chimpanzee, human, gorilla and orangutan were generated using a reference free whole genome aligner cactus (Paten et al., 2011). We used default parameters and did not provide a guide tree to cactus for the generation of multiple sequence alignment. The cactus output (.hal) file was converted to MAF using hal2maf function (--noAncestors --refGenome hg_Y --maxRefGap 100 --maxBlockLen 10000) in HAL tools (Hickey, Paten, Earl, Zerbino, & Haussler, 2013) twice. The first time we set --refGenome to ‘human’, which results in human-centric MAF file (where the first line represents sequences from the human Y) and the second time we set --refGenome parameter to ‘chimpanzee’ to obtain chimpanzee-based MAF files. The coordinates of the human and chimpanzee palindromes (PanTro4) were obtained from a previous publication (Tomaszkiewicz et al., 2016) (Table S6-7). Chimpanzee palindrome coordinates were converted to the PanTro6 version using the liftOver utility from the UCSC Genome Browser (Kent et al., 2002). For few chimpanzee palindromes for which liftOver failed to convert the coordinates, we generated a dotplot of chimpanzee Y to itself using Gepard (Krumsiek, Arnold, & Rattei, 2007) and identified the locations of palindromes from the dotplot.
Using a custom python script, which makes use of AlignIO from Biopython (Cock et al., 2009) we parsed the MAF files to obtain all the alignment blocks which overlapped given palindrome arm coordinates and for each species, sum the number of sites that are covered by these parsed alignment blocks. For each alignment block within the palindrome arm, we considered only those sequences which had less than 5% gaps and counted the number of aligned nucleotides per species at sites where the palindrome arm
105 is unmasked (softmasked with Repeat Masker, with lower-case nucleotides representing repeats). Next we calculated the number of nonrepetitive characters in the whole palindrome arm and used it to calculate the percentage of non-repetitive alignment within the palindrome arm per species.
Palindrome sequence read depth in bonobo, gorilla, and orangutan We used a pre-established pipeline used to identify human and chimpanzee palindrome sequences in Y assemblies (Tomaszkiewicz et al., 2016). The bonobo, gorilla, and orangutan Y contigs were broken into non-overlapping 1-kb windows and each window was aligned to human (hg38) and (separately) chimpanzee (panTro6) Y chromosome sequences using LASTZ (version 1.04.00) (--scores=human_primate.q, --seed=match12, --markend) (Harris, 2007) separately. The substitution scores were set identical to those used for published LASTZ alignments of primates (Miller et al., 2007). The best alignment for each window was retrieved from the LASTZ output. All the windows with alignments of >800 bp were considered as homologs to human or chimpanzee Y palindromes and used in the downstream analysis.
The whole-genome male sequencing reads (paired end reads of length 251bp) from orangutan (ID PR00251), gorilla (ID KB3781), and bonobo (ID 1991-0051) were mapped to the human and (separately) chimpanzee Ys using bwa mem (version 0.7.17-r1188) (Li, 2013) and the resulting output files were sorted and indexed using samtools (version 1.9) (Li et al., 2009). Using computeGCBias and correctGCBias functions from deepTools (version 3.3.0) (Ramírez et al., 2016), we performed GC correction of the sequencing read depths. Finally, using bedtools (v2.27.1) (Quinlan & Hall, 2010) ‘coverage’ function, we obtained the read depth of the windows homologous to human and chimpanzee palindrome sequences identified as described in the previous paragraph. We compared the read depths of palindromic sequence windows to that of windows overlapping X- degenerate genes(Table S8-9). In the case of chimpanzee, windows mapping to all the palindromes with similar repeats were grouped into C1 (C1, C6, C8, C10, C14, and C16), C2 (C2, C11, and C15), C3 (C3 and C12), C4 (C4 and C13), and C5 (C5, C7 and C9) palindrome groups.
106
Search for regulatory factors in human palindromes P6 and P7 We checked for the presence of sites specific to DNA binding proteins on human palindromes P6 and P7, which could imply the presence of functionality related to gene regulation. Initially we used the ENCODE track (http://genome.ucsc.edu/ENCODE/) at the UCSC Genome Browser (Kent et al., 2002) to search for epigenetic modifications in the palindrome regions (Dunham et al., 2012). In particular, we used the track from the Bernstein Lab at the Broad Institute containing H3K27Ac and H3K4Me1 data on seven cell lines from ENCODE, from which we identified human umbilical vein endothelial cells (HUVEC) to have signals in palindrome P6. Later we extended our search to ENCODE data portal (https://www.encodeproject.org/) from where we download the BAM files to look for the presence of peaks (Table S10) (Davis et al., 2018). We used the “Search by region” page under Data in the ENCODE data portal (https://www.encodeproject.org/region-search/) and GRCh38 setting to search for files which are related to the coordinates of palindromes P6 and P7. The resulting files were visualized in the UCSC Genome Browser to identify signals in the peaks, and from this information we identified signals in human liver cancer cell line HepG2 for P7. The ENCODE data processing pipeline by default filters out reads with low mapping quality, as a result we did not find any peaks/signals in the majority of the datasets (https://www.encodeproject.org/pipelines/). Thus, the current version of ENCODE tracks cannot be used as a source for studying epigenetic modifications in palindromes. To validate the two cell types HUVEC and HepG2, in which the signal was observed, we manually downloaded the unfiltered BAM files (HUVEC: H3K27Ac, H3K4Me1 and Dnase- seq; Testis:H3K27Ac, H3K4Me1 and Dnase-seq; and HepG2: CREB1; Table S10) which include low-mapping-quality reads. We performed peak calling using macs2 (version2.1.4; -f BAM --broad --broad-cutoff 0.05) (Gaspar, 2018) and the control samples were used whenever available. We used integrative genomics viewer (IGV) (version 2.4.19)(Robinson et al., 2011) to visualize the data.
Analysis of genes homologous to human Y genes
For the bonobo and orangutan Y chromosomes, some data on the X-degenerate gene content were obtained from a recent review (Hallast & Jobling, 2017) (Note S2) and on the ampliconic gene content—from a recent publication (Vegesna et al. 2020; Chapter 3). However information about RPS4Y2 remained missing. We used gene predictions (see
107 next paragraph) and testis-specific transcriptome assemblies from the same publication for the presence of the RPS4Y2 and MXRA5Y gene (Note S2) (Vegesna et al. 2020; Chapter 3). The gene content of macaque Y chromosome was used as an outgroup of great apes (Hughes et al., 2012).
Novel genes in bonobo and orangutan assemblies We used AUGUSTUS (Stanke & Waack, 2003) (--species=human --softmasking=on -- codingseq=on) to predict genes in the Y chromosome assemblies of bonobo and orangutan. From the list of predicted genes we retained those that have a start and stop codon. Using blastp from BLAST (2.9.0; -db uniprot_sprot.pep -max_target_seqs 1 - evalue 1e-5 -num_threads 10) (Altschul, Gish, Miller, Myers, & Lipman, 1990) we annotated the predicted gene sequences. Based on blast annotation, we classified all the genes which are not annotated by blastp as candidate de novo genes. From these genes, we retained that had >90% identity to the sequence in the UniProt (Boutet et al., 2016) protein database and covered at least 90% of the gene sequence in the blast output. To make sure that the predicted genes are on the Y chromosome and do not represent miss- assemblies, we performed an extra filtering step where we used only those genes which are present on contigs which align to either human or chimpanzee Y chromosomes. The resulting gene annotations which are not found on the human Y chromosome were assigned as candidate novel genes translocated to the Y (Table S11). The longest predicted transcript sequence for genes annotated as Y-chromosome-specific were obtained without the above filters for validation of missing gene content in bonobo and orangutan.
Reconstruction of gene content of great apes Once we obtained the complete Y chromosome gene content in great apes, we converted the gene content into binary values which represent the presence or absence of a gene in a species. The complete loss or pseudogenization of gene/gene families is represented by zero and the remaining cases are represented as ones. Using the model developed by Iwasaki and Takagi (Iwasaki & Takagi, 2007), we reconstructed the evolutionary history of genes (separately ampliconic and X-degenerate) across great apes. The phylogenetic tree of great apes (Locke et al., 2011), along with the table representing the presence or absence of genes, was used as an input to generate the rate of gene birth and gene death for each branch of the tree, and the reconstructed gene content at each internal node of
108 the tree. To obtain the rate in the units of events per million years, the rate of gene birth, as well as the rate of gene death, was divided by the length of the branches (in millions years) (Iwasaki & Takagi, 2007).
Gene conversion events between the X and Y chromosomes We softmasked the X and Y chromosomes of great apes using RepeatMasker (SMIT & A., 2004)(RepeatMasker -pa 63 -xsmall -species Primates ${assembly}.rmsk.fa). Progressive cactus (Paten et al., 2011) was used to align the chromosomes. A guide tree which pairs the X and Y chromosomes from the same species list was used(((chimpY, chimpX),(bonoboY, bonoboX),(humanY, humanX),(gorillaY, gorillaX),(sorangY, sorangX))). The resulting alignment output from cactus was converted to maf format using hal2maf (Hickey et al., 2013) using human Y as a reference (--noAncestors --refGenome humanY --maxRefGap 100 --maxBlockLen 10000). Then we parsed the alignment blocks (retaining blocks longer than >50bp; range of gene conversion tracts in human 55-290 bp, as was observed in previous studies (Chen et al., 2007; Jeffreys & May, 2004)) which fall within the coordinates of X-degenerate genes (excluding CYorf15A, CYorf15B, RPS4Y1, and RPS4Y2, see Results). We did not perform additional filtering based on repeat content within the alignment blocks. For each block, the alignments which constitute the sequences from both the X and Y chromosomes for a species were printed into a FASTA file. We used GENCONV (Sawyer, 1999) ( /w9 /lp -nolog) to identify gene conversion events based on multiple sequence alignment files. By default, the output of GENCONV constitutes a global list of high-confidence gene conversion events after multiple testing for each sequence pair. We also used /lp parameters with which a second list of significant gene conversions for all possible pairwise comparisons it performed. We used a p-value cutoff of 0.05, as GENCONV provides p-values after correcting for multiple comparisons. We used pairwise comparisons to address gene conversion in cases where a chromosome was represented by more than one sequence in the alignment. From the GENCONV output, we parsed events that constitute gene conversion between the X and Y chromosome from the same species and retained events which are longer than 50 bp.
References
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local
109
alignment search tool. Journal of Molecular Biology, 215(3), 403–410.
Bellott, D. W., Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Cho, T. J.,
Koutseva, N., Zaghlul, S., Graves, T., Rock, S., Kremitzki, C., Fulton, R. S., Dugan,
S., Ding, Y., Morton, D., Khan, Z., Lewis, L., Buhay, C., Wang, Q., … Page, D. C.
(2014). Mammalian y chromosomes retain widely expressed dosage-sensitive
regulators. Nature, 508(7497), 494–499.
Betrán, E., Demuth, J. P., & Williford, A. (2012). Why chromosome palindromes?
International Journal of Evolutionary Biology, 2012(Figure 2), 207958.
Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., Bansal, P., Bridge, A. J., Poux, S.,
Bougueleret, L., & Xenarios, I. (2016). UniProtKB/Swiss-Prot, the Manually
Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.
Methods in Molecular Biology , 1374, 23–54.
Chang, T.-C., Yang, Y., Retzel, E. F., & Liu, W.-S. (2013). Male-specific region of the
bovine Y chromosome is gene rich with a high transcriptomic activity in testis
development. Proceedings of the National Academy of Sciences of the United
States of America, 110(30), 12373–12378.
Chen, J.-M., Cooper, D. N., Chuzhanova, N., Férec, C., & Patrinos, G. P. (2007). Gene
conversion: mechanisms, evolution and human disease. Nature Reviews. Genetics,
8(10), 762–775.
Cheung, V. G., Nayak, R. R., Wang, I. X., Elwyn, S., Cousins, S. M., Morley, M., &
Spielman, R. S. (2010). Polymorphic cis- and trans-regulation of human gene
expression. PLoS Biology, 8(9). https://doi.org/10.1371/journal.pbio.1000480
Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg,
I., Hamelryck, T., Kauff, F., Wilczynski, B., & de Hoon, M. J. L. (2009). Biopython:
freely available Python tools for computational molecular biology and bioinformatics.
Bioinformatics , 25(11), 1422–1423.
110
Davis, C. A., Hitz, B. C., Sloan, C. A., Chan, E. T., Davidson, J. M., Gabdank, I., Hilton,
J. A., Jain, K., Baymuradov, U. K., Narayanan, A. K., Onate, K. C., Graham, K.,
Miyasato, S. R., Dreszer, T. R., Strattan, J. S., Jolanki, O., Tanaka, F. Y., & Cherry,
J. M. (2018). The Encyclopedia of DNA elements (ENCODE): data portal update.
Nucleic Acids Research, 46(D1), D794–D801.
Dunham, I., Kundaje, A., Aldred, S. F., Collins, P. J., Davis, C. A., Doyle, F., Epstein, C.
B., Frietze, S., Harrow, J., Kaul, R., Khatun, J., Lajoie, B. R., Landt, S. G., Lee, B.
K., Pauli, F., Rosenbloom, K. R., Sabo, P., Safi, A., Sanyal, A., … Lochovsky, L.
(2012). An integrated encyclopedia of DNA elements in the human genome. Nature,
489(7414), 57–74.
Gaspar, J. M. (2018). Improved peak-calling with MACS2. In bioRxiv (p. 496521).
https://doi.org/10.1101/496521
Gläser, B., Grützner, F., Willmann, U., Stanyon, R., Arnold, N., Taylor, K., Rietschel, W.,
Zeitler, S., Toder, R., & Schempp, W. (1998). Simian Y chromosomes: species-
specific rearrangements of DAZ, RBM, and TSPY versus contiguity of PAR and
SRY. Mammalian Genome: Official Journal of the International Mammalian Genome
Society, 9(3), 226–231.
Graves, J. A. M. (1995). The origin and function of the mammalian Y chromosome and
Y‐borne genes – an evolving understanding. BioEssays: News and Reviews in
Molecular, Cellular and Developmental Biology, 17(4), 311–320.
Hallast, P., & Jobling, M. A. (2017). The Y chromosomes of the great apes. Human
Genetics, 136(5), 511–528.
Harris, R. S. (2007). Improved pairwise Alignmnet of genomic DNA.
https://etda.libraries.psu.edu/catalog/7971
Hickey, G., Paten, B., Earl, D., Zerbino, D., & Haussler, D. (2013). HAL: a hierarchical
format for storing and analyzing multiple genome alignments. Bioinformatics ,
111
29(10), 1341–1342.
Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S.,
Dugan, S., Ding, Y., Buhay, C. J., Kremitzki, C., Wang, Q., Shen, H., Holder, M.,
Villasana, D., Nazareth, L. V., Cree, A., Courtney, L., Veizer, J., Kotkiewicz, H., …
Page, D. C. (2012). Strict evolutionary conservation followed rapid gene loss on
human and rhesus y chromosomes. Nature, 483(7387), 82–87.
Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P.
J., Fulton, R. S., McGrath, S. D., Locke, D. P., Friedman, C., Trask, B. J., Mardis, E.
R., Warren, W. C., Repping, S., Rozen, S., Wilson, R. K., & Page, D. C. (2010).
Chimpanzee and human Y chromosomes are remarkably divergent in structure and
gene content. Nature, 463(7280), 536–539.
Hughes, J. F., Skaletsky, H., Pyntikova, T., Minx, P. J., Graves, T., Rozen, S., Wilson, R.
K., & Page, D. C. (2005). Conservation of Y-linked genes during human evolution
revealed by comparative sequencing in chimpanzee. Nature, 437(7055), 100–103.
Iwasaki, W., & Takagi, T. (2007). Reconstruction of highly heterogeneous gene-content
evolution across the three domains of life. Bioinformatics , 23(13), i230–i239.
Jeffreys, A. J., & May, C. A. (2004). Intense and highly localized gene conversion activity
in human meiotic crossover hot spots. Nature Genetics, 36(2), 151–156.
Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., &
Haussler, a. D. (2002). The Human Genome Browser at UCSC. In Genome
Research (Vol. 12, Issue 6, pp. 996–1006). https://doi.org/10.1101/gr.229102
Krumsiek, J., Arnold, R., & Rattei, T. (2007). Gepard: a rapid and sensitive tool for
creating dotplots on genome scale. Bioinformatics , 23(8), 1026–1028.
Kuroda-Kawaguchi, T., Skaletsky, H., Brown, L. G., Minx, P. J., Cordum, H. S.,
Waterston, R. H., Wilson, R. K., Silber, S., Oates, R., Rozen, S., & Page, D. C.
(2001). The AZFc region of the Y chromosome features massive palindromes and
112
uniform recurrent deletions in infertile men. In Nature Genetics (Vol. 29, Issue 3, pp.
279–286). https://doi.org/10.1038/ng757
Lahn, B. T., & Page, D. C. (1999). Chromosome Four Evolutionary Strata on the Human
X Chromosome. Science, 286(5441), 964–967.
Lange, J., Noordam, M. J., van Daalen, S. K. M., Skaletsky, H., Clark, B. A., Macville, M.
V., Page, D. C., & Repping, S. (2013). Intrachromosomal homologous
recombination between inverted amplicons on opposing Y-chromosome arms.
Genomics, 102(4), 257–264.
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with
BWA-MEM. In arXiv [q-bio.GN]. arXiv. http://arxiv.org/abs/1303.3997
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G.,
Abecasis, G., Durbin, R., & Subgroup, 1000 Genome Project Data Processing.
(2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25,
2078–2079.
Locke, D. P., Hillier, L. W., Warren, W. C., Worley, K. C., Nazareth, L. V., Muzny, D. M.,
Yang, S.-P., Wang, Z., Chinwalla, A. T., Minx, P., Mitreva, M., Cook, L., Delehaunty,
K. D., Fronick, C., Schmidt, H., Fulton, L. A., Fulton, R. S., Nelson, J. O., Magrini,
V., … Wilson, R. K. (2011). Comparative and demographic analysis of orang-utan
genomes. Nature, 469(7331), 529–533.
Lucotte, E. A., Skov, L., Jensen, J. M., Macià, M. C., Munch, K., & Schierup, M. H.
(2018). Dynamic Copy Number Evolution of X- and Y-Linked Ampliconic Genes in
Human Populations. Genetics, 209(3), 907–920.
Mayr, B., & Montminy, M. (2001). Transcriptional regulation by the phosphorylation-
dependent factor CREB. Nature Reviews. Molecular Cell Biology, 2(8), 599–609.
Miller, W., Rosenbloom, K., Hardison, R. C., Hou, M., Taylor, J., Raney, B., Burhans, R.,
King, D. C., Baertsch, R., Blankenberg, D., Kosakovsky Pond, S. L., Nekrutenko, A.,
113
Giardine, B., Harris, R. S., Tyekucheva, S., Diekhans, M., Pringle, T. H., Murphy, W.
J., Lesk, A., … Kent, W. J. (2007). 28-Way vertebrate alignment and conservation
track in the UCSC Genome Browser. In Genome Research (Vol. 17, Issue 12, pp.
1797–1808). https://doi.org/10.1101/gr.6761107
Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome
Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biology and
Evolution, 8(7), 2231–2240.
Paten, B., Earl, D., Nguyen, N., Diekhans, M., Zerbino, D., & Haussler, D. (2011).
Cactus: Algorithms for genome multiple sequence alignment. Genome Research,
21(9), 1512–1528.
Pennacchio, L. A., Bickmore, W., Dean, A., Nobrega, M. A., & Bejerano, G. (2013).
Enhancers: five essential questions. Nature Reviews. Genetics, 14(4), 288–295.
Perry, G. H., Tito, R. Y., & Verrelli, B. C. (2007). The evolutionary history of human and
chimpanzee Y-chromosome gene loss. Molecular Biology and Evolution, 24(3),
853–859.
Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for comparing
genomic features. Bioinformatics , 26(6), 841–842.
Ramírez, F., Ryan, D. P., Grüning, B., Bhardwaj, V., Kilpert, F., Richter, A. S., Heyne, S.,
Dündar, F., & Manke, T. (2016). deepTools2: a next generation web server for
deep-sequencing data analysis. Nucleic Acids Research, 44(W1), W160–W165.
Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G.,
& Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1),
24–26.
Rosser, Z. H., Balaresque, P., & Jobling, M. A. (2009). Gene Conversion between the X
Chromosome and the Male-Specific Region of the Y Chromosome at a
Translocation Hotspot. American Journal of Human Genetics, 85(1), 130–134.
114
Ross, M. T., Grafham, D. V., Coffey, A. J., Scherer, S., McLay, K., Muzny, D., Platzer,
M., Howell, G. R., Burrows, C., Bird, C. P., Frankish, A., Lovell, F. L., Howe, K. L.,
Ashurst, J. L., Fulton, R. S., Sudbrak, R., Wen, G., Jones, M. C., Hurles, M. E., …
Bentley, D. R. (2005). The DNA sequence of the human X chromosome. Nature,
434(March), 325–337.
Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J., Cordum, H. S., Waterston, R. H.,
Wilson, R. K., & Page, D. C. (2003). Abundant gene conversion between arms of
palindromes in human and ape Y chromosomes. Nature, 423(6942), 873–876.
Sawyer, S. A. (1999). GENECONV: a computer package for the statistical detection of
gene conversion. Distributed by the author, Department of Mathematics,
Washington University in St. Louis. St. Louis.
Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G.,
Repping, S., Pyntikova, T., Ali, J., Bieri, T., Chinwalla, A., Delehaunty, A.,
Delehaunty, K., Du, H., Fewell, G., Fulton, L., Fulton, R., Graves, T., Hou, S.-F., …
Page, D. C. (2003a). The male-specific region of the human Y chromosome is a
mosic of discrete sequence classes. Nature, 423, 825–837.
Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G.,
Repping, S., Pyntikova, T., Ali, J., Bieri, T., Chinwalla, A., Delehaunty, A.,
Delehaunty, K., Du, H., Fewell, G., Fulton, L., Fulton, R., Graves, T., Hou, S.-F., …
Page, D. C. (2003b). The male-specific region of the human Y chromosome is a
mosaic of discrete sequence classes. Nature, 423(6942), 825–837.
Skov, L., & Schierup, M. H. (2017). Analysis of 62 hybrid assembled human Y
chromosomes exposes rapid structural changes and high rates of gene conversion.
PLoS Genetics, 13(8), 1–20.
SMIT, & A., F. A. (2004). Repeat-Masker Open-3.0. Http://www.repeatmasker.org.
https://ci.nii.ac.jp/naid/10029514778/
115
Soh, Y. Q. S., Alföldi, J., Pyntikova, T., Brown, L. G., Graves, T., Minx, P. J., Fulton, R.
S., Kremitzki, C., Koutseva, N., Mueller, J. L., Rozen, S., Hughes, J. F., Owens, E.,
Womack, J. E., Murphy, W. J., Cao, Q., de Jong, P., Warren, W. C., Wilson, R. K.,
… Page, D. C. (2014). Sequencing the mouse Y chromosome reveals convergent
gene acquisition and amplification on both sex chromosomes. Cell, 159(4), 800–
813.
Spicuglia, S., & Vanhille, L. (2012). Chromatin signatures of active enhancers. Nucleus ,
3(2), 126–131.
Stanke, M., & Waack, S. (2003). Gene prediction with a hidden Markov model and a new
intron submodel. Bioinformatics , 19 Suppl 2, ii215–ii225.
Tomaszkiewicz, M., Medvedev, P., & Makova, K. D. (2017). Y and W Chromosome
Assemblies: Approaches and Discoveries. Trends in Genetics: TIG, 33(4), 266–282.
Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H.
W., Harris, R., Ye, D., O’Brien, P. C. M., Chikhi, R., Ryder, O. A., Ferguson-Smith,
M. A., Medvedev, P., & Makova, K. D. (2016). A time- and cost-effective strategy to
sequence mammalian Y Chromosomes: an application to the de novo assembly of
gorilla Y. Genome Research, 26(4), 530–540.
Trombetta, B., & Cruciani, F. (2017). Y chromosome palindromes and gene conversion.
Human Genetics, 136(5), 605–619.
Trombetta, B., Cruciani, F., Underhill, P. A., Sellitto, D., & Scozzari, R. (2010). Footprints
of X-to-Y gene conversion in recent human evolution. Molecular Biology and
Evolution, 27(3), 714–725.
Vegesna, R., Tomaszkiewicz, M., Medvedev, P., & Makova, K. D. (2019). Dosage
regulation, and variation in gene expression and copy number of human Y
chromosome ampliconic genes. PLoS Genetics, 15(9), e1008369.
Vegesna, R., Tomaszkiewicz, M., Ryder, O. A., Campos-Sánchez, R., Medvedev, P.,
116
DeGiorgio, M., & Makova, K. D. (2020). Ampliconic genes on the great ape Y
chromosomes: Rapid evolution of copy number but conservation of expression
levels. In Review.
Wilson Sayres, M. A., Lohmueller, K. E., & Nielsen, R. (2014). Natural selection reduced
diversity on human y chromosomes. PLoS Genetics, 10(1), e1004064.
Ye, D., Zaidi, A. A., Tomaszkiewicz, M., Anthony, K., Liebowitz, C., DeGiorgio, M.,
Shriver, M. D., & Makova, K. D. (2018). High Levels of Copy Number Variation of
Ampliconic Genes across Major Human Y Haplogroups. Genome Biology and
Evolution, 10(5), 1333–1350.
117
Chapter 5
Summary In chapter two, we developed a method to estimate the copy number of ampliconic genes on the Y chromosome using whole-genome sequencing data called AmpliCoNE. Using this method, we were able to establish the relationship between the copy number and gene expression of Y ampliconic genes in 149 men. This provided us with insights into the relationship between gene expression and copy number of ampliconic genes, the presence of high tolerance for variation in ampliconic gene expression in testis, and dosage regulation of ampliconic genes to compensate for the variation in their copy number.
In chapter three, as a follow-up to chapter one, we estimated the Y ampliconic gene copy number and expression levels in great apes. We examined the conservation of ampliconic gene copy number and expression across great apes. We observed significant differences in gene copy number between species, whereas the overall expression levels were conserved. When we studied the relationship between copy number and expression across species, we did observe positive correlation in three gene families, which implies that copy number does influence gene expression over long evolutionary times. Similar to humans, within ampliconic gene families we did not observe a relationship between copy number and gene expression, which strengths our observation of dosage regulation of Y ampliconic genes. We observed significant interspecific size differences, sometimes even between sister species—chimpanzee and bonobo. We hypothesize that sperm competition and mating structure, which differ drastically among great apes, might have affected these patterns.
Finally, in chapter four we reconstructed the evolutionary history of gene content in great apes' Y chromosomes to study whether the rate at which the Y chromosome gained or lost genes is constant across great apes or whether there is a species-specific increase. We presented the conservation of human and chimpanzee palindromes in other great apes and attempted to reconstruct the palindromic structure in the great ape common ancestor. We showed that not only genes but also transcription factors can influence the conservation of palindromes. We also identified gene conversion events between X and
118
Y across great apes, which was previously only shown within humans and chimpanzees.
Significance and future work In humans, there are nine ampliconic gene families—BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY—which play an important role in spermatogenesis and are expressed exclusively in testis (Skaletsky et al., 2003). Loss of ampliconic gene families is linked to infertility (Carvalho, Zhang, & Lupski, 2011; Krausz et al., 2011; Navarro-Costa, Plancha, & Gonçalves, 2010; Repping et al., 2002; Vogt, 1996). Recent studies reported that there is a high variation in ampliconic gene copy number in healthy men (Lucotte et al., 2018; Skov, Danish Pan Genome Consortium, & Schierup, 2017; Ye et al., 2018). However, very little was known about how the differences in ampliconic gene copy number influence their expression. We filled this gap by defining the relationship between Y ampliconic gene expression levels and variation in gene copy number, Y haplogroup, and individual’s age. We explained the presence of dosage compensation, which aids in understanding how the Y chromosome adapts to the dynamic shift in copy number.
Similar to variation observed in humans, there is known variation in copy number of ampliconic genes across great apes (Hughes et al., 2010; Oetjens, Shen, Emery, Zou, & Kidd, 2016; Repping et al., 2006; Schaller et al., 2010; Tomaszkiewicz et al., 2016). However, we lacked information about the relationship between ampliconic gene copy number and their expression levels (Hughes et al., 2012, 2010; Skaletsky et al., 2003; Tomaszkiewicz et al., 2016). We estimated the ampliconic gene copy number and expression levels in great apes’ Y chromosomes and addressed this question. On the one hand, Y ampliconic genes are linked to spermatogenesis, and on the other hand, there is known variation in mating pattern of great apes (Harrison & Chivers, 2007; Wistuba et al., 2003). We tested whether Y ampliconic genes’ copy number and expression levels are conserved across great apes and demonstrated that gene families such as TSPY and RBMY have species-specific differences which can explain phenotypes linked to mating pattern across great apes. These results shed light on how Y chromosome genes might be involved in achieving different mating patterns among great apes.
119
As a result of sexual antagonism, ampliconic gene families might have accumulated on the Y chromosome to increase male reproductive fitness (Bellott et al., 2014; Betrán, Demuth, & Williford, 2012; Rozen et al., 2003). However, only five of the nine human ampliconic gene families are found across all great ape species (Hallast & Jobling, 2017). The remaining four gene families are lost or pseudogenized in some species. We annotated the Y chromosome assemblies of great apes to study whether they have acquired new genes and reconstructed the evolution of gene content on the Y to understand whether there is a species-specific loss or gain of genes. Also, we identified factors that influence the survival of palindrome structures which can help better understand the evolution of ampliconic gene families which are mostly located in palindromes on the Y. Non-human great ape species are confined to two geographic locations: Africa (gorillas, chimpanzees, and bonobos) and several islands of Indonesia (orangutan), and they all are critically endangered. There is a need to restore their populations. Ampliconic genes make up the majority of protein-coding genes on the Y and obtaining insights into their role in male fertility should help improve male reproductive fitness that influences the survival of great ape populations. This dissertation should be viewed as a starting point to learn about evolution of ampliconic genes and their variation in great apes.
Below is a list of possible future directions. They are to
1. obtain complete Y ampliconic gene sequences from long reads rather than short reads; the latter cannot address small indels and deleterious mutations.
2. look at expression levels of ampliconic orthologs that moved from sex chromosomes onto autosomes and determine the role they play in dosage compensation of gene expression.
3. obtain cell-type specific expression data, as testis as a tissue is a mixture of different cell-types.
4. study expression data at different stages of spermatogenesis with different copies of genes, because ampliconic genes are linked to spermatogenesis.
120
5. develop tools to study transcription factors in palindromic regions which remain understudied.
6. study Y regulatory sequences which can benefit male fitness, just as the Y has accumulated genes beneficial to males.
Major contributions
1. Presented a method to estimate the copy number of human Y ampliconic genes. 2. Presented a study of Y ampliconic copy number and expression among great apes. 3. Studied the relationship between ampliconic gene expression and copy number in humans and other great apes. 4. Demonstrated conservation of ampliconic genes with respect to sperm competition and mating structure of great apes. 5. Provided information on gene and palindrome content of common ancestor of great apes.
121
References
Bellott, D. W., Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Cho, T.-J., …
Page, D. C. (2014). Mammalian Y chromosomes retain widely expressed dosage-
sensitive regulators. Nature, 508(7497), 494–499.
Betrán, E., Demuth, J. P., & Williford, A. (2012). Why Chromosome Palindromes?
International Journal of Evolutionary Biology, 2012, 1–14.
Carvalho, C. M. B., Zhang, F., & Lupski, J. R. (2011). Structural variation of the human
genome: mechanisms, assays, and role in male infertility. Systems Biology in
Reproductive Medicine, 57(1-2), 3–16.
Hallast, P., & Jobling, M. A. (2017). The Y chromosomes of the great apes. Human
Genetics, 136(5), 511–528.
Harrison, M. E., & Chivers, D. J. (2007). The orang-utan mating system and the
unflanged male: A product of increased food stress during the late Miocene and
Pliocene? Journal of Human Evolution, 52(3), 275–293.
Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S., …
Page, D. C. (2012). Strict evolutionary conservation followed rapid gene loss on
human and rhesus Y chromosomes. Nature, 483(7387), 82–86.
Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P.
J., … Page, D. C. (2010). Chimpanzee and human Y chromosomes are remarkably
divergent in structure and gene content. Nature, 463(7280), 536–539.
Krausz, C., Chianese, C., Giachini, C., Guarducci, E., Laface, I., & Forti, G. (2011). The
Y chromosome-linked copy number variations and male fertility. Journal of
Endocrinological Investigation, 34(5), 376–382.
Lucotte, E. A., Skov, L., Jensen, J. M., Coll Macià, M., Munch, K., & Schierup, M. H.
(2018). Dynamic Copy Number Evolution of X- and Y-Linked Ampliconic Genes in
122
Human Populations. Genetics. https://doi.org/10.1534/genetics.118.300826
Navarro-Costa, P., Plancha, C. E., & Gonçalves, J. (2010). Genetic Dissection of the
AZF Regions of the Human Y Chromosome: Thriller or Filler for Male (In)fertility?
Journal of Biomedicine & Biotechnology, 2010, 1–18.
Oetjens, M. T., Shen, F., Emery, S. B., Zou, Z., & Kidd, J. M. (2016). Y-Chromosome
Structural Diversity in the Bonobo and Chimpanzee Lineages. Genome Biology and
Evolution, 8(7), 2231–2240.
Repping, S., Skaletsky, H., Lange, J., Silber, S., van der Veen, F., Oates, R. D., …
Rozen, S. (2002). Recombination between Palindromes P5 and P1 on the Human
Y Chromosome Causes Massive Deletions and Spermatogenic Failure. American
Journal of Human Genetics, 71(4), 906–922.
Repping, S., van Daalen, S. K. M., Brown, L. G., Korver, C. M., Lange, J., Marszalek, J.
D., … Rozen, S. (2006). High mutation rates have driven extensive structural
polymorphism among human Y chromosomes. Nature Genetics, 38(4), 463–467.
Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J., Cordum, H. S., Waterston, R. H.,
… Page, D. C. (2003). Abundant gene conversion between arms of palindromes in
human and ape Y chromosomes. Nature, 423(6942), 873–876.
Schaller, F., Fernandes, A. M., Hodler, C., Münch, C., Pasantes, J. J., Rietschel, W., &
Schempp, W. (2010). Y Chromosomal Variation Tracks the Evolution of Mating
Systems in Chimpanzee and Bonobo. PloS One, 5(9), e12482.
Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J., Cordum, H. S., Hillier, L., Brown, L. G.,
… Page, D. C. (2003). The male-specific region of the human Y chromosome is a
mosaic of discrete sequence classes. Nature, 423(6942), 825–837.
Skov, L., Danish Pan Genome Consortium, & Schierup, M. H. (2017). Analysis of 62
hybrid assembled human Y chromosomes exposes rapid structural changes and
high rates of gene conversion. PLoS Genetics, 13(8), e1006834.
123
Tomaszkiewicz, M., Rangavittal, S., Cechova, M., Campos Sanchez, R., Fescemyer, H.
W., Harris, R., … Makova, K. D. (2016). A time- and cost-effective strategy to
sequence mammalian Y Chromosomes: an application to the de novo assembly of
gorilla Y. Genome Research, 26(4), 530–540.
Vogt, P. H. (1996). Human Y Chromosome Function in Male Germ Cell Development. In
Advances in Developmental Biology (1992) (pp. 191–257).
Wistuba, J., Schrod, A., Greve, B., Keith Hodges, J., Aslam, H., Weinbauer, G. F., &
Marc Luetjens, C. (2003). Organization of Seminiferous Epithelium in Primates:
Relationship to Spermatogenic Efficiency, Phylogeny, and Mating System1. Biology
of Reproduction, 69(2), 582–591.
Ye, D., Zaidi, A. A., Tomaszkiewicz, M., Anthony, K., Liebowitz, C., DeGiorgio, M., …
Betran, E. (2018). High Levels of Copy Number Variation of Ampliconic Genes
across Major Human Y Haplogroups. Genome Biology and Evolution, 10(5), 1333–
1350.
124
Appendix A
Supporting Material for Chapter 2
Supplemental Tables Table S1. Copy number counts in the simulated sets. The numbers in the table represent the total number of gene copies for each ampliconic family present on the Y chromosome reference and used to simulate paired end fastq files and the copy number estimates using AmpliCoNE (Observed). The gene families in bold have custom copy numbers in each set.
Gene Set 1 Set 2 Set 3 family
Expected Observed Expected Observed Expected Observed
TSPY 22 22.0 29 29.0 23 23.3
RBMY 7 6.99 12 11.9 9 8.94
VCY 4 4.01 2 1.92 3 3.11
BPY2 3 3.02 3 2.98 3 2.92
CDY 4 3.93 4 4.07 4 3.96
DAZ 4 4.01 4 4.02 4 4.00
HSFY 2 2.08 2 2.04 2 1.93
PRY 2 1.96 2 2.01 2 1.98
XKRY 2 1.95 2 1.92 2 1.99
125
Table S2. The ampliconic gene copy number estimates across technical and biological replicates in three males samples from the GIAB consortium. Ashkenazim Son (HG002), Ashkenazim Father (HG004), and Chinese Father (HG006). The ampliconic gene copy number was estimated for two sequencing runs per individual which are represented as separate column in the table. Gene HG002 HG002 HG003 HG003 HG006 HG006 family (Run1) (Run2) (Run1) (Run3) (141008_D0 (141015_D0 0360) 0360)
BPY2 2.93 2.93 2.92 2.74 1.82 1.99
CDY 4.34 3.97 4.08 3.96 3.05 3.14
DAZ 3.86 3.80 3.86 3.87 1.91 1.78
HSFY 2.01 2.24 2.20 2.18 2.28 2.01
PRY 2.02 1.98 1.84 2.02 2.03 2.04
RBMY 5.83 5.87 6.01 5.44 7.00 6.95
TSPY 40.0 40.2 39.7 39.8 19.2 19.4
VCY 1.45 1.23 1.58 1.13 1.38 1.21
XKRY 1.96 1.98 2.16 1.99 1.81 1.77
126
Table S3. The ampliconic gene copy number estimates across different depths of coverage of Y chromosome in Ashkenazim Son (HG002 Run1) from the GIAB consortium. The reads were subsampled using samtools view function.
Gene 21x 17x 11x 6x family
BPY2 2.93 2.92 2.84 2.89
CDY 4.34 4.34 4.24 4.22
DAZ 3.86 3.85 3.89 3.84
HSFY 2.01 1.96 1.96 1.85
PRY 2.02 2.05 2.05 2.11
RBMY 5.83 5.73 5.69 5.70
TSPY 40.0 40.1 39.7 39.7
VCY 1.45 1.40 1.40 1.64
XKRY 1.96 1.86 1.90 1.77
127
Table S4. ddPCR-based ampliconic gene copy number estimates for four males from the GIAB consortium. Replicates a, b and c are the copy numbers from the three replicate experiments that were performed, and their mean value is used as the final estimate of the copy number. N/A - not available.
Mean copy Individual ID Gene Replicate a Replicate b Replicate c number
NA24149 DAZ 3.92 3.92 4.12 3.99
NA24149 PRY 2.05 2.05 2.24 2.11
NA24149 VCY 2.24 2.22 2.24 2.24
NA24149 BPY2 2.99 3.08 2.93 3.00
NA24149 CDY 3.55 3.64 3.84 3.68
NA24149 HSFY 1.93 1.92 1.78 1.88
NA24149 RBMY 6.11 6.81 6.56 6.49
NA24149 TSPY 44.6 44.5 43.7 44.3
NA24149 XKRY 1.98 1.93 1.90 1.93
NA24385 DAZ 3.72 3.90 3.80 3.8
NA24385 PRY 1.94 1.91 1.91 1.92
NA24385 VCY 2.03 2.09 2.06 2.06
NA24385 BPY2 2.87 2.72 2.86 2.82
NA24385 CDY 3.43 3.49 3.49 3.47
NA24385 HSFY 1.88 1.85 1.81 1.85
NA24385 RBMY 6.87 6.53 6.75 6.72
NA24385 TSPY 42.0 N/A 42.7 42.4
NA24385 XKRY 1.73 1.78 1.81 1.77
NA24631 DAZ 2.01 1.97 2.04 2.01
NA24631 PRY 2.05 2.10 2.17 2.11
128
NA24631 VCY 2.08 2.09 2.10 2.09
NA24631 BPY2 1.85 1.96 1.89 1.9
NA24631 CDY 2.83 2.91 2.81 2.85
NA24631 HSFY 1.92 1.98 1.93 1.95
NA24631 RBMY 7.75 7.95 7.94 7.88
NA24631 TSPY 21.9 22.2 21.7 22.0
NA24631 XKRY 1.96 1.90 1.86 1.91
NA24694 DAZ 1.88 1.83 1.87 1.86
NA24694 PRY 2.01 2.03 2.01 2.02
NA24694 VCY 2.09 2.09 2.06 2.08
NA24694 BPY 1.92 1.95 1.93 1.94
NA24694 CDY 2.92 2.90 2.90 2.91
NA24694 HSFY 1.97 2.02 1.89 1.96
NA24694 RBMY 8.06 8.01 7.79 7.95
NA24694 TSPY 21.8 22.1 22.0 22.0
NA24694 XKRY 1.83 1.95 1.96 1.91
129
Table S5. Sample IDs of the 170 GTEx samples used initially and retained after filtering for outliers in the gene expression analysis.
SAMPLE ID STATUS
GTEX-111CU RETAINED
GTEX-111FC RETAINED
GTEX-111VG RETAINED
GTEX-111YS RETAINED
GTEX-117XS RETAINED
GTEX-117YW RETAINED
GTEX-117YX FILTERED
GTEX-1192W FILTERED
GTEX-11DXY FILTERED
GTEX-11DXZ RETAINED
GTEX-11EI6 RETAINED
GTEX-11EQ8 RETAINED
GTEX-11EQ9 RETAINED
GTEX-11GS4 RETAINED
GTEX-11LCK RETAINED
GTEX-11NUK RETAINED
GTEX-11NV4 RETAINED
GTEX-11P7K RETAINED
GTEX-11P82 RETAINED
GTEX-11TT1 RETAINED
GTEX-11TUW RETAINED
GTEX-11UD2 FILTERED
GTEX-11WQC RETAINED
GTEX-11WQK RETAINED
GTEX-11ZUS RETAINED
GTEX-1212Z RETAINED
130
GTEX-12BJ1 FILTERED
GTEX-12C56 RETAINED
GTEX-12WSH RETAINED
GTEX-12WSI RETAINED
GTEX-12WSL RETAINED
GTEX-12WSM RETAINED
GTEX-12ZZY RETAINED
GTEX-13111 RETAINED
GTEX-13112 RETAINED
GTEX-131XE RETAINED
GTEX-131XF FILTERED
GTEX-132QS RETAINED
GTEX-1399Q RETAINED
GTEX-139TT RETAINED
GTEX-13FHP RETAINED
GTEX-13FLW RETAINED
GTEX-13FTW RETAINED
GTEX-13G51 FILTERED
GTEX-13N2G RETAINED
GTEX-13NYB RETAINED
GTEX-13NZA RETAINED
GTEX-13NZB RETAINED
GTEX-13O1R RETAINED
GTEX-13O21 RETAINED
GTEX-13O61 RETAINED
GTEX-13OVH RETAINED
GTEX-13OVL RETAINED
GTEX-13OW5 RETAINED
GTEX-13OW6 RETAINED
131
GTEX-13OW8 FILTERED
GTEX-13QJ3 FILTERED
GTEX-13VXU RETAINED
GTEX-145MF FILTERED
GTEX-145MH RETAINED
GTEX-145MO RETAINED
GTEX-14753 RETAINED
GTEX-147F4 RETAINED
GTEX-147JS RETAINED
GTEX-14A6H RETAINED
GTEX-14ABY RETAINED
GTEX-N7MS RETAINED
GTEX-NPJ8 FILTERED
GTEX-O5YT RETAINED
GTEX-OOBK RETAINED
GTEX-P4PQ FILTERED
GTEX-P4QS RETAINED
GTEX-PLZ5 RETAINED
GTEX-PLZ6 FILTERED
GTEX-PW2O RETAINED
GTEX-Q2AH RETAINED
GTEX-Q2AI RETAINED
GTEX-QDVN FILTERED
GTEX-QEG4 RETAINED
GTEX-QEG5 RETAINED
GTEX-QLQ7 RETAINED
GTEX-QLQW RETAINED
GTEX-QMRM RETAINED
GTEX-QV31 RETAINED
132
GTEX-QV44 RETAINED
GTEX-R55C RETAINED
GTEX-R55D RETAINED
GTEX-R55E RETAINED
GTEX-REY6 RETAINED
GTEX-RM2N RETAINED
GTEX-RN64 RETAINED
GTEX-RUSQ RETAINED
GTEX-RWSA RETAINED
GTEX-S33H RETAINED
GTEX-S3XE RETAINED
GTEX-S4Q7 RETAINED
GTEX-S4Z8 FILTERED
GTEX-S7PM RETAINED
GTEX-S7SE RETAINED
GTEX-S95S RETAINED
GTEX-SIU7 FILTERED
GTEX-SUCS RETAINED
GTEX-T5JC RETAINED
GTEX-T6MN RETAINED
GTEX-T8EM RETAINED
GTEX-TKQ1 RETAINED
GTEX-TKQ2 RETAINED
GTEX-U3ZH RETAINED
GTEX-U3ZM RETAINED
GTEX-U4B1 RETAINED
GTEX-U8T8 RETAINED
GTEX-U8XE RETAINED
GTEX-UPJH RETAINED
133
GTEX-V1D1 RETAINED
GTEX-V955 RETAINED
GTEX-VJYA RETAINED
GTEX-WFG8 RETAINED
GTEX-WFON RETAINED
GTEX-WH7G RETAINED
GTEX-WHSB RETAINED
GTEX-WHSE RETAINED
GTEX-WK11 RETAINED
GTEX-WOFM RETAINED
GTEX-WVLH RETAINED
GTEX-WY7C FILTERED
GTEX-WYJK RETAINED
GTEX-WZTO RETAINED
GTEX-X261 RETAINED
GTEX-X3Y1 RETAINED
GTEX-X4XX FILTERED
GTEX-X5EB RETAINED
GTEX-XAJ8 RETAINED
GTEX-XBEC RETAINED
GTEX-XBED RETAINED
GTEX-XGQ4 RETAINED
GTEX-XMK1 RETAINED
GTEX-XPT6 RETAINED
GTEX-XPVG RETAINED
GTEX-XQ3S RETAINED
GTEX-Y111 RETAINED
GTEX-Y3I4 RETAINED
GTEX-Y5V6 RETAINED
134
GTEX-Y8E4 RETAINED
GTEX-Y9LG RETAINED
GTEX-YEC3 RETAINED
GTEX-YEC4 RETAINED
GTEX-YF7O RETAINED
GTEX-YFCO RETAINED
GTEX-YJ89 RETAINED
GTEX-Z93S RETAINED
GTEX-ZA64 RETAINED
GTEX-ZAB4 RETAINED
GTEX-ZAB5 RETAINED
GTEX-ZDTT FILTERED
GTEX-ZDYS RETAINED
GTEX-ZLFU FILTERED
GTEX-ZPU1 RETAINED
GTEX-ZQUD RETAINED
GTEX-ZT9W RETAINED
GTEX-ZT9X RETAINED
GTEX-ZTSS RETAINED
GTEX-ZTX8 RETAINED
GTEX-ZUA1 RETAINED
GTEX-ZV7C RETAINED
GTEX-ZVTK RETAINED
GTEX-ZVZP RETAINED
GTEX-ZY6K FILTERED
GTEX-ZYFC RETAINED
GTEX-ZYT6 RETAINED
GTEX-ZZ64 RETAINED
135
Table S6. Median, standard deviation (SD) and range of copy number (CN, N=167) and gene expression (GE) values per ampliconic gene family (N=149).
Gene Family CN.Median CN.SD CN.Range GE.Median GE.SD GE.Range
BPY2 3.76 0.69 1.28-8.19 147.91 69.48 39-395
CDY 4.68 0.49 3.05-7.58 259.71 144.42 44-838
DAZ 4.39 0.81 2.05-10.91 1210.02 397.21 485-2791
HSFY 2.29 0.22 1.84-3.06 941.04 381.06 322-2359
PRY 2.16 0.16 1.65-2.8 38.48 21.85 5-122
RBMY 9.46 1.36 4.83-14.42 1017.46 310.5 465-2763
TSPY 35.7 6.34 22.4-65.69 3272.97 915.9 1860-7569
VCY 1.78 0.36 1.02-2.99 1304.96 774.94 547-8431
XKRY 2.21 0.22 1.58-2.82 2.18 2.08 0-9
136
Table S7. Copy number and gene expression correlation values. Spearman correlation coefficient values (r) and P-values for each gene family are calculated using cor.test() function in R. The P-values cutoff after Bonferroni correction for nine tests is 0.05/9 ≈ 0.006).
All samples R1b (European) E1b (African) I1a (European)
Gene r P-value r P-value r P-value r P-value Family
BPY2 0.05 0.55 0.00 0.979 0.07 0.74 0.12 0.658
CDY -0.02 0.839 0.04 0.712 -0.21 0.34 -0.17 0.532
DAZ 0.08 0.351 0.15 0.195 0.15 0.51 -0.09 0.763
HSFY 0.10 0.212 0.03 0.821 0.43 0.05 0.16 0.558
PRY 0.03 0.702 0.04 0.707 -0.06 0.79 -0.25 0.375
RBMY 0.10 0.231 -0.12 0.298 -0.22 0.32 0.27 0.327
TSPY 0.19 0.022 0.00 0.991 -0.27 0.22 0.14 0.611
VCY -0.06 0.451 -0.19 0.102 0.23 0.30 0.24 0.397
XKRY -0.07 0.374 -0.11 0.349 0.07 0.77 -0.53 0.040
137
Table S8. P-values from permutation tests for copy number differences between haplogroup pairs. Given two haplogroups, to test whether the difference in copy number between the haplogroups is significant, we compared the true difference in mean copy number between haplogroups to the difference in mean of 1 million random permutations (randomly rearranged the haplogroup assignment of the two haplogroups). The P-value represents how many permuted mean-differences are larger than the one we observed in our actual data. P-values that pass a Bonferroni corrected cutoff for 54 tests (0.05/54 = 0.00093) are highlighted in bold.
Copy Number E vs. I E vs. J E vs. R I vs. R J vs. I J vs. R
BPY2 4.44x10-03 2.69x10-02 1.70x10-03 0.771 0.138 0.444
CDY 1.12x10-03 0.533 1.74x10-02 0.119 1.05x10-02 0.314
DAZ 5.04x10-02 0.155 5.30x10-03 0.481 0.304 0.839
HSFY 0.866 0.476 0.697 0.839 0.393 0.292
PRY 5.36x10-02 0.288 2.01x10-02 0.873 0.686 0.730
RBMY 0.401 3.53x10-03 6.94x10-04 8.31x10-02 3.29x10-03 0.00
TSPY 0.00 0.440 0.00 3.47x10-02 3.00x10-06 0.00
VCY 0.638 0.456 0.536 0.989 0.334 0.206
XKRY 0.301 0.483 0.463 0.595 0.113 0.187
138
Table S9. P-values from permutation test for gene expression differences between haplogroup pairs. Given two haplogroups, to test if the difference in gene expression between the haplogroups is significant or not, we compared the true difference in mean gene expression between haplogroups to the difference in mean of 1 million (M) random permutations (randomly rearranged the haplogroup assignment). The P-value represents how many permuted mean-differences are larger than the one we observed in our actual data. None of the P-values pass a Bonferroni corrected cutoff for fifty four tests (0.05/54 = 0.00093).
Gene Expression E vs. I E vs. J E vs. R I vs. R J vs. I J vs. R
BPY2 3.12x10-02 0.148 7.09x10-03 0.759 0.963 0.798
CDY 0.517 0.416 0.124 0.522 0.755 0.934
DAZ 0.715 0.534 1.36x10-02 3.04x10-02 0.829 0.171
HSFY 0.329 0.233 9.84x10-02 0.592 0.500 0.716
PRY 0.309 0.749 0.116 0.755 0.724 0.521
RBMY 0.341 0.451 0.114 0.995 0.269 7.65x10-02
TSPY 0.307 0.177 0.825 0.203 6.10x10-02 7.08x10-02
VCY 0.542 0.275 0.234 0.406 0.922 0.629
XKRY 0.077 0.906 0.353 0.192 0.141 0.571
139
Table S10. Correlation between gene expression and age. Spearman correlation coefficient values (r) and P-values for each of the gene family are calculated using cor.test() function in R. P-values that pass a Bonferroni corrected cutoff for nine tests (0.05/9 ≈0.006) are highlighted in bold.
All samples R1b (European) E1b (African) I1a (European)
Gene r P-value r P-value r P-value r P-value Family
BPY2 -0.05 0.522 -0.19 9.02x10-02 0.52 1.31x10-02 0.07 0.815
CDY 0.01 0.901 -0.1 0.399 0.45 3.6x10-02 0.23 0.400
DAZ -0.18 2.63x10-02 -0.11 0.327 -0.35 0.107 -0.38 0.157
HSFY 0.09 0.282 -0.03 0.817 0.57 6.10x10-3 0.09 0.761
PRY 0.05 0.553 -0.12 0.292 0.61 2.80x10-3 0.11 0.703
RBMY -0.1 0.226 0.03 0.825 -0.06 0.789 -0.30 0.282
TSPY -0.06 0.486 0.02 0.840 0.03 0.904 -0.11 0.703
VCY 0.06 0.490 0.12 0.278 0.17 0.441 -0.44 0.105
XKRY 0.16 4.54x10-02 0.16 0.172 0.21 0.343 0.19 0.486
Table S11. Ampliconic gene homologs and their expression patterns. The homologs of ampliconic genes were obtained from a recent review [1]. The expression
140 pattern is obtained from the human protein atlas (HPA) [2]. Tissue-enriched: expression in one tissue is at least five-fold higher than that in all other tissues/cell lines. Tissue- enhanced (five-fold higher average transcripts per million (TPM) in one or more tissues/cell lines compared to the mean TPM of all tissues/cell lines). Ubiquitous (≥ 1 TPM in all tissues/cell lines). Mixed (detected in at least one tissue/cell line and in none of the above categories).
Ampliconic gene Homolog Chromosome Homolog family expression status
CDY CDYL chr6 Ubiquitous
CDYL2 chr16 Mixed
DAZ DAZL chr3 Testis-enriched
BOLL chr2 Testis-enriched
HSFY HSFX1 chrX Mixed
HSFX2 chrX Testis-enriched
RBMY RBMX chrX Ubiquitous
RBMX2 chrX Ubiquitous
RBMXL1 chr1 Ubiquitous
RBMXL2 chr11 Testis-enriched
RBMXL3 chrX Testis-enriched
TSPY TSPYL1 chr6 Ubiquitous
TSPYL2 chrX Ubiquitous
TSPYL4 chr6 Ubiquitous
TSPYL5 chr8 Testis-enhanced
TSPYL6 chr2 Testis-enriched
VCY VCX chrX Testis-enriched
VCX2 chrX Testis-enriched
VCX3A chrX Testis-enriched
VCX3B chrX Testis-enriched
XKRY XKRX chrX Skin-enhanced Expression pattern adapted from the HPA database [2]
141
142
Table S12. Predicted function of ampliconic gene families. The table was adapted from Table 1 of Paulo Navarro-Costa’s review on ampliconic gene families [1]. All the gene families are linked to spermatogenesis and infertility.
Gene Name Function Symbol
BPY2 Basic charge,Y-linked, 2 Proposed role in cytoskeletal regulation.
CDY Chromodomain protein, Y- Transcriptional co-repressor with histone linked acetyltransferase activity. Postulated role in gene expression regulation and chromatin remodeling.
DAZ Deleted in azoospermia RNA-binding protein, regulates pre-meiotic transcript transport/storage and translation initiation.
HSFY Heat shock transcription Testis-predominant Transcription factor. factor, Y-linked
PRY PTPN13-like,Y-linked Putative pro-apoptotic signaling molecule
RBMY1 RNA binding motif protein, RNA-binding protein involved in RNA splicing and Y-linked, family 1 metabolism, signal transduction and meiotic regulation.
TSPY Testis specific protein,Y- Presumable androgen-dependent regulator of early linked germ cell development. Putative functions in cell cycle regulation.
VCY Variable charge, Ylinked Unknown, possible nucleoli-related regulator of ribosomal assembly. Involved in the cytoskeletal network[3]. Possible relationship in the pathogenesis of male infertility[4].
XKRY XK, Kell blood group Putative multipass transmembrane protein involved complex subunit related, Y- in gamete interaction. linked
143
Table S13. Numbers of high identity alignments to the references for each Y ampliconic gene family. BLAT was used to find all sites with >99% identity. For TSPY there are three pseudogene copies in the reference which share identity with parts of functional copies of TSPY genes.
Copies in the Gene family reference
BPY 3
CDY 4
DAZ 4
HSFY 2
PRY 2
RBMY 6
TSPY 9 (6+3 pseudo)
VCY 2
XKRY 2
144
Supplemental Figures
Figure S1. Variation in copy number of ampliconic gene families. In the dotplot the X-axis represents natural log of median copy number and the Y-axis is the natural log of variance in copy number for the 167 individuals analyzed. The blue line represents the linear regression fit (median ~ var) with an R2 value of 0.91. The color of each dot is labeled with ampliconic gene family described in the legend.
145
Figure S2. Variation in gene expression of ampliconic gene families. In the dotplot, the X-axis represents natural log of the median normalized gene expression values and the Y-axis represents natural log of the variance in gene expression for the 149 individuals analyzed. The blue line represents the linear regression fit (median ~ var) with an R2 value of 0.99. The color of each dot is labeled with ampliconic gene family described in the legend.
146
Figure S3. The relationship between expression level and copy number (N=149). Within each scatter plot, the X-axis represents copy number values and Y-axis represents the normalized gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are in bracket. The gray line represents the linear function fitted to the given data points. The nine scatter plots represent the relationship between expression and copy number for each of the nine ampliconic gene families. There is no significant relationship in either of the nine gene families (Bonferroni correction p-value cutoff of 0.05/9=0.006).
147
Figure S4. The relationship between gene expression and copy number for individuals with an R1b (European) subhaplogroup (N=77). Within each scatter plot, the X-axis represents the copy number values and Y-axis represents the normalized gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets. The gray line represents the linear function fitted to the given data points. The nine scatter plots represent the relationship between expression and copy number of the ampliconic gene families. There is no significant relationship in either of the nine gene families (Bonferroni correction p-value cutoff of 0.05/9=0.006).
148
Figure S5 . The relationship between gene expression and copy number for individuals with a I1a (European) subhaplogroup (N=15). Within each scatter plot the X-axis represents the copy number values and Y-axis represents the normalized gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets. The gray line represents the linear function fitted to the given data points. The nine scatter plots represent the relationship between expression and copy number of the ampliconic gene families. There is no significant relationship in either of the nine gene families (Bonferroni correction p-value cutoff of 0.05/9=0.006).
149
Figure S6. The relationship between gene expression and copy number for individuals with a E1b (African) subhaplogroup (N=22). Within each scatter plot the X-axis represents the copy number values and Y-axis represents the normalized gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets. The gray line represents the linear function fitted to the given data points. The nine scatter plots represent the relationship between expression and copy number of the ampliconic gene families. There is no significant relationship in either of the nine gene families (Bonferroni correction p-value cutoff of 0.05/9=0.006).
150
Figure S7. The relationship between gene expression and age in the individuals analyzed (N=149). The nine scatterplots represent the nine ampliconic gene families with their names as the title of their respective plot. Within each scatter plot the Y-axis represents the age and X-axis represents the gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets.There is no significant relationship between age and expression in all the nine families (Bonferroni correction p-value cutoff of 0.05/9=0.006). The gray line represents the linear function fitted to the points in the plot.
151
Figure S8. The relationship between gene expression and age for individuals with a R1b (European) subhaplogroup (N=77). The nine scatterplots represent the nine ampliconic gene families with their names as the title of their respective plot. Within each scatter plot the Y-axis represents the age and X-axis represents the gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets.There is no significant relationship between age and expression in all the nine families (Bonferroni correction p-value cutoff of 0.05/9=0.006). The gray line represents the linear function fitted to the points in the plot.
152
Figure S9. The relationship between gene expression and age for individuals with a I1a (European) haplogroup (N=15). The nine scatterplots represent the nine ampliconic gene families with their names as the title of their respective plot. Within each scatter plot the Y-axis represents the age and X-axis represents the gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets.There is no significant relationship between age and expression in all the nine families (Bonferroni correction p-value cutoff of 0.05/9=0.006). The gray line represents the linear function fitted to the points in the plot.
153
Figure S10. The relationship between gene expression and age for individuals with a E1b (African) haplogroup (N=22). The nine scatterplots represent the nine ampliconic gene families with their names as the title of their respective plot. Within each scatter plot the Y-axis represents the age and X-axis represents the gene expression values. The Spearman correlations were calculated using the cor.test() function in R and the P-values are shown in brackets.There is significant relationship between age and expression in HSFY and PRY families (Bonferroni correction p-value cutoff of 0.05/9=0.006). The gray line represents the linear function fitted to the points in the plot.
154
Figure S11. Combination of expression level differences and individual-level relationship between Y ampliconic gene families and their non-Y homologs can better explain the possible scenarios of evolution for the former. Within each row (A-D), the plot on the left represents the expression level differences between Y ampliconic genes (blue boxplot) and their non-Y homologs (orange boxplot), the plot in the middle represents the individual level relationship between Y ampliconic genes (X-axis) and their non-Y homologs (Y-axis) and on the right are the expected scenarios of evolution. Assuming non-Y homologs represent ancestral expression levels, higher expression of Y ampliconic genes implies independent expression (A,B) and lower expression implies dosage regulation (C,D). Negative correlation among ampliconic genes and their non-Y homologs suggests lack of co-regulation (B,D) and a positive correlation suggests coregulation of gene expression(A,C).
155
Figure S12. PCA plot of all 170 samples using Variance Stabilizing Transformation (VST) normalized read counts. All the points with greater than 20 PC1 value (X-axis) were filtered out.
156
References
1. Navarro-Costa P. Sex, rebellion and decadence: the scandalous evolutionary history of the human Y chromosome. Biochim Biophys Acta. 2012 Dec;1822(12):1851–63.
2. Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, et al. Proteomics. Tissue-based map of the human proteome. Science. 2015 Jan 23;347(6220):1260419.
3. Wong EYM, Tse JYM, Yao K-M, Lui VCH, Tam P-C, Yeung WSB. Identification and characterization of human VCY2-interacting protein: VCY2IP-1, a microtubule- associated protein-like protein. Biol Reprod. 2004 Mar;70(3):775–84.
4. Tse JYM, Wong EYM, Cheung ANY, O WS, Tam PC, Yeung WSB. Specific expression of VCY2 in human male germ cells and its involvement in the pathogenesis of male infertility. Biol Reprod. 2003 Sep;69(3):746–51.
157
Appendix B
Supporting Material for Chapter 3
Supplemental Note 1. CAFE Simulations
Simulations to test gene family size. The great ape dataset includes copy number estimates of nine gene families from six species. We tested whether nine gene families (n=9) were sufficient to predict the rate of gene birth and death, because uncertainty in the rate parameter could influence p-value estimates generated by CAFE (Han, Thomas, Lugo-Martinez, & Hahn, 2013). To test this, we used the gene family copy numbers and the phylogenetic tree from the original CAFE article (Hahn, Demuth, & Han, 2007). However, we only used primate-specific data (human, chimpanzee, and macaque) in our simulations. Our goal was to test the reproducibility of birth and death rate (휆) predictions by CAFE. First, we fixed the phylogenetic tree representing the three primate species to be used in all simulations as (((Chimp:6,Human:6):18,Macaque:24)), and then applied different combinations of gene family copy numbers as input to estimate 휆. As a filtering step, we removed all the gene families that had >200 gene copies cumulatively across the three species. This filtering step was used to remove excessively large gene families and retain gene families whose copy numbers are in the range observed for Y chromosome ampliconic gene families. Next, we filtered all gene families that had the same gene count in all species to ensure there was variation across the species. These filtering steps reduced the gene family count from ~9,800 to 2,445. From this set we picked 30 gene families uniformly at random, and applied CAFE on each to estimate 휆. We repeated this step 20 times (Table SN1). Next, from each set of 30 gene families, we subsampled 5, 10, and 15 gene families uniformly at random and applied CAFE on each to estimate 휆. We observed that with different sizes and combinations of gene families CAFE predicted different values for 휆 (Table SN1). Next, we picked 100 gene families with highest variation in copy number across species and performed the same analyses as above. For all input sizes of 5, 10, 15, and 30 considered, the estimated rate of gene birth and death was identical (휆=0.041667; Table SN2). This result suggests that CAFE requires gene families with high variation in their gene count across species to predict a consistent 휆 value, and the majority of the ampliconic genes considered here have high variation in gene count
158 across great ape species, thereby making CAFE an ideal approach to study their evolution.
Simulations to include multiple samples per species in CAFE analyses. CAFE typically takes the mean/median value as a representation of copy number of gene families in a species. However, we observed high intraspecific variation in copy number of gene families within our dataset, and wished to leverage this information in our predictions of gene family copy number evolution. We therefore tested whether CAFE could reproduce the significant differences observed in great ape ampliconic gene families when we provide it with multiple individuals per species instead of the default application of a single value (mean or median copy number) per gene family as input. With the presence of multiple samples per species, we assumed that CAFE might take the variation in copy number into consideration while estimating p-values. Because the exact phylogenetic relationship among the samples is unknown in terms of the time since their common ancestor, we estimated the time since the most recent common ancestor (TMRCA) for each species using an external data set (Hallast et al., 2016). We followed the same steps used in the generation of the phylogenetic tree of great apes (see Materials and Methods), except that here we calculated the TMRCA for all the combinations of samples available, and took the median TMRCA value within a species as the representative TMRCA for the species.
For each species, we added five branches at the tips (i.e. approximately a star phylogeny) of the great ape phylogenetic tree, such that each species was represented by five individuals (Supplementary Figure SA). The lengths of these five branches are represented by the TMRCA of lineages within each species, with a difference of one thousand years between each internal node. That is, the last split for each individual within a species was within four thousand years since the TMRCA of all five individuals, and this mimics a star phylogeny as the TMRCA ranging from 75 to 550 thousand years is much larger than four thousand years. From our dataset, we picked five individuals per species uniformly at random and provided their ampliconic gene family copy numbers as input to CAFE along with the updated phylogenetic trees (with star phylogenies at the tips) to test for significant gains or losses in copy number. This procedure was repeated 100 times. In these simulations, for each gene family we looked for significant shifts (gains or losses) in gene copy number at each branch along the phylogenetic tree, which is indicated by a p-
159 value (Bonferroni-corrected p-value < 0.005; 0.05/10 by correcting for the 10 external and ancestral branches present on the phylogenetic tree) and compared them to the original run (Table S4) of CAFE. All significant observations identified in the analysis based on median copy number were also significant when we used multiple samples per species. However, within the set of 100 simulated replicates, the frequency in which each branch has a significant shift was not the same. On the ancestral orangutan branch, the CDY gene family displayed a significant shift in all 100 simulations. On the bonobo and chimpanzee branches, the RBMY gene family showed a significant shift in 100 and 83 simulations, respectively. The TSPY gene family had significant shifts on multiple branches, with 99 simulations showing significance in bonobo and chimpanzee, 84 in gorilla, 57 in Bornean orangutan, and 39 in Sumatran orangutan. Finally, the XKRY gene family had 55 simulations with a shift in the ancestral branch of orangutans and 68 with a shift in Sumatran orangutan. The XKRY-specific shift in Bornean orangutan was significant only 12 times out of 100 simulations.
Table SN1. 휆 values based on randomly selected gene families. Each column indicates the number of genes used to estimate the rate of gene birth or death and each row represents different replicates.
N=5 N=10 N=15 N=30
0.011158 0.041667 0.034038 0.017218
0.035354 0.035568 0.022321 0.013378
0.003766 0.010327 0.009695 0.011261
0.007554 0.027215 0.0206 0.017613
0.041667 0.024575 0.017532 0.012252
0.007613 0.00628 0.008843 0.010396
0.006239 0.006899 0.011568 0.010448
0.026245 0.013568 0.041667 0.038559
0.041667 0.041667 0.041667 0.023911
0.004424 0.005405 0.006028 0.009473
0.013036 0.009137 0.012186 0.011632
0.009044 0.010184 0.00975 0.010824
0.004717 0.007606 0.008536 0.023141
160
0.041667 0.041667 0.032196 0.018089
0.01187 0.011803 0.010866 0.012213
0.004724 0.005352 0.008868 0.009709
0.005236 0.014412 0.010689 0.017878
0.041667 0.041667 0.036225 0.038344
0.004755 0.01066 0.012642 0.009939
0.017927 0.018403 0.020957 0.016235
Table SN2. 휆 values based on gene families with high copy number variation across species. Each column indicates the number of genes used to estimate the rate of gene birth or death and each row represents different replicates.
N=5 N=10 N=15 N=30
0.031804 0.041667 0.041667 0.041667
0.041667 0.041667 0.041667 0.041667
0.041667 0.041667 0.041667 0.041667
0.041667 0.041667 0.041667 0.041667
0.041667 0.041667 0.041667 0.041667
0.041667 0.041667 0.041667 0.041667
0.041667 0.041667 0.041667 0.041667
0.041667 0.041667 0.041667 0.041667
0.041667 0.041667 0.041667 0.041667
0.041667 0.041667 0.041667 0.041667
161
Supplemental Figure SA. The great ape phylogenetic tree with five individuals sampled per species (star phylogeny) used in CAFE analysis. Significant shifts were assessed for each of the internal edges numbered 1-10 (edges from the original phylogenetic tree), whereas shifts were not examined for external edges representing the star phylogenies relating sampled individuals within a species.
162
Supplemental Note 2. EVE Simulations
With the difficulty in obtaining testis samples to generate expression data, we had a small sample size for each species except for humans. Currently, the EVE model (R. V. Rohlfs & Nielsen, 2015) cannot handle missing values, and so we limited our analysis to the five gene families that are found in all great ape species. In our dataset, we have two to three individuals sampled per species, except for humans in which we had testis expression measured for more than one hundred sampled individuals. To ensure that sample sizes were comparable across the sampled great apes, we used three (maximum size in non- human species) human testis expression samples that were sampled uniformly at random from the set with African Y-specific haplogroup E from the GTEx dataset (Ardlie et al., 2015). We chose samples with African ancestry as they are expected to have high genetic variation in humans. On the whole, we had 12 testis samples in total (three human, two bonobo, three chimpanzee, two gorilla, and two B. orangutan). We tested whether the EVE model performs well with five gene families and two or three individuals sampled per species.
The built-in function -H in the EVE model simulates expression values when provided with a phylogenetic tree, the number of individuals sampled per species, along with the parameter values for selection strength (α), drift (σ2), ratio of within- to between-species variance (β), and optimal expression value (θ). The within species variance (휏2) or
휎2 population variance is represented by 훽 (R. V. Rohlfs, Harrigan, & Nielsen, 2014). For 2훼 the simulations, we matched sample sizes to empirical sample sizes (3, 2, 3, 2, and 2 for human, bonobo, chimpanzee, gorilla, and B. orangutan, respectively), employed the great ape phylogenetic tree generated using whole-genome sequencing data (Locke et al., 2011), and considered the parameter values from the simulations studies presented in the EVE model article (R. Rohlfs & Nielsen, 2014) (σ2=5, α=3.0, β=6, and θ=100).
Using -H mode in the EVE model, we simulated gene expression values by varying one of the four parameter values (σ2=5, α=3.0, 휏2=6, and θ=100) from 0.5 to 60 (0.5, 1, 3, 5, 10, 15, 20, 40, 60) at a time while keeping the other three parameters, sample size, and phylogenetic tree fixed. Based on the combination of parameters, the EVE model outputs an expression value matrix of size m x n, where m is the number of sampled expressed
163 genes (m=5) and n is the total number of individuals sampled across all species together
(n=12). Using this matrix as input, it tests whether βi=βshared for each simulated gene i, i=1,2,...,m. For each combination of parameters, we simulated gene expression values for 100 independent replicates. Because the parameters for the model were predefined and we did not introduce external variation, we expect the EVE model to predict that all of the simulated gene expression values are conserved across species and that the p-values for the tests of whether βi=βshared, should be non-significant (i.e. p>0.05). If the number of genes and sample sizes are insufficient, then we expect the EVE model to predict that expression of some or all genes is significantly different from the shared expression. Using the significance cutoff 0.05, we summarized the number of times a gene was differentially expressed across great apes in the 100 replicates of each combination of parameters. We plotted the percentage of replicates that have their p-value above the threshold of 0.05 as a heatmap (Figures SB-SD). We observed that 95% of replicates had a non-significant p- value and 5% of replicates could result from random sampling of expression values from a multinormal distribution by EVE model. Based on our simulation results, we are confident that for our sample sizes, the EVE model can predict the difference in gene expression variance correctly 95 out of 100 times.
164
Supplemental Figure SB. Proportion of replicates that have their p-value above the threshold of 0.05 when the selection (α) parameter varied from 0.5 to 60 while other parameters were fixed (σ2=5, β=6, and θ=100).
165
Supplemental Figure SC. Proportion of replicates that have their p-value above the threshold of 0.05 when the drift (σ2) parameter varied from 0.5 to 60 while other parameters were fixed (α=3.0, β=6, and θ=100).
166
Supplemental Figure SD. Proportion of replicates that have their p-value above the threshold of 0.05 when the variation within species parameter (τ2) varied from 0.5 to 60 while other parameters were fixed (σ2=5, α=3.0, and θ=100).
167
Supplemental Tables
Table S1. Summary of ampliconic gene copy numbers across great apes. The median, variance, and range of each individual gene family and all families together in each species studied.
Bonobo Chimpanzee Gene Families Median Variance Min Max Median Variance Min Max
BPY2 1.06 0.27 0.84 2.15 2.05 0.02 1.84 2.3 CDY 2.99 0.06 2.49 3.25 5.28 0.1 4.59 5.52 DAZ 2.03 0.04 1.65 2.19 4.3 0.1 3.46 4.44 HSFY 0 0 0 0 0 0 0 0 PRY 0 0 0 0 0 0 0 0 RBMY 28.59 11.3 22.53 31.52 10.86 3.27 6.25 12.18 TSPY 48.07 31.98 38.09 52.19 17.62 25.5 11.54 25.45 VCY 0 0 0 0 2.07 0.01 1.88 2.25 XKRY 0 0 0 0 0 0 0 0 Species 82.77 87.48 66.39 89.78 40.93 42.5 29.64 50.01
Human Gorilla Gene Families Median Variance Min Max Median Variance Min Max
BPY2 3.26 0.4 1.68 3.67 2.01 0.01 1.95 2.17 CDY 4.06 0.4 2.87 4.63 7.86 3.57 5.91 12.17 DAZ 4.37 0.58 2.78 5.27 2.02 0 1.96 2.17 HSFY 2.07 0.07 1.58 2.29 5.95 2.34 3.97 8.19 PRY 2.13 0.1 1.59 2.78 2.06 0 1.96 2.13 RBMY 9.5 3.93 4.95 11.03 14.16 2.15 12.03 18.06 TSPY 33.04 12.54 26.03 39.09 6.07 1.16 5.84 8.24 VCY 2.11 0.15 1.67 3.17 0 0 0 0 XKRY 1.98 0.06 1.53 2.21 2 0 1.95 2.09 Species 64.32 33.74 50.37 69.34 44.11 13.03 37.82 48.96
168
B. Orangutan S. Orangutan Gene Families Median Variance Min Max Median Variance Min Max
BPY2 14.83 13.3 12.77 21.54 13.64 29.29 12.43 25.33 CDY 34.99 66.68 30.04 49.64 35.75 6.69 33.16 39.58 DAZ 10.52 1.36 9.69 12.59 12.39 0.85 11.55 13.84 HSFY 4.86 0.49 4.36 6.21 5.49 0.36 4.95 6.47 PRY 8.46 1.79 6.77 10.33 10.17 0.16 9.59 10.69 RBMY 3.39 0.41 2.67 4.19 2.8 0.02 2.78 3.08 TSPY 32.23 56.38 28.62 47.7 22.83 50.47 14.04 31.96 VCY 0 0 0 0 0 0 0 0 XKRY 15.21 6.11 13.78 19.81 22.19 3.95 20.49 25.52 Species 128.32 455.34 110.9 160.36 131.72 136.12 116.57 142.83
169
Table S2. P-values from permutation tests for copy number differences between Sumatran and Bornean orangutans. Given two species, we tested whether the difference in copy number between the species is significant. We compared the true difference in mean copy number between the species to the difference in mean of one million random permutations (randomly rearranged the species assignment of the two species). The p-value represents the fraction of permuted mean differences that are larger than the one we observed in our actual data. The p-values that pass a Bonferroni corrected cutoff for eight tests (0.05/8 = 0.00625) are highlighted in bold.
Gene family p-value BPY2 0.863 CDY 0.747 DAZ 0.02 HSFY 0.195 PRY 0.042 RBMY 0.094 TSPY 0.028 XKRY 0.001
170
Table S3. The branch-level p-values showing the presence of significant shift in copy number when compared to its immediate ancestor in the great ape phylogenetic tree. The columns represent the branches in the great ape phylogenetic tree and the rows represent gene families with significant expansions or contractions in one or more branches as predicted by CAFE.
(Bonobo, (Bonobo, Gene (Bonobo, Chimp, Bornean Sumatran Bonobo Chimpanzee Human Chimp, Gorilla Orangutans family Chimp) Human, orangutan orangutan Human) Gorilla)
CDY 0.344 0.158 0.577 0.644 0.246 0.476 0.042 1.86×10-3 0.814 0.298
RBMY 2.02×10-7 9.61×10-4 0.105 0.312 0.522 0.4 0.292 0.195 0.557 0.557
TSPY 1.39×10-9 3.89×10-7 0.324 0.115 0.032 5.07×10-5 0.802 0.43 2.68×10-4 1.01×10-3
XKRY 0.5 0.5 0.026 0.79 0.716 0.843 0.135 4.31×10-3 2.86×10-3 6.46×10-4
171
Table S4. Summary of CAFE results with five individuals per species added as star phylogeny. Each number in the table represents the number of times out of the 100 simulations in which CAFE estimated a significant shift in copy number (p<0.005). For each simulation, ampliconic gene copy numbers from five random individuals per species were used in CAFE analysis to capture the copy number variation within each species. Columns represent the branches of the phylogenetic tree. Rows represent ampliconic gene families. The numbers in bold are the branches with observed significant shifts in copy number when median copy number per species was used in CAFE analysis (Table S3).
(Bonobo, (Bonobo, Gene (Bonobo, Bornean Sumatran Bonobo Chimp Human Chimp, Gorilla Chimp, Human, Orangutans family Chimp) orangutan orangutan Human) Gorilla)
BPY2 1 0 1 0 0 0 15 15 4 0
CDY 15 0 0 3 9 0 15 100 5 2
DAZ 15 0 0 0 0 7 5 11 0 9
HSFY 0 0 13 0 14 1 0 0 0 3
PRY 0 0 15 0 0 0 6 12 0 12
RBMY 100 83 0 2 0 6 3 14 0 22
TSPY 99 99 0 15 15 84 15 15 57 39
VCY 4 11 0 4 0 1 0 0 0 0
XKRY 0 0 8 0 1 0 15 55 12 68
172
Table S5. Gene expression values for Y ampliconic gene families across great apes. Numbers represent read counts after normalization. The read counts for Y ampliconic gene families missing in great apes are represented as NA.
S. Bonobo Chimpanzee Human Gorilla B. orangutan Oranguta n SRR10 SRR10392 Gene SRR3068 SRR20 SRR204 SRR30 SRR11 SRR817 SRR110 SRR305 SRR30 392517 SRR2176 SRR1039 521 (this family 37 40590 0591 6825 00440 512 2852 3573 6810 (this 206 3300 study) study)
BPY2 25.79 79.63 30.49 65.6 58.19 39.66 78.9 62.82 0 4.42 40.17 4.71 36.65
CDY 115.5 97.32 201.21 388.88 793.28 77.89 258.37 157.29 52.09 85.44 107.11 53.64 2040.28
DAZ 274.72 183.58 646.94 420.81 274.12 611.46 493.15 630.82 277.79 251.9 1071.1 644.59 134.39
HSFY NA NA NA NA NA 219.13 436.96 588.14 1820.86 290.2 2155.58 558.95 5558.86
PRY NA NA NA NA NA 13.98 26.42 38.36 0 0 0 0.94 0
RBMY 590.93 453.42 934.47 453.48 356.82 492.76 519.57 575.19 403.67 474.34 66.94 46.11 0
TSPY 913.87 1431.04 530.45 1376.91 722.83 2329.14 1620.16 2013.06 883.3 212.13 1332.18 604.12 598.65
VCY NA NA 386.37 288.1 73.51 883.38 862.83 1179.4 NA NA NA NA NA
XKRY NA NA NA NA NA 0.57 0.7 0.96 0 0 0 0 0
173
Table S6. EVE-model-based likelihood ratios and p-values showing no significant shift in gene expression of shared ampliconic gene families across great apes.
Gene family Likelihood ratio p-value BPY2 0.1064296 0.744 CDY 0.2014877 0.654 DAZ 0.2407138 0.623 RBMY 0.01832644 0.892 TSPY 0.02637181 0.871
Table S7. Sum of median copy number values of shared genes within AZFb and AZFc regions (BPY2, CDY, DAZ and RBMY) across great apes. Gorillas have the lowest copy number and both species of orangutans have the highest copy number of these genes.
Species Sum of median copy number values
Gorilla 14.04
Human 21.19
Chimpanzee 22.49
Bonobo 34.67
B.orangutan 63.73
S.orangutan 64.58
174
Table S8. All the great ape copy number samples used in the study.
Species IID Gorilla KB3781 Gorilla KB10845 Gorilla KB14216 Gorilla KB15257 Gorilla KB15813 Gorilla KB3456 Gorilla KB3512 Gorilla KB4987 Gorilla KB4988 Gorilla KB4989 Gorilla KB5712 Gorilla KB6319 Gorilla KB7026 Gorilla KB7801 Chimpanzee Bandit Chimpanzee BigDaddy Chimpanzee Conan Chimpanzee Moose Chimpanzee Rock Chimpanzee Budda Chimpanzee Cordova Chimpanzee Neptune Chimpanzee Zippy Bornean orangutan KB13383 Bornean orangutan KB3042 Bornean orangutan KB4204 Bornean orangutan KB5418 Bornean orangutan KB5419 Bornean orangutan KB6109 Bornean orangutan KB9002
175
Bonobo KB1843 Bonobo KB3841 Bonobo KB4229 Bonobo KB7032 Bonobo KB7781 Bonobo KB7782 Bonobo KB7998 Sumatran orangutan KB4650 Sumatran orangutan KB4661 Sumatran orangutan KB5390 Sumatran orangutan KB5565 Sumatran orangutan KB5883 Human 7 Human 8 Human 17 Human 23 Human 42 Human 46 Human 47 Human 48 Human 67 Human 72
176
Table S9. List of RNA-Seq samples.
Replica Sequencing Species NCBI SRA ID tes Type Tissue Bornean orangutan SRR2176206, SRR2176207 2 Paired end Testis In lab (3405): SRR10392514- Bornean orangutan SRR10392519 6 Paired End Testis Sumatran orangutan SRR10393299-SRR10393304 6 Paired End Testis Bornean orangutan SRR306798 1 Single End Liver Gorilla SRR306810 1 Single End Testis Gorilla SRR3053573, SRR10393358 2 Paired End Testis Gorilla SRR306808 1 Single end Liver Chimpanzee SRR306825 1 Single End Testis Chimpanzee SRR2040590 1 PairedEnd Testis Chimpanzee SRR2040591 1 Paired End Testis Chimpanzee SRR306823 1 Single End Liver Bonobo SRR306837 1 Single End Testis In lab (5013): SRR10392519- Bonobo SRR10392521 3 Paired End Testis Bonobo SRR306835 1 Single End Liver Human SRR1090722 1 Paired End Testis Human SRR306825 1 Paired End Testis Human SRR817512 1 Paired End Testis Human SRR1100440 1 Paired End Testis Human SRR1071668 1 Paired End Liver
177
Table S10. Mean values of ampliconic gene copy numbers across at least three replicates. Species IID BPY2 CDY DAZ HSFY PRY RBMY TSPY VCY XKRY Human 17 3.31 4.4 5.27 2.28 2.15 10.66 33.04 2.14 2.14 Human 23 2.76 3.47 4.51 1.58 2.49 4.95 35.29 2.12 1.55 Human 42 1.68 3.08 4.81 1.58 2 7.63 26.03 2.03 1.53 Human 46 3.67 4.63 4.49 2.28 2.03 6.82 36.69 2.09 2.2 Human 47 3.49 3.85 4.36 2.04 2.18 11.03 32.88 3.17 1.96 Human 48 2.15 2.87 2.78 2.09 2.78 9.32 32.24 2.09 2.01 Human 67 3.3 4.36 4.33 2.29 2.08 9.69 39.09 2.03 2.17 Human 7 3.22 4.26 4.31 2.05 2.12 10.41 33.03 2.38 1.95 Human 72 2.88 3.5 2.99 1.95 1.59 9.18 31.24 1.67 1.93 Human 8 3.34 4.54 4.37 2.27 2.21 10.71 35.72 2.38 2.21 Chimpanzee Bandit 1.84 4.59 3.46 NA NA 6.25 11.61 1.88 NA Chimpanzee BigDaddy 2.19 5.47 4.44 NA NA 12.01 11.54 2.25 NA Chimpanzee Budda 2.2 5.52 4.33 NA NA 9.11 15.85 2.09 NA Chimpanzee Conan 2.03 5.28 4.3 NA NA 9.65 17.62 2.05 NA Chimpanzee Cordova 2.03 4.98 4.09 NA NA 10.46 15.15 2.03 NA Gorilla KB10845 1.97 9.82 1.96 7.84 2.04 13.71 5.87 NA 2.01 B. orangutan KB13383 16.01 49.64 12.04 5.56 10.29 4.03 43.56 NA 19.24 Gorilla KB14216 2 6.01 2.07 3.97 1.96 15.86 6 NA 1.98 Gorilla KB15257 2.11 6.22 2.07 4.14 2.06 14.43 8.21 NA 2.05 Gorilla KB15813 1.97 7.9 2.06 5.98 1.99 13.8 7.88 NA 2 Bonobo KB1843 1.12 3.09 2.19 NA NA 31.52 51.86 NA NA B. orangutan KB3042 21.07 39.31 12.59 4.4 7.55 4.19 47.7 NA 14.86 Gorilla KB3456 2.17 7.81 2.17 5.93 2.08 15.91 7.93 NA 1.99 Gorilla KB3512 1.95 6.13 2.02 4.09 2.08 14.1 6 NA 2.03 Gorilla KB3781 1.98 8.32 1.97 6.1 2.02 16.14 8.08 NA 1.98 Bonobo KB3841 1.06 2.98 1.86 NA NA 28.8 48.07 NA NA B. orangutan KB4204 12.77 30.15 9.73 4.36 6.77 2.83 29.08 NA 15.21 Bonobo KB4229 2.15 3.25 2.13 NA NA 28.32 44.64 NA NA S. orangutan KB4650 13.38 35.75 13.84 5.98 10.69 2.8 31.96 NA 23.19 S. orangutan KB4661 12.43 33.16 11.55 4.95 9.9 2.78 21.31 NA 20.49 Gorilla KB4987 1.96 5.99 1.98 4.03 1.97 13.9 6.04 NA 1.97 Gorilla KB4988 2.04 9.87 2.01 7.97 1.99 13.85 6 NA 1.96 Gorilla KB4989 2.01 8.07 2.02 6.06 2.05 12.03 6.06 NA 2.06 S. orangutan KB5390 13.64 37.72 12.39 6.47 10.17 2.98 22.83 NA 25.52 B. orangutan KB5418 14.83 34.99 10.52 4.86 10.33 3.39 32.23 NA 17.17 B. orangutan KB5419 13.89 30.04 9.69 4.71 8.46 2.96 30.44 NA 13.78 S. orangutan KB5565 13.69 39.58 11.84 5.26 9.59 3.08 14.04 NA 21.06 Gorilla KB5712 2.02 5.91 1.96 4.95 2.09 13.73 5.84 NA 1.95 S. orangutan KB5883 25.33 34.3 12.97 5.49 10.2 2.79 29.55 NA 22.19 B. orangutan KB6109 21.54 46.61 10.65 6.21 7.98 4.05 35.3 NA 19.81 Gorilla KB6319 2.1 6.09 2 4.14 2.13 18.06 8.1 NA 2.01 Gorilla KB7026 2.13 12.17 2.08 8.19 2.07 14.21 6.08 NA 2.03 Bonobo KB7032 1.96 2.99 2.06 NA NA 28.59 52.19 NA NA
178
Bonobo KB7781 1.06 3.09 2.03 NA NA 29.57 48.41 NA NA Bonobo KB7782 0.84 2.49 1.65 NA NA 23.13 39.31 NA NA Gorilla KB7801 2.04 8.34 2.11 6.23 2.12 14.7 8.24 NA 2.09 Bonobo KB7998 0.98 2.93 1.87 NA NA 22.53 38.09 NA NA B. orangutan KB9002 13.28 30.38 9.81 5.58 8.79 2.67 28.62 NA 13.99 Chimpanzee Moose 2.05 5.25 4.09 NA NA 11.1 25.45 2.07 NA Chimpanzee Neptune 2.3 5.5 4.31 NA NA 10.86 23.67 2.13 NA Chimpanzee Rock 1.97 4.94 3.82 NA NA 11 20.96 1.99 NA Chimpanzee Zippy 2.21 5.42 4.35 NA NA 12.18 21.51 2.08 NA
179
Table S11. Phenotypes of sperm across great apes. The sperm phenotypes were obtained from different studies. The phenotypes in bold were used to compare the ampliconic gene copy number trends.
Gorilla Homo Pan Pan Pongo Phenotype gorilla sapiens paniscus troglodytes pygmaeus Reference (Anderson, Nyholt, & Dixson, Head length (μm3) 6.8 4.5 4.7 4.7 4.9 2005) (Anderson Midpiece length(μm3) 13.2 5.9 8.9 6.3 8.9 et al., 2005) (Anderson Tail length(μm3) 32.3 46.5 54.5 49.4 46.7 et al., 2005) (Anderson Head volume(μm3) 62.3 28.2 29.4 31.9 38.6 et al., 2005) (Anderson Midpiece volume(μm3) 6.9 2.8 9.3 7.8 3.9 et al., 2005) (Anderson Tail volume(μm3) 5.1 5.4 9.1 7.8 7.3 et al., 2005) Multiple male- Multiple multiple male-multiple (Good et Mating system Polygynous Various female female Dispersed al., 2013) (Dixson & Anderson, Body weight (kg) 169 68 39.1 44.34 74.64 2004) (Dixson & Combined testes weight Anderson, (g) 29.6 40.5 135.2 118.8 35.3 2004) (Dixson & Circulating testosterone Anderson, (ng/ml) 4.1 5.7 1.2 4.3 2.4 2004) −0.622 0.430 (Dixson & (Small (Large Anderson, Residual testes weight Testis) −0.251 (S) Testis) 0.326 (L) −0.335 (S) 2004) (Wistuba et No of tubules 400 200 200 100 al., 2003) (Wistuba et No of stages per tubule 2.03 2.26 2.26 1.69 al., 2003) (Wistuba et Multistage tubules % 79.5 91 88 55 al., 2003)
180
(Møller, Ejaculation volume (ml) 0.522 4.3 1.8 2.01 1.1 1988) (Møller, Sperm conc (106ml-1) 142.86 63.5 456.1 370.16 61 1988) (Møller, Number of sperm (106) 83.8 255 821 656.83 67.1 1988) (Møller, Sperm Motility (%) 50 60 75 41.66 47 1988) (Møller, Body Weight (kg) 200 63.5 45 44.3 74.6 1988) (Møller, Testes weight (kg) 0.0296 0.0405 0.1188 0.0353 1988) (Fujii- Hanamoto, Matsubaya shi, Nakano, Kusunoki, Total Testicular & Enomoto, volume(ml) 7.6 121.7 27.7 2011) (Fujii- Spermatogenic Index (x Hanamoto 10-3) 0.67 149 10 et al., 2011)
181
Supplemental Figures Figure S1. Relationship between median and variance of copy number across all great apes. The gray line shows the best fit from ordinary least squares regression (R2 = 0.1476; p- value=0.1664).
182
Figure S2. Principal components analysis of the overall copy number of Y ampliconic gene families across great apes. Proportion of total variance explained by the first five principal components (PCs) is on the Y-axis. The first, second, third, fourth, and fifth PCs explained 68.7%, 22.8%, 6.5%, 1.2%, and 0.8% of the variation, respectively.
183
Figure S3. Across species, gene families with higher copy number have higher variance. In each of the scatterplots the X-axis represents natural log of median copy number and the Y-axis represents natural log of variance in copy number. The Spearman correlations were calculated using the cor.test() function in R and the p-values are in parentheses. The black line represents the linear function fitted to the given data points. The dots are color-coded to represent the six species, with missing dots indicating that the corresponding gene families are either lost or pseudogenized in that species.
184
Figure S4. Phenotypes related to sperm competition in great apes.
The X-axis represents species in decreasing order of the level of sperm competition and the Y-axis in each plot represents the phenotypes: (A) sperm midpiece volume (Anderson et al., 2005), (B) residual testis weight (ResTW) after correcting for the body weight (Dixson & Anderson, 2004), (C) sperm concentration (Møller, 1988), and (D) sperm motility (Møller, 1988). The summary of all the values for available sperm phenotypes for great apes are provided in Table S11.
185
Figure S5. Copy number variation in CDY, RBMY, TSPY, and XKRY.
The X-axis represents species in decreasing order of their sperm competition and Y-axis represents the copy number of (A) CDY, (B) RBMY (C) TSPY, and (D) XKRY. The black points represent the variance in copy number for each species.
186
Figure S6. Transcriptome assembly pipeline.
187
Figure S7. Principal components analysis of great ape gene expression values (RLD-normalized). The gene expression values were normalized using regularized log transformation rlog() function in DESeq2 (Love, Huber, & Anders, 2014).
188
Figure S8. Principal components analysis of great ape gene expression values (VST-normalized). Gene expression values were normalized using Variance Stabilizing Transformation varianceStabilizingTransformation() function in DESeq2 (Love et al., 2014).
189
Figure S9. Heatmap depicting the Euclidean distances of gene expression values between pairs of sampled individuals.
190
References
Anderson, M. J., Nyholt, J., & Dixson, A. F. (2005). Sperm competition and the evolution
of sperm midpiece volume in mammals. Journal of Zoology, 267(02), 135.
Ardlie, K. G., DeLuca, D. S., Segrè, A. V., Sullivan, T. J., Young, T. R., Gelfand, E. T., …
Lockhart. (2015). The Genotype-Tissue Expression (GTEx) pilot analysis:
Multitissue gene regulation in humans. Science, 348(6235), 648–660.
Dixson, A. F., & Anderson, M. J. (2004). Sexual behavior, reproductive physiology and
sperm competition in male mammals. Physiology and Behavior, 83(2), 361–371.
Fujii-Hanamoto, H., Matsubayashi, K., Nakano, M., Kusunoki, H., & Enomoto, T. (2011).
A comparative study on testicular microstructure and relative sperm production in
gorillas, chimpanzees, and orangutans. American Journal of Primatology, 73(6),
570–577.
Good, J. M., Wiebe, V., Albert, F. W., Burbano, H. A., Kircher, M., Green, R. E., …
Pääbo, S. (2013). Comparative population genomics of the ejaculate in humans and
the great apes. Molecular Biology and Evolution, 30(4), 964–976.
Hahn, M. W., Demuth, J. P., & Han, S.-G. (2007). Accelerated rate of gene gain and loss
in primates. Genetics, 177(3), 1941–1949.
Hallast, P., Maisano Delser, P., Batini, C., Zadik, D., Rocchi, M., Schempp, W., …
Jobling, M. A. (2016). Great ape Y Chromosome and mitochondrial DNA
phylogenies reflect subspecies structure and patterns of mating and dispersal.
Genome Research, 26(4), 427–439.
Han, M. V., Thomas, G. W. C., Lugo-Martinez, J., & Hahn, M. W. (2013). Estimating
gene gain and loss rates in the presence of error in genome assembly and
annotation using CAFE 3. Molecular Biology and Evolution, 30(8), 1987–1997.
Locke, D. P., Hillier, L. W., Warren, W. C., Worley, K. C., Nazareth, L. V., Muzny, D. M.,
191
… Wilson, R. K. (2011). Comparative and demographic analysis of orang-utan
genomes. Nature, 469(7331), 529–533.
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and
dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.
Møller, A. P. (1988). Ejaculate quality, testes size and sperm competition in primates.
Journal of Human Evolution, 17(5), 479–488.
Rohlfs, R., & Nielsen, R. (2014). Phylogenetic ANOVA: The Expression Variance and
Evolution (EVE) model for quantitative trait evolution. https://doi.org/10.1101/004374
Rohlfs, R. V., Harrigan, P., & Nielsen, R. (2014). Modeling gene expression evolution
with an extended Ornstein-Uhlenbeck process accounting for within-species
variation. Molecular Biology and Evolution, 31(1), 201–211.
Rohlfs, R. V., & Nielsen, R. (2015). Phylogenetic ANOVA: The expression variance and
evolution model for quantitative trait evolution. Systematic Biology, 64(5), 695–708.
Wistuba, J., Schrod, A., Greve, B., Hodges, J. K., Aslam, H., Weinbauer, G. F., &
Luetjens, C. M. (2003). Organization of seminiferous epithelium in primates:
relationship to spermatogenic efficiency, phylogeny, and mating system. Biology of
Reproduction, 69(2), 582–591.
192
Appendix C
Supporting Material for Chapter 4
Supplemental Note S1. Evolutionary scenarios for palindromes
Palindrome 4. We used two different approaches to obtain information about the conservation of human and chimpanzee palindromes across great apes. First, we used the multiple sequence alignment of Y chromosome assemblies to obtain the coverage of these palindromes. In the case of P4 we observed 23,175 bp of alignment in the bonobo assembly (Table S1). This analysis gave us the percentage of P4 present in bonobo Y assembly, however it did not infer that there is a continuous 23-kb block of palindrome P4 on bonobo Y. P4 could be highly fragmented due to Y chromosome degradation or rearrangements and the multiple sequence alignment can still capture such homologous fragments of the palindromes. The longest blocks in the Y multiple sequence alignment that overlap with P4 are 2-4 kb in length and these alignment blocks included sequences from gorilla and human Ys. In the remaining species, sequences homologous to P4 are mostly represented by gaps. Second, to identify the copy number of P4 homologs present in other species, we used alignments generated with LASTZ (Harris, 2007) based alignments. Non-overlapping 1-kb windows of the assembly were aligned to human P4 using LASTZ. The read depth of windows with >80% identity to P4 was used to estimate the copy number of P4. However, we did not find any 1-kb window in bonobo Y which maps to human palindrome P4 with >80% identify. Since we did not find high-confidence windows aligning to P4 in bonobo, we concluded that it is highly fragmented in this species.
Evolution of sequences homologous to human and chimpanzee palindromes. Common ancestor of great apes. Partial sequences of all human and chimpanzee palindromes were present. P1, P2, P4, P5, P8, C2, C3, C4, and C17 were in multi-copy state (Tables SN1A-B). P3, P6, and C1 might have been in single- or multi-copy state (Tables SN1A-B).
Orangutan. Increase in copy number of P1,P2, P5, C3, and C4 (Fig. 1A, Table SN1A-B). Loss of C3 (≈25% loss in coverage Table S2) and P2 (≈27% loss in coverage Table S1) segments when compared to other great apes.
193
Common ancestor of gorilla, human, bonobo, and chimpanzee. P3, P6, and C1 are in a multi-copy state either at this node or in the common ancestor of great apes (Tables SN1A- B).
Gorilla. Loss of copy number for P8, C2, and C17 compared to bonobo and orangutan (Fig. 1, Tables SN1A-B). C3 and C4 had more than two copies in human, chimpanzee and orangutan, however only two copies in gorilla (Tables SN1A-B). For C2, bonobo and orangutan have higher copy number when compared to gorilla. Loss of segments in C1 (≈15% loss in coverage) and C19 (≈40-60% loss in coverage) when compared to other great apes (Table S2).
Common ancestor of human, bonobo, and chimpanzee. All the palindromes are assumed to be multi-copy with the exception of C5, which could have been in a single-copy state (Tables SN1A-B). Gorilla and orangutan have more than two copies of P4 which is in two copies or lost in human, bonobo and chimpanzees (Tables SN1A-B).
Human. Palindrome P3 has ≈30-35% covered in other great apes (Table S1), so we assume that the remaining portion of P3 is human-specific. Humans also lost most of the sequences homologous to palindrome C2; we observe some sequence homologous to C2 on the human Y, however they are degraded and not visible on an alignment in human and chimpanzee Y dot plot (Hughes et al., 2010). Therefore, we assume C2 was deleted human.
Common ancestor of bonobo and chimpanzee. Gain of C2 segment, both bonobo and chimpanzee share 85% coverage whereas the other great apes cover 20-30% of C2, which implies a Pan genus specific gain of C2 sequences (Table S2). C1 and C5 groups are present in more than two copies in both chimpanzee (Hughes et al., 2010) and bonobo, where as other species have two or fewer copies of these palindromes (Fig. 1A). The Pan genus lost P4, we observe that sequence homologous to P4 are present in bonobo and chimpanzee Y, however they are degraded and not visible as an alignment in the human and chimpanzee Y dot plot (Hughes et al., 2010). Therefore, we assume that P4 was deleted.
Bonobo. Bonobo lost copies of P1, P2, P6, P7, C3, C4, C18, and C19 (Fig. 1A, Table SN1A-B). It also experienced loss of segments in C18 (≈30-60% loss in coverage Table S2) and P7 (≈60% loss in coverage Table S1) when compared to other great apes.
194
Chimpanzee. Chimpanzee gained copies of P1, P2, and P5 as these palindromes share homology with C3 and C4, which are in multi-copy in chimpanzee (Hughes et al., 2010). Chimpanzee also gained a segment of C1, a palindrome which has <50% coverage with the other great ape Y chromosomes (Table S2).
Table SN1A. Reconstructing human palindrome evolution using maximum parsimony. The values from extant species were taken from Table S3-4 and rounded to the following numbers of copies: <1.34 => “1”, “1.33-1.66” =>”1-2”, “1.66-2.5” => “2”, “>2.5” => “M” (“more than two”). P1 P2 P3# P4 P5 P6 P7 P8 Bonobo 1-2 1-2 1-2 0 2 1-2 1 2 Chimpan M M 1$ 0 M 2 2 2 zee BC* 1-2-M 1-2-M 1 0 2-M 2 1-2 2 Human 2 2 2 2 2 2 2 2 BCH** 2 2 1-2 2 2 2 2 2 Gorilla 2 2 2 M 2 1-2 1 1 BCHG*** 2 2 2 2-M 2 2 1-2 1-2 Oranguta M M 1 M M 1 1 2 n GA**** 2-M 2-M 1-2 M 2-M 1-2 1 2 *Bonobo-chimpanzee common ancestor **Bonobo-chimpanzee-human common ancestor ***Bonobo-chimpanzee-human-gorilla common ancestor ****Common ancestor of great apes $We conservatively assigned the copy number of P3 as 1 in chimpanzee to make sure we do not inflate its combined copy number. #We can conservatively assume that P3 became multi-copy in BCHG, but instead it might have been multi- copy in the common ancestor of great apes and lost its multi-copy status in orangutan instead
Table SN1B. Reconstructing chimpanzee palindrome evolution using maximum parsimony. The values from extant species were taken from Table S3-4 and rounded to the following numbers of copies: <1.34 => “1”, “1.33-1.66” =>”1-2”, “1.66-2.5” => “2”, “>2.5” => “M” (“more than two”). C1 group C2 group C3 group C4 group C5 group C17 C18 C19 Chimpan M M M M M 2 2 2 zee Bonobo M M 1 1-2 M 2 1 1 BC* M M 1-M M M 2 2 1-2
195
Human 2 0 M M 1 2 2 2 BCH** 2-M M M M 1-M 2 2 2 Gorilla 2 2 2 2 1 1 1 1 BCHG*** 2 2-M 2-M 2-M 1 1-2 1 1-2 Oranguta 1 M M M 1 2 1 1 n GA**** 1-2 M M M 1 2 1 1 *Bonobo-chimpanzee common ancestor **Bonobo-chimpanzee-human common ancestor ***Bonobo-chimpanzee-human-gorilla common ancestor ****Common ancestor of great apes
196
Supplemental Note S2. Augustus predictions
AUGUSTUS (Stanke & Waack, 2003) predicted 219 genes on the bonobo Y assembly, of which 25 complete or partial genes represent homologs of known human protein-coding genes. In the case of Sumatran orangutan, AUGUSTUS predicted 90 genes of which 33 complete or partial genes represent homologs of known human protein-coding genes. After implementing requirements of gene predictions (1) to have start and stop codons, and (2) be present on contigs that align to human or chimpanzee Y, we did not find any novel genes on the Y chromosome of orangutan, however we found two candidates— SUZ12 and PSMA6—which have >95% identity and >90% coverage to the gene homologs on the autosomes of bonobo.
A possible transposition of the autosomal SUZ12 gene (located on human chromosome 17) onto the bonobo Y chromosome was predicted (Table SN2) based on the limited number of introns (one intron), in contrast to its autosomal homolog, which has 15 introns (NM_015355). The SUZ12 gene has no matches to human and gorilla Y, however the first 121 bp of its predicted sequence align to chimpanzee (palindromes C2, C11 and C15) and orangutan Y. However, when we aligned testis RNA-seq data to the predicted SUZ12 gene on the bonobo Y chromosome, the first exon with the start codon was not expressed (Fig. SN2A), whereas the second exon was expressed (Fig. SN2B). The single nucleotide variants in the RNA-seq reads mapping to the second exon are consistent with the variants of the SUZ12 gene present on chromosome 17. Thus, we concluded that the translocated SUZ12 was pseudogenized on bonobo Y.
The PSMA6 gene was also predicted in bonobo, which shared 99.6% identity with its homolog on the chimpanzee Y and 97.8% identity on human Y. However, there were no homologous sequences in the orangutan and gorilla Y assemblies. The PSMA6 gene on human Y was annotated as a pseudogene in the Entrez database (Gene ID: 5687). Therefore, we concluded that PSMA6 is also a pseudogene on the bonobo. Thus, no novel genes, as compared to the human Y chromosome genes, were found on the bonobo and orangutan Y chromosomes.
197
Table SN2. Gene annotation of SUZ12 homolog as predicted by AUGUSTUS
Seqname Source Feature Start End Strand
Contig591 AUGUSTU gene 74 61572 - S
Contig591 AUGUSTU transcript 74 61572 - S
Contig591 AUGUSTU stop_codon 74 76 - S
Contig591 AUGUSTU CDS 74 2353 - S
Contig591 AUGUSTU CDS 61453 61572 - S
Contig591 AUGUSTU start_codon 61570 61572 - S
198
Supplemental Figure SN2A. IGV (Robinson et al., 2011) view of first exon on SUZ12 homolog on bonobo Y. Testis specific RNASeq reads (SRA ID: SRR306837) were mapped to bonobo Y assembly using BWA MEM (Li, 2013).
199
Supplemental Figure SN2B. IGV view of second exon on SUZ12 homolog on bonobo Y. Testis specific RNASeq reads (SRA ID: SRR306837) were mapped to bonobo Y assembly using BWA MEM (Li, 2013).
200
Supplemental Tables
Table S1. The sequence coverage (percentage) of human palindromes P1-8 (arms) across great apes. The repeats annotated by RepeatMasker (SMIT & A., 2004) were excluded in the percentage calculations.
Length, bp Coverage, percentage Human palindrom Chimpanze Bonob Oranguta Human Chimpanze Bonob Gorill Oranguta e e o Gorilla n * e o a n 33403 P1 362895 340679 0 249906 608650 59.62 55.97 54.88 41.06
P2 50379 50144 49092 29106 76387 65.95 65.64 64.27 38.10
P3 64460 62335 68254 55671 179793 35.85 34.67 37.96 30.96
P4 77345 76543 83948 72509 93979 82.30 81.45 89.33 77.15 15704 P5 158548 158405 0 148243 166929 94.98 94.89 94.08 88.81
P6 35555 35436 34535 32729 36832 96.53 96.21 93.76 88.86
P7 4439 1134 4385 4134 5414 81.99 20.95 80.99 76.36
P8 15027 13698 12688 12783 16728 89.83 81.89 75.85 76.42 74397 118471 Total 768648 738374 2 605081 2 64.88 62.33 62.80 51.07 *Palindrome arm length
201
Table S2. The sequence coverage (percentage) of chimpanzee palindromes C1-19 (arms) across great apes. The repeats annotated by RepeatMasker (SMIT & A., 2004) were excluded from the calculations. The palindromes were clustered into five homology groups: C1 (C1+C6+C8+C10+C14+C16), C2 (C2+C11+C15), C3 (C3+C12), C4 (C4+C13), and C5 (C5+C7+C9). The numbers in bold represent the palindrome with highest coverage within each group, which we used as the representative coverage of that homology group.
Length, bp Coverage, percentage
Chimpanzee palindrome Human Bonobo Gorilla Orangutan Chimpanzee* Human Bonobo Gorilla Orangutan
C1 36118 35368 25476 37460 66916 53.98 52.85 38.07 55.98
C2 35233 111830 27957 38673 141680 24.87 78.93 19.73 27.30
C3 58885 59610 57211 38076 82274 71.57 72.45 69.54 46.28
C4 119067 120124 119349 116986 140401 84.80 85.56 85.01 83.32
C5 113353 107475 113915 110252 136579 82.99 78.69 83.41 80.72
C6 16015 15745 6517 17378 58166 27.53 27.07 11.20 29.88
C7 45695 45809 46167 45408 137290 33.28 33.37 33.63 33.07
C8 21783 21047 12228 22766 64725 33.65 32.52 18.89 35.17
C9 46890 46809 47450 46382 139044 33.72 33.66 34.13 33.36
C10 16096 15957 6926 17487 59322 27.13 26.90 11.68 29.48
C11 24198 105870 27840 38567 123692 19.56 85.59 22.51 31.18
C12 46539 47451 46260 26584 76418 60.90 62.09 60.54 34.79
C13 117844 118277 117312 116043 132034 89.25 89.58 88.85 87.89
C14 19998 19943 11045 21285 47853 41.79 41.68 23.08 44.48
C15 23050 95239 25619 36280 111841 20.61 85.16** 22.91 32.44
C16 40950 33549 28037 41545 76391 53.61 43.92 36.70 54.38
C17 15364 15072 12961 12978 17516 87.71 86.05 74.00 74.09
C18 6686 2582 4741 6304 6888 97.07 37.49 68.83 91.52
C19 155747 156318 57117 122860 168341 92.52 92.86 33.93 72.98
Total 426178 488431 303092 383879 637282 72.96 81.67 57.36 66.48 *Palindrome arm length **In the case of cluster C2, different palindromes had highest coverage for different species and we considered C15 as a representative because it had the highest coverage for more than one species, while other palindromes had highest coverage for only one species.
202
Table S3. The copy number for sequences homologous to human palindromes. The numbers for bonobo, gorilla, and orangutan were obtained based on median read depth of 1-kb windows homologous to human or chimpanzee palindromes. The copy number for chimpanzee and human were obtained by examining the dotplot of human and chimpanzee Y(Hughes et al., 2010). In brackets, we indicate the known homologs of chimpanzee and human palindromes in human and chimpanzee, respectively (Hughes et al., 2010).
Palindrome/ P1 P2 P3* P4 P5 P6 P7 P8 Species
Human 2 2 2 2 2 2 2 2
Chimpanzee >2(C3+C4) >2(C3) >2(C1 0 >2(C4) 2(C19) 2(C18) 2(C17) parts)
Bonobo 1.64 1.46 1.62 0 2.13 1.42 0.70 1.98
Gorilla 2.19 2.13 1.87 2.74 2.04 1.42 1.16 1.19
Orangutan 6.52 13.13 1.05 5.67 7.29 1.06 1.04 1.80
*Note: We assume that some parts of human palindrome P3 in chimpanzee are multi-copy (those that share homology with C1), while others are single-copy.
203
Table S4. The copy number for sequences homologous to chimpanzee palindromes. The numbers for bonobo, gorilla, and orangutan were obtained based on median read depth of 1-kb windows homologous to human or chimpanzee palindromes. The copy number for chimpanzee and human were obtained by examining the dotplot of human and chimpanzee Y(Hughes et al., 2010). In brackets, we indicate the known homologs of chimpanzee and human palindromes in human and chimpanzee, respectively (Hughes et al., 2010).
Palindrome/ C1 group C2 group C3 group C4 group C5 C17 C18 C19 Species grou p
Human 2(P3 parts) 0 >2(P1,P2) >2(P5,P1) 1 2 2(P7 2(P6 (P8) ) )
Chimpanzee >2 >2 >2 >2 >2 2 2 2
Bonobo 13.29 8.42 1.18 1.41 9.31 2.27 0.80 1.25
Gorilla 2.28 2.48 2.13 2.07 1.30 1.19 1.15 1.28
Orangutan 1.02 6.07 12.13 7.85 1.07 1.87 1.02 1.02
204
Table S5. Gene birth and death rates (in events per millions of years) on the Y chromosome of great apes, as predicted using the Iwasaki and Takagi gene reconstruction model. BC - common ancestor of bonobo and chimpanzee; BCH - common ancestor of bonobo, chimpanzee, and human; BCHG - common ancestor of bonobo, chimpanzee, human, and gorilla. GA - common ancestor of great apes.
X-degenerate genes Ampliconic genes
Branch Birth rate Death rate Birth rate Death rate
Bonobo-BC 1.00x10-4 1.00x10-4 1.00x10-4 0.182
Chimpanzee-BC 1.00x10-4 8.00x10-2 1.00x10-4 1.00x10-4
Human-BCH 1.90x10-5 1.90x10-5 1.90x10-5 1.90x10-5
Gorilla-BCHG 1.43x10-5 1.43x10-5 1.43x10-5 1.43x10-5
Orangutan-GA 7.14x10-6 2.06x10-2 7.14x10-6 7.14x10-6
Macaque-Root 3.45x10-6 3.45x10-6 3.45x10-6 9.92x10-3
BC-BCH 2.35x10-5 4.89x10-2 2.35x10-5 9.54x10-2
BCH-BCHG 5.71x10-5 5.71x10-5 5.71x10-2 5.71x10-5
BCHG-GA 1.43x10-5 1.43x10-5 1.43x10-5 1.43x10-5
GA-Root 6.67x10-6 4.04x10-3 6.67x10-6 6.67x10-6
205
Table S6. The coordinates of palindromes on panTro6 Y chromosome. The coordinates were obtained from Tomaszkiewicz et al. 2016 and updated to the current version of chimpanzee Y.
Approx. arm length (half of Left_arm_en Palindrome Start End Amplicon palindrome) d (Approx) C1 1759451 2053069 Pink 146809 1906260 C2 2298081 2984818 Blue 343368.5 2641450 C3 3587737 3925944 Red 169103.5 3756841 C4 4669973 5444881 Gold 387454 5057427 C5 8651546 9099775 Turquoise 224114.5 8875661 C6 9099776 9383863 Pink 142043.5 9241820 C7 9383864 9832093 Turquoise 224114.5 9607979 C8 9832094 10116181 Pink 142043.5 9974138 C9 10116182 10564411 Turquoise 224114.5 10340297 C10 10564412 10851248 Pink 143418 10707830 C11 11060051 11674261 Blue 307105 11367156 C12 12651797 12963129 Red 155666 12807463 C13 13340000 14084660 Gold 372330 13712330 C14 14798116 15024316 Pink 113100 14911216 C15 15493746 16056815 Blue 281534.5 15775281 C16 16477310 16791733 Pink 157211.5 16634522 C17 21591500 21671300 39900 21631400 C18 23517939 23546819 14440 23532379 C19 23807052 24577396 385172 24192224
206
Table S7. The coordinates of palindromes on hg38 Y chromosome. The coordinates were obtained from Tomaszkiewicz et al. 2016.
Left_arm_end Palindrome Start End (Approx)
P1 23359067 26311550 24822577
P2 23061889 23358813 23208197
P3 21924954 22661453 22208730
P4 18450291 18870104 18640356
P5 17455877 18450126 17951255
P6 16159590 16425757 16269541
P7 15874906 15904894 15883575
P8 13984498 14058230 14019652
Table S8. The coordinates of X degenerate genes on hg38 Y chromosome. The locations are based on annotation from NCBI RefSeq annotation.
GENE START END
SRY 2786855 2787699
AMELY 6865918 6874027
DBY 12904831 12920478
EIF1AY 20575725 20593154
NLGN4Y 14523746 14843726
PRKY 7273972 7381547
KDM5D 19705417 19744939
TBL1Y 6910686 7091683
TMSB4Y 13703567 13706024
USP9Y 12701231 12860839
UTY 13248379 13480673
ZFY 2935477 2982506
207
Table S9. The coordinates of X degenerate genes on panTor6 Y chromosome. The locations are based on annotation from NCBI RefSeq annotation.
GENE START END
SRY 26210218 26211127
AMELY 25814614 25822712
DBY (DDX3Y) 20429349 20442635
EIF1AY 17534922 17551649
NLGN4Y 22946482 23238101
PRKY 25143378 25246224
KDM5D 18097144 18135348
TBL1Y 25425481 25499431
USP9Y 20238918 20383654
UTY 20869562 21028971
ZFY 26012708 26038802
Table S10. List of ENCODE experiment and sample ids for the samples analyzed.
Experiment unfiltered BAM Target Tissue ENCSR000ALB ENCFF735TGN H3K27Ac HUVEC ENCSR000AKL ENCFF322MOQ H3K4me1 HUVEC
ENCSR000ALG ENCFF261CBZ Control HUVEC
ENCSR000EOQ ENCFF042VZB Dnase-seq HUVEC
ENCSR112ALD ENCFF319GEZ CREB1 HepG2
ENCSR136ZQZ ENCFF807LQS H3K27Ac Testis
ENCSR956VQB ENCFF077NRU H3k4me1 Testis
ENCSR215WNN ENCFF511LQO Control Testis
ENCSR729DRB ENCFF639PHQ Dnase-seq Testis
208
Table S11. Annotation of possible novel genes on bonobo Y (post filtering).
Gene coverage (Alignment Gene Percentag Query Alignment /Query Gene ID annotation e Identity Length length e-value length) SUZ12_HU g80 MAN 99.187 799 738 0 92.37 PSA6_MO g32 USE 95.732 164 164 2.16E-117 100.00
209
Supplemental Figures
Figure S1. Reconstructed gene content of great apes. The first six rows have information about the gene content of great apes and macaque, which were used as an input for the model of Iwasaki and Takagi (Iwasaki & Takagi, 2007). The other rows were reconstructed by the model. BC - common ancestor of bonobo and chimpanzee; BCH - common ancestor of bonobo, chimpanzee, and human; BCHG - common ancestor of bonobo, chimpanzee, human, and gorillas. GA - common ancestor of great apes. We defined the presence of RPS4Y2 and MXRA5Y in bonobo and orangutan based on AUGUSTUS, and Y chromosome specific testis transcriptome assembly results (Vegesna et al., 2020). The presence of RPS4Y2 gene was confirmed in bonobo through gene prediction (shares 100% identity with chimpanzee RPS4Y2) and assembled transcript sequences (shares 99.6% identity with chimpanzee RPS4Y2). MXRA5Y, which is psedogenized in human and chimpanzee (Hughes et al., 2012), was missing in orangutan (no gene prediction or transcript found) and pseudogenized in bonobo (gene prediction annotated as X chromosome homolog MXRA5 and missing the first three exons of MXRA5 in its sequence). In the case of gorilla, we did not find its transcript in transcriptome assembly and a BLAT search of human MXRA5Y gene (NC_000024.10:11952465-11993293) resulted in a 12-kb long hit which is around 20% of the gene. We assumed the gene is lost in gorilla as well.
210
References
Harris, R. S. (2007). Improved pairwise Alignmnet of genomic DNA.
https://etda.libraries.psu.edu/catalog/7971
Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S.,
Dugan, S., Ding, Y., Buhay, C. J., Kremitzki, C., Wang, Q., Shen, H., Holder, M.,
Villasana, D., Nazareth, L. V., Cree, A., Courtney, L., Veizer, J., Kotkiewicz, H., …
Page, D. C. (2012). Strict evolutionary conservation followed rapid gene loss on
human and rhesus y chromosomes. Nature, 483(7387), 82–87.
Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P.
J., Fulton, R. S., McGrath, S. D., Locke, D. P., Friedman, C., Trask, B. J., Mardis, E.
R., Warren, W. C., Repping, S., Rozen, S., Wilson, R. K., & Page, D. C. (2010).
Chimpanzee and human Y chromosomes are remarkably divergent in structure and
gene content. Nature, 463(7280), 536–539.
Iwasaki, W., & Takagi, T. (2007). Reconstruction of highly heterogeneous gene-content
evolution across the three domains of life. Bioinformatics , 23(13), i230–i239.
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with
BWA-MEM. In arXiv [q-bio.GN]. arXiv. http://arxiv.org/abs/1303.3997
Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G.,
& Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1),
24–26.
SMIT, & A., F. A. (2004). Repeat-Masker Open-3.0. Http://www.repeatmasker.org.
https://ci.nii.ac.jp/naid/10029514778/
Stanke, M., & Waack, S. (2003). Gene prediction with a hidden Markov model and a new
intron submodel. Bioinformatics , 19 Suppl 2, ii215–ii225.
Vegesna, R., Tomaszkiewicz, M., Ryder, O. A., Campos-Sánchez, R., Medvedev, P.,
211
DeGiorgio, M., & Makova, K. D. (2020). Ampliconic genes on the great ape Y
chromosomes: Rapid evolution of copy number but conservation of expression
levels. In Review.
Harris, R. S. (2007). Improved pairwise Alignmnet of genomic DNA.
https://etda.libraries.psu.edu/catalog/7971
Hughes, J. F., Skaletsky, H., Brown, L. G., Pyntikova, T., Graves, T., Fulton, R. S.,
Dugan, S., Ding, Y., Buhay, C. J., Kremitzki, C., Wang, Q., Shen, H., Holder, M.,
Villasana, D., Nazareth, L. V., Cree, A., Courtney, L., Veizer, J., Kotkiewicz, H., …
Page, D. C. (2012). Strict evolutionary conservation followed rapid gene loss on
human and rhesus y chromosomes. Nature, 483(7387), 82–87.
Hughes, J. F., Skaletsky, H., Pyntikova, T., Graves, T. A., van Daalen, S. K. M., Minx, P.
J., Fulton, R. S., McGrath, S. D., Locke, D. P., Friedman, C., Trask, B. J., Mardis, E.
R., Warren, W. C., Repping, S., Rozen, S., Wilson, R. K., & Page, D. C. (2010).
Chimpanzee and human Y chromosomes are remarkably divergent in structure and
gene content. Nature, 463(7280), 536–539.
Iwasaki, W., & Takagi, T. (2007). Reconstruction of highly heterogeneous gene-content
evolution across the three domains of life. Bioinformatics , 23(13), i230–i239.
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with
BWA-MEM. In arXiv [q-bio.GN]. arXiv. http://arxiv.org/abs/1303.3997
Rahulsimham Vegesna, Marta Tomaszkiewicz, Oliver A. Ryder, Rebeca Campos-
Sánchez, Paul Medvedev, Michael DeGiorgio, and Kateryna D. Makova. (2020).
Ampliconic genes on the great ape Y chromosomes: Rapid evolution of copy
number but conservation of expression levels. In Review.
Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G.,
& Mesirov, J. P. (2011). Integrative genomics viewer. Nature Biotechnology, 29(1),
212
24–26.
SMIT, & A., F. A. (2004). Repeat-Masker Open-3.0. Http://www.repeatmasker.org.
https://ci.nii.ac.jp/naid/10029514778/
Stanke, M., & Waack, S. (2003). Gene prediction with a hidden Markov model and a new
intron submodel. Bioinformatics , 19 Suppl 2, ii215–ii225.
213
VITA Rahulsimham Vegesna
Education:
PhD in Bioinformatics and Genomics (2013-2020) The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA.
Master of Science in Bioinformatics (PSM) (2009-2011) The University of Texas at El Paso, El Paso, TX, USA.
Bachelor of Technology in Bioinformatics (2005-2009) Sathyabama University, Chennai, TN, India.
Work Experience:
Associate Applications System Analyst (2012-2013) Bioinformatics and Computational Biology Department, M.D. Anderson Cancer Center, Houston, Texas, USA, 77230.
Fellowship and Awards:
The Huck Institutes of the Life Sciences Fellowship 2013-2014
Computation, Bioinformatics, and Statistics (CBIOS) Training Program 2017-2018
Selected Publications:
• Vegesna, R., Tomaszkiewicz, M., Medvedev, P., & Makova, K. D. (2019). Dosage regulation, and variation in gene expression and copy number of human Y chromosome ampliconic genes. PLoS genetics, 15(9), e1008369. • Torres-García, W., Zheng, S., Sivachenko, A., Vegesna, R., Wang, Q., Yao, R., ... & Verhaak, R. G. (2014). PRADA: pipeline for RNA sequencing data analysis. Bioinformatics, 30(15), 2224-2226. • Yoshihara, K., Wang, Q., Torres-Garcia, W., Zheng, S., Vegesna, R., Kim, H., & Verhaak, R. G. (2015). The landscape and therapeutic relevance of cancer- associated transcript fusions. Oncogene, 34(37), 4845-4854. • Zheng, S., Fu, J., Vegesna, R., Mao, Y., Heathcock, L. E., Torres-Garcia, W., ... & Brennan, C. W. (2013). A survey of intragenic breakpoints in glioblastoma identifies a distinct subset associated with poor survival. Genes & development, 27(13), 1462-1472. • Yoshihara, K., Shahmoradgoli, M., Martínez, E., Vegesna, R., Kim, H., Torres- Garcia, W., ... & Carter, S. L. (2013). Inferring tumour purity and stromal and immune cell admixture from expression data. Nature communications, 4(1), 1-11. • Cancer Genome Atlas Research Network. (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature, 499(7456), 43.