Reference Genome and Transcriptome Informed by the Sex Chromosome

Olney et al. Biology of Sex Differences (2020) 11:42 https://doi.org/10.1186/s13293-020-00312-9 RESEARCH Open Access Reference genome and transcriptome informed by the sex chromosome complement of the sample increase ability to detect sex differences in gene expression from RNA-Seq data Kimberly C. Olney1,2, Sarah M. Brotman1,3, Jocelyn P. Andrews1,4, Valeria A. Valverde-Vesling1 and Melissa A. Wilson1,2,5* Abstract Background: Human X and Y chromosomes share an evolutionary origin and, as a consequence, sequence similarity. We investigated whether the sequence homology between the X and Y chromosomes affects the alignment of RNA-Seq reads and estimates of differential expression. We tested the effects of using reference genomes and reference transcriptomes informed by the sex chromosome complement of the sample’s genome on the measurements of RNA-Seq abundance and sex differences in expression. Results: The default genome includes the entire human reference genome (GRCh38), including the entire sequence of the X and Y chromosomes. We created two sex chromosome complement informed reference genomes. One sex chromosome complement informed reference genome was used for samples that lacked a Y chromosome; for this reference genome version, we hard-masked the entire Y chromosome. For the other sex chromosome complement informed reference genome, to be used for samples with a Y chromosome, we hard- masked only the pseudoautosomal regions of the Y chromosome, because these regions are duplicated identically in the reference genome on the X chromosome. We analyzed the transcript abundance in the whole blood, brain cortex, breast, liver, and thyroid tissues from 20 genetic female (46, XX) and 20 genetic male (46, XY) samples. Each sample was aligned twice: once to the default reference genome and then independently aligned to a reference genome informed by the sex chromosome complement of the sample, repeated using two different read aligners, HISAT and STAR. We then quantified sex differences in gene expression using featureCounts to get the raw count estimates followed by Limma/Voom for normalization and differential expression. We additionally created sex chromosome complement informed transcriptome references for use in pseudo-alignment using Salmon. Transcript (Continued on next page) * Correspondence: [email protected] 1School of Life Sciences, Arizona State University, PO Box 874501, Tempe, AZ 85287-4501, USA 2Center for Evolution and Medicine, Arizona State University, Tempe, AZ 85282, USA Full list of author information is available at the end of the article © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Olney et al. Biology of Sex Differences (2020) 11:42 Page 2 of 18 (Continued from previous page) abundance was quantified twice for each sample: once to the default target transcripts and then independently to target transcripts informed by the sex chromosome complement of the sample. Conclusions: We show that regardless of the choice of the read aligner, using an alignment protocol informed by the sex chromosome complement of the sample results in higher expression estimates on the pseudoautosomal regions of the X chromosome in both genetic male and genetic female samples, as well as an increased number of unique genes being called as differentially expressed between the sexes. We additionally show that using a pseudo-alignment approach informed on the sex chromosome complement of the sample eliminates Y-linked expression in female XX samples. Keywords: RNA-Seq, Sex chromosomes, Differential expression, Transcriptome, Mapping, Alignment, Pseudo- alignment, Quantification Author summary susceptibility, are partially driven by sex-specific gene The human X and Y chromosomes share an evolution- regulation [1–4]. There are reported sex differences in ary origin and sequence homology, including regions of gene expression across human tissues [5–7] and while 100% identity; this sequence homology can result in some may be attributed to hormones and environment, reads misaligning between the sex chromosomes, X and there are documented genome-wide sex differences in Y. We hypothesized that misalignment of reads on the expression based solely on the sex chromosome comple- sex chromosomes would confound estimates of tran- ment [8]. However, accounting for the sex chromosome script abundance if the sex chromosome complement of complement of the sample in quantifying gene expres- the sample is not accounted for during the alignment sion has been limited due to shared sequence homology step. For example, because of shared sequence similarity, between the sex chromosomes, X and Y, that can con- X-linked reads could misalign to the Y chromosome. found gene expression estimates. This is expected to result in reduced expression for re- The X and Y chromosomes share an evolutionary ori- gions between X and Y that share high levels of hom- gin: mammalian X and Y chromosomes originated from ology. For this reason, we tested the effect of using a a pair of indistinguishable autosomes ~ 180–210 million default reference genome versus a reference genome in- years ago that acquired the sex-determining genes [9– formed by the sex chromosome complement of the sam- 11]. The human X and Y chromosomes formed in two ple on estimates of transcript abundance in human different segments: (a) the one that is shared across all RNA-Seq samples from the whole blood, brain cortex, mammals called the X-conserved region (XCR) and (b) breast, liver, and thyroid tissues of 20 genetic female (46, the X-added region (XAR) that is shared across all eu- XX) and 20 genetic male (46, XY) samples. We found therian animals [11]. The sex chromosomes, X and Y, that using a reference genome with the sex chromosome previously recombined along their entire lengths, but complement of the sample resulted in higher measure- due to recombination suppression from Y chromosome- ments of X-linked gene transcription for both male and specific inversions [10, 12], now only recombine at the female samples and more differentially expressed genes tips in the pseudoautosomal regions (PAR) PAR1 and on the X and Y chromosomes. We additionally investi- PAR2 [9–11]. PAR1 is ~ 2.78 million bases (Mb), and gated the use of a sex chromosome complement in- PAR2 is ~ 0.33 Mb; these sequences are 100% identical formed transcriptome reference index for alignment-free between X and Y [11, 13, 14] (Fig. 1a). The PAR1 is a quantification protocols. We observed no Y-linked ex- remnant of the XAR Ross et al. [11] and shared among pression in female XX samples only when the transcript eutherians, while the PAR2 is recently added and quantification was performed using a transcriptome ref- human-specific [14]. Other regions of high sequence erence index informed on the sex chromosome comple- similarity between X and Y include the X-transposed re- ment of the sample. We recommend that future studies gion (XTR) with 98.78% homology [15] (Fig. 1a). The requiring aligning RNA-Seq reads to a reference genome XTR formed from an X chromosome to Y chromosome or pseudo-alignment with a transcriptome reference duplication event following the human-chimpanzee di- should consider the sex chromosome complement of vergence [11, 16]. Thus, the evolution of the X and Y their samples prior to running default pipelines. chromosomes has resulted in a pair of chromosomes that are diverged, but still share some regions of high se- Background quence similarity. Sex differences in aspects of human biology, such as de- To infer which genes or transcripts are expressed, velopment, physiology, metabolism, and disease RNA-Seq reads can be aligned to a reference genome. Olney et al. Biology of Sex Differences (2020) 11:42 Page 3 of 18 The abundance of reads mapped to a transcript is re- Conversely, female XX samples were aligned to a reference flective of the amount of expression of that transcript. genome where the entirety of the Y chromosome is hard- RNA-Seq methods rely on aligning reads to an available masked (Fig. 1c). We tested two different read aligners, high-quality reference genome sequence, but this re- HISAT [31]andSTAR[32], to account for variation be- mains a challenge due to the intrinsic complexity in the tween alignment methods and measured differential ex- transcriptome of regions with a high level of homology pression using Limma/Voom [33]. We found that using a [17]. By default, the GRCh38 version of the human refer- sex chromosome complement informed reference genome ence genome includes both the X and Y chromosomes, for aligning RNA-Seq reads increased expression estimates which is used to align RNA-Seq reads from both male on the pseudoautosomal regions of the X chromosome in XY and female XX samples. It is known that sequence both male XY and female XX samples and uniquely identi- reads from DNA will misalign along the sex chromo- fied differentially expressed genes.

Reference Genome and Transcriptome Informed by the Sex Chromosome

Reference Genome Sequence of the Model Plant Setaria

Y Chromosome Dynamics in Drosophila

The Bacteria Genome Pipeline (BAGEP): an Automated, Scalable Workflow for Bacteria Genomes with Snakemake

Lawrence Berkeley National Laboratory Recent Work

Telomere-To-Telomere Assembly of a Complete Human X Chromosome W

Human Genome Reference Program (HGRP)

De Novo Genome Assembly Versus Mapping to a Reference Genome

Practical Guideline for Whole Genome Sequencing Disclosure

Y Chromosome

Reference Genomes and Common File Formats Overview

Informatics and Clinical Genome Sequencing: Opening the Black Box

Using DECIPHER V2.0 to Analyze Big Biological Sequence Data in R by Erik S