Gene Birth Contributes to Structural Disorder Encoded by Overlapping Genes
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/229690; this version posted June 13, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Gene birth contributes to structural disorder encoded by overlapping genes S. Willis and J. Masel∗ Department of Ecology and Evolutionary Biology, University of Arizona ∗Corresponding Author: [email protected] Abstract The same nucleotide sequence can encode two protein products in different reading frames. Overlapping gene regions encode higher levels of intrinsic structural disorder (ISD) than non-overlapping genes (39% vs. 25% in our viral dataset). This might be because of the intrinsic properties of the genetic code, because one member per pair was recently born de novo in a process that favors high ISD, or because high ISD relieves increased evolutionary constraint imposed by dual-coding. Here we quantify the relative contributions of these three alternative hypotheses. We estimate that the recency of de novo gene birth explains 32% or more of the elevation in ISD in overlapping regions of viral genes. While the two reading frames within a same-strand overlapping gene pair have markedly different ISD tendencies that must be controlled for, their effects cancel out to make no net contribution to ISD. The remaining elevation of ISD in the older members of overlapping gene pairs, presumed due to the need to alleviate evolutionary constraint, was already present prior to the origin of the overlap. Same-strand overlapping gene birth events can occur in two different frames, favoring high ISD either in the ancestral gene or in the novel gene; surprisingly, most de novo gene birth events contained completely within the body of an ancestral gene favor high ISD in the ancestral gene (23 phylogenetically independent events vs. 1). This can be explained by mutation bias favoring the frame with more start codons and fewer stop codons. I. INTRODUCTION to only 23% in non-overlapping regions. In this work we explore three non-mutually-exclusive Protein-coding genes sometimes overlap, i.e. the hypotheses why this might be, and quantify the extent same nucleotide sequence encodes different proteins in of each. Two have previously been considered: that different reading frames. Most of the overlapping pairs elevated ISD in overlapping genes is a mechanism that of genes that have been characterized to date are found relieves evolutionary constraint, and that elevated ISD in viral, bacterial and mitochondrial genomes, with is a holdover from the de novo gene birth process. emerging research showing that they may be common We add consideration of a third, previously-unexplored in eukaryotic genomes as well [10], [20], [31], [32], hypothesis — that elevated ISD with dual-coding may [37]. Studying overlapping genes can shed light on the be the result of an artifact of the genetic code — to processes of de novo gene birth [36] . the mix. Overlapping genes tend to encode proteins with Overlapping genes are particularly evolutionarily higher intrinsic structural disorder (ISD) than those constrained because a mutation in an overlapping re- encoded by non-overlapping genes [36]. The term gion simultaneously affects both of the two (or occa- disorder applies broadly to proteins which, at least in sionally more) genes involved in that overlap. Because the absence of a binding partner, lack a stable secondary 70% of mutations that occur in the third codon and tertiary structure. There are different degrees of dis- position∼ are synonymous, versus only 5% and 0% order: molten globules, partially unstructured proteins of mutations in the first and second codon∼ positions and random coils with regions of disorder spanning respectively [38], a mutation that is synonymous in one from short (less than 30 residues in length) to long. reading frame is highly likely to be nonsynonymous Disorder can be shown experimentally or predicted in another, so to permit adaptation, overlapping genes from amino acid sequences using software [15]. [36] must be relatively tolerant of nonsynonymous changes. estimated, using the latter approach, that 48% of amino Demonstrating the higher constraint on overlapping acids in overlapping regions exhibit disorder, compared regions, they have lower genetic diversity and dN =dS bioRxiv preprint doi: https://doi.org/10.1101/229690; this version posted June 13, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. than non-overlapping regions in RNA viruses [25], overlapping reading frames will tend to encode higher [39], [40], [43]. ISD than the ancestral overlapping reading frames. High ISD can alleviate the problem of constraint. Under the constraint hypothesis, ancestral overlap- Amino acid substitutions that maintain disorder have ping proteins will still have elevated ISD relative to a reasonable chance of being tolerated, in contrast to non-overlapping proteins, even if it is less elevated than the relative fragility of a well-defined three-dimensional that of novel overlapping proteins. Elevated ISD in the structure. This expectation is confirmed by the higher ancestral member of the gene pair might have already evolutionary rates observed for disordered proteins [4], been there at the moment of gene birth, or it might have [13]. The known elevation of evolutionary constraint on subsequently evolved, representing two subhypotheses overlapping genes is usually invoked as the sole [47], within the overall hypothesis of constraint. The overlap- [53] or at least dominant [36] explanation for their high ping gene pairs that we observe are those that have been ISD. Given the strength of the evidence for constraint retained; if either member of the overlapping pair was [25], [39], [40], [43], we attribute to constraint, by a born with low ISD, then constraint makes it difficult to process of elimination, what cannot be explained by adapt to a changing environment, and that pair is less our other two hypotheses. likely to be retained. When the ancestral member of the The second hypothesis that we consider is that high pair already has high ISD at the moment at which the ISD in overlapping genes is an artifact of the process novel gene is functionalized, long-term maintenance of of de novo gene birth [36]. There is no plausible path both genes in the face of constraint is more likely. The by which two non-overlapping genes could re-encode ”already there” or ”preadaptation” [52] version of the an equivalent protein sequence as overlapping; instead, constraint hypothesis predicts that the pre-overlapping an overlapping pair arises either when a second gene ancestors of today’s ancestral overlapping genes had is born de novo within an existing gene, or when the higher ISD than other genes, perhaps because these boundaries of an existing gene are extended to create gene pairs are the ones to have been retained. While overlap [38]. In the latter case of ”overprinting” [7], these ancestral sequences are not available, as a proxy [19], [36], the extended portion of that gene, if not we use homologous sequences from basal lineages the whole gene, is born de novo. One overlapping whose most recent common ancestry predates the origin protein-coding sequence is therefore always evolution- of the overlap. For simplicity, we refer to these se- arily younger than the other; we refer to these as quences as ”pre-overlapping” to distinguish them from ”novel” versus ”ancestral” overlapping genes or por- ”non-overlapping” genes never known to have overlap. tions of genes. Genes may eventually lose their overlap The preadaptation version of the constraint hypothesis through a process of gene duplication followed by predicts higher ISD in pre-overlapping genes than in subfunctionalization [19], enriching overlapping genes non-overlapping genes. for relatively young genes that have not yet been Finally, here we also consider the possibility that the through this process. However, gene duplication may be high ISD observed in overlapping genes might simply inaccessible to many viruses (in particular, many RNA, be an artifact of the genetic code [21]. We perform for ssDNA, and retroviruses), due to intrinsic geometric the first time the appropriate control, by predicting what constraints on maximum nucleotide length [6], [9], the ISD would be if codons were read from alternative [11]. reading frames of existing non-overlapping genes. Any Young genes are known to have higher ISD than DNA sequence can be read in three reading frames on old genes, with high ISD at the moment of gene birth each of the two strands, for a total of 6 reading frames. facilitating the process [52], perhaps because cells tol- We focus only on same-strand overlap, due to superior erate them better [48]. Domains that were more recently availability of reliable data on same-strand overlapping born de novo also have higher ISD [3], [5], [14], [26]. gene pairs. We classify the reading frame of each gene High ISD could be helpful in itself in creating novel in an overlapping pair relative to its counterpart; if function, or it could be a byproduct of a hydrophilic gene A is in the +1 frame with respect to gene B, this amino acid composition whose function is simply the means that gene B is in the +2 frame with respect to avoidance of harmful protein aggregation [16], [22]. gene A. We use the +0 frame designation just for non- Regardless of the cause of high ISD in young genes, the overlapping or pre-overlapping genes in their original ”facilitate birth” hypothesis makes a distinct prediction frame.