Biological Function in the Twilight Zone of Sequence Conservation
Total Page:16
File Type:pdf, Size:1020Kb
Edinburgh Research Explorer Biological function in the twilight zone of sequence conservation Citation for published version: Ponting, C 2017, 'Biological function in the twilight zone of sequence conservation', BMC Biology. https://doi.org/10.1186/s12915-017-0411-5 Digital Object Identifier (DOI): 10.1186/s12915-017-0411-5 Link: Link to publication record in Edinburgh Research Explorer Document Version: Publisher's PDF, also known as Version of record Published In: BMC Biology General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 07. Oct. 2021 Ponting BMC Biology (2017) 15:71 DOI 10.1186/s12915-017-0411-5 REVIEW Open Access Biological function in the twilight zone of sequence conservation Chris P. Ponting DNA cannot immediately be ascribed as either func- Abstract tional or non-functional. Population genetics principles Strong DNA conservation among divergent species is illuminate the functionality of sequence in the twilight an indicator of enduring functionality. With weaker zone. These can be used to assess whether sequence sequence conservation we enter a vast ‘twilight zone’ evolution has been constrained, meaning that it exhibits in which sequence subject to transient or lower aslowerrateofchangethanpredictedbyamodelof constraint cannot be distinguished easily from neutral evolution; selective constraint is inferred by neutrally evolving, non-functional sequence. Twilight considering the degree by which allele frequencies are zone functional sequence is illuminated instead by depressed across extant populations [3–5]. Conversely, principles of selective constraint and positive selection functional sequence subject to positive selection exhibits a using genomic data acquired from within a species’ rate of change greater than seen for neutrally evolving population. Application of these principles reveals that sequence. despite being biochemically active, most twilight zone Sequence conservation and constraint are not the only sequence is not functional. benchmark by which to evaluate functionality. High throughput experimental assays are providing genome- wide assessments of functional sequence. Armed with this Function versus conservation versus constraint experimental information, can we now reveal the extent of Functionality of most human protein coding, and some functional sequence and associated molecular and cellular non-coding, sequence is clearly implied when it is con- biology present in the twilight zone of low sequence con- served across diverse mammalian species. This has been servation? Here I review instances where sequence is func- a rule-of-thumb by which to infer whether a sequence is tional despite its low conservation, focusing principally on functional without the benefit of experimental data. Con- our own and other mammalian species. I conclude that servation, however, is not a faithful indicator of functional- population genomics-based approaches to predict function ity. High sequence conservation could reflect a relatively are paramount because, counterintuitively, experiments are brief period of neutral evolution over which few mutations not perfect predictors of function. accumulated. Just because approximately 98% of human DNA is conserved in chimpanzee, for example, this does not imply that this amount of sequence conveys function. A twilight zone protein-coding gene Conversely, poor conservation of a sequence does not The 2310003L06Rik gene exemplifies the rapidity with imply that it is devoid of function. After all, low conserva- which a locus can evolve (Fig. 1). Little is known about tion could also be explained by frequent episodes when its function, except that in mouse gene expression is rare mutations are brought to high frequency and fix- specific to the tongue. With regards to evolution, it is a ation within a population by positive selection. Thresh- member of the secretory calcium-binding phosphopro- olding on percentage nucleotide sequence identity thus tein (SCPP) gene family [6, 7] located in a tandem array fails to neatly separate functional from non-functional on mouse chromosome 5, including those encoding en- sequence. This means that as sequence conservation di- amel matrix proteins, milk caseins and salivary proteins, minishes we drop into a ‘twilight zone’ [1, 2] in which which mostly arose by local gene duplication and subse- quent divergence during early mammalian evolution. In Correspondence: [email protected] four respects, this gene is not well conserved: (i) it is MRC Human Genetics Unit, The Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Crewe Road, present only in theria (marsupials and placental mammals) Edinburgh EH4 2XU, UK but not in monotremes; (ii) its amino acid sequence varies © Ponting et al. 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Ponting BMC Biology (2017) 15:71 Page 2 of 9 Fig. 1. Rapid evolution among 2310003L06Rik orthologues. a Nucleotide conservation is low across placental mammals, and the locus is not aligned with non-mammalian species. The mouse gene is incompletely predicted in Ensembl, and absent from other databases such as RefSeq and CCDS. b Rapid evolution of mammalian 2310003L06Rik orthologues. Open reading frames (shown in grey) are of highly variable length (amino acid numbers shown on the right) across mammalian species (phylogeny shown on left, not to scale). Human, chimpanzee and macaque genomes contain nucleotide substitutions that truncate the open-reading frame (“!”; pseudogene indicated in black). Deletions in the squirrel (Spermophilus tridecemlineatus) orthologue, relative to the dog, are indicated by “Δ”, and repeats in the dog sequence are shown by “R”. Aligned protein sequences are indicated by dotted blue and brown lines. There is no significant sequence similarity evident between mouse and opossum (Monodelphis domestica) orthologues (blastp E > 0.2). dN/dS is explained in the legend to Fig. 2 greatly, with a 3.7-fold difference in length between mouse in these orthologues’ initiating codon, their common and dog; (iii) it contains lineage-specific repeats and inser- number of exons and their splice sites. tions or deletions; and (iv) in some lineages, such as the Functional sites that are neither well conserved nor Catarrhini (including human), it has acquired open-reading constrained fall into two classes that differ in the rate by frame disruptions and thus has become a pseudogene. Nu- which they accumulate mutations relative to the neutral cleotide sequence similarities between closely related spe- rate (Fig. 2). Sites in the first class evolve rapidly due to cies, such as mouse and rat, differ little between exons and positive selection and adaptation. This is when rare mu- introns and its protein sequence has evolved at a rate near tations confer reproductive advantage leading to their to that of synonymous sites, often used as a neutral rate rise in frequency and their fixation in that population fas- proxy. Of all its many features, conservation is evident only ter than neutral mutations. In mammals, most positively Fig. 2. Trends regarding sequence constraint. Protein-coding sites that are highly constrained (dN/dS → 0) tend to fall within secondary structures within intracellular proteins expressed in many tissues, whereas the less-numerous sites that are evolving either near neutrally (dN/dS ≈ 1) or in response to positive selection (dN/dS > 1) tend to lie in disordered regions or in loops in secreted proteins that are expressed in a tissue-restricted manner [73–75]. The median value of dN/dS for human and mouse orthologues is 0.095 [76]. Inferences of positive selection (for example using PAML [77]) can be in error due to sequence misalignment [78, 79], or when alignments are short [80], or when dN exceeds dS because of chance fluctuations. dN/dS (also written as Ka/Ks or ω) [81] is the ratio of the number of nonsynonymous substitutions per nonsynonymous site (dN) to the number of synonymous substitutions per synonymous site (dS) Ponting BMC Biology (2017) 15:71 Page 3 of 9 selected substitution events occur outside of DNA that four-times more human nucleotides present in copy num- encodes protein [8]. Nevertheless, they are particularly ber increased regions than in single nucleotide variant concentrated in the ~1% of genomic sequence that is sites, copy number gain of human genes appears to be