Biological Function in the Twilight Zone of Sequence Conservation Chris P

Ponting BMC Biology (2017) 15:71 DOI 10.1186/s12915-017-0411-5 REVIEW Open Access Biological function in the twilight zone of sequence conservation Chris P. Ponting DNA cannot immediately be ascribed as either func- Abstract tional or non-functional. Population genetics principles Strong DNA conservation among divergent species is illuminate the functionality of sequence in the twilight an indicator of enduring functionality. With weaker zone. These can be used to assess whether sequence sequence conservation we enter a vast ‘twilight zone’ evolution has been constrained, meaning that it exhibits in which sequence subject to transient or lower aslowerrateofchangethanpredictedbyamodelof constraint cannot be distinguished easily from neutral evolution; selective constraint is inferred by neutrally evolving, non-functional sequence. Twilight considering the degree by which allele frequencies are zone functional sequence is illuminated instead by depressed across extant populations [3–5]. Conversely, principles of selective constraint and positive selection functional sequence subject to positive selection exhibits a using genomic data acquired from within a species’ rate of change greater than seen for neutrally evolving population. Application of these principles reveals that sequence. despite being biochemically active, most twilight zone Sequence conservation and constraint are not the only sequence is not functional. benchmark by which to evaluate functionality. High throughput experimental assays are providing genome- wide assessments of functional sequence. Armed with this Function versus conservation versus constraint experimental information, can we now reveal the extent of Functionality of most human protein coding, and some functional sequence and associated molecular and cellular non-coding, sequence is clearly implied when it is con- biology present in the twilight zone of low sequence conserved across diverse mammalian species. This has been servation? Here I review instances where sequence is func- a rule-of-thumb by which to infer whether a sequence is tional despite its low conservation, focusing principally on functional without the benefit of experimental data. Con- our own and other mammalian species. I conclude that servation, however, is not a faithful indicator of functional- population genomics-based approaches to predict function ity. High sequence conservation could reflect a relatively are paramount because, counterintuitively, experiments are brief period of neutral evolution over which few mutations not perfect predictors of function. accumulated. Just because approximately 98% of human DNA is conserved in chimpanzee, for example, this does not imply that this amount of sequence conveys function. A twilight zone protein-coding gene Conversely, poor conservation of a sequence does not The 2310003L06Rik gene exemplifies the rapidity with imply that it is devoid of function. After all, low conserva- which a locus can evolve (Fig. 1). Little is known about tion could also be explained by frequent episodes when its function, except that in mouse gene expression is rare mutations are brought to high frequency and fix- specific to the tongue. With regards to evolution, it is a ation within a population by positive selection. Thresh- member of the secretory calcium-binding phosphopro- olding on percentage nucleotide sequence identity thus tein (SCPP) gene family [6, 7] located in a tandem array fails to neatly separate functional from non-functional on mouse chromosome 5, including those encoding en- sequence. This means that as sequence conservation di- amel matrix proteins, milk caseins and salivary proteins, minishes we drop into a ‘twilight zone’ [1, 2] in which which mostly arose by local gene duplication and subse- quent divergence during early mammalian evolution. In Correspondence: [email protected] four respects, this gene is not well conserved: (i) it is MRC Human Genetics Unit, The Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Crewe Road, present only in theria (marsupials and placental mammals) Edinburgh EH4 2XU, UK but not in monotremes; (ii) its amino acid sequence varies © Ponting et al. 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Ponting BMC Biology (2017) 15:71 Page 2 of 9 Fig. 1. Rapid evolution among 2310003L06Rik orthologues. a Nucleotide conservation is low across placental mammals, and the locus is not aligned with non-mammalian species. The mouse gene is incompletely predicted in Ensembl, and absent from other databases such as RefSeq and CCDS. b Rapid evolution of mammalian 2310003L06Rik orthologues. Open reading frames (shown in grey) are of highly variable length (amino acid numbers shown on the right) across mammalian species (phylogeny shown on left, not to scale). Human, chimpanzee and macaque genomes contain nucleotide substitutions that truncate the open-reading frame (“!”; pseudogene indicated in black). Deletions in the squirrel (Spermophilus tridecemlineatus) orthologue, relative to the dog, are indicated by “Δ”, and repeats in the dog sequence are shown by “R”. Aligned protein sequences are indicated by dotted blue and brown lines. There is no significant sequence similarity evident between mouse and opossum (Monodelphis domestica) orthologues (blastp E > 0.2). dN/dS is explained in the legend to Fig. 2 greatly, with a 3.7-fold difference in length between mouse in these orthologues’ initiating codon, their common and dog; (iii) it contains lineage-specific repeats and inser- number of exons and their splice sites. tions or deletions; and (iv) in some lineages, such as the Functional sites that are neither well conserved nor Catarrhini (including human), it has acquired open-reading constrained fall into two classes that differ in the rate by frame disruptions and thus has become a pseudogene. Nu- which they accumulate mutations relative to the neutral cleotide sequence similarities between closely related spe- rate (Fig. 2). Sites in the first class evolve rapidly due to cies, such as mouse and rat, differ little between exons and positive selection and adaptation. This is when rare mu- introns and its protein sequence has evolved at a rate near tations confer reproductive advantage leading to their to that of synonymous sites, often used as a neutral rate rise in frequency and their fixation in that population fas- proxy. Of all its many features, conservation is evident only ter than neutral mutations. In mammals, most positively Fig. 2. Trends regarding sequence constraint. Protein-coding sites that are highly constrained (dN/dS → 0) tend to fall within secondary structures within intracellular proteins expressed in many tissues, whereas the less-numerous sites that are evolving either near neutrally (dN/dS ≈ 1) or in response to positive selection (dN/dS > 1) tend to lie in disordered regions or in loops in secreted proteins that are expressed in a tissue-restricted manner [73–75]. The median value of dN/dS for human and mouse orthologues is 0.095 [76]. Inferences of positive selection (for example using PAML [77]) can be in error due to sequence misalignment [78, 79], or when alignments are short [80], or when dN exceeds dS because of chance fluctuations. dN/dS (also written as Ka/Ks or ω) [81] is the ratio of the number of nonsynonymous substitutions per nonsynonymous site (dN) to the number of synonymous substitutions per synonymous site (dS) Ponting BMC Biology (2017) 15:71 Page 3 of 9 selected substitution events occur outside of DNA that four-times more human nucleotides present in copy num- encodes protein [8]. Nevertheless, they are particularly ber increased regions than in single nucleotide variant concentrated in the ~1% of genomic sequence that is sites, copy number gain of human genes appears to be protein-coding and their density is low overall in the non- under little or no selection [23]. protein-coding portion [8]. The second class of functional To summarise the hallmarks of a rapidly evolving gene, yet poorly conserved sequence evolves by weak negative I return to the 2310003L06Rik protein-coding locus selection [9]. Such sites accumulate substitutions, on aver- (Fig. 1). It is a member of a large multi-gene family age, slower than the neutral rate and show a low degree of (namely SCPP genes) whose genes duplicated and be- constraint. Variants at these sites that have only a slight came pseudogenes rapidly over mammalian evolution; deleterious effect on fitness can become fixed in popula- it encodes a secreted protein, which means perhaps that tions, which is a consequence of natural selection being it is more likely to be engaged in inter-specific conflict unable to discriminate effectively between slightly deleteri- between host and pathogens; this protein’sstructureis ous and neutral mutations [3, 10] (discussed below). apparently flexible and disordered, which is less likely Has a poorly conserved homologous sequence diverged to evolve by purifying

Biological Function in the Twilight Zone of Sequence Conservation Chris P

Molecular Evolution and Nucleotide Sequences of the Maize Plastid Genes for the Cy Subunit of CFI (Atpa)And the Proteolipid Subunit of Cfo (Atph)

Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting Mathieu Blanchette and Martin Tompa

The for Report 07-08

Conserved Sequence Human Genome Transcription

The Most Conserved Genome Segments for Life Detection on Earth and Other Planets

Massive Turnover of Functional Sequence in Human and Other Mammalian Genomes

X CRG Annual Symposium

A Conserved Heptamer Motif for Ribosomal RNA Transcription

Novel Protein Domains and Repeats in Drosophila Melanogaster: Insights Into Structure, Function, and Evolution

Intron Evolution As a Population-Genetic Process

Conserved Sequence (B Lymphocyte-Specific Gene Regulation/Promoters) TRISTRAM G

Big Knowledge from Big Data in Functional Genomics