Genesis of Non-Coding RNA Genes in Human Chromosome 22—A Sequence Connection with Protein Genes Separated by Evolutionary Time
Total Page:16
File Type:pdf, Size:1020Kb
non-coding RNA Perspective Genesis of Non-Coding RNA Genes in Human Chromosome 22—A Sequence Connection with Protein Genes Separated by Evolutionary Time Nicholas Delihas Department of Microbiology and Immunology, Renaissance School of Medicine, Stony Brook University, Stony Brook, New York, NY 11794-5222, USA; [email protected] Received: 16 July 2020; Accepted: 1 September 2020; Published: 3 September 2020 Abstract: A small phylogenetically conserved sequence of 11,231 bp, termed FAM247, is repeated in human chromosome 22 by segmental duplications. This sequence forms part of diverse genes that span evolutionary time, the protein genes being the earliest as they are present in zebrafish and/or mice genomes, and the long noncoding RNA genes and pseudogenes the most recent as they appear to be present only in the human genome. We propose that the conserved sequence provides a nucleation site for new gene development at evolutionarily conserved chromosomal loci where the FAM247 sequences reside. The FAM247 sequence also carries information in its open reading frames that provides protein exon amino acid sequences; one exon plays an integral role in immune system regulation, specifically, the function of ubiquitin-specific protease (USP18) in the regulation of interferon. An analysis of this multifaceted sequence and the genesis of genes that contain it is presented. Keywords: de novo gene birth; gene evolution; protogene; long noncoding RNA genes; pseudogenes; USP18; GGT5; Alu sequences 1. Introduction The genesis of genes has been a major topic of interest for several decades [1,2]. One mechanism of gene formation is by duplication of existing genes [1,3]. This is considered one of the major processes in protein gene development, but it has also been shown that there is a prevalence of gene birth from noncoding DNA via de novo processes [4–7]; this pathway also significantly contributes to new protein gene formation [4,7]. Working with yeast Saccharomyces cerevisiae genomic segments, Carvunis et al. [4] formulated an evolutionary model for the de novo development of protein genes in genetic regions where there are no annotated genes but where there is the translation of small open reading frames. These regions are considered protogene elements that can develop into functional genes. With respect to long noncoding RNA (lncRNA) genes, Ulitsky and Bartel [8] have provided a comprehensive background on lncRNA transcripts and genes that includes a discussion of mechanisms of lncRNA gene origins where some gene birth processes may be similar to those that operate in protein gene formations. In this treatise, we analyze the development of long intergenic noncoding RNA (lincRNA) genes and pseudogenes by an evolutionarily conserved ancestral sequence. This is a repeat element found in human chromosome 22. It was previously termed clincRNA [9] and is now officially termed FAM247 by the HUGO Gene Nomenclature Committee (https://www.genenames.org/), and it constitutes the FAM247A-D gene family. The FAM247A gene sequence has previously been used as a guide to finding homologous sequences [9]. Heretofore, FAM247 is used in place of FAM247A. Non-coding RNA 2020, 6, 36; doi:10.3390/ncrna6030036 www.mdpi.com/journal/ncrna Non-CodingNon-coding RNARNA 20202020,, 66,, 36x FOR PEER REVIEW 22 of 13 We previously proposed that FAM247 carries information to form a nucleation site for gene developmentWe previously [9]. This proposed is best exemplified that FAM247 with carries the informationformation of to pseudogenes form a nucleation by the site addition for gene of extraneousdevelopment chromosomal [9]. This is sequences best exemplified to specific with si thetes formationon FAM247, of a pseudogenes process that by is thenow addition described of hereextraneous in this chromosomalcurrent paper. sequences This is a model to specific that perhaps sites on FAM247,can be considered a process analogous that is now to described the model here of dein thisnovo current protein paper. gene development This is a model via that a protogene perhaps canelement be considered [4], with the analogous FAM247 to sequence the model serving of de asnovo the protein protogene. gene developmentDuring the formation via a protogene of these element pseudogenes, [4], with the sequences FAM247 sequencethat consist serving of full as theor partialprotogene. copies During of unprocessed the formation pare ofnt these protein pseudogenes, genes are sequencesadded to the that FAM247 consist ofprotogene, full or partial as well copies as sequencesof unprocessed that are parent added protein from genes other are unrelated addedto ge thenomic FAM247 regions protogene, to form the as well final as gene sequences sequence. that are addedBoth from the other pseudogenes unrelated genomic and regionsthe FAM247A-D to form the finallincRNA gene sequence.family genes appear to be human-specific.Both the pseudogenes The FAM247 and sequence the FAM247A-D is also lincRNAfound in familyprotein genes genes appear USP18 to be(ubiquitin-specific human-specific. protease)The FAM247 and sequenceGGT5 (gamma is also glytamyltransferase). found in protein genes BothUSP18 these(ubiquitin-specific genes date back in protease) evolutionary and GGT5 time, USP18(gamma over glytamyltransferase). 350 million years Bothago (MYA) these genes and GGT5 date back over in 90 evolutionary MYA. Thus, time, the USP18FAM247over sequence 350 million has formedyears ago a (MYA)part of andgenesGGT5 throughover 90much MYA. ofThus, vertebra thete FAM247 evolution, sequence and it has continues formed ato part do ofso. genes The FAM247through muchsequence of vertebrate is of particular evolution, interest and itbecause continues of its to evolutionary do so. The FAM247 conservation sequence and is ofits particularpresence ininterest diverse because genes, of from its evolutionary humans to zebrafish. conservation In addition, and its presencethe FAM247 in diverse sequence genes, carries from information humans to inzebrafish. terms of In its addition, open reading the FAM247 frames sequence where carries the open information reading inframes terms ofare its found open readingto form frames exon sequenceswhere the openof proteins, reading framesthe most are foundimportant to form being exon the sequences carboxy of terminal proteins, amino the most acid important sequence being of USP18.the carboxy terminal amino acid sequence of USP18. 2. Background Background on on Conserved Conserved Linked Linked Sequences Sequences FAM247 is present in didifferentfferent segmental duplications or low copy repeats (LCR22) in human chromosome 22 22 (chr22) (chr22) as as part part of of phylogenetically phylogenetically conserved conserved linked linked gene gene sequences sequences [9]. [9 ].Figure Figure 1 is1 ais representation a representation of ofthese these linked linked sequences, sequences, and and it it shows shows conserved conserved nearest nearest neighbor sequence signatures found in humans. The The linked linked gene gene sequ sequencesences are are repeated repeated in in chr22 and generate gene families.families. TheThe signaturessignatures are are also also representative representative of ancestral of ancestral primate primate linked sequences,linked sequences, e.g., the sequencee.g., the sequencearrangement arrangement in Figure1 bin is Figure present 1b in is the present Rhesus in Old the World Rhesus monkey Old World ( Macaca monkey mulatta (),Macaca where mulatta FAM247), whereand spacer FAM247 sequences and spacer are linked sequences to GGT1 areon linked chr10. to The GGT1 spacer on sequencechr10. The (3953 spacer bp) depictedsequence in(3953 Figure bp)1 depictedis also conserved in Figure in 1 primates. is also conser It isved present in primates. in Pan troglodytes It is present(chimpanzee), in Pan troglodytesPapio Anubis (chimpanzee),(olive baboon), Papio AnubisPongo abelii(olive(Sumatran baboon), orangutan),Pongo abelii and(SumatranMacaca orangutan), mulatta (Rhesus and monkey)Macaca mulatta genomes, (Rhesus but it monkey) does not genomes,encode genes but orit formdoes partnot ofencode genes. genes This conservationor form part indicatesof genes. it This may haveconservation a function. indicates FAM247 it ismay the havecommon a function. denominator FAM247 in Figure is the1a,b. common In Figure denomina1c, the FAM247tor in Figure sequence 1a,b. depicted In Figure is embedded 1c, the FAM247 in the sequenceUSP18 gene. depicted is embedded in the USP18 gene. Figure 1. (Schematica,b) Schematic representation representation of evolutionarily of evolutionarily conserved conserved linked linked sequences sequences with di withfferent different colors a b colorsdepicting depicting different different sequences sequences, and family as described genes, as in described [9]. (c) The in FAM247 [9]. ( , )sequence Conserved (green sequences highlight) that are found linked to GGT sequences. (c) The FAM247 sequence (green highlight) is embedded in the is embedded in the USP18 gene. USP18 gene. Table 1 contains a list of human gene families that are found in repeat units shown in Figure 1 and indicates the sequence or chromosomal locus of origin. For example, GGT represents the locus Non-coding RNA 2020, 6, 36 3 of 13 Table1 contains a list of human gene families that are found in repeat units shown in Figure1 and indicates the sequence or chromosomal locus of origin. For example, GGT represents the locus of origin of GGT1, the gamma-glutamyltransferase and gamma-glutamyltransferase light chain genes and their respective