<<

non-coding RNA

Perspective Genesis of Non-Coding RNA in 22—A Sequence Connection with Genes Separated by Evolutionary Time

Nicholas Delihas

Department of Microbiology and Immunology, Renaissance School of Medicine, Stony Brook University, Stony Brook, New York, NY 11794-5222, USA; [email protected]

 Received: 16 July 2020; Accepted: 1 September 2020; Published: 3 September 2020 

Abstract: A small phylogenetically conserved sequence of 11,231 bp, termed FAM247, is repeated in human by segmental duplications. This sequence forms part of diverse genes that span evolutionary time, the protein genes being the earliest as they are present in zebrafish and/or mice , and the long noncoding RNA genes and the most recent as they appear to be present only in the human . We propose that the conserved sequence provides a nucleation site for new development at evolutionarily conserved chromosomal loci where the FAM247 sequences reside. The FAM247 sequence also carries information in its open reading frames that provides protein amino acid sequences; one exon plays an integral role in immune system regulation, specifically, the function of -specific protease (USP18) in the regulation of interferon. An analysis of this multifaceted sequence and the genesis of genes that contain it is presented.

Keywords: de novo gene birth; gene ; protogene; long noncoding RNA genes; pseudogenes; USP18; GGT5; Alu sequences

1. Introduction The genesis of genes has been a major topic of interest for several decades [1,2]. One mechanism of gene formation is by duplication of existing genes [1,3]. This is considered one of the major processes in protein gene development, but it has also been shown that there is a prevalence of gene birth from noncoding DNA via de novo processes [4–7]; this pathway also significantly contributes to new protein gene formation [4,7]. Working with yeast genomic segments, Carvunis et al. [4] formulated an evolutionary model for the de novo development of protein genes in genetic regions where there are no annotated genes but where there is the of small open reading frames. These regions are considered protogene elements that can develop into functional genes. With respect to long noncoding RNA (lncRNA) genes, Ulitsky and Bartel [8] have provided a comprehensive background on lncRNA transcripts and genes that includes a discussion of mechanisms of lncRNA gene origins where some gene birth processes may be similar to those that operate in protein gene formations. In this treatise, we analyze the development of long intergenic noncoding RNA (lincRNA) genes and pseudogenes by an evolutionarily conserved ancestral sequence. This is a repeat element found in human chromosome 22. It was previously termed clincRNA [9] and is now officially termed FAM247 by the HUGO Gene Nomenclature Committee (https://www.genenames.org/), and it constitutes the FAM247A-D gene family. The FAM247A gene sequence has previously been used as a guide to finding homologous sequences [9]. Heretofore, FAM247 is used in place of FAM247A.

Non-coding RNA 2020, 6, 36; doi:10.3390/ncrna6030036 www.mdpi.com/journal/ncrna Non-CodingNon-coding RNARNA 20202020,, 66,, 36x FOR PEER REVIEW 22 of 13

We previously proposed that FAM247 carries information to form a nucleation site for gene developmentWe previously [9]. This proposed is best exemplified that FAM247 with carries the informationformation of to pseudogenes form a nucleation by the site addition for gene of extraneousdevelopment chromosomal [9]. This is sequences best exemplified to specific with si thetes formationon FAM247, of a pseudogenes process that by is thenow addition described of hereextraneous in this chromosomalcurrent paper. sequences This is a model to specific that perhaps sites on FAM247,can be considered a process analogous that is now to described the model here of dein thisnovo current protein paper. gene development This is a model via that a protogene perhaps canelement be considered [4], with the analogous FAM247 to sequence the model serving of de asnovo the protein protogene. gene developmentDuring the formation via a protogene of these element pseudogenes, [4], with the sequences FAM247 sequencethat consist serving of full as theor partialprotogene. copies During of unprocessed the formation pare ofnt these protein pseudogenes, genes are sequencesadded to the that FAM247 consist ofprotogene, full or partial as well copies as sequencesof unprocessed that are parent added protein from genes other are unrelated addedto ge thenomic FAM247 regions protogene, to form the as well final as gene sequences sequence. that are addedBoth from the other pseudogenes unrelated genomic and regionsthe FAM247A-D to form the finallincRNA gene sequence.family genes appear to be human-specific.Both the pseudogenes The FAM247 and sequence the FAM247A-D is also lincRNAfound in familyprotein genes genes appear USP18 to be(ubiquitin-specific human-specific. protease)The FAM247 and sequenceGGT5 (gamma is also glytamyltransferase). found in protein genes BothUSP18 these(ubiquitin-specific genes date back in protease) evolutionary and GGT5 time, USP18(gamma over glytamyltransferase). 350 million years Bothago (MYA) these genes and GGT5 date back over in 90 evolutionary MYA. Thus, time, the USP18FAM247over sequence 350 million has formedyears ago a (MYA)part of andgenesGGT5 throughover 90much MYA. ofThus, vertebra thete FAM247 evolution, sequence and it has continues formed ato part do ofso. genes The FAM247through muchsequence of vertebrate is of particular evolution, interest and itbecause continues of its to evolutionary do so. The FAM247 conservation sequence and is ofits particularpresence ininterest diverse because genes, of from its evolutionary to zebrafish. conservation In addition, and its presencethe FAM247 in diverse sequence genes, carries from information humans to inzebrafish. terms of In its addition, open reading the FAM247 frames sequence where carries the open information reading inframes terms ofare its found open readingto form frames exon sequenceswhere the openof , reading framesthe most are foundimportant to form being exon the sequences carboxy of terminal proteins, amino the most acid important sequence being of USP18.the carboxy terminal amino acid sequence of USP18.

2. Background Background on on Conserved Conserved Linked Linked Sequences Sequences FAM247 is present in didifferentfferent segmental duplications or low copy repeats (LCR22) in human chromosome 22 22 (chr22) (chr22) as as part part of of phylogenetically phylogenetically conserved conserved linked linked gene gene sequences sequences [9]. [9 ].Figure Figure 1 is1 ais representation a representation of ofthese these linked linked sequences, sequences, and and it it shows shows conserved conserved nearest nearest neighbor sequence signatures found in humans. The The linked linked gene gene sequ sequencesences are are repeated repeated in in chr22 and generate gene families.families. TheThe signaturessignatures are are also also representative representative of ancestral of ancestral primate linked sequences,linked sequences, e.g., the sequencee.g., the sequencearrangement arrangement in Figure1 bin is Figure present 1b in is the present Rhesus in Old the World Rhesus monkey Old World ( Macaca monkey mulatta (),Macaca where mulatta FAM247), whereand spacer FAM247 sequences and spacer are linked sequences to GGT1 areon linked chr10. to The GGT1 spacer on sequencechr10. The (3953 spacer bp) depictedsequence in(3953 Figure bp)1 depictedis also conserved in Figure in 1 . is also conser It isved present in primates. in troglodytes It is present(), in Pan troglodytesPapio Anubis (chimpanzee),(olive baboon), Papio AnubisPongo abelii(olive(Sumatran baboon), ),Pongo abelii and(SumatranMacaca orangutan), mulatta (Rhesus and monkey)Macaca mulatta genomes, (Rhesus but it monkey) does not genomes,encode genes but orit formdoes partnot ofencode genes. genes This conservationor form part indicatesof genes. it This may haveconservation a function. indicates FAM247 it ismay the havecommon a function. denominator FAM247 in Figure is the1a,b. common In Figure denomina1c, the FAM247tor in Figure sequence 1a,b. depicted In Figure is embedded 1c, the FAM247 in the sequenceUSP18 gene. depicted is embedded in the USP18 gene.

Figure 1. (Schematica,b) Schematic representation representation of evolutionarily of evolutionarily conserved conserved linked linked sequences sequences with di withfferent different colors a b colorsdepicting depicting different different sequences sequences, and family as described genes, as in described [9]. (c) The in FAM247 [9]. ( , )sequence Conserved (green sequences highlight) that are found linked to GGT sequences. (c) The FAM247 sequence (green highlight) is embedded in the is embedded in the USP18 gene. USP18 gene. Table 1 contains a list of human gene families that are found in repeat units shown in Figure 1 and indicates the sequence or chromosomal of origin. For example, GGT represents the locus

Non-coding RNA 2020, 6, 36 3 of 13

Table1 contains a list of human gene families that are found in repeat units shown in Figure1 and indicates the sequence or chromosomal locus of origin. For example, GGT represents the locus of origin of GGT1, the gamma-glutamyltransferase and gamma-glutamyltransferase chain genes and their respective pseudogenes; FAM247 is the sequence/locus of origin of GGT5.

Table 1. Human genes and gene families found in linked sequences: FAM230–FAM247–GGT,FAM230–USP18.

Gene/Gene Family Type Locus Origin * FAM230A-J lincRNA FAM230 FAM247A-D lincRNA FAM247 POM121L9P, POM121L10P FAM247 BCRP3 pesudogene FAM247 GGT1, GGT2 protein GGT GGTLC2 protein GGT GGTLC3 protein GGT GGT3P pseudogene GGT GGT4P pseudogene GGT GGTLC4P pseudogene GGT GGTLC5P pseudogene GGT GGT5 protein FAM247 USP18 protein FAM247 * Some FAM230 family members, such as FAM230J, do not have the linked sequence signatures.

A description of genes is as follows. POM121L9P and POM121L10P are members of the POM121 transmembrane nucleoporin like 1 pseudogene family. BCRP3 is a member of the BCR pseudogene family. BCR, a gene of 137,529 bp, is the activator of RhoGEF and GTPase and was formerly termed breakpoint cluster region protein. The BCR gene is important clinically as it is associated with the production of the Philadelphia chromosome in chronic myelogenous leukemia [10,11]. POM121L9P and BCRP3 stem from the FAM247 sequence at chromosomal loci, where the GGT sequence is found, as represented in Figure1b. USP18 is the ubiquitin-specific peptidase gene, a member of the deubiquitinating protease family; the protein product plays a major role in interferon regulation [12], and it has multiple functions [13].

3. lincRNA Gene Families The FAM230 lincRNA and FAM247 lincRNA gene families were named by the HUGO Gene Nomenclature Committee (https://www.genenames.org/)[14]. These genes exemplify how segmental duplications or low copy repeats in chromosome 22 are a driving element in the genesis and proliferation of lincRNA gene families. Ten FAM230 and five FAM247 genes are present in chr22 low copy repeats (LCR22) that originate from sequence duplications [9]. FAM230 family genes differ from one another in sequence, transcript sequence and exon number, and RNA expression in various fetal developing tissues [9,15,16]. Their functions are not known. FAM230 sequences are also present in primates, but these sequences are annotated as predicted protein genes or pseudogenes, not as lincRNA genes, e.g., more than eleven genes that contain the FAM230 sequence in chimpanzee and are annotated as protein genes and two FAM230 sequences in Rhesus monkey and olive baboon are annotated as pseudogenes. An example is LOC106992440, which is found in the Rhesus monkey, annotated as an uncharacterized pseudogene, and resides in a chr locus that is homologous to that of human FAM230D linked to USP18 in humans; the Rhesus LOC106992440 has 58% identity with the human lincRNA FAM230D. In this and other cases, the detection of an experimental transcript that verifies the computational prediction of a protein gene or pseudogene is essential. Evolutionarily, the FAM230 sequence may have originated in the Rhesus monkey or Old World monkeys as the FAM230 sequence is present in the Rhesus genome but is not found in the genome of the Prosimian primate ancestor, Philippine tarsier. Non-coding RNA 2020, 6, 36 4 of 13

The FAM247 lincRNA gene family may have newly formed in humans as there are few or no differences in gene sequence or RNA transcript expression in and fetal developing tissues [9,15]. A homologous sequence to FAM247 is present in chimpanzee and is linked to GGT2. It has the full-length FAM247 sequence [9], but the chimpanzee sequence has not been annotated as a gene. Non-Coding RNA 2020, 6, x FOR PEER REVIEW 4 of 16 Segments of sequences homologous to FAM247 are found in other primate genomes (gorilla, orangutan, Rhesus monkey,ThePhilippine FAM247 lincRNA tarsier ),gene and family the may FAM247 have newly sequence formed in may humans date as there back are evolutionarily few or no to the differences in gene sequence or RNA transcript expression in somatic and fetal developing tissues [9,15]. and A zebrafish homologous (discussed sequence to FAM247 below). is present in chimpanzee and is linked to GGT2. It has the full-length FAM247 sequence [9], but the chimpanzee sequence has not been annotated as a gene. 4. The FAM247Segments Sequence of sequences is Present homologous in Diverse to FAM247 Genes are found in other primate genomes (gorilla, orangutan, Rhesus monkey, Philippine tarsier), and the FAM247 sequence may date back A significantevolutionarily property to the of house the mous FAM247e and zebrafish sequence (discussed is that below). it forms part of diverse genes. Sequences homologous to FAM247 form genes that include lincRNA genes, pseudogenes, and protein 4. The FAM247 Sequence is Present in Diverse Genes genes (Figure2). These genes stem from phylogenetically conserved nearest neighbor gene A significant property of the FAM247 sequence is that it forms part of diverse genes. Sequences loci, where thehomologous FAM247 to FAM247 sequence form genes is linked that include to adjacentlincRNA genes, genes pseudogenes, that form and protein signatures genes containing gene families,(Figure e.g., 2).FAM230E These genes- FAM247Cstem from phylogenetically-GGT3P [9] conserved present nearest in segmental neighbor loci, where the LCR22A and FAM230B-FAM247AFAM247 -GGT2sequence inis linked LCR22D. to adjacent Other genes than that the form FAM247 signatures lincRNA containing family gene families, genes, e.g., which contain FAM230E-FAM247C-GGT3P [9] present in segmental duplication LCR22A and the entire 11,231FAM230B-FAM247A-GGT2 bp, only segments in LCR22D. of FAM247 Other arethan foundthe FAM247 to be lincRNA part offamily other genes, genes. which The ends of these segmentscontain may the represent entire 11,231 sequence bp, only segments breaks, of FAM i.e.,247 bp are positions found to be ~5958 part of and other ~8000–8200 genes. The ends (Figure 2, see numbers aboveof these green segments highlighted may represent FAM247). sequence Thesebreaks, i.e., are bp regions positions that~5958 containand ~8000–8200Alu sequences (Figure 2, (Figure2, see numbers above green highlighted FAM247). These are regions that contain Alu sequences caption), and(Figure may provide 2, caption), sites and formay attachmentprovide sites for of at othertachment sequences of other sequences to FAM247. to FAM247. FAM247 FAM247 contains a total of fifteen Alucontainselements. a total of fifteen Alu elements.

Figure 2. Protein genes, lincRNA genes, and pseudogenes that stem from the FAM247 sequence and contain different sections of the FAM247 sequence (shown in bp position numbers above FAM247 highlighted in green). Analysis of FAM247 by the RepeatMasker program: RepeatMasker http: //www.repeatmasker.org/cgi-bin/WEBRepeatMasker shows the presence of Alu sequences in the FAM247 sequence at bp positions 6007–6285 and 8063–8374, the regions close to breaks. The segment that includes bp positions 8063–8374 of FAM247 has seven Alu elements in tandem repeats. Non-coding RNA 2020, 6, 36 5 of 13

4.1. USP18 Non-Coding RNA 2020, 6, x FOR PEER REVIEW 5 of 13 A comparison of USP18 chromosomal coordinates at loci in different shows that the neighborposition issignatures evolutionarily where conserved evolutionary relative history to adjacent(genetic genessynteny) (Figure and3 origins). This of provides USP18 nearestcan be assessed.neighbor Two signatures neighbor where genes, evolutionary PEX26 (peroxisomal history (genetic biogenesis ) factor and origins 26) and of TUBA8USP18 can(tubulin be assessed. alpha 8)Two are neighbor in homologous genes, PEX26 loci that(peroxisomal show a conserve biogenesisd orientation factor 26) andwithTUBA8 respect(tubulin to each alpha other 8) in are the in chromosomeshomologous loci of mice that show(Mus musculus a conserved (house orientation mouse)) with and respectprimates to (Figure each other 3a–c). in theIn zebrafish (Danio of reriomice, a (Mus member musculus the Cyprinidae(house mouse)) family and of primates freshwater (Figure fish),3a–c). the tubulin In zebrafish gene ( (termedDanio rerio tuba8l4, a member tubulin, the alphaCyprinidae 8 like family4) appears of freshwater to have fish),moved the to tubulin a different gene (termedchromosome tuba8l4 and tubulin, developed alpha into 8 like two 4) appearsgenes, TUBAato have and moved TUBAb to a; this diff erentresults chromosome in PEX28 and and USP18 developed as immediate into two neighbor genes, TUBAa genes inand zebrafishTUBAb ;chr4 this (Figureresults in3d).PEX28 The nearestand USP18 neighboras immediate history of neighbor the USP18 genes gene in zebrafishlocus and chr4the display (Figure 3ofd). evolutionary The nearest conservationneighbor history of gene of the positionsUSP18 gene relative locus to and each the displayother are of evolutionaryconsistent with conservation a common of lineage gene positions of the USP18relative gene. to each Previously, other are consistentUlitsky and with Bartel a common [8] provided lineage an of interesting the USP18 gene. analysis Previously, of human, Ulitsky mouse, and andBartel zebrafish [8] provided vertebrate an interesting genomes analysis with respect of human, to the mouse, concept and of zebrafish synteny vertebrateof genetic genomes loci and withthe lineagerespect of to lincRNA the concept genes of syntenythrough of evolution. genetic loci and the lineage of lincRNA genes through evolution.

FigureFigure 3. 3. GenesGenes that that are are adjacent adjacent to to USP18USP18 areare found found in in different different sp speciesecies and and are are shown shown in in the the chromosomalchromosomal and genegene maps.maps. (a(a––dd)) Drawings Drawings of of gene gene arrangements arrangements are are taken taken directly directly from from the NCBI the NCBIwebsite: website: https: https://www.nc//www.ncbi.nlm.nih.govbi.nlm.nih.gov/gene/gene[17]. [17].

ToTo analyze analyze the the phylogenetic phylogenetic relatedness relatedness of of USP18USP18 genegene nt nt and and aa aa sequences, sequences, sequences sequences were were alignedaligned from from zebrafish, zebrafish, the the hous housee mouse, mouse, three three primate primate species, species, and and humans. humans. The The resultant resultant percent percent sequencesequence identities identities mimic mimic evolutionary distancesdistances betweenbetweenthe the species species (Table (Table2), 2), with with a lineara linear change change in innt nt and and aa aa sequences sequences with with time time between between the primatethe primate and mouseand mouse species species (notshown). (not shown). The pattern The pattern shows shows a continuum of gene nt and protein aa sequence change with evolutionary time and is consistent with a common lineage of the USP18 gene that dates to an ancestor of zebrafish, more than 350 million years ago (MYA). This parallels the nearest neighbor gene history of USP18.

Non-coding RNA 2020, 6, 36 6 of 13 a continuum of gene nt and protein aa sequence change with evolutionary time and is consistent with a common lineage of the USP18 gene that dates to an ancestor of zebrafish, more than 350 million years ago (MYA). This parallels the nearest neighbor gene history of USP18.

Table 2. USP18 gene and protein sequence identities and evolutionary time between species. Non-Coding RNA 2020, 6, x FOR PEER REVIEW 6 of 13 USP18 Gene nt USP18 aa Sequence Evolutionary Age Species Sequence %Identity %Identity (MYA) * Table 2. USP18 gene and protein sequence identities and evolutionary time between species. human 100% 100% 0 MYA Specieschimpanzee USP18 Gene nt Sequence 99% %Identity USP18 aa Sequence 99% %Identity Evolutionary 6 MYA Age (MYA) * humanRhesus monkey100% 92% 100% 94% 25 MYA 0 MYA chimpanzee 99% 99% 6 MYA Philippine tarsier 66% 80% 50 MYA Rhesus monkey 92% 94% 25 MYA Philippine tarsiermouse 66% 51% 71%80% 90 MYA50 MYA mouse zebrafish51% 39% 31%71% 350 MYA90 MYA zebrafish *39% approximate age in million years ago 31% (MYA) [18]. 350 MYA * approximate age in million years ago (MYA) [18]. 4.2. USP18 Exon 11 4.2. USP18 Exon 11 Both human exon 11, which encodes the last 14 aa (the carboxy terminal end) of the USP18 Both human exon 11, which encodes the last 14 aa (the carboxy terminal end) of the USP18 peptidase, and the 30UTR of the USP18 mRNA sequence are provided by the FAM247 sequence [8]. Thepeptidase, identity and between the 3′UTR the FAM247 of the USP18 nt sequence mRNA sequence and the human are provided/primate by exon11the FAM247 nt sequences sequence is [8]. 100%, The identity between the FAM247 nt sequence and the human/primate exon11 nt sequences is 100%, with the exception of that of Philippine tarsier (Figure4). The sequence of the carboxy terminal exon with the exception of that of Philippine tarsier (Figure 4). The sequence of the carboxy terminal exon is is more stable than that of the sequence of the entire gene (compare with Table2). The identities of more stable than that of the sequence of the entire gene (compare with Table 2). The identities of the the USP18 3 UTR sequences from various species, compared to FAM247 (Table3), shows the 3 UTR USP18 3′UTR0 sequences from various species, compared to FAM247 (Table 3), shows the 3′UTR0 sequence is also conserved in primates, but to a lesser extent than that of exon 11, and is more similar sequence is also conserved in primates, but to a lesser extent than that of exon 11, and is more similar toto the theUSP18 USP18gene. gene.

Figure 4. Alignment of the USP18 terminal exon nt sequences from seven species compared with the FAM247 sequence. Data were obtained using the EBI Clustal Omega sequence alignment and phylogeny programs. The EMBL-EBI Clustal Omega Multiple Sequence Alignment program [19] at website http://www.ebi.ac.uk/Tools/msa/clustalo/ was used for nt sequence alignment. (a) Phylogenetic tree of USP18 terminal exon sequences from seven species and the FAM247 sequence. (b) The percent identities created using Clustal 2.1. (c) Alignment of nt sequences. USP18 gene sequences were accessed from the NCBI website https://www.ncbi.nlm.nih.gov/gene [17].

Non-Coding RNA 2020, 6, x FOR PEER REVIEW 7 of 13

The nt sequence similarity of 52% between FAM247 and zebrafish last exon (Figure 4b), the presence of a number of invariant nt positions (Figure 4c), and the similarity with the 3′UTR sequence (53%; Table 3) suggests that this part of the FAM247 sequence was present in the USP18 sequence of zebrafish. The invariant nt residues of exon 11, e.g., positions nt 5–9 (Figure 4c), may relate to the functional importance of the USP18 carboxy terminal end in its role in the regulation of the immune system by USP18 [12,13]. These invariant nt positions may give a picture, albeit a small picture, of what the ancient FAM247 looked like.

Non-coding RNATable2020, 6 ,3. 36 sequence identities of 3′UTR of USP18 from different species. 7 of 13

Source of nt Sequence % Identity Relative to FAM247 3′ End FAM247A 3′ end nt 10,653–11,231 100 Figure 4. Alignment of the USP18 terminal exon nt sequences from seven species compared with USP18 3′UTR human 99.8 the FAM247 sequence. Data were obtained using the EBI Clustal Omega sequence alignment and phylogeny programs.USP18 The 3′UTR EMBL-EBI chimp Clustal Omega Multiple Sequence 98.6 Alignment program [19] at website http:USP18//www.ebi.ac.uk 3′UTR Rhesus/Tools /monkeymsa/clustalo / was used for nt sequence 90.2 alignment. (a) Phylogenetic tree of USP18USP18 terminal 3′UTR exon Philippine sequences tarsier from seven species and the FAM24771.9 sequence. (b) The percent identities createdUSP18 using Clustal3′UTR 2.1. mouse (c) Alignment of nt sequences. USP18 49.5gene sequences were accessed from the NCBIUSP18 website 3 https:′UTR// zebrafishwww.ncbi.nlm.nih.gov /gene[17]. 53.1

Figure5 5shows shows the the USP18 USP18 aa sequence aa sequence percent perc identity,ent sequenceidentity, alignments,sequence alignments, and a phylogenetic and a phylogenetictree produced tree from produced an alignment from ofanUSP18 alignmentterminal of USP18 exon aaterminal sequences exon from aa sequences different species from different with the speciestranslated with aa the sequence translated of FAM247. aa sequence Eight of FAM247. of the 14 aminoEight of acid the residues14 amino that acid form residues the terminal that form exon the terminalare totally exon conserved are totally from conserved primates from to zebrafish, primates togetherto zebrafish, with together the FAM247 with translatedthe FAM247 aa translated sequence aa(Figure sequence5, bottom). (Figure The 5, USP18 bottom). carboxy The terminalUSP18 carboxy peptide terminal sequence peptide interacts sequence with the INFAR2interacts interferon with the INFAR2receptor, interferon and this sequence receptor, is and an important this sequence regulator is an ofimportant IFN signaling regulator [12]; of in addition,IFN signaling the carboxyl [12]; in addition,end sequence the carboxyl functions end in sequence delSGlyation functions [13,20 in,21 de].lSGlyation A [13,20,21]. in L365 A in mutation the exon in 11 L365 sequence in the exon359QETAYL 11 sequenceL365VYMKMEC 359QETAYLL372365abolishesVYMKMEC deISGylation372 abolishes and deISGylation INAFR2 binding and INAFR2 [20,21 ];binding L365 is [20,21]; one of Lthe365 evolutionarilyis one of the evolutionarily conserved amino conserved acids amino of exon acids 11 (Figure of exon5). 11 The (Figure mutation 5). The may mutation alter the may protein alter theconformation protein conformation that is necessary that is for necessary USP18 to for function. USP18 Onto function. the other On hand, the theother high hand, number the ofhigh aa numberresidues of conserved, aa residues relative conserved, to the FAM247relative translatedto the FAM247 aa sequence, translated further aa sequence, supports further the proposal supports that thethe proposal FAM247 sequencethat the FAM247 was present sequence in zebrafish was presentUSP18 in. zebrafish USP18.

Figure 5. Alignment of the USP18 terminal exon amino acid sequences from seven species compared with the FAM247 translated amino acid sequence. Data were obtained using the EBI Clustal Omega sequence alignment and phylogeny programs. The EMBL-EBI Clustal Omega Multiple Sequence Alignment program [19] at website http://www.ebi.ac.uk/Tools/msa/clustalo/ was used for aa sequence alignment. (Top) Phylogenetic tree of USP18 terminal exon aa sequences from seven species and the FAM247 aa sequence. (Middle) The percent identities were created using Clustal 2.1. (Bottom) Alignment of aa sequences. Non-coding RNA 2020, 6, 36 8 of 13

The nt sequence similarity of 52% between FAM247 and zebrafish last exon (Figure4b), the presence of a number of invariant nt positions (Figure4c), and the similarity with the 3 0UTR sequence (53%; Table3) suggests that this part of the FAM247 sequence was present in the USP18 sequence of zebrafish. The invariant nt residues of exon 11, e.g., positions nt 5–9 (Figure4c), may relate to the functional importance of the USP18 carboxy terminal end in its role in the regulation of the immune system by USP18 [12,13]. These invariant nt positions may give a picture, albeit a small picture, of what the ancient FAM247 looked like.

Table 3. Nucleotide sequence identities of 30UTR of USP18 from different species. Non-Coding RNA 2020, 6, x FOR PEER REVIEW 8 of 13

Source of nt Sequence % Identity Relative to FAM247 30 End Figure 5. Alignment of the USP18 terminal exon amino acid sequences from seven species compared FAM247A 30 end nt 10,653–11,231 100 with the FAM247 translated amino acid sequence. Data were obtained using the EBI Clustal Omega USP18 30UTR human 99.8 sequence alignmentUSP18 and 3phylogeny0UTR chimp programs. The EMBL-EBI Clustal 98.6 Omega Multiple Sequence Alignment programUSP18 3 0UTR[19] Rhesusat website monkey http://www.ebi.ac.uk/Tools/msa/clustalo/ 90.2 was used for aa sequence alignment.USP18 3 0UTR(Top)Philippine Phylogenetic tarsier tree of USP18 terminal exon71.9 aa sequences from seven species and the FAM247USP18 3aa0UTR sequence. mouse (Middle) The percent identities 49.5 were created using Clustal 2.1. USP18 3 UTR zebrafish 53.1 (Bottom) Alignment of aa0 sequences.

4.3. GGT5 GGT5 The human GGT5GGT5 proteinprotein gene gene resides in chromosomal segmental duplication LCR22G and and is linkedlinked to to pseudogene POM121L9P withwith a spacer sequence, and the pseudogene GGTLC4P situated between GGT5GGT5 and POM121L9P (Figure 66))[ [9].9]. The The GGT5GGT5 nearestnearest gene/sequence gene/sequence arrangement is more complex than that of the signatures shown in Figure 11b.b. GGT5GGT5 isis an an anomaly as as its sequence does not stem from a GGT locus, as other GGT fa familymily members do, but from the chromosomal site containingcontaining the the FAM247 sequence [[9].9]. GGT5GGT5 carriescarries a a sequence sequence homologous homologous to to the the 5 5′0 halfhalf of of the FAM247A sequence, bp positions 1–5958 (Figures 22 andand6 ),6), and and POM121L9P contains part of the 30′ half of FAM247 (bp 5949–8219). The The FAM247 FAM247 fragments fragments are are where where there there are are or or were were AluAlu sequences.sequences. The GGTLC4P pseudogene derives its sequence from GGT (Figure6 6).).

Figure 6. TheThe genes linked to GGT5 inin LCR22G LCR22G with with nearest nearest neighb neighboror arrangements arrangements (top (top schematic) schematic) andand the source of sequences found in human-linked genes GGT5–GGTLC4P–POM121L9P..

GGT5 andandPOM121L9P POM121L9Pappear appear to haveto have been been formed formed at very at diveryfferent different evolutionary evolutionary times. FAM247 times. FAM247is part of is the partGGT5 of thegenes GGT5 that genes are in that nonhuman are in nonhuman primates, primates, including includingPhilippine Philippine tarsier. In tarsier addition,. In addition,FAM247 providesFAM247 theprovides sequence the foundsequence in exon found 1 of inGGT5. exon 1There of GGT5. is a significant There is a similarity significant between similarity the betweenFAM247 ntthe sequence FAM247 and nt sequence that of the and mouse thatGGT5 of theexon mouse 1 (Figure GGT57 ,exon top). 1 There (Figure is not7, top). enough There evidence is not enoughto suggest evidence that the mouseto suggestGGT5 thatcontains the themouse entire GGT5 50 half contains of the FAM247 the entire sequence, 5′ half but of the the alignment FAM247 of sequence, but the alignment of the mouse exon 1 nt sequence with FAM247 shows that a significant number of are invariant (Figure 7, bottom). Although there is invariance in 50 out of 173 nt between the FAM247 sequence and zebrafish GGT5 exon 1, the zebrafish exon sequence shows significant differences, which makes it difficult to further assess a sequence similarity. The exon 1 data are consistent with the formation of the GGT5 gene with the FAM247 sequence that occurred before the evolutionary appearance of primates and appearing in mice.

Non-coding RNA 2020, 6, 36 9 of 13

the mouse exon 1 nt sequence with FAM247 shows that a significant number of nucleotides are invariant (Figure7, bottom). Although there is invariance in 50 out of 173 nt between the FAM247 sequence and zebrafish GGT5 exon 1, the zebrafish exon sequence shows significant differences, which makes it difficult to further assess a sequence similarity. The exon 1 data are consistent with the formation of the GGT5 gene with the FAM247 sequence that occurred before the evolutionary appearance of primates Non-Codingand appearing RNA 2020 in, mice.6, x FOR PEER REVIEW 9 of 13

FigureFigure 7. 7. ((TopTop)) The percentpercent identityidentity of of the the FAM247 FAM247 sequence sequence with with that that of GGT5of GGT5exon exon 1 sequences 1 sequences from fromfour species.four species. Data wereData obtainedwere obtained using Clustalusing Clustal 2.1. (Bottom 2.1. ()Bottom Alignment) Alignment of the GGT5 of theexon GGT5 nucleotide exon nucleotidesequences sequences from the four from species, the four compared species, co withmpared the FAM247 with the sequence, FAM247 sequence, positions 192–567.positions Data 192–567. were Dataobtained were usingobtained the EBIusing Clustal the EBI Omega Clustal sequence Omega alignmentsequence alignment and phylogeny and phylogeny programs. programs. The EMBL-EBI The EMBL-EBIClustal Omega Clustal Multiple Omega Sequence Multiple Alignment Sequence program Alignment [19] is at websiteprogram http: //[19]www.ebi.ac.uk is at website/Tools / http://www.ebi.ac.uk/Tmsa/clustalo/. ools/msa/clustalo/. 4.4. Pseudogene POM121L9P 4.4. Pseudogene POM121L9P POM121L9P has a very different sequence compared to the other POM121LP family pseudogenes, andPOM121L9P it is a unique has sequence. a very A different schematic sequence of the compositional compared to make-up the other of the POM121LPPOM121L9P familygene pseudogenes,shows that it containsand it is most a unique of the sequencesequence. homologousA schematic to of the the putative compositional parent gene, make-up protein of genethe POM121L9P gene shows that it contains most of the sequence homologous to the putative parent POM121L1 (LOC101929738 putative POM121-like protein 1, 2379 bp) on its 50 side, and the BCRP1 gene, protein gene POM121L1 (LOC101929738 putative POM121-like protein 1, 2379 bp) on its 5′ pseudogene sequence (that is homologous to the 30 section of the BCR gene that includes BCR terminal side, and the BCRP1 pseudogene sequence (that is homologous to the 3′ section of the BCR gene that 19–23) on its 30 side (Figure8). FAM247 may have formed a nucleation site for the addition of includesthese motifs, BCR whichterminal are exons copies 19–23) of sequences on its 3′ fromside (Figure different 8). regions FAM247 of themay genome, have formed and developed a nucleation the site for the addition of these motifs, which are copies of sequences from different regions of the POM121L9P gene. The sequence motifs are found attached to 50 and 30 ends of FAM247 at FAM247 genome,bp positions and wheredeveloped there the are POM121L9PAlu sequences gene. (FAM247 The sequence bp positions motifs 5949 are andfound 8219). attached Additionally, to 5′ andthe 3′ endscomplete of FAM247POM121L-1 at FAM247sequence bp positions has an Alu wheresequence there at are bp Alu positions sequences 2309–2379 (FAM247 (the endbp positions of POM121L-1 5949 andis position 8219). 2304,Additionally, at the attachment the complete site with POM121L-1 FAM247). sequenceAlu sequences has an may Alu have sequence facilitated at bp the positions addition 2309–2379of POM121L-1 (the toend FAM247. of POM121L-1 The BCR sequenceis position addition 2304, at to POM121L9Pthe attachmentis more site complexwith FAM247). as there isAlu an sequencesundefined may sequence have facilitated between the the two addition (bp positions of POM121L-1 4479–5779 to FAM247. on POM121L9P The BCR), and sequence there are addition no Alu to POM121L9P is more complex as there is an undefined sequence between the two (bp positions 4479–5779 on POM121L9P), and there are no Alu sequences detected in the BCR sequence at the junction site. The human POM121L9P pseudogene RNA transcript is highly expressed in somatic testis tissue, and there is a broad expression of circular in developing fetal tissues with major expression in lung and adrenal tissues [15,16]. Its functions are not known, but they should be of interest in view of the strong RNA expression levels.

Non-coding RNA 2020, 6, 36 10 of 13 sequences detected in the BCR sequence at the junction site. The human POM121L9P pseudogene RNA transcript is highly expressed in somatic testis tissue, and there is a broad expression of circular RNAs in developing fetal tissues with major expression in lung and adrenal tissues [15,16]. Its functions are Non-Codingnot known, RNA but 2020 they, 6, x shouldFOR PEER be REVIEW of interest in view of the strong RNA expression levels. 10 of 13

FigureFigure 8. 8. AA schematic schematic of of the the compositional make-up make-up of of the the pseudogene POM121L9PPOM121L9P.. The The numbers numbers underunder thethe motifsmotifs shown shown represent represent the the bp positionsbp positions on the onPOM121L9P the POM121L9Psequence. sequence. The FAM247 The FAM247 sequence sequencethat forms that part forms of POM121L9P part of POM121L9Pconsists of FAM247consists positionsof FAM247 5949–8219, positions where 5949–8219, thereare whereAlu sequencesthere are Aluat both sequences ends. at both ends.

InIn a a homologous homologous nearest nearest neighbor neighbor gene gene arrangem arrangementent that that is is present present in in chimpanzee chimpanzee chr22, chr22, the the genesgenes areare annotated annotated as glutathioneas glutathion hydrolasee hydrolase light chainlight 2chain gene ( LOC7490182 gene (LOC749018) and putative) and POM121-like putative POM121-likeprotein 1 gene protein (LOC112206778 1 gene (LOC112206778); these are linked); these to GGT5 are linkedthrough to theGGT5 spacer through sequence the spacer (Figure sequence9). Thus, (Figurethe human 9). Thus, pseudogenes the humanGGTLC4P pseudogenesand POM121L9P GGTLC4P andsequences POM121L9P are annotated sequences as are protein annotated genes as in proteinthe homologous genes in chromosomalthe homologous loci chromosomal of chimpanzee. loci This of ischimpanzee. another example This is of another human ncRNAexample gene of humansequences ncRNA that gene are annotated sequences as that protein are annotated genes in nonhumanas protein genes primates, in nonhuman but the isolation primates, of but protein the isolationproducts of from protein the chimpanzee products from genes the is chimpanzee essential to add genes any is significance essential toto add it. Ofany importance significance is thatto it. 69% Of importanceof the POM121L9P is that 69%sequence of the isPOM121L9P present in thesequence chimpanzee is present genome in the with chimpanzee 98% identity genome at the with genomic 98% identityregion where at the there genomic is evolutionary region where synteny there with theis evolutionary comparable chromosomal synteny with locus the that comparable resides in chromosomalchimpanzee. Therelocus arethat no resides FAM247 in chimpanzee. or POM121L9P There sequences are no thatFAM247 have beenor POM121L9P found linked sequences to GGT5 thatin Rhesus. have been It appearsfound linked that theto GGT5 development in Rhesus. of It the appearsPOM121L9P that thesequence development may of have the begunPOM121L9P in the sequencechimpanzee may but have with begun a partial in the sequence. chimpanzee but with a partial sequence.

FigureFigure 9. 9. NearestNearest neighbor genegene arrangementsarrangements in in human human and and chimpanzee chimpanzee chromosomal chromosomal loci loci where where the theGGT5 GGT5gene gene resides. resides. 4.5. Pseudogenes BCRP3 and POM121L10P 4.5. Pseudogenes BCRP3 and POM121L10P Human pseudogenes BCRP3 and POM121L10P are linked to GGT1 in the gene/sequence arrangementHuman pseudogenesGGT1-spacer- BCRP3-POM121L10PBCRP3 and POM121L10P, which are is presentlinked into chr22GGT1 LCR22H. in the gene/sequence FAM247 forms arrangement GGT1-spacer-BCRP3-POM121L10P, which is present in chr22 LCR22H. FAM247 forms part of the two pseudogenes: BCRP3, which has the FAM247 positions 33–5958 and POM121L10, positions 5957–8219 (Figure 2). Thus, parts of the 5′ and 3′ regions of FAM247 are found in these linked genes, which is similar to the presence of FAM247 in genes GGT5 and POM121L9P. BCRP3 is a member of the BCRP pseudogene family consisting of eight pseudogenes, all of which contain the homologous sequence of the 3′ end sequence of the BCR protein gene except for BCRP8. BCRP3 is one of the family members that differs as it contains additional sequence motifs

Non-coding RNA 2020, 6, 36 11 of 13 part of the two pseudogenes: BCRP3, which has the FAM247 positions 33–5958 and POM121L10, positions 5957–8219 (Figure2). Thus, parts of the 5 0 and 30 regions of FAM247 are found in these linked genes, which is similar to the presence of FAM247 in genes GGT5 and POM121L9P.

Non-CodingBCRP3 RNAis 2020 a member, 6, x FOR of PEER the REVIEWBCRP pseudogene family consisting of eight pseudogenes, all of 11 which of 13 contain the homologous sequence of the 30 end sequence of the BCR protein gene except for BCRP8. (FigureBCRP3 is10) one and of is the the family only membersBCRP family that dimemberffers as that it contains contains additional the FAM247 sequence sequence motifs. The (Figure BCRP3 10) geneand isappears the only toBCRP havefamily a unique member sequence. that contains The compositional the FAM247 sequence.make-up of The BCRP3BCRP3 showsgene appearsthat its to5′ sidehave has a unique the FAM247 sequence. sequence, The compositional which is followed make-up by of BCRP3a 4255-bpshows segment that its of 5 0theside immunoglobulin has the FAM247 lambdasequence, locus which (IGL) is followedand the 3 by′ end a 4255-bpof the BCR segment sequence of the (Figure immunoglobulin 10). The IGL lambda sequence locus (nt (IGL)positions and 590,381–594,292)the 30 end of the from BCR sequencethe Homo (Figure sapiens 10 immunogl). The IGLobulin sequence lambda (nt positionslocus (IGL) 590,381–594,292) on chromosome from 22) [17]the Homois homologous sapiens immunoglobulin to the IGL locus lambdaV segments locus and (IGL) three on chromosomeC segments, 22)which [17 ]are is homologous not known to to encodethe IGL immunoglobulin locus V segments proteins. and three The C segments, IGL sequence which has are an not Alu known sequence to encode at the immunoglobulin junction with FAM247,proteins. which The IGL may sequence relate to hasthe anprocessAlu sequenceof attachment at the of junction IGL to FAM247 with FAM247, during whichthe maturation may relate of theto the pseudogene. process of The attachment IGL sequence of IGL is to made FAM247 up entirely during theof 8 maturation Alu sequences, of the 7 pseudogene.LINE elements, The other IGL transposablesequence is made up entirelyelements, of 8 Alu sequences, 7 LINEand elements, othersmall transposable elements,repeats and (http://www.repeatmasker.org/cgismall repeats (http://www.repeatmasker.org-bin/WEBRepeatMasker)./cgi-bin/WEBRepeatMasker In terms). Inof termsRNA of expression, RNA expression, the pseudogenethe pseudogene shows shows a broad a broad expression expression of oflinear linear RNA RNA in in27 27 normal normal somatic somatic tissues tissues and and a a broad expression of circular RNA in developing fetal tissues [[15,16].15,16].

Figure 10. Sequence motif of the BCRP3 gene. Positions 6226–10480 of BCRP3 span the IGL insert. Figure 10. Sequence motif of the BCRP3 gene. Positions 6226–10480 of BCRP3 span the IGL insert. The total length of BCRP3 is 20446 bp. The total length of BCRP3 is 20446 bp. The POM121L10P sequence is linked to BCRP3 on chr22. It also contains the FAM247 sequence The POM121L10P sequence is linked to BCRP3 on chr22. It also contains the FAM247 sequence (Figure2). POM121L10P is compositionally made up of nearly the entire sequence of the related (Figure 2). POM121L10P is compositionally made up of nearly the entire sequence of the related pseudogene POM121L1P but has a 1062-bp sequence at its 3 end that consists of a copy of the 3 end of pseudogene POM121L1P but has a 1062-bp sequence at its0 3′ end that consists of a copy of0 the 3′ the BCR gene. POM121L10P also appears to be a unique gene construct. The POM121L10P linear RNA end of the BCR gene. POM121L10P also appears to be a unique gene construct. The POM121L10P transcript is strongly expressed in testes; circular RNAs are broadly expressed in fetal tissues. [15,16]. linear RNA transcript is strongly expressed in testes; circular RNAs are broadly expressed in fetal Thus, both this gene and BCRP3 show a robust RNA expression. It should be pointed out that there are tissues. [15,16]. Thus, both this gene and BCRP3 show a robust RNA expression. It should be additional POM121LP pseudogene family members that carry the FAM247 sequence, but they are not pointed out that there are additional POM121LP pseudogene family members that carry the addressed here. FAM247 sequence, but they are not addressed here. In the Rhesus genome, some genes/neighbor sequences display synteny. The Rhesus GGT1 is In the Rhesus genome, some genes/neighbor sequences display synteny. The Rhesus GGT1 is linked to the spacer sequence and followed by the FAM247 sequence, which is similar to that of the linked to the spacer sequence and followed by the FAM247 sequence, which is similar to that of the human GGT1 gene/sequence arrangement. Rhesus gene LOC107000612, annotated as a “breakpoint human GGT1 gene/sequence arrangement. Rhesus gene LOC107000612, annotated as a “breakpoint cluster region protein-like” is situated close to GGT1. This is part of the homologous chromosomal cluster region protein-like” is situated close to GGT1. This is part of the homologous chromosomal region where the pseudogene BCRP3 resides in the ; approximately 78% of the human region where the pseudogene BCRP3 resides in the human genome; approximately 78% of the BCRP3 sequence is present in the Rhesus genome at this locus. The BCRP3 sequence has not been human BCRP3 sequence is present in the Rhesus genome at this locus. The BCRP3 sequence has not detected in the early primate Philippine tarsier. Thus, the earliest appearance of the BCPR3 sequence been detected in the early primate Philippine tarsier. Thus, the earliest appearance of the BCPR3 is in the Rhesus species, and the sequence appeared to have matured into a pseudogene in humans. sequence is in the Rhesus species, and the sequence appeared to have matured into a pseudogene in There was a large chromosomal expansion of the Rhesus monkey genome between genes GGT1 and humans. There was a large chromosomal expansion of the Rhesus monkey genome between genes GGT5. The chromosomal length between genes GGT1 and GGT5 in Philippine tarsier is 2872 bp; in the GGT1 and GGT5. The chromosomal length between genes GGT1 and GGT5 in Philippine tarsier is rhesus monkey, it is 216,200 bp. Thus, there is a 75-fold sequence expansion between GGT1 and GGT5 2872 bp; in the rhesus monkey, it is 216,200 bp. Thus, there is a 75-fold sequence expansion between in Rhesus. Segments of the BCRP3 gene may have formed with this chromosomal expansion. This GGT1 and GGT5 in Rhesus. Segments of the BCRP3 gene may have formed with this chromosomal may account for the source of the BCRP3 sequence in Rhesus, but again, the sequence is not found in expansion. This may account for the source of the BCRP3 sequence in Rhesus, but again, the Philippine tarsier. sequence is not found in Philippine tarsier. Using Figures8 and 10, a model of de novo gene development from protogene sequences can be Using Figures 8 and 10, a model of de novo gene development from protogene sequences can be visualized whereby the FAM247 sequence, which is present in genomic regions that display synteny, visualized whereby the FAM247 sequence, which is present in genomic regions that display synteny, forms nucleation sites where other genomic sequences are added during the maturation process to complete the pseudogene structure.

5. Conclusions Both the FAM247 lincRNA gene family and pseudogenes appear to have the FAM247 sequence as a foundation for gene development; however, the mechanism of formation and the compositional make-up between lincRNA genes and pseudogenes greatly differ. The FAM247

Non-coding RNA 2020, 6, 36 12 of 13 forms nucleation sites where other genomic sequences are added during the maturation process to complete the pseudogene structure.

5. Conclusions Both the FAM247 lincRNA gene family and pseudogenes appear to have the FAM247 sequence as a foundation for gene development; however, the mechanism of formation and the compositional make-up between lincRNA genes and pseudogenes greatly differ. The FAM247 family (as well as the FAM230 lincRNA gene family) was formed by gene duplication and family members display sequences that are “variations on a theme”. Although pseudogenes BCRP3, POM121L9P, and POM121L10P contain duplications of part of or entire portions of parent protein genes, they were formed differently by a de novo process of addition of large unrelated genomic sequences to the FAM247 sequence, with the resultant formation of unique pseudogene sequences. Alu elements are present in FAM247 at sites of attachment, and these may contribute to the process of sequence addition, possibly by Alu/Alu recombination. As these pseudogenes are unique, with large sequences unrelated to the parent protein genes, the question is whether they should be called pseudogenes. How USP18 and GGT5 protein genes developed is not known, but a putative ancient FAM247 sequence was likely involved. A separate but important aspect of the FAM247 sequence in cellular and molecular functions is that it contributes the amino acid sequence for protein exons, the first exon of GGT5 and the last exon of USP18. The functions of the carboxy terminal aa sequence of USP18 are of major significance because of the important role in the regulation of the immune system. A search for possible nucleation elements similar to FAM247 is important in order to determine the prevalence of this type of protogene. An analysis of repeat sequences that are part of phylogenetically conserved nearest-neighbor genes/sequences in human chromosomes that have a large number of segmental duplications, e.g., chr 15, 16, and X [22], may help find other gene forming elements. Blat searches with the Ensembl program using lncRNA gene sequences as a query can help locate sequences shared with other genes, including protein genes. Use of the NCBI Blast/align two sequences program may reveal small sequence segments that are present in diverse genes.

Funding: This research received no external funding. Conflicts of Interest: The author declares no conflict of interest.

References

1. Ohno, S. Gene duplication and the uniqueness of vertebrate genomes circa 1970–1999. Semin. Dev. Biol. 1999, 10, 517–522. [CrossRef] 2. Jacob, F. Evolution and tinkering. Science 1977, 196, 1161–1166. [CrossRef][PubMed] 3. Wang, W.; Yu, H.; Long, M. Duplication-degeneration as a mechanism of gene fission and the origin of new genes in species. Nat. Genet. 2004, 36, 523–527. [CrossRef] 4. Carvunis, A.R.; Rolland, T.; Wapinski, I.; Calderwood, M.A.; Yildirim, M.A.; Simonis, N.; Charloteaux, B.; Hidalgo, C.A.; Barbette, J.; Santhanam, B.; et al. Proto-genes and de novo gene birth. 2012, 487, 370–374. [CrossRef][PubMed] 5. McLysaght, A.; Guerzoni, D. New genes from non-coding sequence: The role of de novo protein-coding genes in eukaryotic evolutionary innovation. Philos. Trans. R. Soc. B Biol. Sci. 2015, 370, 20140332. [CrossRef] 6. Schlotterer, C. Genes from scratch—The evolutionary fate of de novo genes. Trends Genet. 2015, 31, 215–219. [CrossRef][PubMed] 7. Van Oss, S.B.; Carvunis, A.R. De novo gene birth. PLoS Genet. 2019, 15, e1008160. [CrossRef][PubMed] 8. Ulitsky, I.; Bartel, D.P. lincRNAs: , evolution, and mechanisms. Cell 2013, 154, 26–46. [CrossRef] 9. Delihas, N. Formation of human long intergenic non-coding RNA genes, pseudogenes, and protein genes: Ancestral sequences are key players. PLoS ONE 2020, 15, e0230236. [CrossRef] 10. Nowell, P.; Hungerford, D. A minute chromosome in human chronic granulocytic leukemia. Science 1960, 132, 1497. Non-coding RNA 2020, 6, 36 13 of 13

11. De Klein, A.; van Kessel, A.G.; Grosveld, G.; Bartram, C.R.; Hagemeijer, A.; Bootsma, D.; Spurr, N.K.; Heisterkamp, N.; Groffen, J.; Stephenson, J.R. A cellular oncogene is translocated to the Philadelphia chromosome in chronic myelocytic leukaemia. Nature 1982, 300, 765–767. [CrossRef][PubMed] 12. Arimoto, K.I.; Löchte, S.; Stoner, S.A.; Burkart, C.; Zhang, Y.; Miyauchi, S.; Wilmes, S.; Fan, J.B.; Heinisch, J.J.; Li, Z.; et al. STAT2 is an essential adaptor in USP18-mediated suppression of type I interferon signaling. Nat. Struct. Mol. Biol. 2017, 24, 279–289. [CrossRef][PubMed] 13. Honke, N.; Shaabani, N.; Zhang, D.E.; Hardt, C.; Lang, K.S. Multiple functions of USP18. Dis. 2016, 7, e2444. [CrossRef][PubMed] 14. Bruford, E.A.; Braschi, B.; Denny, P.; Jones, T.E.M.; Seal, R.L.; Tweedie, S. Guidelines for human gene nomenclature. Nat. Genet. 2020, 52, 754–758. [CrossRef][PubMed] 15. Szabo, L.; Morey, R.; Palpant, N.J.; Wang, P.L.; Afari, N.; Jiang, C.; Parast, M.M.; Murry, C.; Laurent, L.C.; Salzman, J. Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development. Genome Biol. 2015, 16, 126. [CrossRef] 16. Fagerberg, L.; Hallström, B.M.; Oksvold, P.; Kampf, C.; Djureinovic, D.; Odeberg, J.; Habuka, M.; Tahmasebpoor, S.; Danielsson, A.; Edlund, K.; et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody based proteomics. Mol. Cell. Proteom. 2014, 13, 397–406. [CrossRef] 17. O’Leary, N.A.; Wright, M.W.; Brister, J.R.; Ciufo, S.; Haddad, D.; McVeigh, R.; Rajput, B.; Robbertse, B.; Smith-, B.; Ako-Adjei, D.; et al. Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016, 44, D733–D745. [CrossRef] 18. Siepel, A. Phylogenomics of primates and their ancestral populations. Genome Res. 2009, 19, 1929–1941. [CrossRef] 19. Madeira, F.; Park, Y.M.; Lee, J.; Buso, N.; Gur, T.; Madhusoodanan, N.; Basutkar, P.; Tivey, A.R.N.; Potter, S.C.; Finn, R.D.; et al. The EMBL-EBI Search and Sequence Analysis Tools APIs in 2019. Nucleic Acids Res. 2019, 47, W636–W641. [CrossRef] 20. Malakhov, M.P.; Malakhova, O.A.; Kim, K.I.; Ritchie, K.J.; Zhang, D.E. Protein ISGylation Modulates the JAK-STAT Signaling Pathway. J. Biol. Chem. 2002, 277, 9976–9981. [CrossRef] 21. Dauphinee, S.M.; Richer, E.; Eva, M.M.; McIntosh, F.; Paquet, M.; Dangoor, D.; Burkart, C.; Zhang, D.E.; Gruenheid, S.; Gros, P. Contribution of increased ISG15, ISGylation and deregulated type I IFN signaling in Usp18 mice during the course of bacterial infections. Genes Immun. 2014, 15, 282–292. [CrossRef] [PubMed] 22. Redaelli, S.; Maitz, S.; Crosti, F.; Sala, E.; Villa, N.; Spaccini, L.; Selicorni, A.; Rigoldi, M.; Conconi, D.; Dalprà, L.; et al. Refining the phenotype of recurrent rearrangements of . Int. J. Mol. Sci. 2019, 20, 1095. [CrossRef][PubMed]

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).