Genomics 84 (2004) 229–238 www.elsevier.com/locate/ygeno

Eleven daughters of NANOG$

H. Anne F. Booth and Peter W.H. Holland*

Department of Zoology, University of Oxford, South Parks Road, Oxford, OX1 3PS, UK

Received 4 December 2003; accepted 25 February 2004 Available online 15 April 2004

Abstract

Nanog is a recently discovered ANTP class homeobox . Mouse Nanog is expressed in the inner cell mass and in embryonic stem cells and has roles in self-renewal and maintenance of pluripotency. Here we describe the location, genomic organization, and relative ages of all NANOG pseudogenes, comprising ten processed pseudogenes and one tandem duplicate. These are compared to the original, intact human NANOG gene. Eleven is an unusually high number of pseudogenes for a homeobox gene and must reflect expression in the human germ line. A pseudogene orthologous to NANOGP4 was found in chimpanzee and an expressed pseudogene in macaque. Examining pseudogenes of differing ages gives insight into pseudogene decay, which involves an excess of deletion mutations over insertions. The mouse genome has two processed pseudogenes, which are not clear orthologues of the primate pseudogenes. D 2004 Elsevier Inc. All rights reserved.

Keywords: Homeobox; Pseudogene; Molecular evolution; Enk

Homeobox are characterized by possession of a Mitsui and colleagues [3] identified a single human recognizable 60- or 63-amino-acid motif within the orthologue of the mouse Nanog gene, mapping to encoded . The homeobox gene superfamily is 12p13.31. extremely diverse, particularly within animal genomes, As of February 2004, the Mouse Genome Informatics and can be subdivided into several major classes, each web site and National Center for Biotechnology Information containing numerous gene families. Among the best (NCBI) LocusLink list Nanog as the approved symbol for known are the Hox, En, Dlx, Evx, NK, and Msx gene the mouse gene (LocusID 71950), with Enk as an alternative families within the ANTP class, and the Pax, Gsc, and symbol. For the human gene, NCBI LocusLink lists Otx gene families within the PRD class. Recently, Wang NANOG as the gene symbol (LocusID 79923), with and colleagues [1] described the cloning of a novel FLJ12581 as an alternative symbol. Because the symbol member of the ANTP class in mouse, which they named Enk might cause confusion with the unrelated enkephalin Enk (early embryo specific expression NK family gene), gene, we refer to the mouse and human orthologues as referring to expression in inner cell mass cells of the Nanog and NANOG, respectively. blastocyst and sequence similarity to several NK-related Here we report the existence of ten processed homeobox genes. The same mouse gene was also cloned NANOG pseudogenes in the complete , by Chambers and colleagues [2] and by Mitsui and plus one duplication pseudogene, and present full colleagues [3]; these groups showed that the encoded descriptions of their locations, organization, and relative protein plays key roles in self-renewal and maintenance ages. We also describe one pseudogene each from of pluripotency in inner cell mass and embryonic stem chimpanzee and macaque (more are expected) and two cells. Reflecting these properties, these authors designated in the complete mouse genome. We comment on the the gene Nanog, from Tı´r nan O´ g, the mythical Celtic process of pseudogene decay and propose explanations land of the ever young. Chambers and colleagues [2] and for the abundance of Nanog pseudogenes and for the particular abundance of primate pseudogenes. First, how- $ Supplementary data for this article may be found on ScienceDirect. ever, we clarify the sequence and organization of the * Corresponding author. Fax: +44-01865-271184. functional human NANOG gene, to allow comparison to E-mail address: [email protected] (P.W.H. Holland). the pseudogenes.

0888-7543/$ - see front matter D 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2004.02.014 230 H.A.F. Booth, P.W.H. Holland / Genomics 84 (2004) 229–238

Results and discussion an unlinked gene (CDT1), probably artifactually. NANOG ESTs are primarily from human NT2 teratocarcinoma cells, Structure of human NANOG germ cell tumors, testis tumors, colon tumor, and adult marrow. The remaining 5 ESTs and the cDNA derive from We subjected the NCBI human genome assembly to pseudogenes and are discussed later. The NCBI contig tBLASTn searches using homeodomain sequences encoded annotation describes four exons; our analyses confirm this, by three human ANTP class genes: EN2, HOXA1, and with all splice sites obeying the GT–AG donor–acceptor NKX1.1. Among the retrieved sequences, three were home- rule [4]. The homeobox is located in exons 2 and 3 of the odomains from unnamed homeobox genes. A second four-exon gene, being interrupted between homeodomain tBLASTn search of the human genome using one of these residues 44 and 45. When an intron occurs within a unidentified homeodomains revealed several other highly homeobox, the most frequent site is between codons 44 similar sequences. One of these was an unnamed homeobox and 45 [5]. There are numerous other examples of this gene mapping to 12 at 12p13.31 (chromosom- intron location in the ANTP class, including several (but al position 7.84 Mb) and located on a genomic contig with not all) members of the Evx, Dlx, Lbx, Not, Gsx, and Hox RefSeq Accession No. NT_009714. This gene is clearly the gene families (e.g., Drosophila lab, pb, Abd-B). The human NANOG gene. Supporting evidence includes two human NANOG gene structure is shown diagrammatically 2.1kb cDNAs with GenBank Accession Nos. AK022643 in Fig. 1A. (RefSeq mRNA NM_024865) deposited by T. Isogai The genomic and mRNA sequences of NANOG are not (Chiba, Japan; unpublished) and AB093576 deposited by identical at the DNA level. In addition to several synony- S. Yamanaka (Nara, Japan [3]) and noted to be the human mous substitutions, there is one nonsynonymous difference orthologue of mouse Nanog. in exon 2, causing a difference at position 82: To confirm exon–intron organization, we searched lysine in the genomic contig and asparagine in the deposited GenBank and the NCBI human EST database and exam- full cDNA sequences. Lys and Asn are not biochemically ined the UniGene cluster for NANOG (Hs.326290). This similar amino acids, so this difference was unexpected. To yielded 33 EST sequences and a cDNA, in addition to the test if this is a naturally occurring polymorphism or a two cDNAs mentioned above. Manual comparisons indi- sequencing error, we used PCR to amplify human NANOG cated that 28 ESTs derive from the functional NANOG exon 2 from the genomic DNA of two individuals. We gene; of these, 20 match the known cDNA closely, 2 have found that one individual encoded Lys at position 82; the intron sequences (and therefore derive from unspliced other encoded Asn. This confirms there is naturally occur- message), and 4 (from the same library) are hybrids with ring variation in the NANOG protein sequence. Other non-

Fig. 1. (A) Human NANOG gene structure. The four exons (labeled 1–4) are represented by horizontal bars; the 5V and 3V UTRs are white and the protein coding region is black with the homeobox colored blue. The three introns are represented by pink lines. The length in nucleotides is written underneath each of the exons and introns (with exons 1 and 4 being split into coding and noncoding/UTR regions). Note that the 3V UTR is 967 nucleotides in the version of genomic sequence deposited, but 983 nucleotides in the deposited cDNA sequences. The start (ATG) and stop (TGA) codons are labeled and represented by a green and a red vertical line, respectively. (B) Nine processed pseudogenes compared to human NANOG protein. One processed pseudogene, NANOGP11,is not shown, as it does not include sequence derived from the NANOG open reading frame. Chromosomal positions, RefSeq Accession Nos., and GenBank Accession Nos. are shown. The NANOG protein is represented by a black horizontal bar with the homeodomain colored blue. The start codon is represented by a green vertical line and the stop codon by a red vertical line. Intron positions are shown by pink triangles. For each pseudogene the black horizontal bar represents sequence similarity to the NANOG open reading frame, which can move between pseudogene reading frames (rf1, rf2, rf3) as a result of insertions and deletions. Insertions are represented by turquoise vertical lines and deletions by the absence of sequence similarity to NANOG. Below each insertion is a plus sign and below each deletion is a minus sign, followed by the number of nucleotide bases inserted/deleted at that point. Substitution mutations are not shown, except where these introduce stop codons into the reading frame homologous to NANOG. Red vertical lines indicate stop codons; these are shown when they occur in the reading frame homologous to NANOG or when they are the first stop codon encountered in rf1. In four of the pseudogenes (NANOGP2, P5, P6, P9) insertions or deletions have caused reading-frame shifts before a stop codon was encountered in rf1. In these cases, the gray dashed lines indicate continuation of rf1 until the first stop codon is encountered.

Fig. 2. (A) Genes affected by tandem duplication of human NANOG. The tandem duplicates NANOG and NANOGP1 are represented by black boxes and the tandem duplicates SLC2A14 and SLC2A3 are represented by red boxes. The NANOGP1 box represents the region of homology to NANOG, not the full NANOGP1 transcript. NANOG and NANOGP1 are on the same DNA strand, while SLC2A14 and SLC2A3 are on the opposite strand (indicated by arrows underneath the genes). The intergenic distances are shown in kilobases and a scale bar of human chromosome 12 in megabases is drawn underneath. (B) Comparison of human NANOG and NANOGP1 transcripts. Exons are represented by horizontal bars: the 5V and 3V UTRs are white with the Alu repetitive elements colored orange and labeled a, b, c, and d; the protein coding regions are black with the homeoboxes colored blue. The introns are represented by pink lines and not drawn to scale. Lengths of exons and introns are shown in nucleotides. Start and stop codons are represented by green and red vertical lines, respectively. The diagram of NANOGP1 shows the transcript as predicted by comparison of the DNA sequence at the NANOGP1 locus to two mRNA (one cDNA and one EST) sequences. The gray boxes show regions homologous to the exons of NANOG but which are spliced out in this transcript. If NANOGP1 was also processed like NANOG (for which there is no evidence), a premature stop codon would be encountered in exon 1 (vertical red line in gray box). The two arrows underneath NANOGP1 indicate that part of the NANOGP1 terminal exon has been duplicated and inserted into the second intron in the reverse orientation. H.A.F. Booth, P.W.H. Holland / Genomics 84 (2004) 229–238 231

Fig. 1.

Fig. 2. 232 H.A.F. Booth, P.W.H. Holland / Genomics 84 (2004) 229–238 synonymous substitutions were predicted from EST sequen- the remnant of an eleventh NANOG processed pseudogene. ces, but since each was present in just a single EST (and This is a remarkable number, considering the paucity of usually near sequence ends) we consider these to be other ANTP class homeobox pseudogenes within the human probable sequencing errors. genome. The longest cDNA clones reported (GenBank Accession Nos. AK022643 and AB093576) begin with an adenine, A duplication pseudogene, NANOGP1 flanked by pyrimidines (C upstream in the genomic se- quence, T downstream). These are characteristics of tran- Duplication pseudogenes arise by tandem duplication of scriptional start sites [6], consistent with the cDNA being genomic DNA containing a functional gene, with disable- full length. The 5V untranslated region (5V UTR) is 216 ment of one of the duplicates either during or after the nucleotides. Examining 1 kb of genomic sequence upstream duplication event. Thus a duplication pseudogene generally of the cDNA start, we detect no TATA box and no SP1 box, shares the same, or similar, exon–intron organization as its but three putative CAAT boxes. The ATG annotated as the functional counterpart. Gene and pseudogene are also usu- start methionine has six of nine matches to the Kozak ally physically linked. NANOGP1 fulfils both criteria. consensus sequence CCRCCATGG (NANOG has CTAA- NANOG and NANOGP1 map to 12p13.31, in the same CATGA). More significantly, the mouse orthologue of orientation (Fig. 2A), separated by an intergenic distance of NANOG (GenBank Accession Nos. AK010332, AF507043, 96.3 kb (regarding NANOGP1 as the region of homology to AY278951, and AB093574) and a rat orthologue (predicted NANOG). The DNA sequence at the NANOGP1 locus has from a genomic contig with RefSeq Accession No. clear homology to all exons and introns of NANOG (97% NW_047696) each show conservation of this methionine, nucleotide identity to the NANOG coding sequence). If but no conservation of translated sequence 5V of this site. NANOGP1 were expressed, and processed using the same Strangely, a similar cDNA sequence reported from macaque exon–intron junctions as NANOG, the resulting mRNA (GenBank Accession No. AB062943) has a leucine instead of could not encode a full-length protein, because of a C to a methionine at the equivalent position; this is discussed later. T transition at nucleotide position 25 of the predicted coding The open reading frame in human NANOG has the potential region. This changes CAA (Gln) to TAA (a stop codon), to code for a protein of 305 amino acids, before a 983- after just 8 deduced amino acids (Fig. 2B). We have no nucleotide 3V UTR (supplementary data). There is no canon- evidence, however, that NANOGP1 is processed in the same ical polyadenylation signal, but two overlapping sequences in way as NANOG. the expected position differ from AATAAA by only a single Examination of the cDNA and EST sequences obtained substitution. The 3V UTR contains an Alu repetitive element from GenBank reveal that one cDNA and two ESTs derive (most similar to the AluY subfamily). This Alu element from NANOGP1; all three are incorrectly ascribed to proved useful in relative dating of pseudogene formation NANOG. The cDNA sequence (GenBank Accession No. (see later). AK097770; FLJ40451) and one of the ESTs (GenBank Accession No. BF773088) derive from an mRNA with a Human NANOG pseudogenes transcriptional start site almost 20 kb upstream of the expected start site (as predicted by comparison to NANOG). In the tBLASTn searches of the human genome assembly This first NANOGP1 exon is entirely noncoding and is (described previously), ten highly significant hits were spliced directly to exon 2, missing the sequence homolo- obtained in addition to the known ANTP class homeobox gous to NANOG exon 1 (Fig. 2B). The second exon of genes and NANOG. All these hits encoded complete or NANOGP1 contains the first in-frame methionine, which is partial homeodomains with high sequence similarity to homologous to an internal methionine of the NANOG NANOG. These sequences could represent either functional protein. Farther downstream, the terminal exon is also members of a Nanog gene family or pseudogenes. Analysis aberrantly spliced, using an in-frame splice acceptor site. revealed that one sequence possesses regions homologous to The consequence of these differences between NANOG and the NANOG introns and exons; we propose this is a NANOGP1 splicing is that the NANOGP1 mRNA includes duplication pseudogene and designate it NANOGP1. The an open reading frame largely homologous to NANOG, other nine sequences lack introns, and eight have in-frame except for deletion of the N-terminus and deletion of an stop codons, deletions, or frameshifts producing stop codons internal section within the putative protein. Despite this, we (Fig. 1B). We conclude that these are processed pseudo- suggest that this transcript does not encode a functional genes and designate them NANOGP2 to NANOGP10.To protein. Our reasoning is that there are up to nine non- determine if there are additional NANOG pseudogenes in synonymous substitutions in NANOGP1. It is not possible the human genome, lacking the homeobox, we conducted a to give an exact number, nor to estimate accurately the BLASTn search using the full human NANOG mRNA nonsynonymous to synonymous substitution ratio (Ka/Ks), sequence. This revealed one additional match comprising due to the high polymorphism of each sequence coupled a short sequence with high similarity to part of the NANOG with their close similarity. For example, analysis of mRNA 3V UTR, on a different chromosome. We deduce that this is sequences using Li’s 1993 method implemented in DAMBE H.A.F. Booth, P.W.H. Holland / Genomics 84 (2004) 229–238 233

[7] gives a Ka/Ks ratio of 1.1 (neutral evolution, consistent ignate NANOGP2 to NANOGP11 (see Fig. 1B for with a pseudogene), but use of genomic sequences gives NANOGP2 to NANOGP10 including Accession Nos.). 3.2. There is a second NANOGP1 EST (GenBank Accession NANOGP2 to NANOGP10 each include sequence ho- No. AA366389) showing a different splicing pattern, but the mologous to the NANOG coding region (including homeo- shortness of this sequence limits analysis. domain), but most do not have the potential to produce a NANOGP1 does not have an AluY element at the site functional protein because of critical mutations (Fig. 1B). where one is integrated in the NANOG 3V UTR region; NANOGP7 and NANOGP8 are exceptional as they contain hence this Alu element inserted after the tandem duplication no insertions or deletions in the region of homology to the event. NANOGP1 does, however, have three other Alu NANOG open reading frame and (in the case of NANOGP8) elements in the region equivalent to the 3V UTR. Interest- very few substitutions (6/915 sites). NANOGP7 has an in- ingly, this region of NANOGP1 has undergone a second frame stop codon close to the start of the reading frame, but duplication, in a reverse orientation, because a homologous NANOGP8 has a complete open reading frame, closely sequence is present on the opposite DNA strand of the similar to NANOG. NANOGP8 is also unique in being the second intron of NANOGP1 (Fig. 2B). This extra duplicated only pseudogene to possess an Alu element in the 3V UTR region contains only two of the three NANOGP1 Alu homologous to the one in NANOG. It is theoretically elements. possible that NANOGP8 is a retrogene rather than a pseu- To determine if other genes were affected by the tandem dogene, but this is unlikely as no ESTs have been reported duplication event, we examined the surrounding genomic for NANOGP8. In contrast, ESTs exist for NANOGP5 and sequence using NIX analysis (http://www.hgmp.mrc.ac.uk/ NANOGP11, even though neither includes a substantial Registered/Webapp/nix/) and the NCBI MapViewer for hu- open reading frame. The two ESTs for NANOGP5 (Gen- man (http://www.ncbi.nlm.nih.gov/mapview/map_search. Bank Accession Nos. BG398802 and BF994088) were cgi). Two genes were predicted: SLC2A14 (also called cloned from kidney and placenta, suggesting that this GLUT14), located 17.7 kb downstream of NANOG,and processed pseudogene has integrated near an enhancer SLC2A3 (GLUT3),located19.1kbdownstreamof active in these tissues. The single NANOGP11 EST (Gen- NANOGP1 (Fig. 2A). These two SLC2 (solute carrier family Bank Accession No. BG193822) includes the region of 2) genes and predicted have high sequence similarity sequence similarity to NANOG, plus 5V flanking sequence (95% amino acid identity), and both are on the DNA strand from . This EST, however, was obtained by opposite that of NANOG and NANOGP1. We conclude that ‘‘random activation of gene expression’’ in the cell line the event that generated the NANOGP1 pseudogene involved HT1080 [10]; there is no evidence that the transcript is tandem duplication of approximately 100 kb of genomic produced under normal physiological conditions. The EST DNA, encompassing two genes: NANOG and a member of is noted on the NCBI MapViewer, with no associated open the SLC2 gene family. reading frame or other supporting data. It has been previously argued that SLC2A14 and SLC2A3 NANOGP11 is unusual in that no trace of the NANOG are tandem duplicates [8]. These authors also noted that coding region remains. Instead, NANOGP11 consists of just SLC2A14 has a derived gene expression pattern, being 370 nucleotides of similarity to the NANOG 3V UTR, located specifically expressed in testis, while SLCA3 retains the within one of the introns of the SYNE1 gene (also known as ancestral role for these genes. This implies an interesting enaptin or nesprin 1) on human chromosome 6 at 6q25 asymmetry in the fates of the two neighboring duplicated (152.90 Mb; genomic contig RefSeq Accession No. genes. In the case of NANOG, the more distal descendant of NT_025741; BAC clone GenBank Accession No. the duplication (farthest from the centromere) retains the AL589963). We deduce that NANOGP11 is a processed ancestral role, while the proximal descendent has accumu- pseudogene that either has degenerated radically through lated degenerative mutations. In contrast, the more proximal the action of large deletion mutations or was derived origi- descendant of the SLC2A duplication retains the ancestral nally by integration of a partial reverse transcription product role, while the distal descendent has diverged in expression [11]. NANOGP3 may also be a partial integrant. and putative function. NANOGP3, NANOGP8, NANOGP9, and NANOGP10 were each erroneously annotated as putative genes on Ten processed pseudogenes, NANOGP2 to NANOGP11 LocusLink (LOC340217, LOC283741, LOC349386, and LOC349372, respectively). These entries have now been Processed pseudogenes arise by retrotransposition of corrected in response to this work. The homeoboxes of mRNA and are recognizable by the absence of introns and NANOGP4 and NANOP10 were previously noted by Pol- 5V promoter sequences and (often) the remnants of a poly- lard and Holland [12] and referred to provisionally as NEW1 adenylation tract and flanking direct repeats [9]. Because and NEW2, respectively. they arise from integration of transcribed mRNA, they are Seven of the processed pseudogenes have indels in the unlikely to be located physically close to the parent gene region homologous to the functional NANOG coding se- and are usually on other . We detected ten quence. In every case, there is a clear predominance of processed pseudogenes of human NANOG, which we des- deletion mutations over insertion mutations (Fig. 1B); this 234 H.A.F. Booth, P.W.H. Holland / Genomics 84 (2004) 229–238 pattern has been noted previously for other pseudogenes of the human–chimpanzee divergence. All the other pseu- [11,13,14]. dogenes give much older dates, ranging from 22 (NANOGP1) to over 150 my ago (NANOGP6). For the Relative ages of the human NANOG pseudogenes reasons outlined above, these latter estimates may not be accurate. Instead, we prefer to use these figures as a guide to It is clear that NANOGP8 is the most recently arisen relative age of each pseudogene. We conclude that pseudogene, as it is the only one to possess an Alu element NANOGP8 is the most recently formed pseudogene, in the 3V UTR homologous to the one in NANOG. We infer NANOGP1 is the second youngest, followed by NANOGP4, that this Alu element inserted very recently, after the tandem then NANOGP7,thenNANOGP2 and NANOGP9,then duplication event producing NANOGP1 and after generation NANOGP10, then NANOGP3 and NANOGP5, and finally of eight of the processed pseudogenes (NANOGP2 through NANOGP6, being the oldest. The relative age of to NANOGP7, NANOGP9, and NANOGP10), but before NANOGP11 is not deduced. generation of NANOGP8. Estimates of pseudogene age (and relative age) may be Rodent and primate Nanog genes obtained by counting the total number of mutations that have occurred in the region corresponding to the NANOG Orthologues of human NANOG have previously been coding sequence and dividing by a rate of neutral mutation reported from mouse and rat [1–3]. The mouse and rat estimated for pseudogenes. For this calculation, we assume orthologues map to chromosomes 6 and 4, respectively, in (i) the functional human NANOG gene retains the ancestral each case in chromosomal regions syntenic to the human condition, to which the pseudogene sequences may be NANOG map position. compared, (ii) each pseudogene became nonfunctional at We conducted database searches to identify ortho- its time of origin, (iii) mutation rates are equal regardless of logues of human NANOG in other species. No clear chromosomal position, (iv) autosomes and the X chromo- homologues were detected in any invertebrate genome some accumulate neutral mutations at the same rate [15], (v) (e.g., insects, nematode), nor in fish, despite several no sites have been hit by two mutations, (vi) insertions or complete genome sequences being available for these deletions greater than one nucleotide in length were caused taxa. It is not yet clear if this reflects secondary loss in by single events, (vii) there has been no gene conversion, certain evolutionary lineages or derived origin of the and (viii) the rate of neutral mutation has remained constant Nanog gene specifically in the mammalian lineage. We over time. did, however, detect DNA sequences with high similarity The last assumption could be violated in many evolu- to Nanog from the crab-eating macaque (Macaca fasci- tionary analyses, because the neutral mutation rate varies cularis) and common chimpanzee (Pan troglodytes). The between taxa and between different types of genetic target. macaque sequence is a brain cDNA (GenBank Accession Here we use a rate of mutation estimated from comparison No. AB062943) with high sequence similarity to human of human, chimpanzee, and gorilla pseudogenes [15,16]. NANOG; for various reasons we conclude this sequence is Therefore, we have reasonable confidence in our estimates derived from an expressed pseudogene (discussed later). back to 6 million years (my), being an approximate The chimpanzee sequence is genomic (located on a clone estimated date of divergence for these species [17]. Be- with GenBank Accession No. AC096875), and this is yond this date, changes to the rate of mutation could lead clearly a processed pseudogene (discussed later). We also to severe underestimates or (more likely) overestimates of detected two Nanog pseudogenes in the mouse genome pseudogene age; hence, we view older dates with caution. (also discussed later). The rate of mutation we have used, 1.25 Â 10À9 mutations The proteins encoded by the human, mouse, and rat per site per year, includes the contributions of insertions cDNAs, and that deduced from the macaque pseudogene and deletions. Because large deletions reduce the opportu- open reading frame, are aligned in Fig. 3. Sequence nity for substitutions (because target DNA is removed), in conservation is most pronounced across the homeodomain, our calculations we have scaled the total number of there being four other short regions with five or more mutations according to the length of the sequence remain- contiguous invariant residues. One of the most interesting ing after the deletions were each adjusted to a unit size of features of the alignment is the presence of a conserved 1; this is particularly important for NANOGP3 as most of structure in the protein region C-terminal to the homeo- the coding region is missing. In each case, we compared domain as previously noted [2,3]. In all four species, this the pseudogene sequence to the NANOG cDNA sequence, region has tryptophan repeated every five residues. The across the coding region only. We do not include tryptophan spacing is conserved, although in human and NANOGP11 in these calculations, as all homology to the macaque the fourth tryptophan is replaced by a glutamine. coding region has been deleted. The precise number of repeat units is also slightly variable, The most recently generated processed pseudogene, with 11 in rat, 10 in mouse, and 9 in human and macaque NANOGP8, has 6 mutations across 915 sites, giving an (including the variant fourth repeat). The function of the estimated age of 5.2 million years: approximately the time repeat structure is unknown. H.A.F. Booth, P.W.H. Holland / Genomics 84 (2004) 229–238 235

Fig. 3. Alignment of the full-length protein sequences of human (Hs) NANOG and mouse (Mm) and rat (Rn) Nanog and that deduced from the macaque (Mf) pseudogene. The first 39 residues of the macaque sequence (italics) are upstream of the first methionine. Asterisks indicate conservation of a residue between all sequences. Homeodomains are shown in blue and other conserved regions of 5 or more contiguous residues are shown in red. The repeated tryptophans are highlighted in yellow. Note the leucine codon instead of the initiator methionine in the macaque pseudogene. For color see online version.

Rodent and primate Nanog pseudogenes A macaque brain cDNA with sequence similarity to human NANOG has been deposited with GenBank (Ac- We detected two pseudogenes of Nanog in the current cession No. AB062943) by K. Hashimoto (Tokyo, Japan; assembly of the mouse genome. Both are deduced to be unpublished). Although the putative protein sequence processed pseudogenes because their sequences cross exon encoded by this cDNA aligns well with human NANOG boundaries. Comparison to the human processed pseudo- and with mouse and rat Nanog, with no in-frame stop genes, and molecular phylogenetic analysis (not shown), codons, several unusual features lead us to conclude that suggests that neither is a direct orthologue of a particular it is derived from an expressed pseudogene. First, the 5V human pseudogene. For this reason, we denote the mouse UTR is exceptionally long, suggestive of an aberrant sequences NanogPa and NanogPb, to avoid inadvertent transcriptional start site. Second, and consistent with this implications about equivalence to the human pseudogenes. suggestion, the macaque cDNA was cloned from brain. Mouse NanogPa is located on the X chromosome, while Mouse Nanog expression has not been detected in brain, NanogPb is located on chromosome 12. The two pseudo- but only in preimplantation embryos and marrow [1–3]. genes have been critically disrupted by mutations, but in Third, the 3V UTR is unusual, extending 772 nucleotides different ways (Fig. 4A). NanogPa is missing the 5V end of beyond the expected end of the Nanog mRNA. Fourth, the gene, including the N-terminal part of the coding the start methionine (ATG) of human NANOG and mouse region and part of the homeobox; in addition, a large and rat Nanog is not conserved in the macaque cDNA, inversion has occurred. We experimentally verified the which has CTA in the equivalent position (the first in- existence of this inversion by PCR (data not shown). In frame methionine is downstream of this point). The CTA NanogPb, the 3V part of the gene is missing, and the plus 150 nucleotides upstream shows 96% identity to a remaining sequence is heavily disrupted by deletions, sequence found over 800 nucleotides 5V of the human substitutions, and an insertion. NANOG transcriptional start site, suggesting this is an 236 H.A.F. Booth, P.W.H. Holland / Genomics 84 (2004) 229–238

Fig. 4. (A) Two mouse pseudogenes compared to the mouse Nanog protien. Note the inverted sequence in NanogPa; the 3V half of the region homologous to Nanog exon 4 has inverted so it runs in the opposite direction and is situated 5V of the rest of the pseudogene. For key to color coding and symbols, see legend for Fig. 1B. For color see online version. (B) Human (Hs) NANOGP4 has a direct orthologue in chimpanzee (Pt). The human NANOG protein is shown, with human NANOGP4 and chimpanzee NanogP4 underneath. Chimpanzee NanogP4 has three of four stop codons and two of three deletions identical to human NANOGP4. For key to color coding and symbols, see legend for Fig. 1B. For color see online version.

integrant of an aberrantly spliced transcript. The cDNA the same as the Alu integration in human NANOG and includes an additional 626 nucleotides at the extreme 5V NANOGP8, because it is a different Alu subfamily and in end. Fifth, there are two unusual nonsynonymous sub- a different position. stitutions in the homeobox of the macaque cDNA (coding A Nanog pseudogene was also detected in chimpanzee for Arg at homeodomain position 41 and for Leu at by BLAST searches of the htgs (high-throughput genomic position 60). Finally, and most importantly, when the sequence) division of GenBank (located on a clone with extreme 5V UTR and 3V UTR sequences are searched Accession No. AC096875). Sequence comparison and against the human genome using BLASTn, they have phylogenetic analysis clearly show that this is a direct very high sequence identity to contiguous regions of a orthologue of human NANOGP4 on chromosome 7. This human genomic contig (RefSeq Accession No. is most noticeable by comparison of the positions of stop NT_022184) mapping to human chromosome 2 at 2p23. codons and deletions, which are almost identical between In human, this genomic location does not contain the the chimpanzee and the human pseudogenes (Fig. 4B). NANOG gene nor any NANOG pseudogene. These obser- Orthology between human NANOGP4 and chimpanzee vations suggest that the macaque cDNA is derived from a NanogP4 is consistent with the age deductions made Nanog pseudogene that has been integrated near a brain above, because NANOGP4 (like most of the human enhancer, producing an aberrant transcript containing NANOG pseudogenes) is deduced to have originated additional flanking 5V and 3V sequences. We also note that before the evolutionary divergence of the human and the macaque cDNA contains an Alu element; this is not chimpanzee lineages. H.A.F. Booth, P.W.H. Holland / Genomics 84 (2004) 229–238 237

Summary been more active in the mouse lineage [22]. We suggest a third reason could be that the faster early embryonic The human NANOG homeobox gene was duplicated by a development and shorter prereproductive life span of tandem duplication event covering approximately 100 kb of rodents results in a shorter absolute time of Nanog gene chromosome 12, which also encompassed a member of the expression and less opportunity for retrotransposition into SLC2 gene family. The tandem duplicate of NANOG is most the germ-line genome. likely nonfunctional, although expressed; we designate it NANOGP1. The human genome also contains ten processed pseudogenes of NANOG that we designate NANOGP2 to Acknowledgments NANOGP11. The most recently formed, NANOGP8, was generated approximately 5.2 million years ago; we also We thank Rebecca Furlong and Ruth Younger for advice deduce the relative ages of nine other processed pseu- on genome analysis and a reviewer for helpful suggestions. dogenes. One of the processed pseudogenes, NANOGP5, This research was funded by the BBSRC. is expressed; NANOGP11 can be expressed under certain conditions. Eleven is an extremely high number of pseudogenes for a homeobox gene and highly unusual in the ANTP gene class. References The ANTP homeobox gene class contains over 90 func- [1] S.-H. Wang, M.-S. Tsai, M.-F. Chiang, H. Li, A novel NK-type ho- tional genes in the human genome, yet—other than the meobox gene, ENK (early embryo specific NK), preferentially NANOG pseudogenes—these genes have generated only expressed in embryonic stem cells, Gene Expr. Patterns 3 (2003) seven processed pseudogenes (six from VENTX2 and one 99–103. from MSX2 [18]; unpublished analyses). Processed pseudo- [2] I. Chambers, et al., Functional expression cloning of Nanog, a pluri- genes are generated by reverse transcription and integration potency sustaining factor in embryonic stem cells, Cell 113 (2003) 643–655. of mRNA, possibly through the activity of LINE element- [3] K. Mitsui, et al., The homeoprotein Nanog is required for mainte- derived reverse transcriptase [19]. Retrotransposition activ- nance of pluripotency in mouse epiblast and ES cells, Cell 113 (2003) ity may affect many mRNA species, and in many cell types, 631–642. but it is only in cells contributing to the germ line that such [4] R. Breathnach, P. Chambon, Organization and expression of eucary- integrants would become stably incorporated into the ge- otic split genes coding for proteins, Annu. Rev. Biochem. 50 (1981) 349–383. nome and inherited by the next generation. Therefore, only [5] T.R. Bu¨rglin, A comprehensive classification of homeobox genes, in: genes expressed in the germ line can generate processed D. Duboule (Ed.), Guidebook to the Homeobox Genes, Oxford Univ. pseudogenes. Press, Oxford, 1994, pp. 25–71. We argue, therefore, that the reason for the predomi- [6] B. Lewin, Genes, vol. VII, Oxford Univ. Press, Oxford, 2000. nance of human NANOG processed pseudogenes is related [7] X. Xia, Z. Xie, DAMBE: data analysis in molecular biology and evolution, J. Hered. 92 (2000) 371–373. to the unusual expression profile of this gene. The mouse [8] X. Wu, H.H. Freeze, GLUT14, a duplicon of GLUT3, is specifically Nanog gene is expressed in the inner cell mass, from which expressed in testis as alternative splice forms, Genomics 80 (2002) the entire embryo, including the germ line, is derived. It is 553–557. very likely that the human orthologue is expressed similar- [9] E.F. Vanin, Processed pseudogenes: characteristics and evolution, ly. Indirect evidence for this comes from the fact that Annu. Rev. Genet. 19 (1985) 253–272. [10] J.J. Harrington, et al., Creation of genome-wide protein expression human NANOG is expressed in teratocarcinoma cells, libraries using random activation of gene expression, Nat. Biotechnol. which in many characteristics are similar to inner cell mass 19 (2001) 440–445. or primitive ectoderm cells. High levels of NANOG expres- [11] R. Ophir, D. Graur, Patterns and rates of indel evolution in pro- sion in the earliest embryonic lineages would give the cessed pseudogenes from and murids, Gene 205 (1997) necessary opportunities for reverse transcription and 191–202. [12] S.L. Pollard, P.W.H. Holland, Evidence for 14 homeobox gene clus- germ-line integration. ters in human genome ancestry, Curr. Biol. 10 (2000) 1059–1062. It is intriguing that the human genome has ten processed [13] D. Graur, Y. Shuali, W.-H. Li, Deletions in processed pseudogenes NANOG pseudogenes, whereas the complete mouse genome accumulate faster in rodents than in humans, J. Mol. Evol. 28 (1989) has just two Nanog processed pseudogenes. These numbers 279–285. must reflect different rates of Nanog pseudogene formation [14] Z. Zhang, M. Gerstein, Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes, and/or removal in the primate and rodent evolutionary Nucleic Acids Res. 31 (2003) 5338–5348. lineages. Consistent with this observation, Zhang et al. [15] M.W. Nachman, S.L. Crowell, Estimate of the mutation rate per nu- [20,21] report that the mouse genome has only half the cleotide in humans, Genetics 156 (2000) 297–304. number of processed pseudogenes as the human genome. [16] R. Martı´nez-Arias, et al., Sequence variability of a human pseudo- These authors propose that this is because the higher gene, Genome Res. 11 (2001) 1071–1085. [17] F.H. Pough, C.M. Janis, J.B. Heiser, Vertebrate Life, Prentice Hall, nucleotide substitution, insertion, and deletion rates of Upper Saddle River, NJ, 1999. rodents lead to more rapid pseudogene decay. This could [18] R.F. Furlong, P.W.H. Holland, Were vertebrates octoploid? Philos. be augmented by the fact that transposable elements have Trans. R. Soc. London B Biol. Sci. 357 (2002) 531–544. 238 H.A.F. Booth, P.W.H. Holland / Genomics 84 (2004) 229–238

[19] C. Esnault, J. Maestre, T. Heidmann, Human LINE retrotransposons [21] Z. Zhang, N. Carriero, M. Gerstein, Comparative analysis of pro- generate processed pseudogenes, Nat. Genet. 24 (2000) 363–367. cessed pseudogenes in the mouse and human genomes, Trends Genet. [20] Z. Zhang, P.M. Harrison, Y. Liu, M. Gerstein, Millions of years of 20 (2004) 62–67. evolution preserved: a comprehensive catalog of the processed pseu- [22] R.H. Waterston, et al., Initial sequencing and comparative analysis of dogenes in the human genome, Genome Res. 13 (2003) 2541–2558. the mouse genome, Nature 420 (2002) 520–562. 本文献由“学霸图书馆-文献云下载”收集自网络,仅供学习交流使用。

学霸图书馆(www.xuebalib.com)是一个“整合众多图书馆数据库资源,

提供一站式文献检索和下载服务”的24 小时在线不限IP 图书馆。 图书馆致力于便利、促进学习与科研,提供最强文献下载服务。

图书馆导航:

图书馆首页 文献云下载 图书馆入口 外文数据库大全 疑难文献辅助工具