Genomics 92 (2008) 101–106

Contents lists available at ScienceDirect

Genomics

journal homepage: www.elsevier.com/locate/ygeno

Comparative genomics reveals -specific and shared regulatory sequences in the spermatid-expressed mammalian Odf1, Prm1, Prm2, Tnp1, and Tnp2

Kenneth C. Kleene ⁎, Jana Bagarova

Department of Biology, University of Massachusetts at Boston, Boston, MA 02125, USA

ARTICLE INFO ABSTRACT

Article history: The comparative genomics of the Odf1, Prm1, Prm2, Tnp1, and Tnp2 genes in 13–21 diverse mammalian Received 6 January 2008 species reveals striking similarities and differences in the sequences that probably function in the Accepted 1 May 2008 transcriptional and translational regulation of gene expression in haploid spermatogenic cells, spermatids. Available online 17 June 2008 The 5′ flanking regions contain putative TATA boxes and cAMP-response elements (CREs), but the TATA boxes and CREs exhibit gene-specific sequences, and an overwhelming majority of CREs differ from the consensus Keywords: ′ ′ fi Comparative genomics sequence. The 5 and 3 UTRs contain highly conserved gene-speci c sequences including canonical and Translational regulation noncanonical poly(A) signals and a suboptimal context for the Tnp2 translation initiation codon. The Spermatid conservation of the 5′ UTR is unexpected because mRNA translation in spermatids is thought to be regulated primarily by the 3′ UTR. Finally, all of the genes contain a single intron, implying that retroposons are rarely Transition created from mRNAs that are expressed in spermatids. Outer dense fiber 1 © 2008 Elsevier Inc. All rights reserved. TATA box CREMτ Noncanonical poly(A) signal Retroposon

Introduction [4]. The importance of delaying translation is demonstrated by reports that premature translation of the Prm1 and Tnp2 mRNAs in round The haploid, differentiation phase of in mammals spermatids in transgenic mice impairs male fertility [5,6]. is a striking system of developmental regulation of gene expression at The deposition of N7×106 sequence reads from the genomes of the transcriptional and posttranscriptional levels. Many new mRNAs N30 mammalian species in the public databases creates opportunities are synthesized in early haploid cells, round spermatids, but changes to use comparative genomics to identify conserved sequences that in the structure of inactivate transcription in late haploid regulate gene expression in spermatids. Comparison of the rat, mouse, cells, elongated spermatids [1]. Thus, many of the that are and human genomes demonstrates that the likelihood of identifying first synthesized during the final stages of differentiation are short regulatory elements increases dramatically with the number of encoded by mRNAs that are transcribed in round spermatids and species [7]. In addition, comparative genomics can provide insights stored for up to 5 days in an inert state before translation is activated into the evolutionary pressures on sperm development, which differ in elongated spermatids. Familiar examples include the mRNAs from the selective pressures on female reproductive success and encoding transition proteins 1 and 2 (TNP1 and TNP2), 1 somatic cells [8,9]. The atypical selective pressures on spermatogen- and 2 (PRM1 and PRM2), and the sperm mitochondria-associated esis can produce regulatory mechanisms that are unique to sperma- cysteine-rich protein (SMCP) [2]. The Prm1, Prm2, Tnp1, and Tnp2 togenic cells [9–12]. genes encode basic chromosomal proteins that replace the A previous study from this lab demonstrated that the Smcp mRNA during chromatin remodeling and are arguably members of a gene contains highly conserved sequences in the 5′ and 3′ UTRs in seven family [3]. The outer dense fiber 1 (Odf1) mRNA, which encodes an species and four orders of mammals [13]. The present study uses abundant protein in the outer dense fibers of the sperm tail, is comparative genomics to look for conserved sequences potentially repressed briefly before translation is activated in round spermatids involved in transcriptional and posttranscriptional regulation in the Odf1, Prm1, Prm2, Tnp1, and Tnp2 genes in a much more diverse set of mammalian genomes. The findings suggest that all five genes use conserved TATA boxes, cyclic AMP response elements (CREs), and ☆ Supplementary data for this article may be found on ScienceDirect. ′ ′ ⁎ Corresponding author. Fax: +1 617 287 6650. sequences in both the 5 and the 3 UTR to regulate expression at the E-mail address: [email protected] (K.C. Kleene). transcriptional and posttranscriptional levels, while at the same time

0888-7543/$ – see front matter © 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2008.05.001 102 K.C. Kleene, J. Bagarova / Genomics 92 (2008) 101–106 displaying many previously unrecognized gene-specific differences in Table 1 these sequences. Gene-specific TATA boxes and poly(A) signal-associated motifs Genea No. of Motif c Results sequencesb Odf1 TATA 18 A[CT]AGAACACAAGC[TC]TTAAAGTAAGTGAA[TC]CATGTG Strategy box [TG]GCCTCA Prm1 TATA 14 G[CT][GA]TCTATAAGAGG[CG]C[GC][AC][GA]G[AG] fi box The Tnp1, Tnp2, Prm1, Prm2, and Odf1 genes were identi ed in Prm2 TATA 13 [GT][CGT]CCCCTTTATATA[CT][AG]AG[CT][CT]C[CG]C various mammalian genomes at the ENSEMBL Web site using TBLASTN box and amino acid queries. Table S1 lists the names and taxonomic Tnp1 TATA 18 ACAATGGC[TC]AAGGCCTTAAATACCCAG[AG]CTC[CT][CT] classifications of species that provided sequences for this study. The box [GA]GCC Tnp2 TATA 16 G[CG]CCC[AG][AG]CTATATAACCAGGGGCTG[CT][CG][AG] positions of the splice sites were determined from consensus splice box sites, intron positions, and protein sequences in other species. The Odf1 3′ UTR 18 AAGTA[CG][GA]TAGGAAACTGAATACATAACTGCAA[TC] positions of the exons predicted by ENSEMBL were often useful. The Prm1 3′ UTR 17 CAGGAGCCTGCTAA[CG]GAACAATGCC[GA] amino acid sequences of the proteins were determined by virtual CCTGTCAATAAATGTTGAAAA ′ translation of a theoretical coding sequence created by deleting the Prm2 3 UTR 13 GAGT[AG]AAATG[AG][GA]C[AC]AAAGTCA[CT]CTGCC[AC] AATAAAGCTTGA intron (not shown), and the protein sequences from various species Tnp1 3′ UTR 17 GAATGTATATGTTGGCTGTTTCTCCCCAACATCTCAATAAAA were aligned with CLUSTALW (Fig. S1). The sequences of the 5′ and 3′ [TC]TTTGAAA UTRs were determined from the sequences upstream and downstream Tnp2 3′ UTR 13 [AG]GTGATTTCTA[TC]GCAACA[TC]T[GA]ATTAAAGCTTGTAC of the coding sequence and transcription start sites and poly(A) sites [AC]C (discussed below). a Name of gene and motif. The 3′ UTR motif includes the sequences surrounding the The scope of this study was limited by the unfinished state of many poly(A) signal. b genomes. Thus, it is unclear whether genes that could not be identified Number of sequences used to compile the motif. c fi fi Motifs identi ed with MEME. The AT-rich core of the putative TATA boxes and poly(A) are absent from the genomes of speci c species and whether all of the signals have been highlighted in boldface. Bases in brackets are alternative bases at each genes are orthologous. However, most of the genes studied here appear site in the order of frequency of occurrence. to be orthologues based on combined evidence from the sequence of the predicted protein, the position of the intron, the conserved sequences in fl the anking and nontranslated regions, and the absence of paralogues rat and mouse, but there is no cDNA sequence to confirm this splice with divergent sequences (discussed below). site. GA–AG splice junctions were not noted in an exhaustive catalog of Fig. S1 indicates that genes encoding ODF1, PRM1, PRM2, TNP1, and splice sites [14]. TNP2 are present in species and orders of mammals in which these Intron-containing mammalian genes have generated thousands of fi proteins had not been identi ed previously, such as tenrec (Afrosor- paralogues, usually nonfunctional, by a pathway in which processed icida), rabbit and pika (Lagomorpha), hedgehog (Erinaceomorpha), mRNAs are reverse transcribed and cDNA copies are inserted into elephant (Proboscidea), bat (Chiroptera), shrew (Soricomorpha), tree genomic DNA [7,16]. These genes, known as retroposons or processed shrew (Scandentia), and armadillo (Cingulata). We were unable to pseudogenes, can be easily identified by the absence of introns and a identify Prm2, Tnp1, and Tnp2 genes in platypus and opossum using remnant of the poly(A) tail 15–20 nt downstream of the poly(A) signal. queries that detected homologues in eutherian mammals. Fig. S1 also The observation that all of the genes studied here contain introns demonstrates that ODF1 and TNP1 are much more conserved than are supports the contention that genes that are expressed in late meiotic and TNP2, PRM1, and PRM2. haploid spermatogenic cells do not generate retroposons [13,17,18]. ′ The hedgehog Prm1 gene contains an extra base near the 3 end of The dog Tnp2 expressed sequence tags (ESTs) contain two variants fi the rst exon, producing a reading frame shift that changes a plausible that differ in an 18-nt INDEL in the coding region (e.g., CO608132 and PRM1 sequence, MARYRCCRSQSRSRCSRRRYRRRRCRRRRRRSCRRRRR- CO605672). RACCRYRYRRY, into a C-terminal sequence, MARYRCCRSQSRSRCSRR- RYRRRRCRRRRRRSCRRRRRRGDVLPLQVPPVLTGCCPPAASPKTL, which is Conserved 5′ flanking sequences unlike those of other PRM1s. The hedgehog Prm1 may be a sequencing error, or a pseudogene, because the promoter lacks CREs as noted below. The binding sites for two factors that are thought to be important Perhaps PRM1 is not a major basic chromosomal protein in hedgehog in transcription in spermatids were analyzed: the TATA-binding sperm. protein (TBP) and the spermatogenic cell-specific isoform of the cyclic AMP response element modulator, CREMτ [19]. Structure and number of gene copies Primer extension and S1 nuclease protection experiments in mouse or rat revealed that all five genes strongly prefer a single transcription While a single copy of most genes was identified, two identical start site, 22–32 nt downstream of an AT-rich element that resembles a copies of the bull Prm1, Prm2, and Tnp2 genes and rabbit Tnp1 gene TATA box [20–23]. Interestingly, these elements display conserved were identified. These genes represent either genome assembly errors differences in each gene in both the AT-rich core and the surrounding or very recent gene duplications. All of the genes in which splice bases: Odf1, AAGCTTTAAAGTAA; Prm1, TCTATAAGAG; Prm2, CTTTA- junctions could be identified contain a single intron. The Tnp1, Tnp2, TATA; Tnp1, AAGGCCTTAAATAC; and Tnp2, CTATATAACC (Fig. S2 and Prm1, and Prm2 introns split the first and second bases of a codon, and Table 1). the Odf1 introns split the second and third bases. Essentially all of the Transcription Factor Element Search software (TESS) revealed that introns contain GT–AG splice sites (not shown) that comprise the vast a perfect match to the CRE, TGACGTCA [19], is present only in the rat majority of mammalian splice junctions [14]. The exceptions are the Tnp1 5′ flanking region (Fig. S2). However, elements differing at one mouse and rat Odf1 genes, in which 15 codons and the 5′ splice site site from the consensus sequence are present in 15/16 Odf1 genes, 8/ that are present in other species have been deleted. As pointed out 17 Tnp1 genes, and 7/13 Prm1 genes, and a more sensitive search using previously [15], both rodent Odf1 genes contain a consensus AG–3′ FUZZNUC detects 92 additional elements differing from the CRE splice junction with an oligopyrimidine tract, but the 5′ splice junction consensus at two sites that are not detected by TESS. Of course, some contains GA instead of the usual GT. The hedgehog Odf1 gene also of these nonconsensus CRE sites may be spurious, such as the CRE sites appears to contain a GA–AG splice junction at the same position as in in the Prm2 5′ UTR. However, the fact that many of these putative CRE K.C. Kleene, J. Bagarova / Genomics 92 (2008) 101–106 103 sites are present in conserved positions in various genes (nt ~124 and Table 2 ~142 in Prm1, nt ~90 and 171 in Tnp1, nt ~184 in Tnp2, and nt ~153 in Relative numbers of Odf1, Prm1, Prm2, Tnp1, and Tnp2 ESTs in various mammalian species Odf1) suggests that they are functional. In addition, 4 of the genes exhibit preferences for specific CRE variants: TGAGGTCA at nt ~153in 16/ mRNAb 17 Odf1 genes, TGAGGTCC at nt ~127 in 14/16 Prm1 genes, TGCCATCA at Speciesa Odf1 Prm1 Prm2 Tnp1 Tnp2 – nt ~177 191 in 13/16 Tnp2 genes, and TGACATCA, TGACAGCA, Bull 5c 43 2 0 3 TGATGACA,orTGATGCCAatnt~171in17/19Tnp1 genes. Prm2 exhibits Dog 47 7 0 9 267 little preference for a specific variant. Careful inspection of the CLUSTAL Human 16 24 188 26 8 alignments in Fig. S2 reveals additional conserved motifs in the 5′ Mouse 98 253 252 222 126 Pig 35 54 17 71 45 flanking regions of all 5 genes. a Species name. b Conserved 5′ UTR sequences Name of mRNA corresponding to ESTs. c Number of ESTs corresponding to each mRNA. The number of ESTs was determined by BLAST searches using the full-length query for each mRNA in each species and The CLUSTAL alignments in Fig. S2 demonstrate that each 5′ UTR BLASTN (word size 11) and limits (spermatid OR testis OR testicle) to restrict contains conserved sequences. For example, the Prm1 5′ UTR contains a the hits to testis. 28-nt sequence that is almost perfectly conserved (only five base substitutions in 14 species and seven orders) close to the transcription start site, a region that is often important in translational control. The corresponding sequence of the platypus and opossum Prm1 5′ UTR signal; the Tnp2 3′ UTRs contain the most common variant, AUUAAA exhibits lower conservation (not shown). Although the Tnp2 5′ terminus (except for AGUAAA in pig); and the Odf1 3′ UTRs, including opossum, contains few conserved nucleotide sites, it is generally GC rich and contain the noncanonical signal AAUACA. The 3′ termini of the rat and contains an 18- to 25-nt AG-rich segment, and the Odf1 5′ UTR contains mouse Tnp1 mRNAs contain the noncanonical signal AAUAAC at the an AT-rich element. Prm2andTnp1 5′ UTRs also contain conserved same position as the AAUAAA in other species, but the poly(A) sites in segments, each of which differs from those in the other genes. Fig. S3 indicate that this signal is nonfunctional in mouse and that the The contexts of the initiation codons are also conserved (Fig. S2). functional AAUAAA signal is 12–14 nt farther downstream. The initiation codons of the Odf1, Prm1, and Prm2 mRNAs are close to All of the 3′ flanking regions contain short U-rich, G-rich, and GU- the optimal context for the initiation of translation, with an A or G in rich tracts downstream of the poly(A) sites. These elements are the −3 position, where the A of the ATG codon is defined as +1, and the important in poly(A) site selection, but the sequences are degenerate initiation codons of the Tnp2 mRNAs are in an adequate context, and cannot be adequately described by a consensus sequence [27] . The CCCATGG [24,25]. The G in the +4 position is important only if there is a human Prm1 3′ UTR contains an upstream AAUAAA poly(A) signal, but pyrimidine in the −3 position [24]. The human Tnp2 5′ UTR is the only there are no ESTs with 3′ termini b25 nt downstream of this poly(A) mRNA that contains an upstream AUG codon, a feature that would be signal (not shown), presumably because the downstream signals are expected to inhibit translation of the Tnp2 coding sequence (Fig. S2). absent.

Poly(A) sites and signals Conservation of 3′ UTRs

The 3′ termini of mRNAs are defined by cleavage of pre-mRNAs and All five mRNA species exhibit conserved sequences in the 3′ UTR addition of the poly(A) tail by the poly(A) polymerase at the poly(A) (Fig. S4 and Table 1). A conserved sequence of variable length is site in the nucleus. To identify the poly(A) sites of various mRNAs, the present immediately upstream of the poly(A) signal in the Prm1, Prm2, complete 3′ UTR and ~60 nt of 3′ flanking sequence of Tnp1, Tnp2, Tnp1, and Tnp2 mRNAs, but other motifs are often present farther Prm1, Prm2, and Odf1 were used as a query with BLASTN (word size 11) upstream (Fig. S4). The Prm1 3′ UTR contains two conserved elements to search the EST database to determine the site at which the 3′ ends just upstream of the Prm1 poly(A) signal in mouse, rat, human, and of the ESTs diverge from genomic DNA. The number of hits was set bull, one of which regulates temporal translation of the human growth high to avoid biasing the results toward ESTs with long 3′ UTR hormone coding region in transgenic mice [28]. Fig. S4 indicates that sequences. both sequences are conserved in eight orders of mammals, including a Initial BLASTN searches produced an unexpectedly large number of marsupial, the opossum, and a monotreme, the platypus. In addition, hits to ESTs from somatic tissues such as human brain and placenta there are short, conserved segments immediately downstream of the and bovine female 6-month-old liver, suggesting that the expression poly(A) signal in all five mRNAs. MEME motifs for the gene-specific of these genes is not limited to spermatids, as is commonly assumed. sequences surrounding the poly(A) signal are presented in Table 1. The expression of PRM1 at high levels in somatic cells would be Careful inspection of the motifs near the poly(A) signals in Table 1 and notable because PRM1 appears to be toxic to transcriptionally active the CLUSTAL alignments in Fig. S4 indicates that the motifs are mRNA- nuclei [5]. Alternatively, the source of the some of these ESTs might be specific. labeled incorrectly. Subsequent BLASTN searches used to compile the information in Fig. S3 included ENTREZ query limits (testis OR testicle Species-specific differences in expression and regulatory regions OR spermatid). A variable number of ESTs terminate upstream of the poly(A) signal Comparative genomics in this study provides several examples of (Fig. S3), presumably because the sequences of mRNAs determined species-specific differences that may be related to differences in from the 5′ ends of the cDNAs are incomplete, particularly for the selective pressures in various lineages. longer mRNAs in this study. As a result, the vast majority of human First, the numbers of testis ESTs corresponding to various mRNAs Prm2 and bull Odf1 ESTs lack full-length 3′ ends. can provide an estimate of the relative levels of mRNAs in various All five genes use poly(A) sites, 7–22 nt downstream of several species. This approach has advantages in comparison to mRNA levels variants of the hexanucleotide poly(A) signal, AAUAAA, demonstrating in various mammals by Northern blots, because the results are that these signals are actually functional (Fig. S3). Some genes use unaffected by differences in the similarity of the probes to mRNAs in multiple poly(A) sites, whereas others strongly prefer a single poly(A) different species. The frequencies of ESTs in various species indicate site; both situations are common in mammalian mRNAs [26]. The striking differences in relative levels of expression of various mRNAs Tnp1, Prm1, and Prm2 3′ UTRs contain the canonical AAUAAA poly(A) (Table 2). 104 K.C. Kleene, J. Bagarova / Genomics 92 (2008) 101–106

For example, the relative numbers of ESTs corresponding to the Interestingly, the vast majority of putative CRE sites identified here Prm1 and Prm2 mRNAs are 253:252, 43:2, and 24:188 in mouse, (122/123) differ by at least one base from the consensus sequence, bull, and human, respectively, and the numbers of Tnp1 and Tnp2 ESTs TGACGTCA. This is puzzling because these genes are often expressed differ greatly, 222:126 and 9:267, in mouse and dog, respectively. at very high levels and the binding of recombinant CREMτ to elements Second, the guinea pig Odf1 gene exhibits two differences from that differ from the consensus sequence at one or two bases is those in other mammals. The guinea pig Odf1 5′ flanking region lacks a drastically reduced [19]. This dilemma can be explained by hypothe- TATA box. In addition, despite the fact that the guinea pig Odf1 3′ UTR sizing that CREMτ forms a heterodimer with another transcription is similar in length to other mammals and contains the noncanonical factor that increases binding to nonconsensus CRE sites. This AAUACA poly(A) signal that is characteristic of this 3′ UTR, the hypothesis is consistent with evidence that binding of TFIIA to TBP conserved 3′ UTR sequences in other species have been replaced by 10 introduces conformational changes in DNA and increases the affinity copies of an 8-nucleotide element, usually CAGTAGGA, in guinea pig. of TBP for nonconsensus TATA boxes [39]. It is relevant to note that Third, there are several additional examples of genes that exhibit CREMτ introduces bends in DNA and binds two transcriptional less striking differences, such as the hedgehog Prm1 5′ flanking region, cofactors, one of which, Tisp40, increases the affinity of CREMτ for which exhibits little similarity to the 5′ flanking regions of other consensus CRE sites [40–42], but the effects of these co factors on the species, including the absence of putative CRE sites and the shift in the binding of CREMτ to nonconsensus CRE sites has not been examined. position of the polyadenylation signals in rat and mouse Tnp1 3′ UTRs The conserved sequences in the 5′ UTR include the sequences noted above. surrounding the initiator codon and the body of the 5′ UTR. The 5′ terminus and the context of the initiation codon of the Smcp mRNA Discussion are highly conserved [13]. The high degree of conservation of the Prm1, Prm2, Tnp1, Tnp2, and Odf1 5′ UTRs was unexpected because It seems likely that the Prm1, Prm2, Tnp1, Tnp2, and Odf1 genes delays in Prm1, Tnp2, and Smcp mRNA translation are primarily identified here are orthologues based on combined evidence from the mediated by the 3′ UTR in transgenic mice [5,6,43]. However, the Smcp sequences of the encoded proteins; the positions of the introns; the 5′ UTR alone produces short delays in translational activation and conserved 5′ flanking, 5′ UTR, and 3′ UTR sequences; and the absence reduces the proportion of polysomal mRNA in transgenic mice [44].It of retroposon paralogues. In general, the results demonstrate the is also possible that the 5′ UTR, possibly in conjunction with presence of both conserved elements in the 5′ flanking region, 5′ UTR, interactions with the 3′ UTR, fine-tunes the efficiency of translation and 3′ UTR in each gene and gene-specific sequences of the conserved of each mRNA. elements in different genes. The context of the initiation codons also exhibits striking Intuitively, the Prm1, Prm2, Tnp1, Tnp2, and Odf1 genes would be conservation. The initiator codons of the Prm1, Prm2, Tnp1, Odf1, and expected to spawn many retroposons, intronless paralogues created by Smcp RNAs are in a strong context, and the Smcp initiator codon is reversetranscriptasecopyingmaturemRNAs,becausethesegenesare preceded by one of the most conserved sequences in the entire mRNA, usually highly expressed in spermatids, and the creation of retroposons GAAG [13], which includes conserved bases in the −4, −2, and −1 requires insertion into DNA in the germ line. However, none of the genes positions that are thought to have little effect on translation [24,25]. identified here is a retroposon. Comparative genomics also demonstrates The Tnp2 initiation codon is in an adequate context, CCCATGG, in that the Smcp and Prm3 genes have not created retroposons in 16 and 12 which ~50% of ribosomes initiate translation at the AUG codon closest species of mammals, respectively ([13,18], Bagarova and Kleene, unpub- to the 5′ cap, and ~50% initiate at downstream AUG codons [24]. The lished) and Southern blots indicate that a variety of genes that are conserved contexts of the initiation codons may modulate the expressed specifically in meiotic and haploid spermatogenic cells are efficiency of translation of these mRNAs, probably by facilitating the single copy [17]. The observation that the somatic glyceraldehyde-3′- identification of initiation codons [45]. In fact, MFOLD predicts little phosphate dehydrogenase (GAPDH) mRNA has generated ~200 processed secondary structure surrounding the initiator codons and that the pseudogenes, while the spermatid-specific GAPDH mRNA has generated initiator codons are not base paired in all five mRNAs in the mouse none [7,29,30], provides a striking example of the deficit in creation of (data not shown). retroposons from mRNAs that are expressed in haploid spermatogenic The conservation of the 3′ UTRs includes sequences in the body of cells. Knockouts of the Mili2 and Dnmt3L genes reveal mechanisms that the 3′ UTR, especially the segment just upstream of the poly(A) signal repress expression of LINE-1 retroposons in meiotic spermatogenic cells and the poly(A) signal itself. The translational control element (TCE) [31,32], which probably suppress the creation of retroposons from cellular and its upstream complement in the Prm1 3′ UTR [28] are highly mRNAs [33]. A systematic examination of the frequency of creation of conserved, with virtually identical sequences present in the 3′ UTRs retroposons from mRNAs that are expressed in various somatic cells and of a marsupial, a monotreme, and 13 placental mammals. The se- male and female germ cells would provide insights into the pathways and lective pressure that conserves the Prm1 TCE is presumably the cell types that produce retroposons from mRNAs. deleterious effect of premature PRM1 expression on male fertility [5]. The promoters of all five genes exhibit similarities in a strong In addition, the Odf1, Prm2, Tnp1, Tnp2, and Smcp mRNAs [13] also preference for transcription start sites 22–32 nt downstream of display gene-specific conserved motifs both upstream and down- putative TATA boxes and the presence of putative CRE sites, but the stream of the poly(A) signal. TATA boxes and CRE sites exhibit conserved gene-specific differences. It was unexpected that the type of poly(A) signal is strongly The TATA boxes exhibit conserved gene-specific sequences in the AT- conserved. The Prm1, Prm2, and Tnp1 3′ UTRs contain canonical rich core and the surrounding sequence. It should be emphasized here AAUAAA poly(A) signals; the Tnp2 3′ UTRs display the most common that the proteins that bind to these putative TATA boxes are unknown, variant, AUUAAA; the Odf1 3′ UTR displays a noncanonical signal, and candidates include the classic TBP, which is grossly overexpressed AAUACA; and the Smcp mRNA displays two or three canonical and in spermatogenic cells and is therefore predicted to bind sequences noncanonical signals ([13], Bagarova and Kleene, unpublished). that differ from the consensus TATA box [34–36]; the spermatid- Noncanonical poly(A) signals are common in mRNAs in spermato- specific TBP-related factor 2 [37]; or one of the many uncharacterized genic cells and are usually thought to regulate differences in the 3′ testis-specific transcription factors [10]. ends of transcripts of genes that are expressed in both somatic and An alternatively spliced isoform of CREM, CREMτ, is essential for spermatogenic cells [12,46]. However, analysis of the 3′ termini of transcription of many genes in haploid spermatogenic cells, and gel ESTs does not indicate that the Tnp2 and Odf1 mRNAs use alternative mobility shift and footprinting assays demonstrate that recombinant poly(A) sites in somatic and spermatogenic cells (data not shown), CREMτ binds to CRE in the Tnp1 and Odf1 5′ flanking regions [19,38]. regardless of the uncertainty of whether the sources of the somatic K.C. Kleene, J. Bagarova / Genomics 92 (2008) 101–106 105

ESTs have been identified correctly. The idea that the noncanonical 3′ nontranslated regions and ~60 nt downstream of the poly(A) signal to determine the site at ′ poly(A) signals are necessary for positioning the poly(A) site raises the which the 3 ends of the ESTs diverge from genomic DNA. Transcription factor binding sites were predicted initially with TESS (http://www.cbil.upenn.edu/cgi-bin/tess/tess) [54],and question why a stronger signal, AAUAAA, is not observed. Conceivably, putative nonconsensus CRE sites were subsequently identified with the FUZZNUC (http:// the poly(A) signals function in cytoplasmic gene regulation, perhaps in www.biotools.umassmed.edu) using the CRE consensus sequence, TGACGTCA, and allowing conjunction with motifs close to the poly(A) signal. up to two mismatches. MEME (http://meme.sdsc.edu/meme/meme.html) was used to search This study also reveals striking examples of differences in the 5′ flanking regions, 5′ UTRs, and 3′ UTRs for conserved motifs. expression of mRNA levels that might be caused by the vagaries of sexual selection on gene expression in spermatogenic cells [9]. The Acknowledgment huge differences in the ratio of Prm1 and Prm2 ESTs reported here agree with differences in the ratio of PRM1 and PRM2 proteins in the This work was supported by NSF Grants BCE-0348497 and MCB- sperm of various mammals [47], but we were surprised by the 0642128. magnitude of the differences in the ratio of Tnp1 and Tnp2 ESTs in mouse and dog. Unfortunately, comparisons of the levels of TNP1 and Appendix A. Supplementary data TNP2 in spermatids of various mammals are not available. The differences in relative levels of various mRNAs may exemplify a Supplementary data associated with this article can be found, in phenomenon in which the levels of expression of mRNAs in testis the online version, at doi:10.1016/j.ygeno.2008.05.001. exhibit species-specific differences due to sexual selection [9,48,49].It is also possible that the ratio of TNP1 and TNP2 has minimal selective References consequences because these genes are partially redundant [3].

The guinea pig Odf1 gene exhibits two differences from the other [1] A.L. Kierszenbaum, L.L. Tres, Structural and transcriptional features of the mouse Odf1 genes: the TATA box is lacking and the conserved segments of spermatid genome, J. Cell Biol. 65 (1975) 258–270. the 3′ UTR upstream of the noncanonical poly(A) signal that are [2] K.C. Kleene, PolyA shortening accompanies the activation of translation of five mRNAs during spermiogenesis in the mouse, Development 106 (1989) 367–373. conserved in 18 other species have been replaced by 10 copies of an 8- [3] M.L. Meistrich, B. Mohapatra, C.R. Shirley, M. Zhao, Roles of transition nuclear nucleotide element, which is absent from all other Odf1 3′ UTRs. proteins in spermiogenesis, Chromosoma 111 (2003) 483–488. Further work will be necessary to determine whether the guinea pig [4] P. Burfeind, S. Hoyer-Fender, Transcription and translation of the outer dense fiber Odf1 gene is a sequencing artifact, a nonexpressed pseudogene, or a gene Odf1 during spermiogenesis in the rat: a study by in situ analyses and polysome fractionation, Mol. Reprod. Dev. 45 (1996) 10–20. functional gene. However, the total transformation of the Odf1 3′ UTR [5] K. Lee, H.S. Haugen, C.H. Clegg, R.E. Braun, Premature translation of protamine 1 is difficult to explain as a sequencing artifact and implies strong mRNA arrests spermatid differentiation and causes dominant sterility in – selective pressure, which may be related to sexual selection or more transgenic mice, Proc. Natl. Acad. Sci. U. S. A. 92 (1995) 12451 12455. [6] K. Tseden, O. Topaloglu, D. Bohm, S. Wolf, C. Muller, I. Adham, A. Meinhardt, A. Dev, general peculiarities of gene sequences in this species [50]. Never- G. Schluter, W. Engel, K. Nayernia, Premature translation of transition protein 2 theless, there are no striking abnormalities of the ultrastructure of the mRNA causes sperm abnormalities and male infertility, Mol. Reprod. Dev. 74 guinea pig outer dense fibers [51]. The hedgehog Prm1 gene may be a (2007) 273–279. [7] Rat Genome Sequencing Consortium. Genome sequence of the Brown Norway rat pseudogene based on the absence of CRE sites in the promoter and the yields insights into mammalian evolution, Nature 428 (2004) 493–521. atypical protein sequence, which raises questions about the identity of [8] T. Birkhead, Promiscuity: An Evolutionary History of Sperm Competition, Harvard the basic chromosomal proteins in this species' sperm. University Press, Cambridge MA, 2001. [9] K.C. Kleene, Sexual selection, genetic conflict, selfish genes and the atypical This study highlights one of the advantages of studying gene patterns of gene expression in spermatogenic cells, Dev. Biol. 277 (2005) 16–26. expression in mammalian spermatogenic cells: the huge amount of [10] K.C. Kleene, A possible meiotic function of the peculiar patterns of gene expression genomic sequence information, which facilitates identifying conserved in mammalian spermatogenic cells, Mech. Dev. 106 (2001) 3–23. [11] S. Kimmins, P. Sassone-Corsi, Chromatin remodelling and epigenetic features of regulatory elements. This partially offsets the disadvantage of a system germ cells, Nature 434 (2005) 583–589. in which transgenic animals provide the only rigorous technique of [12] D. Liu, J.M. Brockman, B.G. Dass, L.N. Hutchins, P. Singh, J.R. McCarrey, C.C. analyzing the functions of specific sequences. The presence of TATA MacDonald, J.H. Graber, Systematic variation in mRNA 3′-processing signals during – boxes, CRE sites, and conserved sequences near the 5′ and 3′ mRNA mouse spermatogenesis, Nucleic Acids Res. 35 (2006) 234 246. [13] S.K. Hawthorne, G.T. Goodarzi, J. Bagarova, K.E. Gallant, R.R. Busanelli, W. Olend, K.C. termini suggests that all five genes utilize common mechanisms to Kleene, Comparative genomics of the sperm-mitochondria associated cysteine- regulate transcription and translation, while the gene-specificdiffer- rich protein gene, Genomics 87 (2006) 382–391. ences in these and other elements suggest that each gene is regulated [14] N. Sheth, X. Roca, M.L. Hastings, T. Roeder, A.R. Krainer, R. Sachidanandam, Comprehensive splice-site analysis using comparative genomics, Nucleic Acids individually. Res. 34 (2006) 3955–3967. [15] P. Burfeind, B. Belgardt, C. Szpirer, S. Hoyer-Fender, Structure and chromosomal Materials and methods assignment of a gene encoding the major protein of rat sperm outer dense fibres, Eur. J. Biochem. 216 (1993) 497–505. [16] A.M. Weiner, P.L. Deininger, A. Efstradiatis, Nonviral retroposons: genes, The Tnp1, Tnp2, Prm1, Prm2, and Odf1 genes in various mammalian genomes were pseudogenes, and transposable elements generated by the reverse flow of genetic initially identified in the ENSEMBL databases (http://www.ensembl.org/Multi/blastview information, Annu. Rev. Biochem. 55 (1986) 631–661. and http://www.ensembl.org/index.html) using TBLASTN and conserved amino- [17] X. Zhong, K.C. Kleene, cDNA copies of the testis-specific lactate dehydrogenase terminal queries (PRM1, MARYRCCRS; PRM2, MVRYRMRSLSE; TNP1, MSTSRKLKTHG; LDH-C mRNA are present in spermatogenic cells in mice, but processed TNP2, MDTKTQSLPITHTQPHSNSRP; and ODF1, MAALSCLLDSVRRDIKKV). The carboxy- pseudogenes are not derived from mRNAs that are expressed in haploid and late terminal exon of the Odf1 gene was identified with a conserved C-terminal segment meiotic spermatogenic cells, Mamm. Genome 10 (1999) 6–12. (PCDPCNPCYPCGSRFSCRKMIL). These short queries minimize spurious “hits” to non- [18] P. Grzmil, D. Boinska, K.C. Kleene, I. Adham, G. Schlüter, M. Kämper, B. Buyandelger, homologous DNA sequences encoding low-complexity amino acid sequences. Later A. Meinhardt, S. Wolf, Engel Wolfgang, Prm3, the fourth gene in the mouse searches used full-length protein queries with the low-complexity filter turned off. protamine gene cluster, encodes a conserved acidic protein that affects sperm Platypus and opossum Prm1 genes were identified with CAA81445 and NP_001028139 motility, Biol. Reprod. 78 (2008) 958–967. [52,53]. In some cases, homologous sequences were distinguished from nonhomologous [19] V. Delmas, F. van der Hoorn, B. Mellstrom, B. Jegou, P. Sassone-Corsi, sequences as long hits containing the N-terminal methionine, regardless of the apparent Induction of CREM activator proteins in spermatids: down-stream targets and similarity to the query indicated by the colored bars. The nucleotide sequences identified implications for haploid germ cell differentiation, Mol. Endocrinol. 93 (1993) 1502– 1514. initially in the ENSEMBL database were later used to identify the corresponding [20] J. Peschon, R.R. Behringer, R.L. Brinster, R.D. Palmiter, Spermatid-specific expres- accession numbers in the NCBI database. Protein and DNA sequences were aligned with sion of protamine 1 in transgenic mice, Proc. Natl. Acad. Sci. U. S. A. 84 (1987) CLUSTALW (http://www.ebi.ac.uk/Tools/clustalw/index.html). 5316–5319. NCBI EST databases were searched with BLASTN to determine the number of ESTs [21] P.A. Johnson, J.J. Peschon, P.C. Yelick, R.D. Palmiter, N.B. Hecht, Sequence corresponding to each mRNA in dog, bull, mouse, human, and pig using full-length mRNA homologies in the mouse protamine 1 and 2 genes, Biochim. Biophys. Acta 950 queries, word size 11, and an ENTREZ limited query (testis OR testicle OR spermatid), and the (1988) 45–53. maximum number of hits was set at 500 to avoid limiting the hits to the longest alignments. [22] M.A. Heidaran, C.A. Kozak, W.S. Kistler, Nucleotide sequence of the Stp-1 gene The positions of poly(A) sites were determined by BLASTN searches of ESTs using the complete coding for rat spermatid nuclear transition protein 1 TP1: homology with 106 K.C. Kleene, J. Bagarova / Genomics 92 (2008) 101–106

protamine P1 and assignment of the mouse Stp-1 gene to 1, Gene 75 [39] A.R. Heib, W.A. Halsey, M.D. Betterton, T.T. Perkins, J.F. Klugel, J.A. Goodrich, TFIIA (1989) 39–46. changes the conformation of the DNA in TBP/TATA complexes and increases their [23] K.C. Kleene, J. Gerstel, D. Shih, Nucleotide sequence of the gene encoding transition kinetic stability, J. Mol. Biol. 372 (2007) 619–632. protein 2, Gene 95 (1990) 301–302. [40] R.P. de Groot, V. Delmas, P. Sassone-Corsi, DNA bending by transcription factors [24] M. Kozak, Context effects and inefficient translation at non-AUG codons in CREM and CREB, Oncogene 9 (1994) 463–468. eucaryotic cell-free translation systems, Mol. Cell. Biol. 9 (1989) 5073–5080. [41] G.M. Fimia, D. De Cesare, P. Sassone-Corsi, CBP-independent activation of CREM [25] M. Kozak, Not every polymorphism close to the AUG codon can be explained and CREB by the LIM-only protein ACT, Nature 398 (1999) 165–169. by invoking context effects on initiation of translation, Blood 101 (2003) [42] I. Nagamori, K. Yomogida, P.D. Adams, P. Sassone-Corsi, H. Nojima, Transcription 1202–1203. factors, cAMP-responsive element modulator CREM, and Tisp40, act in concert in [26] L.B. Tian, J. Hu, H. Zhang, C.S. Lutz, A large-scale analysis of mRNA polyA of human postmeiotic transcriptional regulation, J. Biol. Chem. 281 (2006) 15073–15081. and mouse genes, Nucleic Acids Res. 33 (2005) 201–212. [43] R.E. Braun, J. Peschon, R.R. Behringer, R.L. Brinster, R.D. Palmiter, Protamine 3′- [27] M.I. Zarudnaya, I.M. Kolomiets, A.L. Potyahaylo, D.M. Hovorun, Downstream untranslated sequences regulate temporal translational control and subcellular elements of mammalian pre-mRNA polyA signals: primary, secondary and higher- localization of growth hormone in spermatids of transgenic mice, Genes Dev. 3 order structures, Nucleic Acids Res. 31 (2003) 1375–1386. (1989) 793–802. [28] J. Zhong, A.H.F.M. Peters, K. Kafer, R.E. Braun, A highly conserved sequence [44] S.K. Hawthorne, R.R. Busanelli, K.C. Kleene, The 5′UTR and 3′UTR of the sperm essential for translation repression of the protamine 1 messenger RNA in murine mitochondria-associated cysteine-rich protein mRNA regulate translation in spermatids, Biol. Reprod. 64 (2001) 1784–1789. spermatids by multiple mechanisms in transgenic mice, Dev. Biol. 87 (2006) [29] M. Piechaczyk, J.M. Blanchard, A. Riaad El-Sabaty, L. Marty, P. Jeanteur, Unusual 382–391. abundance of vertebrate 3-phosphate dehydrogenase pseudogenes, Nature 312 [45] M.C. Ganoza, B.G. Louis, Potential secondary structure at the translational start (1984) 469–471. domain of eukaryotic and prokaryotic mRNAs, Biochimie 76 (1994) 4289–4439. [30] J.E. Welch, P.R. Brown, D.A. O'Brien, E.M. Eddy, Genomic organization of a mouse [46] H. Wang, B.L. Sartini, C.F. Millette, D.L. Kilpatrick, A developmental switch in glyceraldehyde 3-phosphate dehydrogenase gene Gapd-s expressed in post- transcription factor isoforms during spermatogenesis controlled by alternative meiotic spermatogenic cells, Dev. Genet. 16 (1995) 179–189. messenger RNA 3′-end formation, Biol. Reprod. 75 (2006) 318–323. [31] M.A. Carmell, A. Girard, H.J. van de Kant, D. Bour'chis, T.H. Bestor, D.G. de Rooij, G.J. [47] M. Corzett, J. Mazrimas, R. Balhorn, Protamine 1:protamine 2 stoichiometry in the Hannon, MIWI2 is essential for spermatogenesis and repression of transposons in sperm of eutherian mammals, Mol. Reprod. Dev. 61 (2002) 519–527. the mouse male germline, Dev. Cell 2 (2007) 503–514. [48] C.D. Meiklejohn, J. Parsch, J.M. Ranz, D.L. Hartl, Rapid evolution of male-biased [32] D. Bourc'his, T.H. Bestor, Meiotic catastrophe and retrotransposon reactivation in gene expression in Drosophila, Proc. Natl. Acad. Sci. U. S. A. 100 (2003) 9894-9849. male germ cells lacking Dnmt3L, Nature 431 (2004) 96–99. [49] J.M. Ranz, C.I. Castillo-Davis, C.D. Meiklejohn, D.L. Hartl, Sex-dependent gene [33] H.H. Kazazian, Mobile elements: drivers of genome evolution, Science 303 (2004) expression and evolution of the Drosophila transcriptome, Science 300 (2003) 1626–1632. 1742–1745. [34] E.E. Schmidt, Transcriptional promiscuity in testis, Curr. Biol. 6 (1996) 768–769. [50] D. Graur, W.A. Hide, W.H. Li, Is the guinea pig a rodent? Nature 351 (1991) [35] E.E. Schmidt, U. Schibler, High accumulation of components of the RNA 649–652. polymerase II transcription machinery in rodent spermatids, Development 121 [51] D.W. Fawcett, The anatomy of the mammalian spermatozoon with particular (1995) 2373–2383. reference to the guinea pig, Z. Zellforsch. Mikrosk. Anat. 67 (1965) 279–296. [36] S.P. Persengiev, S. Robert, D.L. Kilpatrick, Transcription of the TATA binding protein [52] J.D. Retief, R.J. Winkfein, G.H. Dixon, Evolution of the monotremes: the sequences gene is highly up-regulated during spermatogenesis, Mol. Endocrinol. 10 (1996) of the protamine P1 genes of platypus and echidna, Eur. J. Biochem. 218 (1993) 742–747. 457–461. [37] P.G. Zhang, T.L. Pentilla, P.L. Morris, M. Teichman, R.G. Roeder, Spermiogenesis [53] J.D. Retief, C. Krajewski, M. Westerman, R.J. Winkfein, G.H. Dixon, Molecular deficiency in mice lacking the Trf2 gene, Science 292 (2001) 1153–1155. phylogeny and evolution of marsupial protamine P1 genes, Proc. R. Soc. London B, [38] M.K. Kistler, P. Sassone-Corsi, W.S. Kistler, Identification of a functional cyclic Biol. Sci. 259 (1995) 7–14. adenosine 3′,5′-monophosphate response element in the 5′-flanking region of the [54] J. Schug, Using TESS to predict transcription factor binding sites in DNA sequence, gene for transition protein 1 TP1, a basic chromosomal protein of mammalian in: A.D. Baxevanis (Ed.), Current Protocols in Bioinformatics, Wiley, New York, spermatids, Biol. Reprod. 51 (1994) 1322–1329. 2003, Chap. 2.6.