Mammalian Genome 10, 390–396 (1999).

Incorporating Mouse Genome © Springer-Verlag New York Inc. 1999

Exon structure of the DNA-binding domain from C. elegans to mammals

Colin F. Fletcher,1 Nancy A. Jenkins,1 Neal G. Copeland,1 Ali Z. Chaudhry,2 Richard M. Gronostajski2

1Mammalian Genetics Laboratory, ABL-Basic Research Program, NCI-Frederick Cancer Research and Development Center, Frederick, Maryland 21702, USA 2The Lerner Research Institute, Cleveland Clinic Foundation, 9500 Euclid Avenue, Dept. of Cancer Biology NB40, Cleveland, Ohio 44195 and Case Western Reserve University, Dept. of Biochemistry, Cleveland, Ohio, USA

Received: 2 November 1998 / Accepted: 9 December 1998

Abstract. The Nuclear Factor I (NFI) family of DNA-binding Osada et al. 1996), hamster (Gil et al. 1988), pig (Meisterernst et is essential for adenovirus DNA replication and the tran- al. 1989), chicken (Rupp et al. 1990; Kruse et al. 1991) and Xeno- scription of many cellular . Mammals have four genes en- pus (Roulet et al. 1995; Puzianowska-Kuznicka and Shi 1996) and coding NFI proteins, C. elegans has only a single NFI , and have been shown to be alternatively spliced, yielding as many as prokaryotes have none. To assess the relationship between mem- seven proteins from each gene (Kruse and Sippel 1994; Wen- bers of this unusually small family of transcription/replication fac- zelides et al. 1996). While the number of distinct transcripts from tors, we mapped the chromosomal locations of the four murine each gene is large, the same four NFI genes are present in all NFI genes and analyzed the exons encoding the DNA-binding vertebrate species examined to date and, unlike the Hox gene domains of the mouse, Amphioxus, and C. elegans NFI genes. The family (Carroll 1995), no extra-familial homologs have been de- four murine NFI genes are on Chrs 4 (Nfia and Nfib),8(Nfix), and tected. The apparent genetic simplicity of this gene family makes 10 (Nfic), suggesting early duplication of the genes and dispersal it a useful system to investigate gene-family expansion during throughout the genome. The DNA-binding domains of all four NFI metazoan evolution. genes are encoded by large (532 bp) exons with identical splice NFI proteins function by binding to duplex DNA containing acceptor and donor sites in each. In contrast, the C. elegans nfi-1 the consensus sequence TTGGC/AN5G/TCCAA (Gronostajski et gene has four phased introns interrupting this DNA-binding, do- al. 1985; Gronostajski 1986, 1987; Meisterernst et al. 1988) and main-encoding exon, and the last exon extends 213 bp past the modulating DNA replication and transcription. The DNA-binding splice site used in all four murine genes. In addition, the introns activity of NFI proteins resides in a highly homologous N-terminal present in C. elegans nfi-1 are missing from the NFI genes of DNA-binding domain (Mermod et al. 1989; Gounari et al. 1990; Amphioxus and all mammalian genomes examined. This analysis Novak et al. 1992), while the C-terminal regions of the proteins are of the exon structure of the C. elegans and murine NFI genes more divergent. Multiple C-terminal domains of each NFI gene indicates that the murine genes were probably generated by dupli- product are generated by alternative splicing (Kruse and Sippel cation of a C. elegans-like ancestral gene, but that significant 1994). The N-terminal DNA-binding domains of all NFI proteins changes have occurred in the genomic organization of either the C. elegans or murine NFI genes during evolution. appear to bind to the same DNA consensus sequence with appar- ently equal affinities (Meisterernst et al. 1988; Goyal et al. 1990) and are sufficient to stimulate adenovirus DNA replication in vitro (Mermod et al. 1989; Gounari et al. 1990). In contrast, activation Introduction of transcription requires both the DNA-binding domain and C- terminal transactivation domains (Mermod et al. 1989). The find- The Nuclear Factor I (NFI) gene family encodes a set of highly ing that NFI genes are differentially expressed during mouse de- conserved site-specific DNA-binding proteins required for both velopment and cellular differentiation suggests that NFI proteins adenovirus DNA replication (Nagata et al. 1983; Bosher et al. may play an important role in during development 1991; Armentero et al. 1994; Coenjaerts and van der Vliet 1994) (Kulkarni and Gronostajski 1996; Chaudhry et al. 1997). Since and the transcription of a variety of cellular and viral genes (Gold- NFI proteins differ in their ability to transactivate NFI-responsive berg et al. 1992; Cardinaux et al. 1994; Furlong et al. 1996; Krebs promoters, the precise mechanisms of stimulation of replication et al. 1996; Spitz et al. 1997). cDNAs encoding NFI proteins have and transcription may differ among the NFI proteins (Apt et al. been cloned from a number of species and define a set of four 1994; Krebs et al. 1996). paralogous genes (NFI-A, NFI-B, NFI-C, and NFI-X) that are Although four NFI genes are present in mammals, only a single conserved throughout vertebrate evolution (Rupp et al. 1990; NFI gene has been detected in the nematode C. elegans, and the Kruse et al. 1991). Transcripts from NFI genes have been cloned exon structure of the NFI gene in C. elegans differs considerably from human (Santoro et al. 1988; Apt et al. 1994; Qian et al. 1995; from those of some previously characterized mammalian NFI Kulkarni and Gronostajski 1996), mouse (Inoue et al. 1990; Nebl genes. Also, no NFI-like genes have been detected in any viral, and Cato 1995; Chaudhry et al. 1997), rat (Paonessa et al. 1988; fungal, bacterial, or archebacterial genomes sequenced to date. These data suggest either that the NFI gene family arose during The nucleotide sequence data reported in this paper have been submitted to metazoan evolution or was lost independently in a number of GenBank and have been assigned the accession numbers: AF059517 lineages. To examine the relationship between members of the NFI (xNFI-1), AF059518 (xNFI-2), AF059519 (xNFI-3), AF059520 (xNFI-4), gene family, we have mapped the locations of the NFI genes in the AF059521 (amphi-NFI), AF111263 (Nfia), AF111264 (Nfib), AF111265 mouse genome, characterized the exons encoding most of the (Nfic), AF111266 (Nfix). DNA-binding domain of the mouse genes, and compared the Correspondence to: R.M. Gronostajski, [email protected] DNA-binding domain-encoding exons of the C. elegans NFI gene C.F. Fletcher et al.: Structure and divergence of NFI genes 391

(nfi-1) with the mammalian and Amphioxus exons. Our studies in M. spretus DNA. The presence or absence of the 9.4-kb M. spretus- indicate that while the mammalian and C. elegans genes likely specific KpnI fragment was followed in backcross mice. The Nfic probe, an 32 share a common ancestral gene, significant changes have occurred 0.631-kb EcoRI fragment of human cDNA, was labeled with [␣ P] dCTP in exon organization during evolution of the NFI gene family. as above, and washing was to a final stringency of 0.8 × SSCP, 0.1% SDS, 65°C. Fragments of 7.4 and 3.5 kb were detected in BamHI-digested C57BL/6J DNA, and fragments of 7.4 and 4.9 kb were detected in M. Materials and methods spretus DNA. The presence or absence of the 4.9-kb M. spretus-specific BamHI fragment was followed in backcross mice. The Nfix probe, an 1.472-kb EcoRI fragment of human cDNA, was labeled as above, and Degenerate PCR and cloning of genomic and cDNA fragments. washing was to a final stringency of 0.8 × SSCP, 0.1% SDS, 65°C. Frag- Degenerate PCR to detect NFI genes was performed with either of two sets ments of 3.9 and 2.6 kb were detected in PvuII-digested C57BL/6J DNA, of primers: Deg 1 (5ЈTTCCGGATGARTTYCAYCITTYATYGARGC3Ј) and fragments of 6.0 and 2.6 kb were detected in M. spretus DNA. The and Deg 2 (5ЈAATCGATRTGRTGBGGCTGIAYRCAIAG3Ј) are primers presence or absence of the 6.0-kb M. spretus-specific PvuII fragment was used previously to clone NFI cDNAs from chicken (Rupp et al. 1990) and followed in backcross mice. The map locations of several other loci used human (Kulkarni and Gronostajski 1996) genomic DNA and do not am- to position the Nfi loci on our interspecific backcross have been previously plify the C. elegans nfi-1 gene (data not shown), while Deg 3 (5ЈTAY- described (Copeland and Jenkins 1991; Fletcher et al. 1996, 1997). Re- AMITGGTTYMAYCTICARG3Ј)andDeg4(5ЈCKYTCICCRTCIGTRC- combination distances were calculated as described by Green (1981), using TYTCS3Ј) were designed from the regions most highly conserved between the computer program MapManager. Gene order was determined by mini- C. elegans nfi-1 and vertebrate NFI genes and amplify NFI genes from both mizing the number of recombination events across the . C. elegans and vertebrates. Amplification was performed in 20- to 50-␮l reactions containing 10–300 ng of genomic DNA, 3 ␮M degenerate prim- ␮ ers, 200 M dNTPs, 50 mM Tris-HCl, pH 8.2, 1.5 mM MgCl2,50mM KCl, Results and 20 ␮g/ml BSA. Cycle times were 94° − 5Ј,35×(94°−1Ј, 50° − 2Ј, 72° − 3Ј), 72° − 7Ј for reactions used to clone PCR products. PCR products were repaired with T4 polymerase, cloned and sequenced. Amphioxus ge- Mapping of the mouse NFI genes. The Nfi loci are distributed in nomic DNA was a gift from Peter Holland (Univ. Reading, UK). the mouse genome and map to regions of conserved synteny with Genomic fragments containing the 2nd exons of the murine Nfia and their locations in the (Fig. 1). Nfia and Nfib are Nfic genes were obtained by screening a mouse strain 129 genomic library located in the middle region of mouse Chr 4. Nfia maps to the (Stratagene) with oligonucleotide probes specific for exon 2 of these genes region of synteny with human chromosome 1p, consistent with its (Chaudhry et al. 1997) with standard techniques. Exon 2 of Nfix was localization at 1p31.2-31.3 (Qian et al. 1995). Nfib maps to the obtained from Bac116M11 (Research Genetics), shown previously to con- region of synteny with human chromosome 9p, consistent with its tain the Nfix gene (Fletcher et al. 1996), and exon 2 of Nfib was obtained from Bac175E2, which was identified by probing Bac filter arrays (Re- localization at 9p24.1. While 67 mice were analyzed for each search Genetics) with an Nfib-specific probe. Sequencing was performed marker, up to 190 mice were typed for some pairs of markers. Each by the Lerner Research Institute Molecular Biotechnology Core Facility. locus was analyzed in pairwise combinations for recombination Clones of cDNAs from the C. elegans nfi-1 gene were obtained from frequencies. The ratios of the total number of mice exhibiting Y. Kohara (C. elegans cDNA sequencing/expression project, CREST and recombinant to the total number of mice analyzed Gene Network Lab, National Institute of Genetics, Mishima, Japan), TIGR, for each pair of loci and the most likely gene order are: centro- and by RT-PCR (Life Sciences) of total RNA of N2 worms with standard mere–Tyrp1–1/158–Nfib–4/110–Ifna–0/122–Jun–5/160–Nfia–0/ techniques. Primers from putative exons of the nfi-1 gene identified by the 156–Pgm2–15/149–Cyp4a10. C. elegans Genome Sequencing Consortium (U21308) were used for PCR Nfic is located in the proximal region of mouse Chr 10 in a amplification of RT product in the buffer described above, and cDNA fragments were cloned and sequenced. region syntenic to human chromosome 19p, consistent with its position at human 19p13.3 but differing from its earlier predicted location mapping on mouse Chr 8 (Qian et al. 1995). The most Sequence comparison of NFI genes. To assess the sequence diver- likely gene order was determined by haplotype analysis of 141 gence of the NFI genes, sequences bounded by the Deg 3 and Deg 4 animals typed for all markers and pairwise analysis of up to 185 primers within the highly conserved NFI DNA-binding domain were animals typed for some pairs of markers. The ratios of the total aligned and analyzed with both the ProtDist and ProtPars programs of Phylip [Ver. 3.572 (Felsenstein 1989)] and ClustalW (Thompson et al. number of mice exhibiting recombinant chromosomes to the total 1994). Trees were examined for step length using MacClade [Ver. 3.07 number of mice analyzed for each pair of loci and the most likely (Maddison and Maddison 1992)] and displayed using TreeView (Page gene order are: centromere–Col6a1–2/158–Nfic–0/173–Lmet1–10/ 1996). Only the distance tree of ProtDist is shown, but the clus- 175–Igf1. tering of the NFI homologs was relatively consistent in the different analy- Nfix is located in the middle region of Chr 8 in a region syn- ses. tenic to human chromosome 19p, consistent with its position at human 19p13.3 (Qian et al. 1995) and mouse 8C1-2 (Scherthan et Interspecific backcross mapping. Interspecific backcross progeny al. 1994). The most likely gene order was determined by haplotype analysis of 144 animals typed for all markers and pairwise analysis were generated by mating (C57BL/6J × Mus spretus)F1 females and C57BL/6J males as described (Copeland and Jenkins 1991). In total, 205 of up to 187 animals typed for some pairs of markers. The ratios backcross mice were used to map the Nfia, Nfib, Nfic, and Nfix loci (see of the total number of mice exhibiting recombinant chromosomes text for details). DNA isolation, restriction enzyme digestion, agarose gel to the total number of mice analyzed for each pair of loci and the electrophoresis, Southern blot transfer, and hybridization were performed most likely gene order are: centromere–Il15–3/172–Lyl1–0/187– essentially as described (Jenkins et al. 1982). The Nfia probe, an 0.677-kb Junb–0/168–Nfix–10/167–Gnao. EcoRI fragment of human cDNA, was labeled with [␣32P] dCTP by ran- dom priming (Stratagene), and washing was to a final stringency of 1.0 × SSCP, 0.1% SDS, 65°C. Fragments of 9.0 and 8.0 kb were detected in Analysis of alternative splicing and exon structure of the C. EcoRV-digested C57BL/6J DNA, and fragments of 10.5 and 9.0 kb were elegans nfi-1 gene. Since a single NFI gene in C. elegans was detected in M. spretus DNA. Fragments of 2.8, 1.5, and 1.3 kb were identified by the C. elegans Genome Sequencing Consortium (now detected in TaqI-digested C57BL/6J DNA, and fragments of 4.1, 1.5, and designated nfi-1, Fig. 2A), we asked whether this was likely the 1.3 kb were detected in M. spretus DNA. The presence or absence of the 10.5- and 4.1-kb M. spretus-specific fragments were followed in backcross only NFI gene in C. elegans and assessed the accuracy of the mice. The Nfib probe, an 0.793-kb EcoRI fragment of human cDNA, was splicing pattern predicted by Genefinder. By three criteria, the labeled as above, and washing was to a final stringency of 1.0 × SSCP, nfi-1 gene is the only NFI homolog present in C. elegans: (i) PCR 0.1% SDS, 65°C. Fragments of 7.1 and 7.0 kb were detected in KpnI- with degenerate primers Deg 3 and Deg 4, which efficiently am- digested C57BL/6J DNA, and fragments of 9.4 and 7.0 kb were detected plify all four mouse and human NFI genes (Kulkarni and 392 C.F. Fletcher et al.: Structure and divergence of NFI genes

Fig. 1. Genes encoding the NFI transcription factors are dispersed in the chromosomes, where known, are shown to the right. References for human mouse genome. Nfia and Nfib were mapped to mouse Chr 4 by interspecific map positions can be obtained from GDB (Genome Data Base, http:// backcross analysis. Partial linkage maps of Chr 4, 8, and 10 are shown, gdbwww.gdb.org), a computerized database of human linkage information indicating the locations of Nfia, Nfib, Nfic, and Nfix in relation to linked maintained by The William H. Welch Medical Library of The Johns Hop- genes. Recombination distances between loci in centimorgans (cM) are kins University (Baltimore, Md.). shown to the left of the chromosome, and the positions of loci in human

Fig. 2. Exons of the nfi-1 gene and the number of NFI genes in C. elegans. A: The line represents the nfi-1 gene with confirmed exons shown as black boxes, putative exons as white boxes, alternative splices as angled brackets, and predicted or confirmed translation start sites for the nfi-1, col-6 and CEESH3 gene products shown as right-angled arrows. The exons encoding the predicted NFI DNA-binding domain are shown by the square bracket marked DNA-BD. Transcripts with splice a were detected by RT-PCR with exon 2 and exon 7 primers and delete putative exons 4–6. Transcripts either containing or lacking splice b were detected with PCR primers from exons 7 and 11. Alternative splice b deletes 16 aa N-terminal to the DNA-binding domain from the translated nfi-1 protein. Alternative splice c is present in the cDNA from plasmid yk42f10 but is absent from the cDNA from plasmid CEESQ09 (wEST02096) and generates a protein with a 3-aa deletion from the C-terminal region of the nfi-1 protein. B: PCR products of C. elegans cDNA and genomic DNA with primers Deg 3 and Deg 4 were analyzed on a 2% agarose gel, stained with ethidium bromide and imaged. Lanes: M, 123-bp markers; 1, -DNA PCR; 2, cDNA; 3, genomic DNA. Bars on the left indicate sizes of 123 markers, and arrows on the right indicate the size of PCR products of cDNA (365 bp) and genomic DNA (586 bp). C: The Yac polytene filter from the C. elegans Genomic Sequencing Consortium was probed with a fragment generated by PCR with primers Deg 3 and Deg 4, washed, and subjected to autoradiography. The spots on the right indicate the positions of the indicated Yacs that map to the genomic location of the nfi-1 gene. No additional spots were detected at levels of detection equal to 0.1% of the indicated spots. The spots appear slightly smeared owing to the long exposure shown.

Gronostajski 1996; Chaudhry et al. 1997), yields a single 586-bp mixed worm population (Fig. 2A). Using previously cloned product predicted from the genomic nfi-1 gene in worm genomic cDNAs and our RT-PCR products from primers within predicted DNA and a single 365-bp product predicted for spliced nfi-1 exons 1–19, we verified the splicing of exons 2–3 and 6–19 (black mRNA from RT product of worm RNA (Fig. 2B); (ii) hybridiza- boxes, Fig. 2A), but detected no products from exons 1, 4, and 5. tion of Southern blots of total worm genomic DNA and the C. We did, however, isolate a PCR product of splicing from predicted elegans Genomic Yac grid with a probe from the conserved nfi-1 exon 3 to predicted exon 7, bypassing exons 4–6 (Fig. 2A, alter- DNA-binding domain yielded signals consistent with only a single native splice a). Most relevant to the comparison of exons encod- nfi-1 gene (Fig. 2C and data not shown); and (iii) more than 99% ing the DNA-binding domain of nfi-1 (predicted exons 7–11), we of the C. elegans genome has been sequenced, and nfi-1 is the only found that ∼50% of the RT-PCR products generated with primers NFI homolog detected. Given these criteria, we concluded that from exons 7 and 13 showed an alternative splice, not predicted by nfi-1 was likely to be the only NFI homolog in C. elegans and Genefinder, with the splice acceptor site located at the exact po- examined cDNAs from the nfi-1 gene by RT-PCR of RNA from a sition seen for the DNA-binding, domain-encoding exons of the C.F. Fletcher et al.: Structure and divergence of NFI genes 393

Fig. 3. Comparison of the DNA-binding, domain-encoding exons of the four murine and single C. elegans NFI genes. Shown are the sequences flanking the DNA-binding domain encoding exons of mouse NFI (mNFI-A, Nfia), NFI-B (mNFI-B, Nfib), NFI-C (mNFI-C, Nfic), NFI-X (mNFI-X, Nfix) and C. elegans NFI (CeNFI, nfi-1). Brackets extending from the sequences show the sizes of the mouse NFI exons (top, 1 exon of 532 bp, 177.3 aa) and the C. elegans exons (bottom, 5 exons covering 742 bp and extending 213 bp 3Ј to the end of the murine exons). The small triangles at the bottom denote the approximate locations of the additional four introns and their phasing relative to splice acceptor sites in the mouse and C. elegans genes. Amino acids are shown in three-letter code below the DNA sequence. human and porcine NFI-C genes and rat NFI-A gene (Fig. 2A, in murine NFI-C is identical to that found in the human and por- alternative splice b; see also Fig. 3). Thus, predicted exon 7 is cine NFI-C genes (Meisterernst et al. 1989). No detectable se- expressed either as a single exon or as two separate exons with an quence homology exists within the intron sequences flanking the internal 48-nt intron excised. Excision of the 48-nt intron generates exons, with the exception of the canonical ag and gt splice sites. an in-frame deletion of 16 aa in the predicted nfi-1 protein (Fig. 3). This lack of homology in flanking regions suggests that it is un- While the DNA binding domains of the human and porcine likely that gene conversion is maintaining the homogeneity of the NFI-C gene, the rat NFI-A gene, and all four murine NFI genes splice junctions in the four NFI genes (see Discussion). (see below) are encoded on a single 532-nt exon (Meisterernst et As discussed above, the extension of the final exon encoding al. 1989; Xu et al. 1997), the nfi-1 gene has four additional introns the C. elegans nfi-1 DNA-binding domain 213 nt 3Ј of the splice inserted into the region encoding the DNA-binding domain (Fig. sites found in the murine NFI genes was unexpected, and is not 2A, bracket, and Fig. 3) and the final exon encoding the DNA- owing to a single point mutation in the splice donor site, since binding domain of nfi-1 extends 213 nt past the splice donor sites neither of the gt canonical residues is found at the appropriate used in the mammalian genes (Fig. 3). Both cloned cDNAs (see positions (Fig. 3). There is no detectable be- Materials and methods) and RT-PCR products from predicted ex- tween the protein encoded by this 213-nt extension and any of the ons 7–11 confirmed this predicted structure of the nfi-1 DNA- mouse NFI proteins (not shown). In addition, the splice acceptor binding domain-encoding exons (Fig. 2B and data not shown). In site at the 5Ј end of the DNA-binding domain is used in only a contrast, cDNAs from the TIGR and Kohara cDNA libraries fraction of nfi-1 mRNAs, since we have shown that the intron showed that the predicted splicing of exons 17–19 of nfi-1 was Ј upstream of the DNA-binding domain region can be skipped to probably incorrect and that two alternative 3 splice acceptor sites generate an in-frame readthrough product (Fig. 2A, alternative are used at a revised exon 18, generating two proteins that differ by splice a). Extensive cloning of vertebrate NFI cDNAs shows that three amino acids (alternative splice C). Thus, we confirmed only they all contain a 10- to 12-aa translated exon prior to the DNA- a subset of the exons predicted by Genefinder (∼80%), showed that binding domain exon, which is not required for DNA binding some exon and splice calls were most likely incorrect or incom- plete, demonstrated alternative splicing of the C. elegans nfi-1 (Kruse and Sippel 1994). In contrast to our data on the nfi-1 gene, gene, and demonstrated that the exons encoding the nfi-1 DNA- there is little or no evidence of alternative splicing of this exon in binding domain were strikingly different from those of the previ- the mammalian NFI genes (see Discussion). ously published human and porcine NFI-C and rat NFI-A genes. Lack of detectable “nfi-1-like” introns within the DNA-binding Comparison of the 2nd exons of the murine Nfi genes and the C. domain-encoding exons of other mammalian, Xenopus and Am- elegans nfi-1 gene. The DNA-binding domains of the human and phioxus NFI genes. Since the Deg 1 and Deg 2 primers used porcine NFI-C genes and rat NFI-A gene had previously been previously to amplify the DNA-binding domains of human NFI shown to be encoded by a single large 2nd exon (Meisterernst et al. genes (Kulkarni and Gronostajski 1996) flanked the four introns in 1989; Xu et al. 1997). This contrasted sharply with the finding that the nfi-1 gene, we used these primers to test whether NFI genes the C. elegans nfi-1 gene contained four phased introns inserted from other vertebrates contained these “nfi-1-like” introns. When within this DNA-binding, domain-encoding exon (Figs. 2 and 3). genomic DNA from human, pig, mouse, rat, hamster and Xenopus To compare this exon region from all four NFI genes within a was used in PCR reactions, a strong 486 bp was obtained, indi- single mammalian species, we cloned the DNA-binding, domain- cating that “nfi-1-like” exons were not detectable (data not shown). encoding exons of all four murine NFI genes. When the corre- Cloning of individual members from these mixed PCR products sponding exons of the four murine NFI genes were isolated and showed that all four mouse and human NFI genes were efficiently sequenced, it was clear that each of the four genes had identical amplified, as were multiple Xenopus NFI gene products (Kulkarni exon structures (Fig. 3). The exons were all 532 nt in length and and Gronostajski 1996; Chaudhry et al. 1997). Since vertebrate had identical splice junctions (phase 0 at the 5Ј end of the exon and introns are frequently larger than those present in C. elegans, and phase 1 at the 3Ј end). The same aspartate residue is encoded at the larger introns within this region might inhibit amplification, some 5Ј ends of all four mouse exons, but in keeping with gene-specific larger PCR products might have been missed in these analyses. differences within their DNA-binding domains, different residues However, since we successfully obtained clones of all four human, are encoded at the 3Ј ends of these exons of the four NFI genes. As four mouse, and multiple Xenopus NFI genes by degenerate PCR expected, the arginine residue encoded by the 3Ј end of this exon with these primers, these data are highly suggestive that the NFI 394 C.F. Fletcher et al.: Structure and divergence of NFI genes

completion with the restriction enzyme BanII and yielded a single product consistent with the two distinct sequences obtained (not shown). These data indicate that Amphioxus probably contains only a single NFI gene, and that the Amphioxus NFI gene detected by amplification with Deg 3 and Deg 4 lacks nfi-1-like introns.

Sequence comparison of DNA-binding domains of NFI genes. To address the pathway of generation of the four vertebrate NFI genes, we subjected NFI protein sequences to protein distance and parsimony analyses. To eliminate uncertainty in the treatment of gaps in alignments, we used only contiguous sequences from within the highly conserved NFI DNA-binding domain which are available from a number of species, and included the Amphioxus and Xenopus sequences obtained in our degenerate PCR studies. Shown are the results of protein distance measurements with the branches drawn by the UPGMA method and outgrouped to the distant C. elegans nfi-1 sequence (Fig. 4). The tree has a length of 89 steps and shows several interesting features. The Amphioxus sequence clusters with three Xenopus NFI sequences, two obtained from degenerate PCR and one published sequence (Fig. 4, Amphi- oxus and Xenopus bracket). This grouping is consistent with these sequences branching earlier than the four mammalian NFI family members. However, the incorrect inclusion of a known Xenopus NFI-C homolog in this cluster (whose identity was clearly indi- Fig. 4. Protein distance analysis of the DNA-binding domains of NFI cated by protein homology outside of the DNA-binding domain) proteins. Sequences of NFI proteins encoded by DNA bounded by the Deg illustrates one difficulty in defining the early divergence of the NFI 3 and Deg 4 primers (106 aa) were subjected to protein distance analysis genes. with the ProtDist program. The bar marked 0.1 changes/residue shows the branch length equal to 10 changes per 100 residues, and the tree is dis- The vertebrate NFI-C gene family is represented in the next played with the C. elegans NFI protein as the outgroup (ceNFI). The cluster, which is completely consistent with the grouping of the brackets on the right show groupings containing the indicated genes. The sequences established with larger portions of the protein se- tree length is 89 steps and was the 2nd shorted tree obtained by both quences, with the exception of inclusion of the chicken NFI-X gene jumbling and bootstrap analyses with protein parsimony and distance tech- (Fig. 4, NFI-C bracket, cNFI-X) and exclusion of the Xenopus niques. An 88-step tree has Amphioxus branching alone after the three NFI-C gene discussed above. The branch length and improper Xenopus sequence cluster from CeNFI, but was inconsistent with parsi- placement of cNFI-X in this grouping suggests that this region of mony trees generated by ClustalW and ProtPars. The sequences analyzed the DNA-binding domains of the cNFI-C and cNFI-X genes has were (top to bottom): C. elegans nfi-1 (CeNFI, U21308); Amphioxus NFI diverged more than is seen in the NFI-A and NFI-B family mem- (amphi-NFI, AF059521); Xenopus NFI-3 (xNFI-3, AF059519); Xenopus NFI-1 (xNFI-1, AF059517); Xenopus NFI-C1 (xNFI-C1, L43149); chicken bers (which are clustered more consistently and have shorter NFI-C (cNFI-C, X51483); chicken NFI-X (cNFI-X, X61225), human branch lengths). Misplacement of cNFI-X was seen with both dis- NFI-C (hNFI-C, X12492); porcine NFI-C (pNFI-C, X12764); murine tance measurements and bootstrapped parsimony analyses with NFI-C (mNFI-C, U57635); rat NFI-C (rNFI-C, AF112458); chicken Phylip and ClustalW (data not shown). We also observed more NFI-A (cNFI-A, X51486); rat NFI-A (rNFI-A, X13167); human NFI-A sequence divergence within the NFI-C gene family than within the (hNFI-A, U07809); murine NFI-A (mNFI-A, U57633); Xenopus NFI-X2 NFI-A, -B and -X families, as assessed by longer branch lengths in (xNFIX2, Z34463); murine NFI-X (mNFI-X, U57636); hamster NFI-X the NFI-C group. The cluster containing all of the sequenced (haNFI-X, J04123); human NFI-X (hNFI-X, U07811); rat NFI-X (rNFI-X, NFI-A genes is next, followed by the NFI-X and NFI-B groups, all AB012235); chicken NFI-B (cNFI-B, X51485); Xenopus NFI-2 (xNFI-2, showing high consistency of clustering. The absence of branch AF059518); Xenopus NFI-B1 (xNFI-B1, L43146); hamster NFI-B (haNFI- B, J04122); Xenopus NFI-4 (xNFI-4, AF059520); murine NFI-B (mNFI-B, lengths for most of the vertebrate NFI-A, -X and -B homologs is U57634); rat NFI-B (rNFI-B, AF112457); human NFI-B (hNFI-B, owing to sequence identity of this domain within each of the gene U07810). families. While there are clearly four paralogous NFI genes in chicken, rat, mouse, and human, there appear to be five NFI genes in Xenopus. However, this apparent increase in gene number may genes from all of the species examined lack nfi-1-like introns be due to polymorphism between NFI homologs isolated from interrupting their DNA-binding domain-encoding exons. outbred Xenopus populations, rather than the actual presence of Amphioxus is a non-vertebrate cephalochordate and is a useful five NFI genes. It is also possible that polyploidation of Xenopus transition species between the simple metazoan C. elegans and has generated a complex set of NFI genes in this species (Sidow mammals (Sidow 1996). For example, while most vertebrates have 1996). four Hox gene clusters, Amphioxus has only a single cluster (Gar- In summary, the Amphioxus NFI homolog clusters with the cia-Fernandez and Holland 1994). To investigate the number of Xenopus NFI genes, including a known Xenopus NFI-C homolog. NFI genes in Amphioxus and determine if they possessed nfi-1-like The four mouse NFI DNA-binding domains show strong and con- introns, we amplified Amphioxus genomic DNA with the Deg 3 sistent clustering with the other vertebrate NFI genes, demonstrat- and Deg 4 primers. Only a single 365-bp product was detected, ing the high degree of conservation within each paralogous NFI which was identical in size to the product obtained from mouse group. genomic DNA and cDNA (data not shown), but smaller than the 586-bp product obtained from C. elegans genomic DNA (Fig. 2B). Discussion When multiple clones were generated from the Amphioxus PCR product, two very similar NFI DNA sequences were obtained (not shown but used in Fig. 4). These two products have 9 nt differ- Mapping of mouse NFI genes and mouse-human synteny. These ences, all in 3rd base positions of codons, and predict identical mapping results further define paralogous regions of the mouse proteins. In addition, the 365-bp PCR product was digested to genome. A number of gene families, including Jun proto- C.F. Fletcher et al.: Structure and divergence of NFI genes 395 oncogenes, Hu RNA-binding proteins, cAMP phosphodiesterases sites as the four mammalian genes. However, if an extension of the (type 4), janus kinases, prostaglandin E receptors, Rfx transcrip- DNA-binding, domain-encoding exon to a region similar to that tion factors, and now the NFI transcription factors, have members seen in nfi-1 were observed, it would suggest that this extension is localized to human 19p13, 9p, and 1p. The mouse homologs of the more primitive state. Thus, it will be of great interest to obtain these genes are distributed in regions of conserved synteny with genomic clones of the Amphioxus NFI gene and determine its the human chromosomes [human 1p:mouse 4; human structure. 19p13:mouse 8, 9, 10, 17; and human 9p:mouse 4; (Fletcher et al. Our sequence comparison of the NFI proteins is consistent 1997) and Fletcher, Copeland and Jenkins, unpublished observa- with a model where the Amphioxus NFI gene was derived from the tions]. These data suggest that several very large chromosomal C. elegans nfi-1 gene after removal of four introns, and is ancestral duplications underlie the expansion of the Nfi gene family in ver- to the mammalian NFI genes. This scenario is consistent with tebrates. several studies that have placed Amphioxus between C. elegans and vertebrates in terms of genome evolution (Garcia-Fernandez and Holland 1994; Sidow 1996). Since the vertebrate NFI-C pro- Phylogenetic differences in the DNA-binding domain-encoding ex- teins are more divergent than the other three NFI proteins, it is ons of NFI genes. Our detailed comparison of the C. elegans nfi-1 tempting to suggest that NFI-C represents the earliest of the ver- gene and mouse NFI genes (Figs. 2–3) and our phylogenetic com- tebrate NFI genes, with the other homologs being derived from it parison of NFI homologs (Fig. 4) demonstrate major changes in (Fig. 4). However, the very low degree of divergence among all the organization and composition of these exons in different spe- four mammalian NFI proteins in this region make this conclusion cies. In all mammals examined, the DNA-binding domain appears tentative, as the increased diversity of the NFI-C members may be composed of a single, unusually large 532-nt exon. In the C. owing to different selection pressure at this locus from that at the elegans nfi-1 gene, there are four phased introns interrupting this other NFI loci. The lack of homology between nfi-1 and vertebrate exon, and the final exon of the set extends 213 nt past the end of NFI genes in exons outside of the DNA-binding domain makes the mammalian exons. However, the conservation of the exact comparison of these regions of the genes uninformative (data not splice acceptor site at the beginning of the DNA-binding, domain- shown). Thus, while our data clearly demonstrate major changes in encoding exons in the nfi-1 and mouse genes demonstrates the the exon organization of the NFI DNA-binding domain in different clear relatedness of the nematode and mammalian NFI genes. In species, it will probably be necessary to isolate NFI genes from addition, we have evidence that the C. elegans and vertebrate NFI other early metazoans and transitional species to more precisely proteins have similar or identical DNA-binding specificity and define the pathway of evolution of the NFI gene family. affinity (A. Velyvis and R.M. Gronostajski unpublished). It is uncertain whether the introns in nfi-1 are remnants of the ancestral Acknowledgments. We thank Debra Gilbert for excellent technical assis- NFI gene or were inserted into nfi-1 after duplication of the an- tance, Christine Campbell for reading the manuscript, Feng Qian for gifts cestral gene to form the four vertebrate genes. Both scenarios are of human NFI gene probes, TIGR and Yuji Kohara for C. elegans nfi-1 possible and would conform to the two major proposed models of cDNAs, Peter Holland for Amphioxus genomic DNA, and Hans Baumeis- exon evolution, the so-called “introns early” vs. “introns late” ter and Frank Margolis for rat NFI cDNA sequences. This research was models (Keese and Givvs 1992; Gilbert et al. 1997; Stoltzfus et al. supported, in part, by the National Cancer Institute, DHHS, under contract NO1-CO-74101 with ABL, and by the National Institute for Child Health 1997). and Development grant HD34908 and National Science Foundation grant The size of the vertebrate NFI DNA-binding domain-encoding MCB-9612367 to R.M. Gronostajski. exons is unusually large (532 nt), suggesting that it may have been generated by fusion of smaller ancestral exons (Long et al. 1995). References If this were so, the introns-early model would suggest that the mammalian NFI DNA-binding domain was generated by the fu- Apt D, Liu Y, Bernard H-U (1994) Cloning and functional analysis of sion of smaller functional subdomains of an ancestral NFI gene. spliced isoforms of human nuclear factor I-X: interference with tran- Consistent with this model, it has recently been shown that the rat scriptional activation by NFI/CTF in a cell-type specific manner. NFI-A DNA-binding domain can be subdivided functionally into a Nucleic Acids Res 22, 3825–3833 C-terminal “specificity” domain that confers site-specific recogni- Armentero MT, Horwitz M, Mermod N (1994) Targeting of DNA poly- merase to the adenovirus origin of DNA replication by interaction with tion, and an N-terminal “affinity” domain that increases the bind- nuclear factor I. Proc Natl Acad Sci (USA) 91, 11537–11541 ing affinity of the specificity domain >100-fold (Dekker et al. Bosher J, Leith IR, Temperley SM, Wells M, Hay RT (1991) The DNA- 1996). The N-terminal affinity domain is an extremely basic alpha- binding domain of nuclear factor I is sufficient to cooperate with the helical domain composed of residues 1–78 (∼ the first 1/3rd) of the adenovirus type 2 DNA-binding protein in viral DNA replication. J Gen rat NFI-A DNA-binding domain. This region of nfi-1 shows less Virol 72, 2975–2980 homology with the vertebrate proteins than does the specificity Cardinaux JR, Chapel S, Wahli W (1994) Complex organization of CTF/ domain. The extension of the final exon encoding the nfi-1 DNA- NF-I, C/EBP, and HNF3 binding sites within the promoter of the liver- binding domain 213 nt past the conserved splice donor sites of the specific vitellogenin gene. J Biol Chem 269, 32947–32956 four murine NFI genes (Fig. 3) suggests loss of an intron present Carroll S (1995) Homeotic genes and the evolution of arthropods and chordates. Nature 376, 479–485 in all four mammalian NFI genes from the C. elegans gene. How- Chaudhry AZ, Lyons CE, Gronostajski RM (1997) Expression patterns of ever, this 213-nt extension shows no homology to any portions of the four Nuclear Factor I genes during mouse embryogenesis indicate a the mammalian NFI genes (data not shown), and thus its relation- potential role in development. Dev Dyn 208, 313–325 ship to the vertebrate genes is unclear. It will be of interest to Coenjaerts FE, van der Vliet PC (1994) Early dissociation of nuclear factor determine whether the DNA-binding domains of the C. elegans or I from the origin during initiation of adenovirus DNA replication studied mammalian NFI proteins can be dissected into five functionally by origin immobilization. Nucleic Acids Res 22, 5235–5240 independent subdomains that would be consistent with an early, Copeland NG, Jenkin NA (1991) Development and applications of a mo- multi-exon origin of the NFI DNA-binding domain. lecular genetic linkage map of the mouse genome. Trends Genet 7, The absence of nfi-1-like introns from the Amphioxus NFI gene 113–118 Dekker J, van Oosterhout JA, van der Vliet PC (1996) Two regions within is consistent with the absence of these introns in a single ancestral the DNA binding domain of nuclear factor I interact with DNA and cephalochordate NFI gene, which then was duplicated to generate stimulate adenovirus DNA replication independently. Mol Cell Biol 16, the vertebrate/mammalian NFI genes. While we have not yet 4073–4080 mapped the splice acceptor and donor sites of this Amphioxus Felsenstein J (1989) PHYLIP—Phylogeny Inference Package (Version gene, one simple prediction is that it would share the same splice 3.2). Cladistics 5, 164–166 396 C.F. Fletcher et al.: Structure and divergence of NFI genes

Fletcher C, Lutz C, O’Sullivan T, Shaughnessy J Jr, Hawkes R et al. (1996) Meisterernst M, Gander I, Rogge L, Winnacker EL (1988) A quantitative Absence epilepsy in tottering mutant mice is associated with calcium analysis of nuclear factor I/DNA interactions. Nucleic Acids Res 16, channel defects. Cell 87, 607–617 4419–4435 Fletcher CF, Okano HJ, Gilbert DJ, Yang Y, Yang C et al. (1997) Mouse Meisterernst M, Rogge L, Foeckler R, Karaghiosoff M, Winnacker E-L chromosomal locations of nine genes encoding homologs of human (1989) Structural and functional organization of a porcine gene coding paraneoplastic neurologic disorder antigens. Genomics 45, 313–319 for nuclear factor I. Biochemistry 28, 8191–8200 Furlong EE, Rein T, Martin F (1996) YY1 and NF1 both activate the Mermod N, O’Neill E, Kelly T, Tjian R (1989) The proline-rich transcrip- human promoter by alternatively binding to a composite element, tional activator of CTF/NF-1 is distinct from the replication and DNA and YY1 and E1A cooperate to amplify p53 promoter activity. Mol Cell binding domain. Cell 58, 741–753 Biol 16, 5933–5945 Nagata K, Guggenheimer RA, Hurwitz J (1983) Specific binding of a Garcia-Fernandez J, Holland PWH (1994) Archetypal organization of the cellular DNA replication protein to the origin of replication of adeno- amphioxus Hox gene cluster. Nature 370, 563–566 virus DNA. Proc Natl Acad Sci USA 80, 6177–6181 Gil G, Smith JR, Goldstein JL, Slaughter CA, Orth K et al. (1988) Multiple Nebl G, Cato A (1995) NFI/X proteins: a class of NFI family of transcrip- genes encode nuclear factor 1-like proteins that bind to the promoter for tion factors with positive and negative regulatory domains. Cell Mol 3-hydroxy-3-methylglutaryl-coenzyme A reductase. Proc Natl Acad Sci Biol Res 41, 85–95 USA 85, 8963–8967 Novak A, Goyal N, Gronostajski RM (1992) Four conserved cysteine Gilbert W, de Souza SJ, Long M (1997) Origin of genes. Proc Natl Acad residues are required for the DNA binding activity of nuclear factor I. J Sci USA 94, 7698–7703 Biol Chem 267, 12986–12990 Goldberg H, Helaakoski T, Garrett LA, Karsenty G, Pellegrino A et al. Osada S, Daimon S, Nishihara T, Imagawa M (1996) Identification of (1992) Tissue-specific expression of the mouse ␣2(I) collagen pro- DNA binding-site preferences for nuclear factor I-A. FEBS Lett 390, moter—studies in transgenic mice and in tissue culture cells. J Biol 44–46 Chem 267, 19622–19630 Page RDM (1996) TREEVIEW: An application to display phylogenetic Gounari F, De Francesco R, Schmitt J, van der Vliet P, Cortese R et al. trees on personal computers. CABIOS 12, 357–358 (1990) Amino-terminal domain of NFI binds to DNA as a dimer and Paonessa G, Gounari F, Frank R, Cortese R (1988) Purification of a NFI- activates adenovirus DNA replication. EMBO J 9, 559–566 like DNA binding protein from rat liver and cloning of the correspond- Goyal N, Knox J, Gronostajski R (1990) Analysis of multiple forms of ing cDNA. EMBO J 7, 3115–3123 nuclear factor I in human and murine cell lines. Mol Cell Biol 10, Puzianowska-Kuznicka M, Shi YB (1996) Nuclear factor I as a potential 1041–1048 regulator during postembryonic organ development. J Biol Chem 271, Green EL (1981) Linkage, recombination, and mapping. In Genetics and 6273–6282 Probability in Animal Breeding Experiments. ed. (New York: Macmil- Qian F, Kruse U, Lichter P, Sippel AE (1995) Chromosomal localization lan) pp 77–113 of the four genes NFIA, B, C, and X for the human Gronostajski R (1986) Analysis of nuclear factor I binding to DNA using Nuclear Factor I by FISH. Genomics 28, 66–73 degenerate oligonucleotides. Nucleic Acids Res 14, 9117–9132 Roulet E, Armentero MT, Krey G, Corthesy B, Dreyer C et al. (1995) Gronostajski R (1987) Site-specific DNA binding of nuclear factor I: effect Regulation of the DNA-binding and transcriptional activities of Xenopus of the spacer region. Nucleic Acids Res 15, 5545–5559 laevis NFI-X by a novel C-terminal domain. Mol Cell Biol 15, 5552– Gronostajski R, Adhya S, Nagata K, Guggenheimer R, Hurwitz J (1985) 5562 Site-specific DNA binding of nuclear factor I: analyses of cellular bind- Rupp R, Kruse U, Multhaup G, Gobel U, Beyreuther K et al. (1990) ing sites. Mol Cell Biol 5, 964–971 Chicken NFI/TGGCA proteins are encoded by at least three independent Inoue T, Tamura T, Furuichi T, Mikoshiba K (1990) Isolation of comple- genes: NFI-A, NFI-B and NFI-C with homologues in mammalian ge- mentary DNAs encoding a cerebellum-enriched nuclear factor I family nomes. Nucleic Acids Res 18, 2607–2616 that activates transcription from the mouse myelin basic protein pro- moter. J Biol Chem 265, 19065–19070 Santoro C, Mermod N, Andrews P, Tjian R (1988) A family of human Jenkins NA, Copeland NG, Taylor BA, Lee BK (1982) Organization, CCAAT-box-binding proteins active in transcription and DNA replica- distribution, and stability of endogenous ecotropic murine leukemia vi- tion: cloning and expression of multiple cDNAs. Nature 334, 218–224 rus DNA in chromosomes of Mus musculus. J Virol 43, 26–36 Scherthan H, Seisenberger C, Greulich K, Winnacker E-L (1994) Mapping Keese P, Givvs A (1992) Origins of genes: “Big bang” or continuous of the murine Nuclear Factor I/X gene (Nfix) to mouse chromosome 8 creation. Proc Natl Acad Sci USA 89, 9489–9493 C1-2 by FISH. Genomics 22, 247–249 Krebs CJ, Dey B, Kumar G (1996) The cerebellum-enriched form of Sidow A (1996) Gen(om)e duplications in the evolution of early verte- nuclear factor I is functionally different from ubiquitous nuclear factor I brates. Curr Opin Genet Dev 6, 715–722 in glial-specific promoter regulation. J Neurochem 66, 1354–1361 Spitz F, Salminen M, Demignon J, Kahn A, Daegelen D et al. (1997) A Kruse U, Sippel A (1994) The genes for transcription factor nuclear factor combination of MEF3 and NFI proteins activates transcription in a sub- I give rise to corresponding splice variants between vertebrate species. J set of fast-twitch muscles. Mol Cell Biol 17, 656–666 Mol Biol 238, 860–865 Stoltzfus A, Logsdon JM Jr, Palmer JD, Doolittle WF (1997) Intron “slid- Kruse U, Qian F, Sippel AE (1991) Identification of a fourth Nuclear ing” and the diversity of intron positions. Proc Natl Acad Sci USA 94, Factor I gene in chicken by cDNA cloning: NFIX. Nucleic Acids Res 19, 10739–10744 6641 Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving Kulkarni S, Gronostajski RM (1996) Altered expression of the develop- the sensitivity of progressive multiple sequence alignment through se- mentally regulated NFI gene family during phorbol ester-induced dif- quence weighting, positions-specific gap penalties and weight matrix ferentiation of human leukemic cells. Cell Growth Differ 7, 501–510 choice. Nucleic Acids Res 22, 4673–4680 Long M, Rosenberg C, Gilbert W (1995) Intron phase correlations and the Wenzelides S, Altmann H, Wendler W, Winnacker E-L (1996) CTF5—a evolution of the intron/exon structure of genes. Proc Natl Acad Sci USA new transcriptional activator of the NFI/CTF family. Nucleic Acids Res 92, 12495–12499 24, 2416–2421 Maddison W, Maddison D (1992) MacClade: analysis of phlogeny and Xu M, Osada S, Imagawa M, Nishihara T (1997) Genomic organization of character evolution. (Sunderland: Sinauer Assoc. Inc.) the rat nuclear factor I-A gene. J Biochem 122, 795–801