J Mol Evol (1999) 48:517–527

© Springer-Verlag New York Inc. 1999

Diversification Pattern of the HMG and SOX Family Members During Evolution

Ste´phan Soullier,1 Philippe Jay,1 Francis Poulat,1 Jean-Marc Vanacker,2 Philippe Berta,1 Vincent Laudet2

1 ERS155 du CNRS, Centre de Recherche en Biochimie Macromole´culaire, CNRS, BP5051, Route de Mende, 34293 Montpellier Cedex 5, France 2 UMR 319 du CNRS, Oncologie Mole´culaire, Institut de Biologie de Lille, Institut Pasteur de Lille, 1 rue Calmette, 59019 Lille Cedex, France

Received: 20 July 1998 / Accepted: 19 October 1998

Abstract. From a database containing the published brate HMG box 2, insect SSRP, and plant HMG. The HMG sequences, we constructed an alignment of various UBF boxes cannot be clustered together and their the HMG box functional domain based on sequence diversification appears to be extremely ancient, probably identity. Due to the large number of sequences (more before the appearance of metazoans. than 250) and the short size of this domain, several data sets were used. This analysis reveals that the HMG box Key words: HMG box — SOX — Sry — superfamily can be separated into two clearly defined Molecular phylogeny subfamilies: (i) the SOX/MATA/TCF family, which clusters proteins able to bind to specific DNA sequences; and (ii) the HMG/UBF family, which clusters members Introduction which bind non specifically to DNA. The appearance and diversification of these subfamilies largely predate the split between the yeast and the metazoan lineages. Par- The high-mobility group (HMG) proteins were first de- ticular emphasis was placed on the analysis of the SOX fined by their electrophoretic behavior on SDS-PAGE, as subfamily. For the first time our analysis clearly identi- a series of nonhistone proteins able to interact nonspe- fied the SOX subfamily as structured in six groups of cifically with DNA. Among this heterogeneous group, named SOX5/6, SRY, /3, SOX14, SOX4/22, HMG1 and HMG2 proteins, also called ‘‘classical’’ and SOX9/18. The validity of these clusters is con- HMG, contain two 80- domains responsible firmed by their functional characteristics and their se- for DNA binding and termed ‘‘HMG boxes.’’ Since its quences outside the HMG box. In sharp contrast, there identification, this domain was found in a large number are only a few robust branching patterns inside the UBF/ of apparently unrelated factors that are all able to bind HMG family, probably because of the much more an- DNA. Thus, the HMG box is now considered as a sig- cient diversification of this family than the diversifica- nature of a large superfamily of proteins (Grosschedl et tion of the SOX family. The only consistent groups that al. 1994). can be detected by our analysis are HMG box 1, verte- The HMG box superfamily contains usual transcrip- tion factors which bind to specific DNA sequences, such as the T-cell transcription factors TCF1 and LEF1, the mating-type proteins of several fungi, the mammalian 1 Present address: UPR1142 du CNRS, Institut de Ge´ne´tique Humaine, male sex-determining factor SRY, and the numerous 141 rue de la Cardonille, 34396 Montpellier Cedex 5, France 2 Present address: UMR 49 du CNRS, Ecole Normale Supe´rieure de SOX factors. The superfamily also contains a number of Lyon, 46 alle´e d’Italie, 69364, Lyon Cedex 07, France nonhistone proteins. The classical HMG proteins, which Correspondence to: Dr. P. Berta; e-mail: [email protected] probably play both a structural and a functional role in 518 chromatin (Lehming et al. 1994; Carballo et al. 1983; The structure of the superfamily, and of the various fami- Nightingale et al. 1996; Onate et al. 1994; Singh and lies, subfamilies, and groups, is also fully confirmed by Dixon 1990; Zappavigna et al. 1996; Bonne et al. 1984), functional and structural considerations. bind to single- and double-stranded DNA sequences and also to nonclassical DNA structures such as adduct- modified DNA (Billings et al. 1992; Farid et al. 1996), Materials and Methods four-way junctions (Bianchi et al. 1989), and B–Z DNA junctions (Hamada and Bustin, 1985). Other members of the superfamily such as the mitochondrial transcription Construction of the Database and Sequence Alignments factor MTTF1, the nucleolar UBF, the structure specific recognition protein SSRP1, and the Sequences were extracted from the EMBL, GenBank, and NBRF data yeast nuclear nonhistone protein NHP6A also have a libraries using both the FASTA program and the search for sequences relaxed specificity for DNA and are able to recognize identical to a given signature available (9-1 and 9-2 option in the CITI-II Bisance/Infobiogen network). To perform the FASTA search structural motifs on the DNA molecule. The number of DNA binding domain sequences of Saccharomyces MTHMG-1, Mus HMG boxes in the various members of the family is Sox4, Homo TCF1, Mus Sox12, Neurospora MATA1, Tetrahymena variable, from one in most cases to five or six in the HMGC, and Bos HMG1-2 were used. For the signature search program mouse or Xenopus UBF factors. Thus, the HMG box we used a consensus sequence corresponding to the most conserved part of the HMG box and encompassing the following sequence: protein superfamily is characterised by an extreme func- PX(M/L)XN(A/S/T)X(I/M/S/L)S(K/Q/E) X(L/R)GXX(W/S), in which tional and structural diversity. X is any amino acid. Previous phylogenetic analysis has shown that the Protein names and GenBank accession numbers of the full-length HMG box superfamily is extremely ancient (Laudet et al. HMG box used in this study are as follows: ROX1, X60458; STE11, Z11156; MAT, X64195; MATA1, M54787; MATA1M, X07642; 1993b; Griess et al. 1993). In fact, members of this fam- TCF1, X59869; TCF3, X62870; TCF4, X62871; LEF1, X58636; ily are known in all metazoan phyla, in plants, and in HBP1, U09551; SRY, X53772, X86384, X86383, X86382, X86380, yeast but also in unicellular such as Trypano- X86386, Z30265, Z30646, U15569, X55491, L29551, L29547, soma (Erondu and Donelson 1992). A member of the L29544, L29543, L29549, L29548, L29552, L29542, S46279; SOX3, classical HMG has been transduced in the Chilo iride- X71135, S69429, U12467; SOX19, X79821; SOX2, U12532; SOXRET, L07335; SOX9, Z46629, Z18958, U12533; SOX18, scens virus (Schnitzler et al. 1994). Previous phyloge- L35032; SOX4, X70683, X70298; SOX11, U23752, U12534; SRA3, netical studies have classified the superfamily into two L12021; SOX20, AB006768; SOX70D, U68056; SOX14, X65667; families: one comprising proteins with a single se- SOX21, U66141; SOX6, U32614; SOX22, U35612; SOX5, S83308, quence-specific HMG box, such as the yeast mating-type X65657; SOXLZ, D61689, D61688; UBF, X53461, X57561, X60831, M61725; MTTF-1, M62810; MLH, M87306; YD9395, Z46727; genes, the TCF group of transcription factors, and the NOR90-1, X56687; HMG1, X12597, M21683, X12796, Y00463, SOX proteins; and the other encompassing relatively D14314, U21933, L06453; HMG2, M83665, J02895, X67668, non-sequence-specific DNA binding proteins with mul- M80574, D30765; HMGT, X02666; PMS1, U13695; HMGX, L07107; tiple HMG boxes, such as the classical HMG proteins, YD8119, Z48008; SSRP1, M86737, S50213, L08825; DEF1, D14315; UBF, and mTTF1 (Laudet et al. 1993b; Griess et al. HMGZ, X71139; HMGD, M77023; HMG1B, M93254; HMG1PS, L08048; HMG1R, D14718; HMG2A, X63463; MTEST750, Z31299; 1993). These analyses revealed striking differences in the HMGT2, L32954; HMG, X81456, L22300, D13491, Z28410, X58282, rate of accumulation of mutations in the various HMG L28094, X76774, X58245, Z21703, L12169; S4664, D41834; S2676, boxes, the SRY gene evolving the fastest within the su- D40599; HMGC, M63424; MTHMG-1, M73753; IXRI, L16900; CI- perfamily (Pamilo and O’Neill 1997; Whitfield et al. IDBP, L08814; DSP1, U13881; NHP6A, X15317; and NHP6B, X15318. 1993). The sequences of the HMG boxes were aligned using the ED pro- Since the beginning of these studies the size of the gram of the MUST package (Philippe 1993). This program allows a superfamily has increased considerably and the various color visualization of the aligned sequences and the alignment is done trees published so far have never been tested for their by eye. To avoid errors we also aligned the sequences with the Clustal robustness using statistical resampling methods such as V program, which we used previously to generate an independent alignment of the HMG domain (Laudet et al. 1993a; Higgins and Sharp the bootstrap analysis (Swofford and Olsen 1990). In the 1988). A printed version of the complete alignment of the HMG boxes present paper, we have reconstructed the phylogeny of is available upon request to S.S. all published HMG box sequences using both distance Regions of the alignment which are equivocally aligned (such as and parsimony analysis. The validity of the various the very N-terminal and C-terminal parts of our alignment, which con- tain some residues outside the HMG box itself) as well as long inser- branches of the trees was tested using the bootstrap tions present in only one or two closely related sequences were ex- method and all branches below a threshold value of 60% cluded from the analysis. As numerous sequences were isolated by were considered as nonvalid and then collapsed. This PCR analysis and were thus incomplete, only full-size sequences of the analysis allowed us to confirm the split of the HMG box box were treated. For SOX genes uncompleted sequences were also superfamily into two widely distinct families: the used to perform a separate analysis as described under Results, below. The initial full-length alignment of the 144 HMG box sequences con- MATA/TCF/SOX family and the HMG/UBF family. tained 75 sites, all variable, among which 74 were informative. Com- Furthermore, inside these two families, a clear and robust plete exclusion of all positions containing gaps was also done, and in structure can be observed in the case of the SOX genes. that case only 52 informative sites remained. 519 Phylogenetical Reconstruction Procedures posed (Baxevanis and Landsman 1995; Landsman and Bustin 1993). These signatures remain extremely degen- Distance matrices were calculated using a Boolean matrix: any change erated and may lead to erroneous identification of some including gaps is considered as 1; identical amino acids are considered proteins as HMG box-containing proteins. Some signa- as 0. Tree reconstructions were performed using the neighbor-joining tures for major subgroups such as SRY, UBF-1-3-6, program available on MUST together with bootstrap analysis using 1000 replicates (Swofford and Olsen 1990). We thus generated only a HMG box 2, and HMG box 1 have also been proposed nonbootstrapped distance tree from this alignment. In the case of SOX and are interesting tools of identification (Baxevanis and factors three overlapping data sets were treated, depending on the size Landsman 1995; Landsman and Bustin 1993; Kolodru- of the known sequences: an N-term data set (‘‘SOX N-term’’; AA betz 1990). Whether such signatures hold outside verte- 14–75 in our alignment), a central data set (‘‘SOX central’’; AA 26– brate or arthropod species remains to be determined. 83), and a C-term data set (‘‘SOX C-term’’; AA 33–91). The use of these three SOX data sets allowed us to test the effect of the sampling of sequences. Parsimony analysis was performed using the 3.1 version of the Two Families of HMG Box Proteins PAUP software (Swofford 1991). On each parsimony analysis data set, 100 bootstrap replicates were performed. All analyses were performed Because the number of sequences (144) in the complete on a Macintosh Power PC 6100/66. data set largely exceeds the number of informative sites (74), this data set cannot be treated adequately by boot- strapping. To circumvent this problem, four independent Results data sets of 30 sequences (designed 30-1 to 30-4 in Table 1) that are representative of the major groups identified A HMG Box Database in the analysis based on the complete data set were used. We also chose to change the number of sequences in Our first aim was to generate a list of published HMG a given group between the various data sets in order to box sequences through a database search (GenBank and check the influence of the sampling procedure on the EMBL) with both the FASTA and the sequence signa- results. The bootstrap sampling procedure was system- ture procedures (see Materials and Methods). We used atically applied to our various data sets (Swafford and different members of the family to find all possible se- Olsen 1990). All branches which were supported by quences. The results of this exhaustive search gave more bootstrap values below 60% were collapsed to allow a than 250 incomplete and full-length HMG box se- discussion only on supported topologies. quences. The topologies and bootstrap values of the distance The sequences from the resulting list (see Materials trees for the four independent data sets allowed the clas- and Methods and Table 2 for gene names and accession sification of the HMG superfamily into two families: (i) numbers) were aligned using the MUST computer pack- the MATA/TCF/SOX family and (ii) the UBF/HMG age. For proteins containing more than one HMG box, family (data not shown). This classification has a clear each box was treated separately. This allowed us to com- functional implication since it discriminates HMG box pare the evolution of classical HMGs or UBF box se- proteins with only one box, able to recognize specific quences. Some gaps were included in the alignment. In- DNA sequences, from proteins with several boxes, terestingly, when positions of these gaps were compared which mainly recognize DNA unspecifically. This par- with the three-dimensional structures already determined tition of the superfamily was also observed by other au- (Werner et al. 1995; Weir et al. 1993; Read et al. 1993; thors (Laudet et al. 1993a; Griess et al. 1993; Baxenavis Jones et al. 1994), insertion/deletion events appeared to and Landsman 1995). To exclude the possibility that be located mainly between the ␣-helices. This constitutes these two families were the result of a sampling artifact a good argument in favor of this alignment since corre- during the choice of 30 sequences, a tree with one rep- lations between three-dimensional structure and position resentative of each gene was generated (‘‘’’ tree in of insertion/deletion events have been demonstrated in Fig. 1). In that case, only one orthologue of vertebrate, other cases (Laudet et al. 1992). For example, at posi- arthropod, or plant genes was conserved for the analysis tions 42 and 43 of our alignment an insertion of two (‘‘Locus’’ data set in Table 1). All the branches sup- amino acids in box 1 of the vertebrate classical HMG ported by bootstrap values under 60% were collapsed. proteins was observed. This insertion lies between helix This tree gave an identical topology to those obtained 1 and helix 2 in the three-dimensional structure. Further- with the four sets of 30 sequences. more, the precise position of the gaps does not alter the The MATA/TCF/SOX family can be divided into topology of the robust branches of the tree (see below) three subfamilies: yeast mating-type genes (MATA), T- and its effect may thus be considered marginal. cell transcription factors (TCF), and Sry-related genes The sequence of the HMG box is highly variable, with (Sox). Two sequences (Saccharomyces ROX1 and Rat- no strict conservation at any position between all the tus HBP1) do not fall into any of these subfamilies (Ba- superfamily members. Weak signatures based on the ma- lasubramanian et al. 1993; Lesage et al. 1994). From the jor amino acid present at each position have been pro- topology of the distance tree, although not supported by 520 521 Table 1. List of the data sets used in this studya Dixon 1993) and that duplication of the unique HMG box of plant HMGs took place very early during the No. sites No. Inform. evolution of animals. Name No. sequences used sites Figure In contrast to vertebrates, the classical HMG gene is Total 144 75 74 Not shown present as a single copy in Drosophila as well as in the Locus 62 75 73 1 sea urchin Strongylocentrotus. Inside the classical HMG, 30-1 30 73 70 Not shown a typical vertebrate specific duplication leading to 30-2 30 72 70 Not shown 30-3 30 74 74 Not shown paralogous HMG1 and HMG2 genes present in all ver- 30-4 30 76 76 Not shown tebrates is observed. This pattern is further complicated HMG 79 77 75 Not shown by the existence of HMG-related genes described in hu- SOX man, mouse, and chicken (Mus MTEST750, Gallus Full 41 78 64 2 HMG2A boxes 1 and 2, Homo HMG2R, etc.; data not N-terminal 48 56 47 Not shown Central 80 53 46 Not shown shown). Some of these genes seem functional, whereas C-terminal 55 54 43 Not shown others probably represent pseudogenes (Stros and Dixon 1993). a The number of sequences and number of sites are indicated, as well The tree in Fig. 1 clearly indicates that Drosophila as the figure illustrating the results. All the alignments, trees, and bootstrap values obtained with these data sets are available from S.S. HMGD and HMGZ (as well as their close relatives Chi- upon request. ronomus HMG1A and -1B; data not shown) are not ho- mologues of classical HMG genes. These genes encodes only one box and are closely related to human and Dro- significant bootstrap values, it is possible that ROX1 sophila SSRP1 proteins which are implicated in DNA could be located inside the MATA subfamily that clus- recombination (Bruhn et al. 1992, 1993). The same situ- ters genes linked by common functional characteristics ation is true for Babesia HMG, which is related to Sac- (see Discussion). The UBF/HMG family, however, is charomyces NHP6, and for Arabidopsis and Catharan- less well resolved. This could be due to the extremely thus HMG, which form a group distinct from classical ancient dichotomy that led to the major subfamilies (see plant HMGs. Thus, a lot of molecules termed HMG pro- Discussion). No clear and robust subfamilies can be as- teins with sequence identity with the classical HMGs signed to the UBF/HMG family, and only small groups appear to belong to different groups than the bona fide of genes are clearly clustered together. In fact, 12 groups classical HMGs and are likely to perform different func- of genes are supported by bootstrap values above 60%: tions. vertebrate HMG box 2, NHP6A, plant HMG, SSRP1, Parsimony analysis on the four data sets gave the HMG box 1, MTTF1 box 2, UBF box 1, UBF box 3, same results as distance analysis, although a global re- MTTF1 box 1, UBF boxes 2a/5, UBF box 2/4, and Ara- duction of the bootstrap values was observed (data not bidopsis/Catharanthus HMG, which is distinct from the shown). The split of the superfamily into the MATA/ other plant HMGs (Figure 1; data not shown). Eight se- SOX and the HMG/UBF families was thus confirmed. quences cannot be grouped with the others in the ‘‘lo- Within these families, MATA, TCF, SOX, HMG-2, cus’’ data set: Tetrahymena MLH, Trypanosoma HMG NHP6, plant HMG, SSRP1, HMG-1, UBF1, UBF-3, boxes 1 and 2, Homo PMS1, the transduced HMG box UBF2/4, and UBF2a/5 are observed and supported by protein from chilo iridescent virus, Tetrahymena HMGC, bootstrap values above 50%. Saccharomyces MTHMG-1, and Saccharomyces IXR1 (Fig. 1; data not shown). Noteworthily, the various boxes present in a given Diversification Inside the SOX Subfamilies protein (such as MTTF1, classical HMGs, or UBF) ap- pear to be the result of ancient dichotomies. For example, Previous analyses suggested the existence of several the plant HMGs contain only one box, whereas the meta- groups of SOX genes based either on functional charac- zoan ones contain two boxes. The relationship teristics or on the analysis of sequence identity values between plant and animal HMGs is not visualized in our between isolated boxes encoded by these genes (Pevny tree since plant HMGs are not the sister group of animal and Lovell-Badge 1997; Collignon et al. 1996; Wright et HMG box 1 and box 2. However, functional data suggest al. 1993; Griffiths 1991; Spotila et al. 1994a, b). In order that the homologous status of these proteins is highly to determine if a phylogenetic signal would allow the probable (Ner 1992; Grasser and Feix 1991; Stros and definition of several groups in these sequences, complete

< Fig. 1. Unrooted phylogenetic neighbor-joining ‘‘locus’’ tree ob- been collapsed. The brackets indicate the various groups or the two tained obtain with the NJ program of the Must package. Only one families MATA/SOX and HMG/UBF. The dashed brackets suggest orthologue in vertebrates, arthropods, or plants was conserved in the possible groupings not well supported by the boostrap analysis. analysis. All the branches supported by boostrap values under 600 have 522

Fig. 2. Unrooted phylogenetic neighbor-joining tree of complete SOX HMG box sequences showing the different subgroups of the SOX family. All the branches connecting these subgroups are supported by strong bootstrap values. The brackets indicate the various groups. Sequences from the MATA and TCF families were added as the external group.

HMG box SOX sequences were analyzed (see Table 1). was removed (and after the deletion of the incomplete A robust signal for a partition into six groups was ob- Drosophila SOX14) was generated (data not shown). served: SRY (821 bootstrap replicates), SOX2/3 (989 Again, the same partition into the groups SOX5/6, bootstrap replicates), SOX9/18 (922 bootstrap repli- SOX4/22, SOX2/3, and SRY was obtained when com- cates), SOX4/22 (894 bootstrap replicates), SOX 5/6 pared with the previous tree using the HMG box alone. (1000 bootstrap replicates), and Drosophila SOX14 (Fig. The only difference was the clustering of the Drosophila 2). The latter gene could be distantly related to the SOX70 gene with the vertebrate SOX9 gene, but with a SOX4/22 group (722 bootstrap replicates). The relation- very low support and the breakup of the SOX9/18 group. ships between the groups are not resolved. We tested whether the observed groups are robust Surprisingly a tree reconstructed from an alignment of despite a sampling modification, using partial SOX complete sequences of the SOX genes reported so far led HMG box sequences. Three data sets, based on N ter- to the same classification (not shown). Since this may be minal-, central-, and C terminal-located sequences, were the result of a strong signal present inside the HMG box generated (see Table 1). These data sets are overlapping itself, a tree from an alignment in which the HMG box both for the sequence sampling and for the part of the 523 HMG box used and present in the database. The se- ized in Trypanosoma, a unicellular (Erondu quence portions used span positions 13–76 of the align- and Donelson 1992). Furthermore, the UBF/HMG fam- ment for the N-terminal, 25–84 for the central, and 32– ily contains known members in the three major phyla of 92 for the C-terminal data sets (Table 1). Interestingly, in eukaryotes: fungi, plants, and animals. The MATA/TCF/ the three cases the same six groups of SOX genes (SRY, SOX family is known only in fungi and animals. From SOX2/3, SOX9/18, SOX4/22, SOX 5/6, and Drosophila these data it appears clear that the superfamily was pre- SOX14) were observed (data not shown). These groups sent long before the separation among the three phyla. are supported by bootstrap values above 60%. The Dro- From the important structural and functional role of sophila SOX14 gene remained isolated, distantly related ‘‘classical’’ HMG proteins such as the one found in Try- to the SOX4/22 group. The turtle Herdmania SOX1 gene panosoma, it is tempting to speculate that the HMG box appeared to be weakly related to Drosophila SOX14. superfamily could be specific to eukaryotes and even Table 2 shows the distribution of the various SOX se- considered a synaptomorphy of this kingdom. Such a quences in these six robust groups. As in our previous model is easily testable by searching for HMG box pro- analysis, species-specific clusters of genes such as the teins in other organisms such as eubacteria and archae- reptilian AD, SRA, AES, TG, and LG gene clones in bacteria. Screening of completely sequenced archaebac- Alligator, Chelydra, and Tarentola were observed. Since terial and eubacterial genomes (reviewed by Garrett the relationships within the SOX groups are in many 1996) revealed no clear signature of HMG box proteins cases unresolved due to extreme sequence conservation but these proteins may be difficult to identify in such and to the small length of the sequences, it is difficult to divergent organisms. know if these species-specific clusters correspond to a The question of the age of the dichotomy is more biological reality. difficult to resolve. From the available data, classical HMGs are probably the most ancient proteins of the family and are the only HMG proteins found in a protist Discussion lineage so far. Thus, it could be hypothesized that the root of our tree is not located between the two families but, rather, within the UBF/HMG family. Figure 3 illus- Age, Root, and Boxes trates two possible positions of the root and the important Our analysis clearly revealed an early split of the HMG functional differences that exist between the two models. box protein superfamily into two subfamilies. This struc- In the first model (Fig. 3A), the root separates the two ture, well supported by bootstrap values, confirmed pre- families and it is impossible to know whether DNA bind- vious work (Laudet et al. 1993a; Griess et al. 1993; Bax- ing specificity has been acquired by MATA/TCF/SOX evanis and Landsman 1995). Interestingly, this split factors or lost by UBF/HMG factors. In the second separates DNA binding specific transcription factors model (Fig. 3B), the basic dichotomy would have oc- with a single HMG box (MATA/TCF/SOX family) and curred relatively late during the evolution of unicellular nonspecific DNA binding factors, which generally con- eukaryotes and members of the MATA/TCF/SOX family tain at least two or more boxes (UBF/HMG family). should have evolved from them. In such a case it is clear Nevertheless, due to the wide variability existing within that the DNA binding specificity evolves from a nonspe- the superfamily, the application of these two criteria cific binding framework. This model assumes that the (specific DNA binding and number of boxes) is not suf- UBF/HMG family is paraphyletic and that its structure in ficient to debate about the distribution of a protein to one families cannot be solved in our analysis. This model or the other families. For example, plant HMG proteins also suggests that the MATA/TCF/SOX factors (which contain only one box as members of the SSRP1 group appeared later than UBF/HMG factors) should be found (Grasser and Feix 1991; Laux and Goldberg 1991). Also, in a smaller number of unicellular eukaryotes than UBF/ the mitochondrial transcription factor MTTF1 has been HMG family members. MATA/TCF/SOX factors have demonstrated to exhibit some specific DNA binding, al- not been found yet in plants but since the precise - though this specificity appears relaxed compared to tionships among plant, fungi, and animals have not been ‘‘classical’’ transcription factors (Ikeda et al. 1994; correctly resolved, it is difficult to know if this could be Fisher and Clayton 1988). It is worth noting that the an argument for their relatively younger age. members of the MATA/TCF/SOX family are also able to The homologous relationships that could exist be- bind nonspecifically to altered DNA structures (Ferrari et tween the plant HMG and the classical HMG box 1 and al. 1992). But, overall, the basic dichotomy between 2 groups is interesting to discuss in this context. Al- MATA/TCF/SOX and UBF/HMG is well supported at though not discernible in phylogenetic trees due to the both the phylogenetical and the functional levels. small size of the HMG box, these proteins may indeed The question of the age of the superfamily and of the belong to a common subfamily as suggested by available basic dichotomy is of importance. Interestingly, a ‘‘clas- functional data and intron position (Erondu and Donel- sical’’ HMG protein with two boxes has been character- son 1992; Grasser and Feix 1991; Laux and Goldberg 524

Table 2. SOX-related sequences used in this studya Table 2. Continued.

Accession Accession Factor Species SOX group No. Factor Species SOX group No.

SOXA Homo SOX2/3 X71136 SOXB Homo X71137 SRA9 Chelydra L12009 SOX10 Homo X65666 SRA10 Chelydra L12015 SOX12 Homo X73039 MG42 Tarentola M86337 SOX14 Mus Z18963 MG43 Tarentola M86338 SOX15 Mus X70909 MG44 Tarentola M86339 SOX16 Mus L29084 LG27 Eublepharis M86335 M2P Mus H.C. LG28 Eublepharis M86336 M4P Mus H.C. SOX13 Xenopus X65656 M5P Mus H.C. SOX1 Herdmania DROSOX14 X79248 M6P Mus H.C. M8P Mus H.C. a List of data sets for partial SOX HMG box-related sequences. The CH2 Gallus M86321 gene names and GenBank accession numbers are indicated. Phyloge- CH3 Gallus M86322 netic relationships are indicated in the ‘‘SOX group’’ column. H.C., CH4 Gallus M86325 Hans Clevers (personal communication). CH7 Gallus M86327 CH1 Gallus M86320 CH31 Gallus M86323 1991). This is different for Drosophila HMGD protein, CH32 Gallus M86324 CH60 Gallus M86326 and this is a strong argument to suggest that this protein, SOXLF2 Larus Z23085 which belongs to the SSRP1 group, is not an arthropod SOXLF3 Larus Z23086 homologue of ‘‘classical’’ HMG proteins. Nothing is AMA1 Alligator M86317 known about the structure of the plant HMG genes. If the AMA2 Alligator M86318 plant HMGs have introns inside the box at a position AMA3 Alligator M86319 SOX11 Xenopus X65654 similar to that of the classical animal ones, this would be SOX2 Herdmania X79249 a strong argument in favor of their real homologous re- SOX3 Herdmania X79250 lationship. In that case, since the Trypanosoma HMG DM10 Drosophila M86328 protein contains two boxes, this would suggest that plant DM17 Drosophila M86329 HMGs have specifically lost one of their boxes (Erondu DM23 Drosophila M86330 DM33 Drosophila M86331 and Donelson 1992; Grasser and Feix 1991; Laux and DM36 Drosophila M86332 Goldberg 1991). DM64 Drosophila M86333 In addition, this would suggest that the internal di- SOX6 Homo SOX5/6 X65663 chotomy that led to the appearance of two boxes in one SOX5 Mus X65657 protein predates the split between Trypanosoma and the SOX13 Mus Z18962 SOX12 Xenopus X65655 lineage that conducts to animals. In such a case, the SOX5 Xenopus X65653 internal duplication could be one of the most ancient SOX8 Homo SOX9/18 X65664 events that took place during the evolution of the HMG SOX15 Drosophila X65668 box superfamily and the root can be placed in order to SOX7 Mus X65660 separate HMG box 1 and HMG box 2. This idea should SOX8 Mus Z18957 SOX10 Mus Z18959 be tested by isolating and analyzing classical HMG SOX17 Mus L29085 members from unicellular eukaryotes. SOX12 Mus SOX4/22 Z18961 M1P Mus H.C. SOXLF4 Larus Z23087 SOX Gene Diversification SOXLF5 Larus Z23089 SOXLF6 Larus Z23088 As discussed above, the MATA/TCF/SOX family ap- AES6 Alligator M86316 pears to be relatively younger than the UBF/HMG one. AES1 Alligator M86313 AES2 Alligator M86314 Within this subfamily, MATA genes are found exclu- AES4 Alligator M86315 sively in fungi, whereas TCF and SOX appear to be ADW2 Alligator M86310 restricted to animals. Thus whereas the appearance of the ADW4 Alligator M86311 MATA/TCF/SOX family predates the yeast–animal lin- ADW5 Alligator M86312 eage split, the diversification within this family appears SRA1 Chelydra L12020 SRA4 Chelydra L12013 to be restricted to animals. In line with this notion, no SRA5 Chelydra L12014 TCF or SOX genes (apart from homologues of MATA SRA6 Chelydra L12010 subfamily members) are found in the complete genome SRA7 Chelydra L12012 sequence of Saccharomyces cerevisiae. Thus, our analy- SRA8 Chelydra L12011 sis allows us to suggest that two widely separated bursts 525

Fig. 3. Two possible scenarios illustrating the evolution of the HMG MATA/TCF/SOX factors or lost by UBF/HMG factors. B In this case, box containing a protein superfamily. A The root is placed between the the ancestor protein is not sequence specific and this property was MATA/TCF/SOX and the UBF/HMG groups. In this case, it is impos- acquired by the MATA/TCF/SOX group. sible to know whether DNA binding specificity has been acquired by of diversification took place during the evolution of the since the marsupial Sry gene is often not directly con- HMG box superfamily: one extremely early, in the UBF/ nected to Sry from eutherian mammals. This is likely due HMG superfamily, which provided a large diversity of to the extremely high rate of mutation accumulation ob- nonspecific DNA binding factors; and the second later, served within the Sry gene (Pamilo and O’Neill 1997; specifically in metazoans, which gave rise to many tran- Whitfield et al. 1993; Tucker and Lundrigan 1993). All scription factors. The latter diversification burst is remi- the Sry sequences both inside and outside the box niscent of the one that took place during the formation of evolved rapidly, and this is clearly visualized in our trees some other transcription factor superfamilies such as (compare the dichotomy between human and mouse Sry nuclear receptors, Hox, and Ets (Laudet et al. 1992, with human and mouse, or even chicken, SOX9). This 1993b; Holland and Garcia-Fernandez 1996; Laudet rapid divergence is not specific to regions outside the 1997). box and thus cannot be attributed to a reduced functional Within the SOX family the robust existence of several pressure specific to these regions. The HMG box is not groups of genes is observed: SRY, SOX2/3, SOX9/18, the only functionally important part of Sry (Poulat et al. SOX4/22, SOX5/6, and Drosophila SOX14. Interest- 1997) and the high evolutionary speed of this gene ingly, the definition of SOX groups is confirmed in an should rather be interpreted as a specific adaptation of analysis performed with sequences outside the HMG the protein to its function rather than to a relaxed selec- box. The SOX9/18 group is the only nonmonophyletic tive pressure. The fact that the mammalian Sry genes one. Thus there are clear structural arguments in favor of accumulate more nonsynonymous than synonymous mu- the grouping pattern observed by the study done with the tations is in accordance with this view (Pamilo and box only. SOX11 and SOX4, located in the same group, O’Neill 1997; Whitfield et al. 1993; Tucker and Lund- contain both transactivating domains located in their C- rigan 1993). terminal part, with an HMG box situated in the N- The organization of the SOX family into six groups terminal moiety of the protein (Jay et al. 1995). Together suggests that each of these groups may have distinct and with SOX22, these two proteins are encoded by a gene specific functions. To date, very few reports have ana- with only one coding exon, whereas the SOX9 and lyzed the specific function of individual SOX genes. The SOX17 genes, located in a separate group, contain one effect of the disruption of a given gene has been de- intron interrupting their HMG domain encoding region scribed, for example, for SOX4 (Schilham et al. 1996), (Foster et al. 1994; Kanai et al. 1996). Also, SOX4, but the precise network of their target genes and even the SOX11, and SOX22 contain in their extreme N-terminal precise DNA sequence that an individual SOX protein part the conserved sequence MVQQ, the function of may bind are still unclear. Future experiments will allow which is not clear (Jay et al. 1997). us to decipher the functional differences that may exist The position of the SRY gene itself appears to be between the six SOX gene groups. difficult to define phylogenetically. In most of our analy- The SOX genes have been described in a wide variety ses the SRY gene does not appear to be monophyletic of organisms, from Drosophila to human. Several SOX 526 genes have been found in the C. elegans genome but comparison of the properties of Sox-3 with Sry and two related their precise study has not been reported yet. Neverthe- genes, Sox-1 and Sox-2. Development 122:509–520 less, most of these SOX genes can be placed in the vari- Erondu NE, Donelson JE (1992) Differential expression of two mR- NAs from a single gene encoding an HMG1-like DNA binding ous SOX groups defined in this study. Thus, we can protein of African trypanosomes. Mol Biochem Parasitol 51:111– already reasonably assume that, apart from SRY, which 118 is specific to mammals, the other SOX groups appeared Farid RS, Bianchi ME, Falciola L, Engelsberg BN, Billings PC (1996) earlier than the arthropod/vertebrate split. This situation Differential binding of HMG1, HMG2, and a single HMG box to is comparable to what was observed in other gene fami- cisplatin-damaged DNA. Toxicol Appl Pharmacol 141:532–539 Ferrari S, Harley VR, Pontiggia A, Goodfellow PN, Lovell-Badge R, lies (Laudet et al. 1992, 1993b; Holland and Garcia- Bianchi ME (1992) SRY, like HMG1 recognizes sharp angles in Fernandez 1996; Laudet 1997). In these cases it has been DNA. EMBO J 11:4497–4506 demonstrated that the major groups of genes originated Fisher RP, Clayton DA (1988) Purification and characterization of early during the evolution of metazoans but cannot be human mitochondrial transcription factor 1. Mol Cell Biol 8:3496– found outside this phylum. Thus it is tempting to specu- 3509 Foster JW, Graves JA (1994) An SRY-related sequence on the marsu- late that SOX genes as well as Ets or nuclear receptors pial X : implications for the evolution of the mamma- are specific to metazoans. In this case it would be inter- lian testis-determining gene. Proc Natl Acad Sci USA 91:1927– esting to study them in early metazoans such as nema- 1931 todes, plathyhelminthes, and coelenterates. Delineation Foster JW, Dominguez SM, Guioli S, Kowk G, Weller PA, Stevanovic of the function of these genes in these organisms will M, Weissenbach J, Mansour S, Young ID, Goodfellow PN, Schafer AJ (1994) Campomelic dysplasia and autosomal sex reversal allow us to infer what could have been their basic bio- caused by mutations in an SRY-related gene. Nature 372:525–530 logical role. Garrett RA (1996) Methanococcus jannaschii and the golden fleece. Curr Biol 6:1377–1380 Acknowledgments. We thank Jacques Demaille and Dominique Ste´- Grasser KD, Feix G (1991) Isolation and characterization of maize helin for constant support, Catherine Ha¨nni for advice since the very cDNAs encoding a high mobility group protein displaying a HMG- beginning of this work, Jean Derancourt for computer facilities, Pascale box. Nucleic Acids Res 19:2573–2577 Cre´pieux for critical reading of the manuscript, and members of the Griess EA, Rensing RA, Grasser KD, Maier UG, Feix G (1993) Phy- Endocrin’os (Lille) and Human Molecular Genetics (Montpellier) logenetic relationships of HMG box DNA-binding domains. J Mol groups for help and friendship. This work was funded by the CNRS, Evol 37:204–210 IPL, and ACC-SV1 (No. 9501014) from the Ministe`re de l’Education Griffiths R (1991) The isolation of conserved DNA sequences related Nationale et de l’Enseignement Supe´rieur. to the human sex-determining region Y gene from the lesser black- backed gull (Larus fuscus). Proc R Soc Lond B Biol Sci 244:123– 128 Grosschedl R, Giese K, Pagel J (1994) HMG domain proteins: archi- References tectural elements in the assembly of nucleoprotein structures. Trends Genet 10:94–100 Hamada H, Bustin M (1985) Hierarchy of binding sites for chromo- Balasubramanian B, Lowry CV, Zitomer RS (1993) The Rox1 repres- somal proteins HMG 1 and 2 in supercoiled deoxyribonucleic acid. sor of the Saccharomyces cerevisiae hypoxic genes is a specific Biochemistry 24:1428–1433 DNA-binding protein with a high-mobility-group motif. Mol Cell Higgins DG, Sharp PM (1988) CLUSTAL: a package for performing Biol 13:6071–6078 multiple sequence alignment on a microcomputer. Gene 73:237– Baxevanis AD, Landsman D (1995) The HMG-1 box protein family: 244 classification and functional relationships. Nucleic Acids Res 23: Holland PWH, Garcia-Fernandez J (1996) HOX genes and chordate 1604–1613 evolution. Dev Biol 173:382–395 Bianchi ME, Beltrame M, Paonessa G (1989) Specific recognition of Ikeda S, Sumiyoshi H, Oda T (1994) DNA binding properties of re- cruciform DNA by nuclear protein HMG1. Science 243:1056–1059 combinant human mitochondrial transcription factor 1. Cell Mol Billings PC, Davis RJ, Engelsberg BN, Skov KA, Hughes EN (1992) Biol 40:489–493 Characterization of high mobility group protein binding to cisplat- Jay P, Goze C, Marsollier C, Taviaux S, Hardelin JP, Koopman P, in-damaged DNA. Biochem Biophys Res Commun 188:1286–1294 Berta P (1995) The human SOX11 gene: cloning, chromosomal Bonne AC, Harper F, Sobezak J, De Recondo (1984) Rat liver HMG1: assignment and tissue expression. Genomics 29:541–545 a physiological nucleosome assembly factor. EMBO J 3:1193– Jay P, Sahly I, Goze´ C, Taviaux S, Poulat F, Couly G, Abitbol M, Berta 1199 P (1997) SOX22 is a new member of the SOX gene family, mainly Bruhn SL, Pil PM, Essigmann JM, Housman DE, Lippard SJ (1992) expressed in human nervous tissue. Hum Mol Genet 6:1069–1077 Isolation and characterization of human cDNA clones encoding a Jones DN, Searles MA, Shaw GL, Churchill ME, Ner SS, Keeler J, high mobility group box protein that recognizes structural distor- Travers AA, Neuhaus D (1994) The solution structure and dynam- tions to DNA caused by binding of the anticancer agent cisplatin. ics of the DNA-binding domain of HMG-D from Drosophila me- Proc Natl Acad Sci USA 89:2307–2311 lanogaster. Structure 2:609–627 Bruhn SL, Housman DE, Lippard SJ (1993) Isolation and character- Kanai Y, Kanai AM, Noce T, Saido TC, Shiroishi T, Hayashi Y, ization of cDNA clones encoding the Drosophila homolog of the Yazaki K (1996) Identification of two Sox17 messenger RNA iso- HMG-box SSRP family that recognizes specific DNA structures. forms, with and without the high mobility group box region, and Nucleic Acids Res 21:1643–1646 their differential expression in mouse spermatogenesis. J Cell Biol Carballo M, Puigdomenech P, Palau J (1983) DNA and histone H1 133:667–681 interact with different domains of HMG 1 and 2 proteins. EMBO J Kolodrubetz D (1990) Consensus sequence for HMG-1 like DNA bind- 2:1759–1764 ing domains. Nucleic Acids Res 18:5565 Collignon J, Sockanathan S, Hacker A, Cohen-Tannoudji M, Norris D, Landsmann D, Bustin M (1993) A signature for the HMG-1 box DNA- Rastan S, Stevanovic M, Goodfellow PN, Lovell-Badge R (1996) A binding proteins. BioEssays 15:539–546 527

Laudet V (1997) Evolution of the nuclear superfamily: early Wetering M, Verbeek S, Lamers WH, Kruisbeek AM, Cumano A, diversification from an ancestral orphan receptor. J Mol Endocrinol Clevers HC (1996) Defects in cardiac outflow tract formation and 19:207–226 pro-B-lymphocyte expansion in mice lacking Sox-4. Nature 380: Laudet V, Ha¨nni C, Coll J, Catzefils F, Stehelin D (1992) Evolution of 711–714 the gene superfamily. EMBO J 11:1003–1013 Schnitzler P, Hug M, Handermann M, Janssen W, Koonin EV, Delius Laudet V, Niel C, Coll J, Duterque-Coquillaud M, Leprince D, Stehelin H, Darai C (1994) Identification of genes encoding D (1993b) Evolution of the ets gene family. Biochem Biophys Res proteins, non-histone chromosomal HMG protein homologue, and a Commun 190:8–14 putative GTP phosphohydrolase in the genome of Chilo Iridescent Laudet V, Stehelin D, Clevers H (1993a) Ancestry and diversity of the virus. Nucleic Acids Res 22:158–166 HMG box superfamily. Nucleic Acids Res 21:2493–2501 Singh J, Dixon GH (1990) High mobility group proteins 1 and 2 func- Laux T, Goldberg RB (1991) A plant DNA binding protein shares tion as general class II transcription factors. Biochemistry 29:6295– highly conserved sequence motifs with HMG-box proteins. Nucleic 6302 Acids Res 19:4769 Spotila LD, Kaufer NF, Schoenbach L, Roy SW, Gilbert W (1994a) Lehming N, Thanos D, Brickman JM, Ma J, Maniatis T, Ptashne M Sequence analysis of the ZFY and Sox genes in the turtle, chelydra (1994) An HMG-like protein that can switch a transcriptional ac- serpentina. Mol Phylogenet Evol 3:1–9 tivator to a repressor. Nature 371:175–179 Spotila JR, Spotila LD, Kaufer NFJ (1994b) Molecular mechanisms of Lesage F, Hugnot JP, Amri EZ, Grimaldi P, Barhanin J, Lazdunski M TSD in reptiles: a search for the magic bullet. Exp Zool 270:117– (1994) Expression cloning in K+ transport defective yeast and dis- 127 tribution of HBP1, a new putative HMG transcriptional regulator. Stros M, Dixon GH (1993) A retropseudogene for non-histone chro- Nucleic Acids Res 22:3685–3688 mosomal protein HMG-1. Biochim Biophys Acta 1172:231–235 Lundrigan BL, Tucker PK (1994) Tracing paternal ancestry in mice, Swofford DL (1991) Phylogenetical analysis using parsimony, version using the Y-linked, sex-determining locus, Sry. Mol Biol Evol 11: 3.1. Computer program distributed by the Illinois Natural History 483–492 Survey, Champaign Ner SS (1992) HMGs everywhere. Curr Biol 2:208–210 Swofford DL, Olsen GJ (1990) Phylogeny reconstruction. In: Hillis Nightingale K, Dimitrov S, Reeves R, Wolffe AP (1996) Evidence for DM, Morris C (eds) Molecular systematics. Sinauer Associates, a shared structural role for HMG1 and linker histones B4 and H1 in Sunderland, MA, pp. 411–501 organizing chromatin. EMBO J 15:548–561 Tucker PK, Lundrigan BL (1993) Rapid evolution of the sex determin- Onate SA, Prendergast P, Wagner JP, Nissen M, Reeves R, Pettijohn ing locus in Old World mice and rats. Nature 364:715–717 DE, Edwards DP (1994) The DNA-bending protein HMG-1 en- Weir HM, Kraulis PJ, Hill CS, Raine AR, Laue ED, Thomas JO (1993) hances binding to its target DNA sequences. Structure of the HMG box motif in the B-domain of HMG1. EMBO Mol Cell Biol 14:3376–3391 J 12:1311–1319 Pamilo P, O’Neill RJ (1997) Evolution of the Sry genes. Mol Biol Evol Werner MH, Huth JR, Gronenborn AM, Clore GM (1995) Molecular 14:49–55 basis of human 46X,Y sex reversal revealed from the three- Pevny LH, Lovell-Badge R (1997) SOX genes find their feet. Curr dimensional solution structure of the human SRY-DNA complex. Opin Genet Dev 7:338–344 Cell 81:705–714 Philippe H (1993) MUST, a computer package of Management Utilities Whitfield LS, Lovell-Badge R, Goodfellow PN (1993) Rapid sequence for Sequences and Trees. Nucleic Acids Res 21:5264–5272 evolution of the mammalian sex-determining gene SRY. Nature Poulat F, de Santa Barbara P, Desclozeaux M, Soullier S, Moniot B, 364:713–715 Bonneaud N, Boizet B, Berta P (1997) The human testis determin- Wright EM, Snopek B, Koopman P (1993) Seven new members of the ing factor SRY binds a nuclear factor containing PDZ protein in- Sox gene family expressed during mouse development. Nucleic teraction domains. J Biol Chem 272:7167–7172 Acids Res 21:744 Read CM, Cary PD, Crane RC, Driscoll PC, Norman DG (1993a) Zappavigna V, Falciola L, Citterich MH, Mavilio F, Bianchi ME Solution structure of a DNA-binding domain from HMG1. Nucleic (1996) HMG1 interacts with HOX proteins and enhances their Acids Res 21:3427–3436 DNA binding and transcriptional activation. EMBO J 15:4981– Schilham MW, Oosterwegel MA, Moerer P, Ya J, de Boer PA, van de 4991