Diversification Pattern of the HMG and SOX Family Members During Evolution
Total Page:16
File Type:pdf, Size:1020Kb
J Mol Evol (1999) 48:517–527 © Springer-Verlag New York Inc. 1999 Diversification Pattern of the HMG and SOX Family Members During Evolution Ste´phan Soullier,1 Philippe Jay,1 Francis Poulat,1 Jean-Marc Vanacker,2 Philippe Berta,1 Vincent Laudet2 1 ERS155 du CNRS, Centre de Recherche en Biochimie Macromole´culaire, CNRS, BP5051, Route de Mende, 34293 Montpellier Cedex 5, France 2 UMR 319 du CNRS, Oncologie Mole´culaire, Institut de Biologie de Lille, Institut Pasteur de Lille, 1 rue Calmette, 59019 Lille Cedex, France Received: 20 July 1998 / Accepted: 19 October 1998 Abstract. From a database containing the published brate HMG box 2, insect SSRP, and plant HMG. The HMG protein sequences, we constructed an alignment of various UBF boxes cannot be clustered together and their the HMG box functional domain based on sequence diversification appears to be extremely ancient, probably identity. Due to the large number of sequences (more before the appearance of metazoans. than 250) and the short size of this domain, several data sets were used. This analysis reveals that the HMG box Key words: HMG box — SOX proteins — Sry — superfamily can be separated into two clearly defined Molecular phylogeny subfamilies: (i) the SOX/MATA/TCF family, which clusters proteins able to bind to specific DNA sequences; and (ii) the HMG/UBF family, which clusters members Introduction which bind non specifically to DNA. The appearance and diversification of these subfamilies largely predate the split between the yeast and the metazoan lineages. Par- The high-mobility group (HMG) proteins were first de- ticular emphasis was placed on the analysis of the SOX fined by their electrophoretic behavior on SDS-PAGE, as subfamily. For the first time our analysis clearly identi- a series of nonhistone proteins able to interact nonspe- fied the SOX subfamily as structured in six groups of cifically with DNA. Among this heterogeneous group, genes named SOX5/6, SRY, SOX2/3, SOX14, SOX4/22, HMG1 and HMG2 proteins, also called ‘‘classical’’ and SOX9/18. The validity of these gene clusters is con- HMG, contain two 80-amino acid domains responsible firmed by their functional characteristics and their se- for DNA binding and termed ‘‘HMG boxes.’’ Since its quences outside the HMG box. In sharp contrast, there identification, this domain was found in a large number are only a few robust branching patterns inside the UBF/ of apparently unrelated factors that are all able to bind HMG family, probably because of the much more an- DNA. Thus, the HMG box is now considered as a sig- cient diversification of this family than the diversifica- nature of a large superfamily of proteins (Grosschedl et tion of the SOX family. The only consistent groups that al. 1994). can be detected by our analysis are HMG box 1, verte- The HMG box superfamily contains usual transcrip- tion factors which bind to specific DNA sequences, such as the T-cell transcription factors TCF1 and LEF1, the mating-type proteins of several fungi, the mammalian 1 Present address: UPR1142 du CNRS, Institut de Ge´ne´tique Humaine, male sex-determining factor SRY, and the numerous 141 rue de la Cardonille, 34396 Montpellier Cedex 5, France 2 Present address: UMR 49 du CNRS, Ecole Normale Supe´rieure de SOX factors. The superfamily also contains a number of Lyon, 46 alle´e d’Italie, 69364, Lyon Cedex 07, France nonhistone proteins. The classical HMG proteins, which Correspondence to: Dr. P. Berta; e-mail: [email protected] probably play both a structural and a functional role in 518 chromatin (Lehming et al. 1994; Carballo et al. 1983; The structure of the superfamily, and of the various fami- Nightingale et al. 1996; Onate et al. 1994; Singh and lies, subfamilies, and groups, is also fully confirmed by Dixon 1990; Zappavigna et al. 1996; Bonne et al. 1984), functional and structural considerations. bind to single- and double-stranded DNA sequences and also to nonclassical DNA structures such as adduct- modified DNA (Billings et al. 1992; Farid et al. 1996), Materials and Methods four-way junctions (Bianchi et al. 1989), and B–Z DNA junctions (Hamada and Bustin, 1985). Other members of the superfamily such as the mitochondrial transcription Construction of the Database and Sequence Alignments factor MTTF1, the nucleolar transcription factor UBF, the structure specific recognition protein SSRP1, and the Sequences were extracted from the EMBL, GenBank, and NBRF data yeast nuclear nonhistone protein NHP6A also have a libraries using both the FASTA program and the search for sequences relaxed specificity for DNA and are able to recognize identical to a given signature available (9-1 and 9-2 option in the CITI-II Bisance/Infobiogen network). To perform the FASTA search structural motifs on the DNA molecule. The number of DNA binding domain sequences of Saccharomyces MTHMG-1, Mus HMG boxes in the various members of the family is Sox4, Homo TCF1, Mus Sox12, Neurospora MATA1, Tetrahymena variable, from one in most cases to five or six in the HMGC, and Bos HMG1-2 were used. For the signature search program mouse or Xenopus UBF factors. Thus, the HMG box we used a consensus sequence corresponding to the most conserved part of the HMG box and encompassing the following sequence: protein superfamily is characterised by an extreme func- PX(M/L)XN(A/S/T)X(I/M/S/L)S(K/Q/E) X(L/R)GXX(W/S), in which tional and structural diversity. X is any amino acid. Previous phylogenetic analysis has shown that the Protein names and GenBank accession numbers of the full-length HMG box superfamily is extremely ancient (Laudet et al. HMG box used in this study are as follows: ROX1, X60458; STE11, Z11156; MAT, X64195; MATA1, M54787; MATA1M, X07642; 1993b; Griess et al. 1993). In fact, members of this fam- TCF1, X59869; TCF3, X62870; TCF4, X62871; LEF1, X58636; ily are known in all metazoan phyla, in plants, and in HBP1, U09551; SRY, X53772, X86384, X86383, X86382, X86380, yeast but also in unicellular eukaryotes such as Trypano- X86386, Z30265, Z30646, U15569, X55491, L29551, L29547, soma (Erondu and Donelson 1992). A member of the L29544, L29543, L29549, L29548, L29552, L29542, S46279; SOX3, classical HMG has been transduced in the Chilo iride- X71135, S69429, U12467; SOX19, X79821; SOX2, U12532; SOXRET, L07335; SOX9, Z46629, Z18958, U12533; SOX18, scens virus (Schnitzler et al. 1994). Previous phyloge- L35032; SOX4, X70683, X70298; SOX11, U23752, U12534; SRA3, netical studies have classified the superfamily into two L12021; SOX20, AB006768; SOX70D, U68056; SOX14, X65667; families: one comprising proteins with a single se- SOX21, U66141; SOX6, U32614; SOX22, U35612; SOX5, S83308, quence-specific HMG box, such as the yeast mating-type X65657; SOXLZ, D61689, D61688; UBF, X53461, X57561, X60831, M61725; MTTF-1, M62810; MLH, M87306; YD9395, Z46727; genes, the TCF group of transcription factors, and the NOR90-1, X56687; HMG1, X12597, M21683, X12796, Y00463, SOX proteins; and the other encompassing relatively D14314, U21933, L06453; HMG2, M83665, J02895, X67668, non-sequence-specific DNA binding proteins with mul- M80574, D30765; HMGT, X02666; PMS1, U13695; HMGX, L07107; tiple HMG boxes, such as the classical HMG proteins, YD8119, Z48008; SSRP1, M86737, S50213, L08825; DEF1, D14315; UBF, and mTTF1 (Laudet et al. 1993b; Griess et al. HMGZ, X71139; HMGD, M77023; HMG1B, M93254; HMG1PS, L08048; HMG1R, D14718; HMG2A, X63463; MTEST750, Z31299; 1993). These analyses revealed striking differences in the HMGT2, L32954; HMG, X81456, L22300, D13491, Z28410, X58282, rate of accumulation of mutations in the various HMG L28094, X76774, X58245, Z21703, L12169; S4664, D41834; S2676, boxes, the SRY gene evolving the fastest within the su- D40599; HMGC, M63424; MTHMG-1, M73753; IXRI, L16900; CI- perfamily (Pamilo and O’Neill 1997; Whitfield et al. IDBP, L08814; DSP1, U13881; NHP6A, X15317; and NHP6B, X15318. 1993). The sequences of the HMG boxes were aligned using the ED pro- Since the beginning of these studies the size of the gram of the MUST package (Philippe 1993). This program allows a superfamily has increased considerably and the various color visualization of the aligned sequences and the alignment is done trees published so far have never been tested for their by eye. To avoid errors we also aligned the sequences with the Clustal robustness using statistical resampling methods such as V program, which we used previously to generate an independent alignment of the HMG domain (Laudet et al. 1993a; Higgins and Sharp the bootstrap analysis (Swofford and Olsen 1990). In the 1988). A printed version of the complete alignment of the HMG boxes present paper, we have reconstructed the phylogeny of is available upon request to S.S. all published HMG box sequences using both distance Regions of the alignment which are equivocally aligned (such as and parsimony analysis. The validity of the various the very N-terminal and C-terminal parts of our alignment, which con- tain some residues outside the HMG box itself) as well as long inser- branches of the trees was tested using the bootstrap tions present in only one or two closely related sequences were ex- method and all branches below a threshold value of 60% cluded from the analysis. As numerous sequences were isolated by were considered as nonvalid and then collapsed. This PCR analysis and were thus incomplete, only full-size sequences of the analysis allowed us to confirm the split of the HMG box box were treated. For SOX genes uncompleted sequences were also superfamily into two widely distinct families: the used to perform a separate analysis as described under Results, below. The initial full-length alignment of the 144 HMG box sequences con- MATA/TCF/SOX family and the HMG/UBF family. tained 75 sites, all variable, among which 74 were informative.