COMPUTATIONAL STRUCTURAL AND FUNCTIONAL PROTEOMICS THE ALPHA-GALACTOSIDASE SUPERFAMILY: SEQUENCE BASED CLASSIFICATION OF ALPHA- AND RELATED GLYCOSIDASES Naumoff D.G. State Institute for Genetics and Selection of Industrial Microorganisms, Moscow, Russia, e-mail: [email protected] Keywords: α-galactosidase, melibiase, glycoside , GH-D clan, GH31 family, GHX family, COG1649, classification, , protein phylogeny

Summary Motivation: About 1 % of genes in genomes code with glycosidase activities. On the basis of sequence similarity all known glycosidases have been classified into 90 families. In many cases proteins of different families have common evolution origin. It makes necessary to combine the corresponding families into a superfamily. Results: Using of the PSI-BLAST program we found significant sequence similarity of several glycosidase families, two of which includes enzymes with the α galactosidase activity. Sequence homology, common catalytic mechanism, folding similarities, and composition of the active center allowed us to group three of these families – GH27, GH31, and GH36 – into the α-galactosidase superfamily. Phylogenetic analysis of this superfamily revealed polyphyletic origin of GH36 family, which could be divided into four families. Glycosidases of the α-galactosidase superfamily have a distant relationship with proteins belonging to families GH13, GH70, and GH77 of glycosidases, as well as with two families of predicted glycosidases. Introduction Glycoside or glycosidases (EC 3.2.1.-) are a widespread group of enzymes, hydrolyzing the glycosidic bonds between two carbohydrates or between a carbohydrate and an aglycone moiety. A large multiplicity of these enzymes is a consequence of the extensive variety of their natural substrates: di-, oligo-, and polysaccharides. Comparative analysis of 300 amino acid sequences of glycosidases known at the beginning of the 1990s showed that they could be classified into 36 families. Recent progress in genome sequencing resulted in collecting of a huge number of enzymatically-uncharacterized proteins: about 1 % of all genes encode enzymes with predicted glycosidase activities. Currently, more than ten thousand sequences of glycosidases and their homologues are known. They are grouped into 91 families: GH1 GH95 (except GH21, GH40, GH41, and GH60). Several glycosidases do not have any homologues. They are included into a group of non-classified glycoside hydrolases. Glycosidases catalyze hydrolysis of the glycosidic bond of their substrates via two general mechanisms, leading to either inversion or overall retention of the anomeric configuration at the cleavage point. Some related families of glycosidases, having the same molecular mechanism of hydrolyzing reaction, have been combined into clans. Currently, 14 clans (GH-A–GH-L) are described, and in total they contain 46 families (see Carbohydrate-Active Enzymes server, http://afmb.cnrs-mrs.fr/CAZY/). Melibiases or α-galactosidases [E.C. 3.2.1.22] are glycosidases that cleave, with overall retention of the anomeric configuration, the terminal non-reducing α-D-galactose residues in α-D-galactosides, including galactose oligosaccharides, galactomannans, and galactolipids. On the basis of sequence similarity, all α-galactosidases have been classified into four families of glycosidases: GH4, GH27, GH36, and GH57. Families GH4 and GH57 mostly include other glycosidases. The majority known б-galactosidases belong to GH27 and GH36 families which

315

BGRS'2004 COMPUTATIONAL STRUCTURAL AND FUNCTIONAL PROTEOMICS BGRS 2004 form clan GH-D. Proteins of this clan have distant sequence similarity with representatives of several other families of glycosidases (Naumoff, 2001; Rigden, 2002). The recently established tertiary structure of several members of GH27 family is similar to the structure of retaining glycosidases from GH13 family (clan GH-H). Glycosidases of both families consist of the

N-terminal catalytic (β/α)8-barrel domain and the C-terminal β-sandwich domain. Results and Discussion Sequences of the proteins belonging to the GH27 and GH36 families, according to the Carbohydrate- Active Enzymes classification, were used for BLAST screening of the GenPept database of amino acid sequences at NCBI server. The resulted database was enlarged by translation of nucleic acid sequences found by screening genomic sequences with the Genomic BLAST. In total we analyzed more than 300 proteins. Family GH27 includes representatives from Eukaryota (Alveolata, Fungi, Metazoa, Mycetozoa, Viridiplantae) and Bacteria (Actinobacteria, Bacteroidetes, Fibrobacteres, Firmicutes, Proteobacteria). They possess the α-galactosidase, isomalto-dextranase [E.C. 3.2.1.94], α-N-acetylgalactosaminidase [E.C. 3.2.1.49], and galactosyltransferase [E.C. 2.4.1. ] activities. Multiple sequence alignment of the full-length sequences of proteins from GH27 family shown that each protein has both domains characteristic of the family. Only three enzymatically- uncharacterized proteins contain solely the catalytic N-terminal domain. Some (mostly prokaryotic) proteins have additional domains, which we grouped into eight families by sequence homology. Pairwise sequence comparisons showed that the majority of GH27 proteins have higher then 30 % identity, meeting the criterion of glycosidase subfamilies. All these proteins were grouped into 27a subfamily. Another subfamily, 27b, included five enzymatically-uncharacterized proteins from plants and bacteria. Two fungal proteins, including one α galactosidase, were considered to be the only representatives of subfamily 27c. A unique isomalto-dextranase from Arthrobacter globiformis and two other bacterial proteins do not belong to any of the subfamilies. The largest subfamily 27a included three subgroups, each containing sequences with no less then 50 % identity. The subgroups comprised proteins of yeasts, plants, and chordates, respectively. Phylogenetic analysis of the GH27 family was used to study the evolutionary relationships of its members. Trees constructed by neighbor-joining and maximum parsimony methods (PHYLIP package) were topologically similar: all subfamilies (27a 27c) appear to form monophyletic groups with bootstrap value higher than 90 %. Eukaryotic proteins compose five distinct clusters of branches on the phylogenetic trees. PSI-BLAST searches (E-value was 0.001 or 0.01) with a few randomly selected divergent representatives of the GH27 family used as a query sequence during the first or second iteration revealed some representatives of GH31 and GH36 families of glycosidases. The further iterations yielded members of GH-H clan (it includes families GH13, GH70, and GH77). Also we found a number of bacterial enzymatically-uncharacterized hypothetical proteins from several genome projects. Sequence analysis allowed to group them into two distinct families. One of them is known as COG1649. Another includes a unique α-glucosidase [E.C. 3.2.1.20] SusB from Bacteroides thetaiotaomicron, which belongs to the group of non-classified glycoside hydrolases. We have found the latter family for the first time and named it as the GHX family. Statistically significant similarity of GH27 glycosidases with members of the other protein families was only within the N-terminal catalytic (β/α)8-barrel type domain. Families GH31 and GH36 includes representatives from Archaea, Bacteria, and Eukaryota. In addition to the α-galactosidase activity, α-N-acetylgalactosaminidase, stachyose synthase [E.C. 2.4.1.67], and raffinose synthase [E.C. 2.4.1.82] activities have been described for some members of GH36 family. Family GH31 includes retaining enzymes with α-glucosidase [E.C. 3.2.1.20], glucoamylase [E.C. 3.2.1.3], -isomaltase [E.C. 3.2.1.10 and E.C. 3.2.1.48], α-xylosidase 316

BGRS'2004 COMPUTATIONAL STRUCTURAL AND FUNCTIONAL PROTEOMICS BGRS 2004

[E.C. 3.2.1.-], α-glucan [E.C. 4.2.2.13], and isomaltosyltransferase [E.C. 2.4.1. ] activities. Multiple protein sequence alignment allowed us to find that two key Asp residues, playing the roles of nucleophile and proton donor in the enzyme active center, are located in the homologous sites of the catalytic domain in proteins of GH27, GH31, and GH36 families. Based on sequence homology, composition of the active center, common catalytic mechanism with overall retention of the α-D-glycopyranoside anomeric configuration of substrate during the reaction catalyzed, and predicted common (β/α)8 TIM barrel-type tertiary structure of the catalytic domain, we combined GH27, GH31, and GH36 families into the α-galactosidase superfamily (Fig.). Phylogenetic analysis of proteins from the α-galactosidase superfamily showed that GH27 and GH31 appear to be monophyletic families and GH36 family is a polyphyletic one. Sequence analysis allowed us to distinguish in GH36 family four subgroups, which are monophyletic. We suggest considering these subgroups as four different families of glycosidases (GH36A-GH36D) belonging to the α-galactosidase superfamily (Fig.). Family GH36A includes proteins from Fungi and several phyla of Bacteria (Actinobacteria, Bacteroidetes, Firmicutes, Proteobacteria, Spirochaetes). Family GH36B contains only bacterial proteins (Actinobacteria, Proteobacteria, Spirochaetes, Thermotogales, Thermus). Among members of GH36A and GH36B families only the α-galactosidase activity has been shown. Family GH36C is composed by proteins from Archaea (Crenarchaeota), Bacteria (Actinobacteria, Bacteroidetes), and Eukaryota (Alveolata, Fungi, Viridiplantae). In addition to the α -galactosidase activity, stachyose and raffinose synthase activities have been described for this family. Family GH36D contains Clostridium perfringens α-N-acetylgalactosaminidase and a few enzymatically-uncharacterized proteins from Bacteria (Firmicutes, Proteobacteria). On the basis of the available experimental data for SusB from B. thetaiotaomicron, protein sequence homology with glycosidases from the α-galactosidase superfamily, and the gene context we propose to consider the GHX family as a new family of glycosidases. However, taking into account a distant sequence similarity with α galactosidases and absence of experimental data about molecular mechanism and composition of the active center for the GHX family, at this point we have decided not to include the GHX family into the α-galactosidase superfamily. Statistically significant sequence similarity with glycosidases from the α-galactosidase superfamily and clan GH-H (which we propose to name the α-glucosidase superfamily) allows us to predict some glycosidase activities for proteins of COG1649 family.

Fig. Proposed classification of (β/α)8-barrel-type retaining α-D-glycopyranosidases. 317

BGRS'2004 COMPUTATIONAL STRUCTURAL AND FUNCTIONAL PROTEOMICS BGRS 2004

Our data strongly support a common evolution origin of proteins from the α galactosidase and α glucosidase superfamilies, as well as GHX and COG1649 families (Fig.). Homology of glycosidases from these two superfamilies have been proposed recently, as well as their distant relationship with some other retaining α-D-glycopyranosidase, representing families GH38, GH57, and GH66 (Henrissat, 1998; Imamura et al., 2001; Janeček, 1998; Rigden, 2002). The results of this work including multiple sequence alignment and the updated classification of the α galactosidase superfamily may be obtained by e-mail. Acknowledgements

This work was supported by a grant of the Russian President for young scientists (MK 118.2003.04). References Henrissat B. Glycosidase families // Biochem. Soc. Trans. 1998. V. 26. P. 153–156. Imamura H., Fushinobu S., Jeon B.-S., Wakagi T., Matsuzawa H. Identification of the catalytic residue of Thermococcus litoralis 4-α-glucanotransferase through mechanism-based labeling // Biochem. 2001. V. 40. P. 12400–12406. Janeček Š. Sequence of archaeal Methanococcus jannaschii α- contains features of families 13 and 57 of glycosyl hydrolases: a trace of their common ancestor? // Folia Microbiol. 1998. V. 43. P. 123–128. Naumoff D.G. Sequence analysis of glycosylhydrolases: β-fructosidase and a galactosidase superfamilies // Glycoconjugate J. 2001. V. 18. P. 109. Rigden D.J. Iterative database searches demonstrate that families 27, 31, 36 and 66 share a common evolutionary origin with family 13 // FEBS Lett. 2002. V. 523. P. 17–22.

318

BGRS'2004