View of the Hydrolysis Ratio Is Unlikely to Be Predictable from Se- Coverage and Distribution of the Curated Set of 400 Ex- Quence Alone
Total Page:16
File Type:pdf, Size:1020Kb
Aspeborg et al. BMC Evolutionary Biology 2012, 12:186 http://www.biomedcentral.com/1471-2148/12/186 RESEARCH ARTICLE Open Access Evolution, substrate specificity and subfamily classification of glycoside hydrolase family 5 (GH5) Henrik Aspeborg1†, Pedro M Coutinho3†, Yang Wang1, Harry Brumer III1,2 and Bernard Henrissat3* Abstract Background: The large Glycoside Hydrolase family 5 (GH5) groups together a wide range of enzymes acting on β-linked oligo- and polysaccharides, and glycoconjugates from a large spectrum of organisms. The long and complex evolution of this family of enzymes and its broad sequence diversity limits functional prediction. With the objective of improving the differentiation of enzyme specificities in a knowledge-based context, and to obtain new evolutionary insights, we present here a new, robust subfamily classification of family GH5. Results: About 80% of the current sequences were assigned into 51 subfamilies in a global analysis of all publicly available GH5 sequences and associated biochemical data. Examination of subfamilies with catalytically-active members revealed that one third are monospecific (containing a single enzyme activity), although new functions may be discovered with biochemical characterization in the future. Furthermore, twenty subfamilies presently have no characterization whatsoever and many others have only limited structural and biochemical data. Mapping of functional knowledge onto the GH5 phylogenetic tree revealed that the sequence space of this historical and industrially important family is far from well dispersed, highlighting targets in need of further study. The analysis also uncovered a number of GH5 proteins which have lost their catalytic machinery, indicating evolution towards novel functions. Conclusion: Overall, the subfamily division of GH5 provides an actively curated resource for large-scale protein sequence annotation for glycogenomics; the subfamily assignments are openly accessible via the Carbohydrate-Active Enzyme database at http://www.cazy.org/GH5.html. Keywords: Protein evolution, Enzyme evolution, Functional prediction, Glycogenomics, Glycoside hydrolase family 5, Phylogenetic analysis, Subfamily classification Background material with significant potential to address energy and Carbohydrates, in the form of mono-, di-, oligo-, and material needs. polysaccharides, as well as glycoconjugates, play funda- A striking feature of carbohydrates is their remarkable mental roles in all forms of life [1]. Beyond their role in structural complexity, due to a rich diversity of mono- energy storage, carbohydrates are central to diverse bio- saccharide building blocks, and the possibility of numer- logical processes such as host-pathogen interactions, sig- ous stereo- and regiospecific linkages [3], which give rise nal transduction, inflammation, intracellular trafficking, to both simple linear and complex, highly branched diseases, and differentiation/development. Not least, as molecules [1]. A decade of investments in genomics and structural components of terrestrial biomass, carbohy- proteomics has greatly improved our interpretation of drates comprise approximately 75% of the carbon fixed the molecular language of the cell, but deciphering the annually by primary production [2]. Sugar-rich plant cell complex carbohydrate-based information in the biomo- walls, seeds, and tubers thus represent a renewable lecular landscape is still in its infancy. Indeed, glycomics has been identified both as “the last frontier of molecu- ” “ * Correspondence: [email protected] lar and cellular biology [4] as well as an emerging tech- †Equal contributors nology that will change the world” [5]. 3 Architecture et Fonction des Macromolécules Biologiques, Aix-Marseille Functional analysis of glycans and glycoconjugates is Université, CNRS, UMR 7257, 163 Avenue de Luminy, Marseille 13288, France Full list of author information is available at the end of the article complicated by the fact that they are not direct genetic © 2012 Aspeborg et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Aspeborg et al. BMC Evolutionary Biology 2012, 12:186 Page 2 of 16 http://www.biomedcentral.com/1471-2148/12/186 products, but are instead synthesized, recognized, modi- study, which notably also suggested the merger of A5 fied, and degraded by a plethora of carbohydrate-active and A6 [19]. Finally, A10 was the most recently defined enzymes (CAZymes) and binding proteins. In the syn- GH5 subfamily [20], while new subfamilies that pres- thetic direction, phosphosugar-dependent glycosyltrans- ently lack a unique identifier have also been suggested ferases (GTs) catalyze the formation of glycosidic [21,22]. Family GH5 belongs to clan GH-A, which pres- linkages, whereas their breakdown is mediated by glyco- ently groups 19 GH families to form the largest set of side hydrolases (GHs) and polysaccharide lyases (PLs), evolutionarily related GH families described in CAZy with the assistance of carbohydrate esterases (CEs). The thus far (a clan is a group of families that arise from a structural diversity of carbohydrates is reflected in an common but very distant ancestor; despite weak se- abundance of CAZyme-encoding genes, which comprise quence similarity, clan members share conserved protein 1-3% of the genome of most organisms [6]. Expanding fold and catalytic machinery). and harnessing knowledge of the complexity of the Families such as GH5 were originally defined with a “CAZome” is thus essential to understanding the com- very small number of sequences. With the accumulation plexity of the glycome. of an increasing body of sequence data, the relationship The protein sequence-based classification of CAZymes between the original families has sometimes changed was initiated in 1991 as a complement to the long- enough to merit reexamination of family membership. standing Enzyme Commission (EC) number system [7], Very recently, detailed three-dimensional structural ana- which is based solely on enzyme activities [8]. Given the lysis led to the reclassification of several GH5 sequences prevalence of convergent evolution of enzymes that into family GH30 based on the organization of second- cleave glycosidic bonds, as well as the demonstrable ary structural elements around the conserved (β/α)8 fold catalytic promiscuity of individual enzymes, sequence- of the catalytic module [23]. based classification has proven to be a robust way to Given the continuing expansion in sequence numbers unify information on enzyme structure, specificity, and and the partial GH5/GH30 reclassification, it is clear mechanism, which provides enormous predictive power that a global re-analysis of the subfamily division of GH5 [9]. Initially motivated by a need to delineate cellulases is now needed. The rapid accumulation of genomic data (EC 3.2.1.4) into distinct structural families [10], the first in the past decade revealed a complex and varied se- incarnation of the GH family classification, as such, quence space, with the consequence that a substantial comprised 35 GH families [8]. The number of families portion of GH5 family members are currently not increased steadily with the growing interest in Glycobiol- assigned to any subfamily. This situation will only be- ogy so that, as of August 2012, 130 sequence-based fam- come worse as the rate of (meta)genomic sequencing ilies of GHs have been defined in the continuously continues to increase with phenomenal rapidity. Further, updated CAZy database [11]. this flood of data will cause an increasing reliance on Presently, one of the largest GH families is GH5, his- computer-based annotation, which necessarily requires a torically known as “cellulase family A” as it was the first robust framework to produce meaningful functional pre- cellulase family described [10]. GH5 exemplifies a family dictions. The division of CAZyme families into subfam- with a large variety of specificities: it currently contains ilies based on phylogenetic analysis has been applied as close to 20 experimentally determined enzyme activities a successful approach to meet this challenge: Subfamily denoted with an EC number. The abundance of GH5 classification of GH13, GH30 and all of the PL families enzymes in different ecological niches has been high- has demonstrated that the majority of the defined sub- lighted by their frequent identification in metagenomes families were monospecific, thus indicating a signifi- of diverse microbial communities [12-14], as well as the cantly better correlation of substrate specificity between genomes of individual organisms [11]. As with other sequences at the subfamily level than the family level CAZyme families [15], GH5 members are commonly [23-25]. Significantly, the division into subfamilies allows found to be encoded as parts of multi-modular polypep- the identification of currently uncharacterized subfam- tide chains containing other catalytic, substrate-binding, ilies that can subsequently be analyzed biochemically and functionally unidentified or yet to be described and structurally to potentially unveil new activities. modules. Hence, we present here an improved, robust subfamily Within the large GH5 family, a discernible diversity of classification for GH5 by employing a large-scale ana-