<<

SUPPLEMENTARY MATERIAL

The glycan alphabet is not universal: a hypothesis

Jaya Srivastava1*, P. Sunthar2 and Petety V. Balaji1

1Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India

2Department of Chemical Engineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India

*Corresponding author Email: [email protected]

1

CONTENTS

Data Description Methods Methods in detail Figure S1 Number of organisms with different number of strains sequenced Figure S2 Biosynthesis pathways Figure S3 Proteome sizes for different number of Figure S4 Prevalence of monosaccharides in species versus that in genomes Figure S5 Bit score distribution plots for hits of various pairs of profiles Table S1 Tools and databases used in this study Flowchart S1 Procedure used to generate HMM profiles Flowchart S2 Precedence rules for assigning annotation to proteins that are hits to two or more profiles and/or BLASTp queries References References cited in this file

MS-EXCEL file provided separately: Supplementary Data.xlsx

Worksheet1 Details of HMM profiles Worksheet2 Details of BLASTp queries Worksheet3 Prevalence of monosaccharides in genomes / species Worksheet4 Abbreviated names of monosaccharides Worksheet5 Enzyme types, enzymes and groups Worksheet6 Precursors of various monosaccharides

2

METHODS

Databases and software: Protein sequences and 3D structures were obtained from UniProt and PDB (Table S1). Completely sequenced genomes of 12939 prokaryotes, of which 303 are archaeal, were obtained from the NCBI RefSeq database. These genomes are spread across 3384 species belonging to 1194 genera (Figure S1). Gene neighborhood was analyzed using feature tables taken from NCBI for the respective genomes. BLASTp, MUSCLE, HMMER and CD-Hit (Table S1) were installed and used locally. Default values were used for all parameters except when stated otherwise. Word size was set to 2 for BLASTp to prioritize global alignments over local alignments. Thresholds for HMM profiles were set based on the best 1 domain bit score rather than e-values since the former is independent of database size.

Searching genomes for monosaccharide biosynthesis pathways: Pathways for the biosynthesis of 55 monosaccharides have been elucidated to date (Table 2, Figure S2). HMM profiles were generated using carefully curated sets of homologs for 57 families of enzymes that catalyse various steps of 55 monosaccharides (Supplementary_data.xlsx:Worksheet1). Sequences were used directly as BLASTp queries when the number of enzymes characterized experimentally is not sufficient for a HMM profile (Supplementary_data.xlsx:Worksheet2). In-house python scripts were used to scan genomes to identify homologs. Presence of a homolog for each and every enzyme of the biosynthetic pathway of a monosaccharide is taken as evidence of the utilization of this monosaccharide by the organism. On the other hand, absence of a homolog for even one enzyme of the pathway is interpreted as the absence of the corresponding monosaccharide from the organism’s glycan alphabet.

Choice of precursors: -1-phosphate, fructofuranose-6-phosphate and -7-phosphate are precursors for many of the monosaccharides (Supplementary_data.xlsx:Worksheet6). Fructofuranose- 6-phosphate and sedoheptulose-7-phosphate are intermediates in the

3

glycolytic pathways viz., Embden-Meyerhof pathway and phosphate pathway, respectively, and these enzymes are not considered for the search. Pathways for biosynthesis of UDP-Glc2NAc and GDP- have been considered separately since Glc2NAc and mannose are glycan building blocks as well as intermediates in the biosynthesis of several other monosaccharides. Hence biosynthesis steps of UDP-Glc2NAc and GDP- mannose were excluded from those of their derivatives. An additional pathway for UDP-glucose biosynthesis was considered to analyze its ubiquity since UDP-glucose is part of both anabolic and catabolic pathways. The biosynthesis of CMP-Leg5Ac7Ac starting from N-acetyl-glucosamine- 1-phosphate has also been considered because of the uncommon guanylyltransferase.

Generation of HMM profiles: An HMM profile was generated for each step of a biosynthesis pathway except where mentioned otherwise. Profiles were generated in two steps (Flowchart S1). The extended dataset was created to account for sequence divergence. In some cases, no additional sequences satisfying the aforementioned criteria were found, hence there is no Extend dataset. Each profile was given an annotation based on the enzyme activities of proteins that were used to generate the profile and an identifier of the format GPExxxxx; here GPE stands for Glycosylation Pathway Enzyme and xxxxx is a unique 5-digit number (Supplementary_data.xlsx:Worksheet1).

Setting thresholds for HMM profiles: Thresholds for HMM profiles were set as described below (profile-wise details are given in Supplementary_data.xlsx:Worksheet1):

Using ROC curves: TrEMBL database was used to generate ROC curves. Several of the TrEMBL entries have been assigned molecular function electronically based on UniRule and SAAS (Table S1). It is assumed that these annotations are correct while generating ROC curves. True positives, false positives and false negatives were identified by comparing TrEMBL annotation with profile annotation.

4

Using bit-score scatter plots: Members of some enzyme families differ in their molecular function while retaining significant global sequence similarity e.g., C4- and C3-aminotransferases. Consequently, annotations of several TrEMBL sequences belonging to such families are incomplete e.g., DegT/DnrJ/EryC1/StrS aminotransferase family protein. In such cases, bit score scatter-plots were used to set thresholds (Figure S5). Scatter plot was also used to set threshold in case of hydrolysing and non-hydrolysing NDP- Hex2NAc C2 epimerases since many TrEMBL hits are just annotated as NDP-Hex2NAc C2 epimerases.

Using 푇푒푥푝 and 푇푒푥푡푒푛푑 as thresholds: 푇푒푥푝 or 푇푒푥푡푒푛푑 was used as the threshold for some profiles for one of these two reasons: (i) Sequences used to generate the profile are a subset of the sequences used to generate another profile; the latter set of enzymes has broader substrate specificity than those of the former set. For instance, sequences used for generating GPE02430 [(d)TDP-4-keto-6-deoxyglucose 3-/3,5-epimerase] and GPE02530 (NDP- 3-/3,5-/5-epimerase) are homologs but the former set has narrow specificity. 푇푒푥푡푒푛푑 was set as threshold for GPE02430 as lowering the threshold would make this profile less specific. (ii) For some profiles such as GPE50010 [nucleotide sugar formyltransferase], very few TrEMBL entries that score < 푇푒푥푝 have been assigned molecular function and hence ROC curve could not be generated.

The case of GPE00530: Scanning TrEMBL database with GPE00530 (Glucose-1-phosphate uridylyltransferase family 2) using the default threshold of HMMER (e-value = 10) resulted in 2693 hits with matching annotation and their scores ranged continuously from 705 to 303 bits and then from 57 to 41 bits. It was not possible to generate an ROC curve because of this discontinuity. Hence, 303 bits was set as the threshold.

Profile annotations with broader substrate / product specificities: Many sequence homologs catalyse the “same” reaction but with (slightly) different substrate specificities. Sequence changes that confer such differential specificities are subtle and often unknown. HMM profiles of such families 5

lack the ability to discriminate between sequences with varying substrate specificities. Two products, a major product and a minor product, are formed in certain enzyme catalysed reactions (Tello et al. 2006; Polizzi et al. 2012; Li et al. 2017). It is possible that only the major product has been characterized while assaying an enzyme with broader substrate specificity. Another possibility is that only a subset of possible substrates has been assayed for. Hence, substrate specificities are broad in annotations of some of the profiles. As opposed to these, some profiles of aminotransferases and reductases are generated from enzymes which differ from each other with respect to the product formed viz., orientation (equatorial or axial) of the newly formed/added -OH / -NH2 group. Profile for 3,4-ketoisomerase is also of this type. UDP-GlcA decarboxylase (UXS) converts UDP-GlcA to UDP-4-keto , which is further reduced to UDP-xylose. UDP-4-keto xylose is a minor product for human UXS whereas it is a major product for E. coli UXS (Polizzi et al. 2012). Both these enzymes are used to generate the profile GPE20030 (Supplementary_data.xlsx:Worksheet1).

Pathway steps associated with more than one HMM profile: Some steps are associated with more than one profile for one of these two reasons: (i) Non- orthologous enzymes known to catalyse the same reaction e.g., phosphomannoisomerases. (ii) Two or more profiles are generated, one with narrow and the other(s) with broad substrate specificity. Enzymes used for the former are a subset of enzymes used for the latter type of profiles e.g., aminotransferases. The process flow adopted to assign annotation for a sequence which satisfies thresholds for more than one profile is shown in Flowchart S2.

Finding homologs using BLASTp instead of HMM profiles: HMM profiles were generated only when four or more experimentally characterized enzymes are available (two exceptions are discussed below). Global alignment and sequence similarity were used as the criteria to infer homology based on BLASTp search. The default values were set to be >= 90% query coverage and >=30% sequence similarity. However, these values were upwardly revised when query sequences belonged to homologous

6

families that are functionally divergent (Supplementary_data.xlsx:Worksheet2). Specifically, similarity and coverage cut-offs were revised by performing an all-against-all BLASTp search of all experimentally characterized sequences of monosaccharide biosynthesis pathways.

B. cereus PdeG (Q81A42_1-328) is a retaining UDP-Glc2NAc 4,6- dehydratase (Hwang et al. 2015). It shares higher sequence similarity with inverting UDP-Glc2NAc 4,6-dehydratases than with retaining dehydratases. The sequence of PdeG was compared with TrEMBL hits for the HMM profile of inverting UDP-Glc2NAc 4,6-dehydratases (GPE05331), based on which the sequence similarity cut-off for PdeG was set to 70%. The threshold for GPE05331 was set such as to exclude PdeG (Figure S5).

Criteria for finding homologs of UDP-2,4-diacetamido-2,4,6-trideoxy-β-L- hydrolase and UDP-4-amino-6-deoxy-Glc2NAc acetyltransferase: Four experimentally characterized enzymes are known for each of these two families. However, BLASTp approach was used instead of generating an HMM profile. This is because a suitable bit score threshold could not be arrived, which in turn, was because several of the TrEMBL entries obtained as hits are annotated as CMP-N-acetylneuraminic acid synthetase or equivalent (for hydrolase), or O-acetyltransferase or equivalent (for acetyltransferase).

Uncertainties in prediction: Any description of molecular function of a protein is stratified and includes specifying the type of reaction catalysed, substrate(s) used, etc. A vast majority of sequences conceptually translated from genome sequences are assigned molecular function based on sequence homology to experimentally characterized proteins. Even though experimental validation is available for only a small fraction of proteins due to practical constraints, such studies have shown that homology-based assignments are generally valid and deviations typically pertain to the extent of substrate specificity, metal ion dependency and such. Nevertheless, caution is warranted with increasing sequence divergence and one has to be

7

on the lookout for homologs that have acquired new molecular function as a result of mutation of a handful of key residues (neo-functionalization). In view of this, in the present study, HMM and BLASTp thresholds have been chosen with higher stringency and assignment of substrate(s) and product(s) has been made conservatively by manually curating false positives and false negatives from the Swiss-Prot database, details of which are given below:

(1) Both GDP- and GDP-6-deoxytalose are assigned as products of the same pathway, because their biosynthesis proceeds through the same pathway with the exception of the last step being catalysed by homologous 4-reductases. It is not possible to infer if product specificity of enzymes in this family is absolute or partial i.e., one is a major product and other, a minor product, due to inadequate experimental data. An identical situation is seen in the pathways for the biosynthesis of CDP-cillose and CDP-cereose, and for CDP-abequose and CDP-paratose. In view of this, prevalence data will be the same for the two monosaccharides of a pair (Supplementary_data.xlsx:Worksheet3).

(2) Non-hydrolyzing NDP-Hex2NAc C2-epimerases (GPE02030) are part of biosynthesis pathways of different monosaccharides. The extent of substrate specificity of the experimentally characterized members of this family is not known since not all enzymes have been assayed using all possible substrates. In literature, substrate specificity is arrived at based on the genomic context and the same approach has been followed in the present study as well. For example, hits for GPE02030 profile are treated as Man2NAc synthesis pathway enzymes, unless other enzymes of L- Fuc2NAc, L-Qui2NAc or Man2NAc3NAcA pathway are also present.

(3) Some monosaccharides are precursors for other monosaccharides and hence, genomes predicted to have the pathway for the latter monosaccharide will also have the precursor monosaccharide. Following are the precursor- final product monosaccharide pairs encountered in this study: (i) L- Rha2NAc → L-Qui2NAc, (ii) L-Rhamnose → 6-Deoxy-L-, (iii) → Fucofuranose, (iv) Paratose → Tyvelose, (v)

8

Galactofuranose, (vi) GlcA → GalA, (vii) L-Ara4N → L-Ara4NFo, (viii) Per → Per4Ac, (ix) Man2NAc → Man2NAcA, (x) Glc2NAcA → Gal2NAcA, and (xi) Bac2Ac4Ac → Leg5Ac7Ac.

(4) The pathway for the synthesis of L- is an extension of the pathway for the synthesis of xylose. However, most genomes predicted to have xylose pathway also have L-arabinose pathway. This is because UDP- sugar C4-epimerase family members (GPE02230) catalyse C4- epimerization of glucose, GlcA, Glc2NAc, Glc2NacA and xylose. Assigning substrate specificity solely based on sequence similarity is not possible. The challenge is compounded by the fact that some of these enzymes show broad substrate specificity while the rest are only specific to a single substrate. Not all enzymes have been assayed for all potential substrates.

Data Availability: Results of genome scan which include predictions of monosaccharides as well as the biosynthesis pathway enzymes predicted in 12939 genomes is available at http://www.bio.iitb.ac.in/glycopathdb/.

9

Figure S1 The number of species for which different number of strains are sequenced. Six or fewer strains are sequenced for most of the species. On the other hand, more than 50 strains are sequenced for 29 species. Escherichia coli and Salmonella enterica have the highest number of sequenced strains (714 and 602, respectively). Genus and species names are not known for 45 endosymbionts; only their host name is known e.g., Legionella endosymbiont. Each such case is considered as a distinct species.

10

4 Mutase Glucose-1-phosphate C1 4 (d)TDP- (d)TDP-fucose C1

ThymidylylT (Pyr to Fur) fucofuranose

)

0 6 4

GPE01130 A GPE09510

3 1

(

C

2 3

- -

e

@

1

1

s

: :

4 H a

C 4

(d)TDP-glucose 1 2

O

t

7

F

c

3

7

x

u E GPE05430 5

4,6-Dehydratase (R) A

6

0

d

e Q

Retention of config @ C5 O

R - 4 4-AminoT 4 (d)TDP-4-keto-6-deoxy-glucose C1 Eq –NH2@C4 3,4-Ketoisomerase 3,4-Ketoisomerase GPE01830 GPE07510 Eq –OH@C4 Ax –OH@C4 GPE01230

(d)TDP-4-amino- e s (d)TDP-3-keto-6-

a (d)TDP-3-keto-6- 4

r 0

0 6-deoxy-glucose C1

3 3

e 4 4 5

4 deoxy-glucose C1 deoxy-galactose C1

m

2 2

i 0

0 3

p E E 2 1 G

E GPE01910 9 P P 3-AminoT (E) 3-AminoT (E) 2

- GPE01910 - 1 P G G 1 - F E 5 GPE01230 : 1 5 , Eq –NH @C3 Eq –NH @C3 : T o 2 GPE01230 2 3 3 l ( 0 3 D y ) 4 rm 0 8 W t o 1 -a 0 B C e in y (d)TDP-3-amino-6- 4 X c m lT (d)TDP-3-amino-6- C 9 A m i Q n

4 a 0 4 0 4 T - o

C C 3 C deoxy-galactose 1 3 4 )

deoxy-glucose 1 o (

8 2

@

n 1

(d)TDP-4-keto- 1

2

i

0 0

H

E E 1 m

AcetylT AcetylT N P

L-rhamnose C4 Q6T1W7:1-192 P

A –

- G

Q6TFC6:1-265 G x

(3-amino) Q12KT8:1-152 (3-amino) 4 A 4 (d)TDP-Qui4NAc C1 (d)TDP-Qui4NFo E5 ) 4 O F1 4 E - 4 C ( R 6 6 4 4 1 e 62 : (d)TDP-Qui3NAc C1 (d)TDP-Fuc3NAc C1 e A d 5 1- s 4 x u 6: 31 a C 0 – c 1 8 t 3 O t - @ a 2 (d)TDP-4-amino-6- c 3 H s 7 H @ e 0 u 6 d O 0 C (A 4C e – E 4 ) deoxy-galactose 1 P R q - G 4 E AcetylT Q8FBQ3:1-224 4-Epimerase (4-amino) (d)TDP- (d)TDP-6-deoxy- 1 Q2SYH7:1-363 L-rhamnose C4 L-talose 1C 4 4 (d)TDP-Fuc4NAc C1

D-enantiomer L-enantiomer HMM profile BLASTp query D-enantiomer / form unless mentioned otherwise

Figure S2a (d)TDP-linked monosaccharides derived from glucose-1-phosphate. Abbreviated names are used for some of the monosaccharides. Full names of these are given in Supplementary_data.xlsx:Worksheet4.

11

4 Glucose-1-phosphate C1

GPE00330 CytidylylT

4 CDP-glucose C1

GPE05230 4,6-Dehydratase (R) Retention of config @ C5 GPE40110 C-MeT (E) 3-Epimerase CDP-4-keto-6- CDP-4-keto-3C-methyl- 4C 4C 1 CDP-4-keto-6-deoxy 1 4 GPE02530 deoxy-glucose C1 Eq –Me, Ax-OH@C3 6-deoxy-gulose Ax –OH@C4 4-Reductase GPE06230 4 GPE05630 3-Dehydratase -R e e s E a q d t 4 – u 4 c C O c CDP-6-deoxy gulose C1 u H t CDP-6-deoxy-D3,4-glucocene @ @ a d H GPE06230 s e C e O 4 R – - x P26395:1-330 3,4-Glucoseen 4 A Q66DP5:1-329 reductase GPE02530

CDP-3,6-dideoxy-L-glycero- 5-Epimerase glycero 4 CDP-3,6-dideoxy-D- - CDP-cereose C1 CDP-cillose 1C 4 D-glycero-4-hexulose 4 C1 4 D-glycero-4-hexulose C1

GPE06230 4-Reductase 4 e - s R Eq –OH@C4 a E e t 4 q d c C – u u @ O c d H GPE06230 H t 1C e O @ a CDP-L-ascarylose 4 R – C s - x 4 e 4 A

2-Epimerase 4C 4 4C CDP-abequose 1 CDP-paratose C1 CDP-tyvelose 1 P14169:1-338

D-enantiomer L-enantiomer HMM profile BLASTp query D-enantiomer / pyranose form unless mentioned otherwise

Figure S2b CDP-linked monosaccharides derived from glucose-1-phosphate. Abbreviated names are used for some of the monosaccharides. Full names of these are given in Supplementary_data.xlsx:Worksheet4.

12

4 Glucose-1-phosphate C1

GPE00430 UridylylT GPE00530 GPE09510 4,6-Dehydratase (R) 4-Epimerase Mutase UDP-galacto- UDP-4-keto-6- 4 4 UDP-glucose C1 UDP-galactose C1 4 Retention of config @ C5 deoxy-glucose C1 GPE02230 (pyranose GPE05510 to furanose) 2 5 3 3 , - ) 4 5 :1 E - - 8 ( R E 6-Dehydrogenase N p 4 T 4 E e Y o C q d im GPE03130 2 n G – u G i @ e 2 P O c GPE03430 0 m H E H t r A N 0 @ a a 0 A – 5 s s A - 7 C e e 4 q 1 4 4C E 0 ( , Decarboxylase 1 E 4 ) UDP-GlcA C1 UDP-4-keto xylose GPE20030 GPE01830 1 4-Epimerase 4-AminoT (A) UDP-4-amino-6- UDP-L-rhamnose C4 GPE01230 GPE02230 Ax –NH @C4 4C 2 deoxy-glucose 1 4 UDP-xylose C1 4C 4 AcetylT UDP-GalA 1 UDP-L-Ara4N C1 (4-amino) 4-Epimerase FormylT GPE02230 GPE50010 Q5UR11:1-213 (4-amino) 4C 4 UDP-L-arabinose 1 UDP-L-Ara4NFo UDP-Qui4NAc C1 4 C1 D-enantiomer L-enantiomer HMM profile BLASTp query D-enantiomer / pyranose form unless mentioned otherwise

Figure S2c UDP-linked monosaccharides derived from glucose-1-phosphate. Abbreviated names are used for some of the monosaccharides. Full names of these are given in Supplementary_data.xlsx:Worksheet4.

13

GPE01330 4 Fructofuranose-6-phosphate Glucosamine-6-phosphate C1 GPE07030 2-AmidoT GPE07130 M lT GPE07230 y u Isomease t 0 G t e 3 P a GPE07330 1 G E c 8 P 0 s A 0 E 9 e Q5SIM4:1-254 E 0 1 P 9 3 G 6 P29954:1-387 3 0 0

1 4C C4 Mannose-6-phosphate 1 GDP-L-galactose GPE09230 4 3,5 Glc2NAc-6-phosphate Glucosamine-1-phosphate C1 -e GPE09330 Mutase 4C R p 1 4N im GPE09630 RR er AcetylT 5: as 1-3 e 4 Mutase GPE08510 50 Mannose-1-phosphate C1 (2-amino) GPE09430 GPE00620 (GlmU-CTD) GPE00720 GuanylylT 4 Glc2NAc-1-phosphate C1 4 6-dehydrogenase 4 GDP-ManA C1 GDP-mannose C1 GPE03210 GPE00831 GPE03430 GPE05020 4,6-dehydratase (R) UridylylT UridylylT Retention of config @ C5 GPE00830 (GlmU-NTD) 4-AminoT (E) 4 4 C1 GDP-Per C1 GDP-4-keto-6-deoxy mannose 4 Eq –NH2@C4 UDP-Glc2NAc C1 AcetylT Q7DBF7:1-221 GPE01430 Eq 4- –OH O85353:1-215 se 3,5-epimerase, re @C (4-amino) ta duc 4 a G tas dr 4-reductase (A) PE0 e 4 y 4- 65 (E) GDP-Per4Ac C1 h 30 re 10 de 56 GPE06030 Ax –OH@C4 du - E0 Ax ct 3 P –O as G H@ e 1C C4 (A 4 GDP-L-fucose 4 ) GDP-rhamnose C1 GDP-4-keto-3,6- 4 4C dideoxy mannose C1 GDP-6-deoxy-talose 1 5-epimerase, 4-reductase (A) GPE06030 Ax –OH@C4

1 GDP-L-colitose C4

D-enantiomer L-enantiomer HMM profile BLASTp query D-enantiomer / pyranose form unless mentioned otherwise

Figure S2d GDP- and UDP-linked monosaccharides derived from fructofuranose-6- phosphate. Abbreviated names are used for some of the monosaccharides. Full names of these are given in Supplementary_data.xlsx:Worksheet4.

14

4-Oxidase, 5,6 dehydratase, 2-Epimerase 4-reductase (Non-hydrolyzing) 4 UDP-6-deoxy-GlcNAc-5,6-ene UDP-Glc2NAc UDP-Man2NAc C1 A0A0A1G406:1-341 4C GPE02030 C5 1 4-Oxidase, g @ onfi of c GPE03030 6-Dehydrogenase 5,6-reductase, ion (I) Q3ESA4:1-320 vers se In rata 5-epimerase GPE05331 hyd 6- 1 De D 4C C4 ,6- e UDP-Man2NAcA 1 4 hy UDP-2-acetamido-2,6- dr og dideoxy-L-arabino-4-hexulose en 3-E as GPE01530 4-AminoT pim e UDP-Gal2NAcA GPE01830 era Eq –OH@C4 se 4 GPE01230 , GPE03030 C1 3 4-r - ed UDP-4-amino-2-acetamido- e Eq uc 4-Epimerase p –O ta 1 im H se 4C 2,4,6-trideoxy-L-altrose C4 @C (E 1 GPE02230 e 4 ) r UDP-Glc2NAcA a s Q8L347:1-290 AcetylT e , GPE08410 A GPE03310 3-dehydrogenase 4 1 (4-Amino) x - C – r UDP-L-Rha2NAc 4 O e H d 4 @ u UDP-3-keto-Glc2NAcA C1 UDP-2,4-diacetamido- C c 2-epimerase 4 t GPE02030 1 a C4 s (Non-hydrolyzing) GPE01910 2,4,6-trideoxy-L-altrose e 3-AminoT (E) ( GPE01230 Eq –NH2@C3 A O25093:209-517 ) UDP-L-Qui2NAc 1C A0A1D3UJQ1:1-321 4 4C Hydrolase P95700:1-371 UDP-Glc2NAc3NH2A 1 Q0P8U5:1-274 Q9XC60:1-372 Q3ESA1:1-366 A9IH93:1-190 AcetylT G3XD01:1-191 (3-amino) 2,4-diacetamido-2,4,6- 1 UDP-6-deoxy-L-talose C4 1 4C trideoxy-L-altrose C4 UDP-Glc2NAc3NAcA 1 2-Epimerase Nonulosonic acid GPE02030 2-Epimerase GPE04030 (Non-hydrolyzing) GPE02030 synthetase (Non-hydrolyzing) 1 2C UDP-L-Fuc2NAc C4 L-Pse 5 4 UDP-Man2NAc3NAcA C1 GPE00231 CytidylylT

2 CMP-L-Pse5Ac7Ac C5 D-enantiomer L-enantiomer HMM profile BLASTp query D-enantiomer / pyranose form unless mentioned otherwise

Figure S2e CMP- and UDP-linked monosaccharides derived from UDP-Glc2NAc (1 of 2). Abbreviated names are used for some of the monosaccharides. Full names of these are given in Supplementary_data.xlsx:Worksheet4.

15

4 4 UDP-Qui2NAc C1 C1 4 Glc2NAc-1-phosphate UDP-Gal2NAc C1 Q0P8S9:1-341 GuanylylT Eq –OH@C4 Q6TP29:1-309 GPE02230 4-Epimerase Q81A43:1-303 4-Reductase (E) 4C GPE02110 GDP-Glc2NAc 1 Q0P8T8:1-323 2-epimerase Retention of config @ C5 G8UHQ7:1-325 4,6-dehydratase (R) (Hydrolyzing) 4,6-dehydratase (R) UDP-4-keto-6- 4C 4 Man2NAc 1 UDP-Glc2NAc 4 C1 C1 GPE05332 deoxy-Glc2NAc GDP-4-keto-6- 4 Nonulosonic acid Q81A42:1-328 deoxy-Glc2NAc C1 GPE04030 GPE01710 4-AminoT (E) synthetase GPE01830 GPE01710 Eq –NH2@C4 4-AminoT (E) GPE01230 GPE01830 2C Eq –NH2@C4 Neu5Ac 5 GPE01230 UDP-4-amino-6- GPE00210 4 CytidylylT C1 GDP-4-amino-6- GPE00231 ) deoxy-Glc2NAc A 4 ( P71063:1-216 deoxy-Glc2NAc C1 2 e CMP-Neu5Ac C5 s B0V6J7:1-219 AcetylT a AcetylT ) t Q0P8V9:1-263 (A c Q0P9D1:1-195 T u 5 (4-Amino) o 9 Q5FAE1:197-403 (4-Amino) in d 2 C4 - m e 1 4 -a 2@ r : C 4 4 NH - 8 1 C1 – 4 1 UDP-Bac2Ac4Ac GDP-Bac2Ac4Ac Ax A K 4 3 GPE01710 C H 0 2-epimerase @ A GPE01830 H 0 O A (hydrolyzing) GPE01230 – x GPE02110 A 4 UDP-4-amino-Fuc2NAc C1 2,4-diacetamido-2,4,6- 4 C1 4 UDP-Fuc2NAc C1 Q814Z5:1-371 Synthase trideoxy-mannose GPE04030 Nonulosonic acid 4 UDP-yelosamine C1 synthetase 2 Leg C5 GPE00231 CytidylylT

2 CMP-Leg5Ac7Ac C5

D-enantiomer L-enantiomer HMM profile BLASTp query D-enantiomer / pyranose form unless mentioned otherwise

Figure S2f CMP- and UDP-linked monosaccharides derived from UDP-Glc2NAc (2 of 2). CMP-Leg5Ac7Ac may be biosynthesized through GDP-linked or UDP-linked intermediates. Abbreviated names are used for some of the monosaccharides. Full names of these are given in Supplementary_data.xlsx:Worksheet4.

16

4 Sedoheptulose-7-phosphate C1

P63224:1-192 Isomerase

4 D-glycero-a/b-D-manno-heptose-7-phosphate C1

Kinase Kinase P76658:1-318 Q9AGY8:1-341

D-glycero-a-D-manno- D-glycero-b-D-manno- 4 4C heptose 1,7-bisphosphate C1 heptose 1,7-bisphosphate 1 Q9AGY5:1-179 Phosphatase P63228:1-191 Phosphatase D-glycero-a-D-manno- D-glycero-b-D-manno- 4 4 heptose 1-phosphate C1 heptose 1-phosphate C1

Q9AGY6:1-230 Guanylyltransferase P76658:344-477 AdenylylT

GDP-D-glycero-a-D- ADP-D-glycero-b- 4 4 manno-heptose C1 D-manno-heptose C1

GPE05020 4,6-Dehydratase (R) P76910:1-310 6-Epimerase Retention of config @ C5 ADP-L-glycero-b-D- GDP-4-keto-6-deoxy-a- 3-Epimerase GDP-4-keto-6-deoxy- 4 manno C1 4 -heptose D-arabino-heptose a-D-lyxo-heptose C1 4 GPE02530 C1 4-Reductase (E) GPE06510 4-Reductase (E) GPE06510 Eq –OH@C4 Eq –OH@C4

GDP-6-deoxy-a- GDP-6-deoxy-a-D- 4 4C D-altro-heptose C1 manno-heptose 1

D-enantiomer L-enantiomer HMM profile BLASTp query Pyranose form unless mentioned otherwise

Figure S2g ADP- and GDP-linked heptoses derived from sedoheptulose-7-phosphate. Abbreviated names are used for some of the monosaccharides. Full names of these are given in Supplementary_data.xlsx:Worksheet4.

17

Figure S3 Variations in the proteome size of organisms which encode the same number of monosaccharides. Only the smallest and largest proteome sizes are shown. As can be seen, the number of monosaccharides used by an organism is independent of the proteome size. For instance, Helicobacter pylori PNG84A (proteome size = 1353) uses the same number of monosaccharides (7) as Sorangium cellulosum So0157-2 (proteome size = 10480).

18

A

B

19

Figure S4 (A) The prevalence of each monosaccharide as percentages of the genomes analyzed in this study (viz., 12939) and the number of species covered by these genomes (viz., 3384; Figure S1(a)). The diagonal line is manually drawn to facilitate visualization of deviations. (B) Zoomed in view of the region near the origin in (A). Data for most of the monosaccharides lie on the diagonal suggesting that the sequencing of a large number of strains for a few species has not biased the outcome, with the exception of (d)TDP-Fuc4NAc and UDP-L-Qui2NAc. (d)TDP-Fuc4NAc (a point below the diagonal line) is present in fewer species but represents a larger fraction of genomes since 679 strains of E. coli contain this monosaccharide. Conversely, presence of UDP-L-Qui2NAc (a point above the diagonal line) is highly strain specific. Abbreviated names are used for some of the monosaccharides. Full names of these are given in Supplementary_data.xlsx:Worksheet4.

20

Figure S5 Setting bit score thresholds for HMM profiles with varying substrate specificities. TrEMBL database was scanned using the profiles shown along the X- and Y-axes in the above scatter plots; for these scans, default values set by HMMer were used for all the parameters. Hits that are common to a pair of profiles (shown along X- and Y-axes) were chosen and bit scores of such hits were plotted against each other. Bit score thresholds (indicated by red lines) were chosen such that a protein is a hit for only one of the two profiles. Threshold was revised for GPE05331 set to exclude PdeG.

21

Table S1 Tools and databases used in this study

Tool / Version / URL Reference Database Release

Tools installed and run locally on a Linux platform

(Altschul et BLASTp 2.2.31+ ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.31/ al. 1990)

HMMER 3.1b2 http://hmmer.org/download.html (Eddy 2011)

(Madeira et MUSCLE 3.8.31 https://www.ebi.ac.uk/Tools/msa/muscle/ al. 2019) (Huang et al. CD-Hit 4.6 http://weizhongli-lab.org/cd-hit/ 2010)

Directly accessed from the website or FTP site

(UniProt UniProt 2018_07 https://www.uniprot.org/ Consortium 2019) (NCBI Resource Genome 2019_03 https://www.ncbi.nlm.nih.gov/genome/ Coordinators 2018) (NCBI Not Resource Pubmed https://www.ncbi.nlm.nih.gov/pubmed/ applicable Coordinators 2018) (Knudsen CATH-Plus 4.2 http://www.cathdb.info/ and Wiuf 2010) Not (Berman et PDB https://www.rcsb.org/ applicable al. 2000)

Used through the TrEMBL database

(UniProt Not UniRule https://www.uniprot.org/help/unirule Consortium applicable 2019) (UniProt Not SAAS https://www.uniprot.org/help/saas Consortium applicable 2019)

22

Flowchart S1 Procedure used to generate HMM profiles

1. Generation of Exp dataset and Exp profile, and setting 푇푒푥푝 Step 1a Consider only those enzymes which are characterized by direct enzyme activity assay Step 1b Remove redundancy (80% sequence identity cutoff) and obtain a multiple sequence alignment (MSA) Step 1c Use the MSA as input to generate an HMM profile Step 1d Score Exp dataset sequences against this HMM profile Step 1e Set the bit score of the lowest scoring sequence as the bit score threshold

for Exp dataset, 푇푒푥푝

2. Generation of Extend dataset and Extend profile, and setting 푇푒푥푡푒푛푑 Step 2a Add sequences that meet any of the following criteria to the Exp dataset

(i) SwissProt entries satisfying the threshold 푇푒푥푝

(ii) SwissProt entries scoring < 푇푒푥푝 provided they show conservation of active site residues. Active site residues were collated based on site directed mutagenesis studies or ligand-bound 3D structures (iii) TrEMBL entries for which molecular function has been inferred from experiments other than direct enzyme assays viz., complementation assays, phenotypic studies, etc. (iv) TrEMBL entries with solved 3D structure (v) FunFam members (CATH database) but only in the case of CDP-glucose 4,6-dehydratase (FunFam 20603) and phosphomannoisomerase family 3 (FunFam 54112) Step 2b Remove redundancy (80% sequence identity cutoff) and obtain a multiple sequence alignment (MSA) Step 2c Use the MSA as input to generate an HMM profile Step 2d Score Extend dataset sequences against this HMM profile Step 2e Set the bit score of the lowest scoring sequence as the bit score threshold

for Extend dataset, 푇푒푥푡푒푛푑

23

Flowchart S2 Precedence rules for assigning annotation to proteins that are hits to two or more profiles and/or BLASTp queries # Case 1 of 14 # specific_aminoTs = [GPE01710, GPE01430, GPE01530] # C3_C4_aminoTs = [GPE01910, GPE01830] # EXPECTED: for a protein which is a hit for one of the specific_aminoTs is expected to be a hit in C3_C4_aminoTs as well as GPE01230

IF (hit for any one of specific_aminoTs) THEN IF (hit for any one of C3_C4_aminoTs) THEN IF (hit for GPE01230) THEN Pass (i.e., this is as expected) ELSE Alert: Hit for one of C3_C4_aminoTs but not GPE01230 ENDIF ELSE Alert: Hit for one of specific_aminoTs but not C3_C4_aminoTs ENDIF ENDIF IF (hit for any one of C3_C4_aminoTs) THEN IF (hit for GPE01230) THEN Pass (as expected) ELSE Alert: Hit for one of C3_C4_aminoTs but not GPE01230 ENDIF ENDIF # Case 2 of 14 IF a protein is a hit for any one of specific_aminoTs, it should be assigned that annotation IF (hit for any one of specific_aminoTs) THEN Assign annotation ENDIF ENDIF # Case 3 of 14 # For a protein which is a hit for one of C3_C4_aminoTs but not any of specific_aminoTs, it should be assigned the former IF (hit for one of C3_C4_aminoTs and not for any of specific_aminoTs) THEN Assign GPE01830/GPE01910 annotation ENDIF

24

# Case 4 of 14 # For a protein which is a hit for GPE00210 # GPE00210 is used only in combination with GPE00231 # A protein which is a hit for GPE00210 is expected to be a hit for GPE00231 also IF (hit for GPE00210) THEN IF (hit for GPE00231) THEN Assign GPE00210 annotation ELSE Alert: Hit for GPE00210 but not GPE00231 ENDIF ENDIF # Case 5 of 14 # For a protein which is a hit for GPE02430 # GPE02430 is used only in combination with GPE02530 # A protein which is a hit for GPE02430 is expected to be a hit for GPE02530 also IF (hit for GPE02430) THEN IF (hit for GPE002530) THEN Assign GPE02430 annotation ELSE Alert: Hit for GPE02430 but not GPE02530 ENDIF ENDIF # Case 6 of 14 # For a protein that is a hit for GPE03130 # GPE03130 is used only in combination with GPE03430 # A protein which is a hit for GPE03130 is expected to be a hit for GPE03430 also IF (hit for GPE03130) THEN IF (hit for GPE03430) THEN Assign GPE03130 annotation ELSE Alert: Hit for GPE03130 but not GPE03430 ENDIF ENDIF

25

# Case 7 of 14 # For a protein that is a hit for GPE03210 # GPE03210 is used only in combination with GPE03430 # A protein which is a hit for GPE03210 is expected to be a hit for GPE03430 also IF (hit for GPE03210) THEN IF (hit for GPE03430) THEN Assign GPE03210 annotation ELSE Alert: Hit for GPE03210 but not GPE03430 ENDIF ENDIF # Case 8 of 14 # For a protein that is a hit for GPE09130 # GPE09130 is used only in combination with GPE09630 # A protein which is a hit for GPE09130 is expected to be a hit for GPE09630 also IF (hit for GPE09130) THEN IF (hit for GPE09630) THEN Assign GPE09130 annotation ELSE Alert: Hit for GPE09130 but not GPE09630 ENDIF ENDIF # Case 9 of 14 # For a protein that is a hit for GPE09230 # GPE09230 is used only in combination with GPE09330 and GPE09630 # A protein which is a hit for GPE09230 is expected to be a hit for GPE09630 also IF (hit for GPE09230) THEN IF (hit for GPE09630) THEN Assign GPE09230 annotation ELSE Alert: Hit for GPE09230 but not GPE09630 ENDIF ENDIF

26

# Case 10 of 14 # For a protein that is a hit for GPE09330 and GPE09630 # GPE09330 is used only in combination with GPE09630 # A protein can be a hit for GPE09330 or GPE09630, but not for both (non- orthologous) IF (hit for GPE09330 AND hit for GPE09630) THEN Alert: Hit for GPE09330 and GPE09630 ENDIF # Case 11 of 14 # For a protein that is a hit for GPE00620 and GPE00720 # GPE00620 is used only in combination with GPE00720 # A protein can be a hit for GPE00620 or GPE00720, but not for both (non- orthologous) IF (hit for GPE00620 AND hit for GPE00720) THEN Alert: Hit for GPE00620 and GPE00720 ENDIF # Case 12 of 14 # Isomerases: GPE07030, GPE07130, GPE07230, and GPE07330 # A protein can be a hit for any one of the above four profiles (non-orthologous) For GPE07030, GPE07130, GPE07230 and GPE07330 IF (hit for more than one) Alert: Hit for (list all profiles which appear as hits from above list)] # Case 13 of 14 # For a protein that is a hit for GPE00430 and GPE00530 # GPE00430 is used only in combination with GPE00530 # A protein can be a hit for GPE00430 or GPE00530, but not for both (non- orthologous) IF (hit for GPE00430 AND hit for GPE00530) THEN Alert: Hit for GPE00430 and GPE00530 ENDIF # Case 14 of 14 # For a protein that is a hit for GPE05332 and Q81A42:1-328 # GPE05332 is used only in combination with Q81A42:1-328 # A protein can be a hit for GPE05332 or Q81A42:1-328, but not for both (non- orthologous) IF (hit for GPE05332 AND hit for Q81A42:1-328) THEN Alert: Hit for GPE05332 and Q81A42:1-328 ENDIF

27

REFERENCES CITED IN THIS FILE

1. Polizzi, S.J., Walsh, R.M. Jr, Peeples, W.B., Lim, J.M., Wells, L., Wood, Z.A. (2012) Human UDP-α-D-xylose synthase and Escherichia coli ArnA conserve a conformational shunt that controls whether xylose or 4-keto-xylose is produced. , 51:8844-8855.

2. Tello, M., Jakimowicz, P., Errey, J.C., Freel Meyers, C.L., Walsh, C.T., Buttner, M.J., Lawson, D.M., Field, R.A. (2006) Characterisation of Streptomyces spheroides NovW and revision of its functional assignment to a dTDP-6-deoxy-D-xylo-4- hexulose 3-epimerase. Chem. Commun (Camb)., 1079-1081.

3. Li, Z., Mukherjee, T., Bowler, K., Namdari, S., Snow, Z., Prestridge, S., Carlton, A., Bar-Peled, M. (2017) A four-gene operon in Bacillus cereus produces two rare spore- decorating . J. Biol. Chem., 292:7636-7650.

4. Hwang, S., Aronov, A., Bar-Peled, M. (2015) The Biosynthesis of UDP-D-QuiNAc in Bacillus cereus ATCC 14579. PLoS One, 10:e0133790.

5. Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25:3389-3402.

6. Eddy, S.R. (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform., 23:205-211.

7. Madeira, F., Park, Y.M., Lee, J., Buso, N., Gur, T., Madhusoodanan, N., Basutkar, P., Tivey, A.R.N., Potter, S.C., Finn, R.D., Lopez, R. (2019) The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Res., 47(W1):W636-W641.

8. Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W. (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 28:3150-3152.

9. UniProt Consortium. (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47(D1):D506-D515.

10. NCBI Resource Coordinators. (2018) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 46(D1):D8‐D13.

11. Dawson, N.L., Lewis, T.E., Das, S., Lees, J.G., Lee, D., Ashford, P., Orengo, C.A., Sillitoe, I. (2017) CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res., 45(D1):D289-D295.

28

12. Berman, H.M., Battistuz, T., Bhat, T.N., Bluhm, W.F., Bourne, P.E., Burkhardt, K., Feng, Z., Gilliland, G.L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J.D., Zardecki, C. (2000) The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr., 58:899-907.

29